Big Data Reproducibility: Applications in Brain Imaging and Genomics

. Reproducibility, the ability to replicate analytical ﬁndings, is a prerequisite for both scientiﬁc discovery and clinical utility. Troublingly, we are in the midst of a reproducibility crisis, in which many investigations fail to replicate. Although many believe that these failings are due to misunderstanding or misapplication of statistical inference (e.g., p-values or the dichotomization of “statistically signiﬁcant”), we believe the shortcomings arise much earlier in the data science workﬂow, at the level of measurement, including data acquisition and reconstruction. A key to reproducibility is that multiple measurements of the same item (e.g., experimental sample or clinical participant) are similar to one another, while they are dissimilar from other items. The intra-class correlation coefﬁcient ( ICC ) quantiﬁes reproducibility in this way, but only for univariate (one dimensional) data, while relying on Gaussian assumptions for validity. In contrast, big data is multivariate (high-dimensional), non-Gaussian, and often non-Euclidean (including text, images, speech, and networks), rendering ICC inadequate. We propose a novel statistic, discriminability , which quantiﬁes the degree to which individual samples are similar to one another, without restricting the data to be univariate, Gaussian, or even Euclidean. We then introduce the possibility of optimizing experimental design via increasing discriminability. We prove that optimizing discriminability yields an improved ability to use the data for subsequent inference tasks, without specifying the inference task a priori . We then apply this approach three different datasets: a brain imaging dataset built by the “Consortium for Reliability and reproducibility” which consists of 28 disparate magnetic resonance imaging datasets, and two genomics datasets. Discriminability is the only statistic that, by optimizing according to it, improves performance on all subsequent inference tasks for each dataset, despite that they were not considered in the optimization. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the reproducibility crisis.

1 Introduction Reproducibility and reliability are central tenets in data science, whether applied to scientific discovery or clinical utility. As a rule, if results do not reproduce, we do not trust them. In statistics, reproducibility is closely related to robustness, which quantifies the robustness of an estimator the model misspecifications (such as additional sources of variance not accounted for in the model) [1]. Reproducibility is also related to discriminability, which generalizes robustness in certain ways, for example, whether an estimate is discriminable over multiple random samples of the data [2]. Engineering and operations research have been concerned with reliability for a long time, as they require that their products are reliable under various conditions. Very recently, the general research community became interested in these issues, as individuals began noticing and publishing failures to reproduce across fields, including neuroscience and psychology [3][4][5].
A number of strategies have been suggested to resolve this "reproducibility crisis." For example, the editors of "Basic and Applied Social Psychology" have banned the use of p-values [6]. Unfortunately, an analysis of the publications since banning indicates that studies after the ban tended to overstate, rather than understate, their claims, suggesting that this proposal possibly had the opposite effect [7]. More recently, the American Statistical Association released a statement recommending banning the phrase "statistically significant" for similar reasons [8].
A different strategy has been to quantify the reproducibility or reliability of ones' data after collection using studies with multiple measurements, in which each item is scanned multiple times under identical acquisition parameters. This practice has been particularly popular in brain imaging, where many studies have been devoted to quantifying the reproducibility of different properties of the data [9-12]. However, this approach has severe limitations. Perhaps the most problematic aspect of this approach is clear from the popular adage, "garbage in, garbage out" [13]. If the measurements themselves are therefore believe optimizing experiments to improve reproducibility, specifically by maximizing Discr, will be useful for a wide range of disciplines and sectors. To facilitate its use, we make all of our code and data derivatives open access at https://neurodata.io/mgc.

Intra-Class Correlation
The intra-class correlation coefficient (ICC) is a commonly used data reproducibility statistic [22]. ICC is the fraction of the total variability that is across-item variability, that is, ICC is defined as the across-item variability divided by the within-item plus across-item variability. ICC has several severe limitations. First, it is univariate, meaning if the data are multidimensional, they must first be represented by univariate statistics, thereby discarding multivariate information. Second, ICC is based on an (overly simplistic) Gaussian assumption characterizing the data. Thus, any deviations from this assumption render the interpretation of the magnitude of ICC questionable, because non-Gaussian measurements that are highly reliable could yield quite low ICC.

Image Intra-Class Correlation (I2C2)
The Image Intra-Class Correlation (I2C2) was introduced to mitigate ICC's univariate limitation [23]. Specifically, I2C2 operates on covariances matrices, rather than variances. To obtain a univariate summary of reproducibility, I2C2 operates on the trace of the covariance matrices, one of several possible strategies, similar to most multivariate analysis of variance procedures [24]. Thus, while overcoming one limitation of ICC, I2C2 still heavily leverages Gaussian assumptions of the data to justify its validity. We introduce a complementary multivariate parametric generalization of ICC which we call Principle Component Intra-Class Correlation (PICC). PICC is simply ICC computed on the the first principle component of the data. The main advantage of PICC over I2C2 is conceptual and computational simplicity; empirically, it performs approximately as well.

Fingerprinting Index (FPI)
The Finterprinting Index (FPI) [25,26] provides a metric for discovering individual-specific connectivity profiles in resting-state MRI (fMRI). Specifically, FPI operates on the pairwise correlation of the vectorized connectivity matrices. A high FPI corresponds to the connectivity matrices being more strongly correlated within-subject versus between-subject. Unlike the strategies employed in this manuscript, the FPI produces a statistic for each possible ordering of 2 measurement sessions; ie, if each item is measured s times, FPI produces s(s − 1) statistics. To ease the utility of FPI for assessing the effectiveness of a strategy, we instead propose the usage of the FPI averaged across all s(s − 1) statistics, which we use for the duration of this manuscript (denoted AFPI). [27] extends the classical Analysis of Variance (ANOVA) framework to cases where the distributions are not necessarily Gaussian.

Distance Components (DISCO) Distance Components (DISCO)
In contrast to ANOVA which makes simplifying assumptions of normality, DISCO operates on the dispersion of the samples based on the Euclidean Distance, comparing the within-class dispersion to the between-class dispersion. DISCO produces a consistent test against general alternatives as the number of observations s per item goes to infinity. Note that in many real data scenarios, s is small (particularly, most "repeat measurements" datasets have s = 2), and the finite-sample performance of DISCO on such a small number of repeat trials is not known.

Maximum Mean Discrepancy (MMD)
Maximum mean discrepancy (MMD) [28] provides a nonparametric framework for comparing whether two samples are drawn from the same distribution. MMD subverts Gaussian assumptions by embedding the points in a reproducing kernel Hilbert Space (RKHS), and looking for functions over the unit ball in the RKHS which maximize the difference in the means of the embedded points. In the two-item regime, MMD can be shown to be equivalent to the Hilbert-Schmidt Independence Criterion (HSIC) [29][30][31], which provides a natural generalization of MMD when the number of classes exceeds two. any parametric assumptions, and without requiring the data to be univariate. Here we outline how one can compute Discr from data.
Consider n items, where each item has s measurements, resulting in N = n×s total measurements across items, then Discr can be computed as follows: 1. Compute the distance between all pairs of samples (resulting in an N × N matrix).
2. For all samples of all items, compute the fraction of times that a within-item distance is smaller than an across-item distance.
3. The Discr of the dataset is the average of the above mentioned fraction. A high Discr indicates that within-item measurements are more similar to one another than acrossitem measurements. For more algorithmic details, see Algorithm 1. For formal definition of terms, see Appendix A.
3 Theoretical properties of Discr Under reasonably general assumptions, if within-item variability increases, predictive accuracy will decrease. Therefore, a statistic that is sensitive to within-item variance is desirable for optimal experimental design, regardless of the distribution of the data. Carmines and Zeller [32] introduces a univariate parametric framework in which predictive accuracy can be lowerbounded by a decreasing function of ICC; as a direct consequence, a strategy with a higher ICC will, on average, have higher predictive performance on subsequent inference tasks. Unfortunately, this valuable theoretical result is limited in its applicability, as it is restricted to univariate data, whereas big data analysis strategies often produce multivariate data. We therefore prove the following generalization of this theorem (see Appendix B for proof): Theorem 3.1. Under the multivariate additive noise setting, Discr provides a lower bound on the predictive accuracy of a subsequent classification task. Consequently, a strategy with a higher Discr provably provides a higher bound on predictive accuracy than a strategy with a lower Discr.
Thus, Discr provides a theoretical extension of ICC to a multivariate model, and correspondingly, motivates optimizing experiments to obtain higher Discr.
4 Empirical properties of Discr on simulated data 4.1 Simulation settings To develop insight into the performance of Discr, we consider several different simulation settings (see Appendix D for details). Each setting includes between 2 and 20 items, with 128 total measurements, in two dimensions. Figure 1A shows a two-dimensional scatterplot of each setting, and Figure 1B shows the Euclidean distance matrix between samples, ordered by item: 1. Gaussian Sixteen items are each distributed according to a spherically symmetric Gaussian, therefore respecting the assumptions that motivate ICC and I2C2.
2. Cross Two items have Gaussian distributions with the same mean and different diagonal covariance matrices. 3. Ball/Circle One item is distributed in the unit ball, the other on the unit circle; Gaussian noise is added to both. 4. XOR Each of two items is a mixture of two spherically symmetric Gaussians, but means are organized in an XOR fashion; that is, the means of the first item are (0, 1) and (1, 0), whereas the means of the second are (0, 0) and (1, 1). The implication is that many measurements from a given item are further away than any measurement of the other item. 5. No Signal Both items have the same Gaussian distribution.
We evaluate the three statistics that we developed: Discr PICC, and AFPI to previously existing statistics, including I2C2 and MMD.

Discr empirically predicts performance on subsequent inference tasks
We investigate the sensitivity of the reproducability statistics to changes in within-item variability in each of the settings. Figure 1C shows the impact of increasing within-item variance on the different simulation settings. For those with predictive information (the top four), increasing variance decreases predictive accuracy (green line). As desired, Discr also decreases nearly perfectly monotonically with decreasing variances. However, only in the first setting, where each item has a spherically symmetric Gaussian Figure 1: Five simulated settings demonstrate the value of Discr for optimal experimental design. All simulations are two-dimensional, with 128 samples, and α = 0.05, with 500 iterations per setting. For all simulations, the variance is normalized (Appendix D for details). (A) For each setting, class label is indicated by shape, and color indicates item identity. (B) Euclidean distance matrix between samples within each simulation setting. Samples are organized by item. Simulation settings in which items are discriminable tend to have a block structure where samples from the same item are relatively similar to one another. (C) Reproducability statistic versus variance. Here, we can compute the Bayes accuracy (the best one could perform to predict class label) as a function of variance. Discr and MMD are mostly monotonic relative to within-item variance across all settings, suggesting that one can predict improved performance via improved Discr. (D) goodness of fit test of whether data are discriminable. Discr achieves nearly as high or higher power than alternative strategies in all cases. (E) comparison test of which approach is more discriminable. Discr achieves highest power for all settings and variances for most settings and variances. distribution, do I2C2, PICC, and AFPI drop proportionally. Even in the second (Gaussian) setting, I2C2, PICC, and AFPI are effectively uninformative about the within-item variance. And in the third and fourth (non-Gaussian) settings, they are similarly useless. This suggests that of these statistics, only Discr and MMD can serve as satisfactory surrogates for predictive accuracy under these relatively simple settings.

A goodness of fit test for Discr
A prerequisite for making item-specific predictions is that items are different from one another in predictable ways, that is, are discriminable. If not, the same assay applied to the same individual on multiple trials could yield in unacceptably highly variable results. Thus, prior to embarking on a machine learning search for predictive accuracy, one can simply test whether the data are discriminable at all. If not, predictive accuracy will be hopeless. Letting S denote the Discr of a dataset with n items and s measurements per item, and S 0 denote the Discr of the 5 . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a same size dataset with zero item specific information, the goodness of fit hypothesis test for Discr is (1) One could replace D for Discr with some other test statistic, such as PICC or I2C2. We devised a permutation test to obtain a distribution of the test statistic under the null, and a corresponding pvalue. To evaluate the different procedures, we compute the power of each test, that is, the probability of correctly rejecting the null when it is false (which is one minus type II error; see Appendix C.1 for details). Figure 1D shows that Discr achieves as high or higher power than all competing approaches in essentially all settings and variances. This result demonstrates that despite the fact that Discr does not rely on Gaussian assumptions, it still performs as well or better than parametric methods when the data satisfy these assumptions. In the cross setting, only Discr and MMD correctly identify that items differ from one another, despite the fact that the data are Gaussian. In both ball/disc and XOR settings, most statistics perform well despite the non-Gaussianity of the data. And when there is no signal, all tests are valid, achieving power less than or equal to the critical value. Non-parametric Discr therefore has the power of parametric approaches for data at which those assumptions are appropriate, and much higher power for other data. Though MMD performs comparably to Discr here, it does not consistently perform well in the below comparison test evaluations.

A comparison test for whether one experimental design is more discriminable than another
Given two experimental designs-which can differ either by acquisition and/or analysis details-are the measurements produced by one method more discriminable than the other? Letting D (1) be the Discr of the first approach, and D (2) be the Discr of the second approach, we have the following comparison hypothesis for Discr: (2) Again, one could replace Discr with other test statistics, and we devised a permutation test to obtain the distribution of the test statistic under the null, and p-values (see Appendix C.2 for details). Figure   1D shows Discr achieves nearly as high or higher power than the other approaches. Specifically, only AFPI achieves higher power in the Gaussian setting, but it achieves almost no power in the cross setting. MMD achieves extremely low power for all settings, as does PICC. I2C2 achieves similar power to Discr only for the Gaussian and ball/disc setting. All tests are valid in that they achieve a power approximately equal to or below the critical value when there is no signal. The fact that Discr achieves nearly equal or higher power than the statistics that build upon Gaussian methods, even under Gaussian assumptions, suggests that Discr will be a superior metric for optimal experimental design.
5 Empirical Discr on Real Data 5.1 Human brain imaging data acquisition and analysis Consortium for Reliability and Reproducibility (CoRR) [33] has generated functional, anatomical, and diffusion magnetic resonance imaging (dMRI) scans from >1,600 participants, often with multiple measurements, collected through 28 different studies (22 of which have both age and sex annotation) spanning over 20 sites. Each of the sites use different scanners, technicians, and scanning protocols, thereby representing a wide variety of different acquisition settings with which one can test different analysis pipelines. Figure 2A shows the six stage sequence of analysis steps for converting the raw fMRI data into networks or connectomes, that is, estimates of the strength of connections between all pairs of brain regions. At each stage of the pipeline, we consider several different "standard" approaches, that is, approaches that have previously been proposed in the literature, typically with hundreds or thousands of citations [34]. Moreover, they have all been collected into an analysis engine, called Configurable Pipeline for the Analysis of Connectomes (C-PAC) [35]. In total, for the six stages together, we consider 2 × 2 × 2 × 2 × 4 × 3 = 192 different analysis pipelines. Because each stage is nonlinear, it is possible that the best sequence of choices is not equivalent to the best choices on their own. For this reason, publications that evaluate a given stage using any metric, could result in misleading conclusions if one is searching for the best sequence of steps. The dMRI connectomes were acquired via 48 analysis pipelines using the Neurodata MRI Graphs (ndmg) pipeline [36]. Appendix E provides specific details for both fMRI and dMRI analysis, as well as the options attempted. Figure 2B shows the analysis strategy has a large impact on the Discr of the resulting fMRI connectomes. Each column shows one of 64 different analysis strategies, ordered by how significantly different they are from the pipeline with highest Discr (averaged over all datasets, tested using the above comparison test). Interestingly, pipelines with worse average Discr also tend to have higher variance across datasets. The best pipeline, FNNNCP, uses FSL registration, no frequency filtering, no scrubbing, no global signal regression, CC200 parcellation, and converts edges weights to ranks. While all strategies across all datasets with multiple participants are discriminable at α = 0.05 (Discr goodness of fit test), the majority of the strategies (51/64 ≈ 80%) show significantly worse Discr than the optimal strategy at α = 0.05 (Discr comparison test).

Discr identifies which acquisition and analysis decisions are most important for improv-
ing performance While the above analysis provides evidence for which sequence of analysis steps is best, it does not provide information about which choices individually have the largest impact on overall Discr. To do so, it is inadequate to simply fix a pipeline and only swap out algorithms for a single stage, as such an analysis will only provide information about that fixed pipeline. Therefore, we evaluate each choice in the context of all 192 considered pipelines in Figure 3A. The pipeline constructed by identifying the best option for each analysis stage is FNNGCP ( Figure 3A). Although it is not exactly the same as the pipeline with highest Discr (FNNNCP), it is also not much worse (Discr 2-sample test, p-value ≈ 0.14). Moreover, except for scrubbing, each stage has a significant impact on Discr after correction for multiple hypotheses (Wilcoxon signed-rank statistic, p-values all < 0.001).
Another choice is whether to estimate connectomes using functional or diffusion MRI ( Figure 3B).
Whereas both data acquisition strategies have known problems [37], the Discr of the two experimental modalities has not been directly compared. Using four datasets from CoRR that acquired both fMRI and dMRI on the same subjects, and have quite similar demographic profiles, we tested whether fMRI or dMRI derived connectomes were more discriminable. The pipelines being considered were the best-performing fMRI pre-processing pipeline (FNNNCP) against the dMRI pipeline with the CC200 parcellation. For three of the four datasets, dMRI connectomes were more discriminable. This is not particularly surprising, given the susceptibility of fMRI data to changes in state rather than trait (e.g., amount of caffeine prior to scan [35]).
The above results motivate investigating which aspects of the dMRI analysis strategy were most effective. We focus on two criteria: how to scale the weights of connections, and how many regions of interest (ROIs) to use. For scaling the weights of the connections, we consider three possible criteria: using the raw edge-weights ("Raw"), taking the log of the edge-weights ("Log"), and ranking the non-zero edge weights in sequentially increasing order ("Rank"). Figure 3C.i shows that both rank and log transform significantly exceed raw edge weights (Wilcoxon signed-rank statistic, sample size= 60, p-values all < 0.001). Figure 3C.ii shows that parcellations with larger numbers of ROIs tend to have higher Discr. Unfortunately, most parcellations with semantic labels (e.g., visual cortex) have hundreds not thousands of parcels. This result therefore motivates the development of more refined semantic labels.

Optimizing Discr improves downstream inference performance
We next examined the relationship between the Discr of each pipeline, and the amount of information it preserves about two properties of interest: sex and age. Based on the simulations above, we expect that analysis pipelines with higher Discr will yield connectomes with more information about covariates. Indeed, Figure 4 7 . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020.  shows that, for virtually every single dataset including sex and age annotation (22 in total), a pipeline with higher Discr tends to preserve more information about both covariates. The amount of information is quantified by the effect size of the distance correlation DCorr (which is exactly equivalent to MMD [38,39]), a statistic that quantifies the magnitude of association for both linear and nonlinear dependence structures. In contrast, if one were to use either PICC, DISCO, or I2C2 to select the optimal pipeline, for many datasets, subsequent predictive performance would degrade. AFPI performs similarly to Discr on this dataset. These results are highly statistically significant: the slopes of effect size versus Discr and AFPI across datasets are significantly positive for both age and sex in 82 and 95 percent of all studies, respectively (robust Z-test, α = 0.05). DISCO performs poorly, basically always, because k-sample tests are designed to perform well with many samples from a small number of differ- pipelines are aggregated for a particular analysis step, with pairwise comparisons with the remaining analysis options held fixed. The beeswarm plot shows the difference between the overall best performing option and the second best option for each stage (mean in bigger black dot); the x-axis label indicates the best performing strategy. The best strategies are FNIRT, no frequency filtering, no scrubbing, global signal regression, the CC200 parcellation, and ranks edge transformation. A Wilcoxon signed-rank test is used to determine whether the mean for the best strategy exceeds the second best strategy: a * indicates that the p-value is at most 0.001 after Bonferroni correction. Of the best options, only no scrubbing is not significantly better than alternative strategies. Note that the options that perform marginally the best are not significantly different than the best performing strategy overall, as shown in Figure 2. (B) A comparison of the stabilities for the 4 datasets with both fMRI and dMRI connectomes. dMRI connectomes tend to be more discriminable, in 14 of 20 total comparisons. (C.i) Comparing raw edge weights (Raw), ranking (Rank), and log-transforming the edge-weights (Log) for the diffusion connectomes, the Log and Rank transformed edge-weights tend to show higher Discr than Raw. (C.ii) As the number of ROIs increases, the Discr tends to increase. ent populations, and questions of reproducibility across repeated measurements have a few samples across many different populations.

Discr in Genomics
The first genomics study aimed to explore variation in gene expression across human induced pluripotent stem cell (hiPSC) lines [20], and includes RNAseq data from 101 healthy individuals, comprising 38 males and 63 females. Expression was interrogated across donors by studying between one and seven replicated iPSC lines from each donor, yielding bulk RNAseq data from a total of 317 individual hiPSC lines. While the pipeline includes many steps, we focus here for simplicity on (1) counting, and (2) normalizing. The two counting approaches we study are the raw hiPSC lines and the count-per-million (CPM). Given counts, we consider four different normalization options: raw, rank, and log-transformed (as described above), as well as to mean-centering (normalizing each sample to have an average count of 0). The task of interest was to identify the sex of the individual. The second genomics study [21] includes 331 individuals, consisting of 135 patients with nonmetastatic cancer and 196 healthy controls, each with eight DNA samples. The study leverages a PCRbased assay called Repetitive element aneuploidy sequencing system to analyze ∼750,000 amplicons distributed throughout the genome to investigate the presence of aneuploidy (abnormal chromosome counts) in samples from cancer patients (see Appendix E.1 for more details). The possible processing strategies include using the raw amplicons or the amplicons downsampled by a factor of 5 × 10 4 bases,

9
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint 5 × 10 5 bases, 5 × 10 6 bases, or to the individual chromosome level (the resolution of the data), followed by normalizing through the previously described approaches (Raw, Rank, log-transformed) yielding 5 × 3 = 15 possible strategies in total. The task of interest was to identify whether the sample was collected from a cancer patient or a healthy control.
Across both tasks, slope for discriminability is positive, and for the first task, the slope is significantly bigger than zero (robust Z-test, p-value = .001, α = .05). Similarly, for PICC, in both datasets the slope is positive and the effect is significant. Both I2C2 and DISCO do not provide value for subsequent inference. While AFPI is informative for the sex study, AFPI provides no information for the second study, producing a statistic of 1 for all of the processing strategies considered.

10
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. For each study, the effect size is regressed onto the statistic of interest. Color and line width correspond to the study and number of scans, respectively (see Figure 2B). The solid black line is the weighted mean over all studies. Discr is the only statistic in which all slopes are positive. Moreover, the corrected p-value [40,41] is significant across datasets for both covariates (all p-values < .001). This indicates that pipelines with higher Discr correspond to larger effect sizes for the covariate of interest, and that this relationship is stronger for Discr than other statistics. A similar experiment is performed on two genomics datasets, measuring the effects due to sex and whether an individual has cancer. Each point represents a single processing strategy (there is only one dataset per covariate). (III) indicates the fraction of datasets with positive slopes and with significantly positive slopes, ranging from 0 ("None", red) to 1 ("All", green), at both the task and aggregate level. Discr is the statistic where the most datasets have positive slopes, and the statistic where the most datasets have significantly positive slopes, across the neuroimaging and genomics datasets considered. Appendix E.2 details the methodologies employed.

Discussion
We propose the use of the Discr statistic as a simple and intuitive measure for experimental design featuring multiple measurements. Numerous efforts have established the value of quantifying reliability, repeatability, and replicability (or discriminability) using parametric measures such as ICC and I2C2. However, they have not been used to optimize reproducibility-that is, they are only used post-hoc to determine reproducibility, not used as criteria for searching over the design space-nor have non-parametric multivariate generalizations of these statistics been available. We derive goodness of fit and comparison (equality) tests for Discr, and demonstrate via theory and simulation that Discr provides numerous advantages over existing techniques across a range of simulated settings. Our neuroimaging and genomics use-cases exemplify the utility of these features of the Discr framework for optimal experimental design.
Discr provides a number of connections with related statistical algorithms worth further consideration. Discr is related to energy statistics [42], in which the statistic is a function of distances between observations [43]. Energy statistics provide approaches for goodness-of-fit (one-sample) and equality testing (two-sample), and multi-sample testing [27]. However, we note an important distinction: a goodness of fit test for discriminability can be thought of as a K-sample test in the classical literature, and a comparison of discriminabilities is analogous to a comparison of K-sample tests. Further, similar to Discr, energy statistics make relatively few assumptions. However, energy statistics requires a large number of measurements per item, which is often unsuitable for biological data where we frequently have only a small number of repeated measurements. Discr is most closely related to multiscale generalized correlation (MGC) [38,39], which combines energy statistics with nearest neighbors, as does Discr. Like many energy-based statistics, Discr relies upon the construction of a distance matrix. As such, Discr generalizes readily to high-dimensional data, and many packages accelerate distance computation in high-dimensionals [44].
Limitations While Discr provides experimental design guidance for big data, other considerations may play a role in a final determination. For example, the connectomes analyzed here are resting-state, as opposed to task-based fMRI connectomes. Recent literature suggests that the global signal in a rs-fMRI scan may be a nuisance variable for task-based approaches [45,46]. Thus, while Discr is an effective tool for experimental design, knowledge of the techniques in conjunction with the inference task is still a necessary component of any investigation. Further, in this study, we only consider the Euclidean distance, which may not be appropriate for all datasets of interest. For example, if the measurements live in a manifold (such as images, text, speech, and networks), one may be interested in dissimilarity or similarity functions other than Euclidean distance. To this end, Discr readily generalizes to alternative comparison functions, and will produce an informative result as long as the choice of comparison function is appropriate for the measurements.
It is important to emphasize that Discr, as well the related statistics, are neither necessary, nor sufficient, for a measurement to be practically useful. For example, categorical covariates, such as sex, are often meaningful in an analysis, but not discriminable. Human fingerprints are discriminable, but typically not biologically useful. In addition, none of the statistics studied here are immune to sample characteristics, thus interpreting results across studies deserves careful scrutiny. For example, having a sample with variable ages will increase the inter-subject dissimilarity of any metric dependent on age (such as the connectome). With these caveats in mind, Discr remains as a key experimental design consideration a wide variety of settings.
Conclusion The use-cases provided herein serve to illustrate how Discr can be used to facilitate experimental design, and mitigate reproducibility issues. We envision that Discr will find substantial applicability across disciplines and sectors beyond brain imaging and genomics, such pharmaceutical research. To this end, we provide open-source implementations of Discr for both Python and R [47,48]. Code for reproducing all the figures in this manuscript is available at https://neurodata.io/mgc.

13
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint as the measurement protocal f and the analysis choices g.
The population Discr represents a property of the distribution of θ i . In real data since we do not observe the true distribution, we instead rely on the sample Discr. Suppose a dataset consists of i ∈ {1, . . . , n} items, where each item i has J i repeat measurements. The sample Discr is defined: It can be shown [49] that the under the multivariate additive noise model; that is, and E j i = c c c, that the sample Discr, Discr is both a consistent and unbiased estimator for population Discr.

Appendix B. Discr Provides an Informative Bound for Inference.
During experimental design, the extent of subsequent inference tasks may be unknown. A natural question may be, what are the implications of the selection of a discriminable experimental design? Formally, assume the task of interest is binary classification: that is, Y = {0, 1}, and we seek a classifier h : X → Y. The goal of experimental design in this context is to choose the options (f * , g * ) that will minimize the classification loss: For a fixed (f, g), the minimal prediction error is achieved by the Bayes classifier [50]: and let L * f,g denote Bayes error, that is, the error achieved by h * f,g .
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint Theorem B.1. Assume the multivariate gaussian additive noise setting; that is: There exists a decreasing function γ(·) which depends only on θ and y s.t.: That is, the Bayes error can, in fact, be upper bounded by a decreasing function of Discr, as shown in the proof below. As a direct consequence of this theorem, we see: Corollary B.1. Assume (f 1 , g 1 ) and (f 2 , g 2 ) are two analysis strategies, and suppose that D f 1 ,g 1 > D f 2 ,g 2 . Then: In other words, the Bayes error achieved by strategy (f 1 , g 2 ) can, in fact, be upper bounded by the Bayes error achieveable by strategy (f 2 , g 2 ). Consequently, under the described setting, the pipeline that achieves a higher Discr can facilitate improved inference than competing strategies, despite the fact that the task is unknown during data acquisition and analysis. Complementarily, note that if we were to instead consider the predictive accuracy 1 − L * f,g , we can obtain a similar result to obtain a lower bound on the predictive accuracy via an increasing function of Discr. That is, in the context of the corollary, a more discriminable pipeline will tend to have a higher accuracy on an arbitrary predictive task.

Proof of Theorem (B.1).
Consider the additive noise setting, that is x x x j i = θ i + j i , To bound the probability above, we bound the θ i − θ i and separately. We start with the first term Here, σ 2 2 = tr(Σ Σ Σ θ ) is the trace of covariance matrix of θ i . We can apply Markov's Inequality for any t > 0: (4) Let σ 2 1 = tr(Σ Σ Σ ) denote the trace of covariance matrix of j i , and let a and b be two constants satisfy Furthermore, we let t 2 = √ 2aσ 1 σ 2 , and let If a 2 σ 2 1 ≥ 2σ 2 2 , then θ ≤ 1. According to the Paley-Zygmund Inequality [51], that is, for all 0 ≤ θ ≤ 1 and Z ≥ 0, we can plug in the θ above to achieve Also plug in the t 2 for the inequality 4, we have Understand the fact that θ's and 's are independent, we can combine the two inequalities Note that the resulted bound holds true even if a 2 σ 2 1 < 2σ 2 2 , as the right hand side becomes greater than 1. So we can have a bound on σ 2 σ 1 , To obtain a bound on Bayes error, we apply Devijver and Kittler's result [52], which is L ≤ 2π 0 π 1 1 + π 0 π 1 ∆µ µ µ T Σ Σ Σ −1 ∆µ µ µ .
Here, π 0 and π 1 are prior probabilities for two classes. ∆µ is the difference between means of two classes. Since is assumed to be independent of x x x and y y y, ∆µ µ µ = E(x x x|y y y = 0) − E(x x x|y y y = 1) = E(θ|y y y = 0) − E(θ|y y y = 1).

C.1 Goodness of Fit Test
Recall the goodness of fit test, shown in Equation (1). We approximate the distribution ofD under the null through a permutation approach. The item labels of our N samples are first permutated randomly, andD 0,N is computed each time given the observed data X X X and the permuted labels. For a level α significance test, we compareD to the (1 − α) quantile Q 1−α of the empirical null distributionD 0,N , and reject the null hypothesis ifD N < Q 1−α . This approach provides a consistent and valid test under general assumptions.
Note that the permutation-based approach requires r computations of the sample Discr. The total computational complexity is then O N 2 max(p, rs) . This approach is only linear in the number of desired repetitions, and therefore is sensible for most settings in which the sample Discr can itself be computed. Moreover, we can greatly speed this computation up through parallelization. With T cores, the computational complexity is instead O N 2 max p, r T s) , as shown in Algorithm 1. We extend this goodness of fit test to both PICC and I2C2 to provide a robust p-value associated with both statistics of interest. Note that the permutation approach can be generalized to any statistic quantifying repeatability based on repeated measurements.
n items of data, each featuring J i measurements.
(2) r an integer for the number of permutations.
Output: p ∈ [0, 1] the p-value associated with the test.
compute observed sample Discr Note that this for-loop can be parallelized over T cores, as the loops are independent 3: for i in 1, . . . , r do 4: a random shuffling of the measurements 5: The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint C.2 Comparison Test We implement Comparison testing using a permutation approach, similar to the goodness of fit test. First, compute the observed difference in Discr between two design choices. The null distribution of the difference in Discr is constructed by first taking random convex combinations of the observed data from each of the two methods choices (the "randomly combined datasets").
Discr is computed for each of the two randomly combined datasets for each permutation. Finally, for each permutation, the all pairs of observed differences in Discr is computed. Finally, the observed statistic is compared with the differences under the null of the randomly combined datasets. The p-value is the fraction of times that the observed statistic is more extreme than the null. Note that we can use this approach for both one and two-tailed hypotheses for an experimental design having higher Discr, lower Discr, and equal Discr relative a second approach; we implement all three in the software implementation of the comparison test. The Algorithm for the comparison test is shown in Algorithm 2, with the alternative hypothesis as specified in Equation (2). The computational complexity is then O r T N 2 max(p, max i (s i )) . Note that for each permutation, the limiting step is the computation of the Discr in O N 2 max(p, s) . This is then offset through parallelization over T cores in the implementation. We extend this comparison test to all competing approaches to provide a robust p-value associated with both statistics of interest, for similar reasons to the above. Again, this permutation approach can be generalized to any statistic quantifying repeatability based on repeated measurements.

23
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint Algorithm 2 Discr Discriminability Comparison Test.Our implementation of the permutation test for the hypothesis given in Equation (2) requires O r T N 2 max(p, s) time, where r is the number of permutations and T is the number of cores available for the permutation test. Above, the only alternative considered is that H A : D (1) > D (2) ; our code-based implementation provides strategies for H A : D (1) < D (2) and H A : D (1) = D (2) as well.
n items of data, each featuring J i measurements, from the first sample.
(2) z z z j i j∈[J i ],i∈ [n] n the observed data, from the second sample.
(3) r an integer for the number of permutations.
Output: p ∈ [0, 1] the p-value associated with the test.
The Discr of the first sample.
The Discr of the second sample. 4: The observed difference in Discr between samples 1 and 2.

5:
The for-loop below can be parallelized over T cores, as each loop is an independent 6: for i in 1 : r do 7: Generate a synthetic null dataset for each of the 2 samples, using a convex combination of the elements of each sample 8: for k in 1 : 2 do 9: a random shuffle of the measurements 10: Convex combination of random elements from each sample 13: Compute Discr of the convexly combined elements 14: end for 15: end for 16: Compute all pairs differences in Discr using the convexly-combined samples 17: for i in 1, . . . , r − 1 do 18: for j in i + 1, . . . , r do 19: Null distribution of the difference 20: end for 21: end for 22: p-value is fraction of times that observed Discr is more extreme than synthetic datasets 23: p = The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint dMRI Analysis Pipelines The dMRI connectomes were acquired as follows. The dMRI scans were corrected for eddy currents using FSL's eddy-correct [54]. FSL's "standard" linear registration pipeline was used to register the sMRI and dMRI images to the MNI152 atlas [54][55][56][57]. A tensor model is fit using DiPy [58] to obtain an estimated tensor at each voxel. A deterministic tractography algorithm is applied using DiPy's EuDX [58,59] to obtain streamlines, which indicate the voxels connected by an axonal fiber tract. Graphs are formed by contracting voxels into graph vertices depending on spatial [60], anatomical [61][62][63][64], or functional [65][66][67][68] similarity. Given a parcellation with vertices V and a corresponding mapping P (v i ) indicating the voxels within a region i, we contract our fiber streamlines as follows.
w is true if a fiber tract exists between voxels u and w, and false if there is no fiber tract between voxels u and w. The specific parcellations leveraged are detailed in Kiar et al. [36], consisting of parcellations defined in the MNI152 space [61][62][63][64][65][66][67][68]. The graphs are then re-weighted using the afforementioned weighting schemes described in fMRI Analysis Pipelines Appendix E.1; namely, the raw, ranked, and log edge-weights. All parcellations are available in neuroparc human brain atlases [53]. NEBNext Ultra II Q5 Master Mix (New England Biolabs cat # M0544S), and 5 uL of DNA containing 5% of the PCR product from the first round. The cycling conditions were: one cycle of 98ÂřC for 120 s, then 15 cycles of 98 o C for 10 s, 65 o C for 15 s, and 72 o C for 120 s. Amplification products from the second round were purified with AMPure XP beads (Beckman cat # a63880), as per the manufacturer's instructions, prior to sequencing. As noted above, each sample was amplified in eight independent PCRs in the first round. Each of the eight independent PCRs was then re-amplified using index primers in the second PCR round. Bowite2 was then used to align reads to the human reference genome assembly GRC37 [69] for each well. After alignment to ∼ 750, 000 amplicons, the wells were downsampled into non-overlapping windows of 5 × 10 4 bases, 5 × 10 5 bases, 5 × 10 6 bases, or to the individual chromosome level (the resolution of the data).

E.2 Effect Size Investigation
In this investigation, we are interested in learning how maximization based on the observed notion of reliability correlates with real performance on a downstream inference task. Recalling Corollary (B.1), we explore the implications of this corollary in a large neuroimaging dataset provided by the Consortium for Reliability and Reproducibility [19], and demonstrate that selection of the experimental design via Discr, in fact, facilitates improved downstream inference on both a regression and classification task. We further extend this to two separate genomics datasets investigating classification tasks, and again demonstrate that selection of experimental design via Discr improves downstream inference. This provides strong motivation for leveraging the Discr for experimental design.
Ideally, for a particular summary reference statistic, a high value will generally correlate with a positive effect size. For datasets i = 1, . . . , M where M is the total number of datasets, an analysis strategy j = 1, . . . , 192 for 192 total analysis strategies, and k = 1, . . . , 3 are our summary reference statistics of interest (Discr, PICC, AFPI, I2C2, DISCO), we fit the standard linear regression model Y = βX + , where we model the effect size Y estimated by DCorr [70] via a linear relationship with X, the observed reference statistic for approach k, with coefficient β. Note that the interpretation of β is the expected change in the effect size Y due to a single unit change in the observed reference statistic X. Both Y and X are uniformly normalized across all strategies within a single dataset to facilitate 27 . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. intuitive comparison across methods. For each reference statistic k, we pose the following hypothesis: Acceptance of the alternative hypothesis would have the interpretation that an increase in the observed reference statistic X would tend to correspond to an increase in the observed effect size Y , and the relevant test is the one-way Z-test. To robustify against model assumptions, we use robust standard errors [41]. Acceptance of the alternative hypothesis against the null provides evidence that an increase in the sample statistic corresponds to an increase in the observed effect size, where the responses (age, sex, cancer status) were not considered at the time the data were analyzed nor when the reference statistics computed. This provides evidence that the statistic is informative for experimental design within the context of this investigation. Model fitting for this investigation is conducted using the lm package in the R programming language [71].

28
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020.  Figure 6: dMRI Dataset Descriptions. In the above table, Dir corresponds to the number of diffusion directions. Rows with NA entries do not have available metadata associated with the scanning protocol. The sample stabilities correspond to the Discr of the pipeline with the CPAC200 parcellation and the log-transformed edges.
Useful Data Links All relevant analysis scripts and data for figure reproduction in this manuscript made publicly available, and can be found at https://neurodata.io/mgc.

29
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted April 9, 2020. . https://doi.org/10.1101/802629 doi: bioRxiv preprint