Eliminating accidental deviations to minimize generalization error: applications in connectomics and genomics

Reproducibility, the ability to replicate scientific findings, is a prerequisite for scientific discovery and clinical utility. Troublingly, we are in the midst of a reproducibility crisis. A key to reproducibility is that multiple measurements of the same item (e.g., experimental sample or clinical participant) under fixed experimental constraints are relatively similar to one another. We demonstrate that existing reproducibility statistics, such as intra-class correlation coefficient and fingerprinting, are not valid measures of reproducibility, in that they can provide unreasonably low or high results, even without model misspecification. We therefore propose a novel statistic, discriminability, which quantifies the degree to which an individual’s samples are relatively similar to one another, without restricting the data to be univariate, Gaussian, or even Euclidean. Using this statistic, we introduce the possibility of optimizing experimental design via increasing discriminability and prove that optimizing discriminability improves performance bounds in subsequent inference tasks. In extensive simulated and real datasets (focusing on brain imaging and demonstrating on genomics), only optimizing data discriminability improves performance on all subsequent inference tasks for each dataset. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the reproducibility crisis, and more generally, mitigating accidental measurement error.

1 Introduction Understanding variability, and the sources thereof, is fundamental to all of statistical science. Even the first papers on modern statistical methods concerned themselves with distinguishing systematic from accidental variability [1]. Systematic deviations are true deviations in the process under study, for example, actual changes in brain activity or connectivity within an individual over time through experimental manipulation, clinical intervention, or natural biological variation. Accidental deviations, however, correspond to measurement noise and "batch effects", sources of variance that are not of scientific interest. Delineating between these two sources of noise is a central quest in modern statistics, and failure to do so, has been problematic in modern science.
Scientific reproducibility, repeatability, and reliability are central in data science, whether applied to basic discovery or clinical utility. As a rule, if results do not reproduce, we can not justifiably trust them. The concept of reproducability is closely related to the statistical concepts of stability [2] and robustness [3]. Engineering and operations research have been concerned with reliability for a long time, as they require that their products are reliable under various conditions. Very recently, the general research community became interested in these issues, as individuals began noticing and publishing failures to reproduce across fields, including neuroscience and psychology [4][5][6].
A number of strategies have been suggested to resolve this "reproducibility crisis." For example, the editors of "Basic and Applied Social Psychology" have banned the use of p-values [7]. Unfortunately, an analysis of the publications since banning indicates that studies after the ban tended to overstate, rather than understate, their claims, suggesting that this proposal possibly had the opposite effect [8]. More recently, the American Statistical Association released a statement recommending banning the phrase "statistically significant" for similar reasons [9,10].
A different strategy has been to quantify the reproducibility or reliability of ones' data by measuring each sample (or individual) multiple times. This practice has been particularly popular in brain imaging, where many studies have been devoted to quantifying the reproducibility of different properties of the * 1 Johns Hopkins University, 2 Shanghai Jiaotong University, 3 Child Mind Institute, 4 Beijing Normal University, Nanning Normal University, University of Chinese Academy of Sciences, 5 Progressive Learning. * Corresponding author: Joshua T. Vogelstein (jovo@jhu.edu). data [11][12][13][14]. However, this approach has severe limitations. Perhaps the most problematic aspect of this approach is clear from the popular adage, "garbage in, garbage out" [15]. If the measurements themselves are not sufficiently reproducible, then scalar summaries of the data cannot be reproducible either. This perspective, the primacy of measurement, is fundamental in statistics, so much so that one of the first modern statistics textbook, R.A. Fisher's, "The Design of Experiments" [16], is focused on taking measurements.
Motivated by Fisher's work on experimental design, and Spearman's work on measurement, rather than recommending different post-data acquisition inferential techniques, or computing the reproducibility of data after collecting, we take a different approach. Specifically, we advocate for designing experiments to maximize something like reproducibility. Experimental design has a rich history, including in psychology [17] and neuroscience [18,19]. The vast majority of work in experimental design, however, focuses on designing an experiment to answer a particular scientific question. In this big data age, experiments are often designed to answer many questions, including questions not even considered at the time of data acquisition. How can one even conceivably design experiments to obtain data that is particularly useful for those questions?
We propose to design experiments to optimize the discriminability of individual items (for example, participants in a study, or samples in an experiment). This idea is closely inspired by, and closely related to ideas proposed by Cronbach's "Theory of Generalizability" [20,21].
To do so, we leverage our recently introduced Discr statistic [22]. Discr quantifies the degree to which multiple measurements of the same item are more similar to one another than they are to other items [23], essentially capturing the desiderata of Spearman from over 100 years ago. This statistic has several advantages over existing statistics that one could potentially use to optimize experimental design. First, it is nonparametric, meaning that its validity does not depend on any parametric assumptions, such as Gaussianity. Second, it can readily be applied to multivariate Euclidean data, or even non-Euclidean data (such as images, text, speech, or networks). Third, it can be applied to any stage of the data science pipeline, from data acquisition to data wrangling to data inferences. This manuscript provides the following contributions: 1. Demonstrates that Discr is the only existing valid statistic for quantifying data discriminability.
2. Formalizes hypothesis tests to assess discriminability of a dataset, and whether one dataset or approach is more discriminable than another.
3. Provides sufficient conditions for Discr to provide a lower bound on predictive accuracy. 4. Illustrates on 28 datasets from Consortium for Reliability and Reproducibility (CoRR) [24] which preprocessing pipelines maximize Discr, and demonstrate that maximizing Discr is significantly associated with maximizing the amount of information about multiple covariates, in contrast to other reliability statistics. 5. Replicates the above on multiple ultrahigh-dimensional genomics datasets. 6. Provides all source code and data derivatives open access at https://neurodata.io/mgc.

The discriminability statistic
Testing for discriminability is closely related to, but distinct from, k-sample testing. In k-sample testing we observe k groups, and we want to determine whether they are different at all. In discriminability, the k groups are in fact k different items (or individuals), and we care about whether replicates within each of the k groups are close to each other, which is a specific kind of difference. As a general rule, if one can specify the kind of difference one is looking for, then tests can have more power for that particular kind of difference. The canonical example of this would be an t-test, where if only looks at whether the means are different across the groups, one obtains higher power than if also looking for differences in variances.
To give a concrete example, assume one item has replicates on a circle with radius one, with random angles. Consider another item whose replicates live on another circle, concentric with the first, but with a different radius. The two items differ, and many nonparametric two-sample tests would indicate so (because one can perfectly identify the item by the radius of the sample). However, the discriminability in this example is not one, because there are samples of either item that are further from other samples of that item than samples from the other item.
On this basis, we developed our discriminability test statistic (Discr), which is inspired by, and builds upon, nonparametric two-sample and k-sample testing approaches called "Energy statistics" [25] and "Kernel mean embeddings" [26] (which are equivalent [27]). These approaches compute all pairwise similarities (or distances) and operate on them. Discr builds on these approaches in two ways. First, rather than operating on all pairwise distances, Discr only considers the subset that are between measurements of the same item, all other distances are literally discarded, as they do not provide information about the question of interest here. Second, Discr effectively operates on the ranks of distances, rather than the distances, rendering it robust to monotonic transformations of the data [28]. Figure 1 shows three different simulations illustrating the differences between Discr and other reliability statistics, including the fingerprinting index (Fingerprint) [29], intraclass correlation coefficient (ICC) [30], and Kernel [26] (see Appendix A for details). All four statistics operate on the pairwise distance matrices in column (B). However, Discr, unlike the other statistics, only considers the elements of each row whose magnitudes are smaller than the distances within an item. Thus, Discr explicitly quantifies the degree to which multiple measurements of the same item are more similar to one another than they are to other items. Definition 1 (Discriminability). Assuming we have n items, where each item has s i measurements, we obtain N = n i i × s i total measurements. For simplicity, assume s i = 2 for the below definition, and that there are no ties. Given that, Discr can be computed as follows (for a more formal definition and pseudocode, please see Appendix B): 1. Compute the distance between all pairs of samples (resulting in an N × N matrix), Figure 1(B).

2.
Identify replicated measurements of the same individual (green boxes). The number of green boxes is g = n × 2.
3. For each measurement, indentify measurements that are more similar to it than the other measurement of the same item, i.e., measurements whose magnitude is smaller than that in the green box (orange boxes). Let f be the number of orange boxes.

4.
Discriminability is defined as fraction of times across-subject measurements are smaller than within-subject measurements: A high Discr indicates that within-item measurements tend to be more similar to one another than across-item measurements. See Wang et al. [31] for a theoretical analysis of Discr as compared to these and other data reliability statistics.

Testing for discriminability
Letting R denote the reliability of a dataset with n items and s measurements per item, and R 0 denote the reliability of the same size dataset with zero item specific information, test for reliability is One can use any 'data reliability' statistic for R and R 0 [31]. We devised a permutation test to obtain a distribution of the test statistic under the null, and a corresponding p-value. To evaluate the different procedures, we compute the power of each test, that is, the probability of correctly rejecting the null when it is false (which is one minus type II error; see Appendix E.1 for details).

Testing for better discriminability
Letting R (1) be the reliability of one dataset or approach, and R (2) be the reliability of the second, we have the following comparison hypothesis for reliability: Again, we devised a permutation test to obtain the distribution of the test statistic under the null, and p-values (see Appendix E.2 for details).

Simulation settings
To develop insight into the performance of Discr, we consider several different simulation settings (see Appendix D for details). Each setting includes between 2 and 20 items, with 128 total measurements, in two dimensions: 1. Gaussian Sixteen items are each distributed according to a spherically symmetric Gaussian, therefore respecting the assumptions that motivate intraclass correlations. 2. Cross Two items have Gaussian distributions with the same mean and different diagonal covariance matrices. 3. Ball/Circle One item is distributed in the unit ball, the other on the unit circle; Gaussian noise is added to both. 4. XOR Each of two items is a mixture of two spherically symmetric Gaussians, but means are organized in an XOR fashion; that is, the means of the first item are (0, 1) and (1, 0), whereas the means of the second are (0, 0) and (1, 1). The implication is that many measurements from a given item are further away than any measurement of the other item. 5. No Signal Both items have the same Gaussian distribution.

Theoretical properties of Discriminability
Under reasonably general assumptions, if withinitem variability increases, predictive accuracy will decrease. Therefore, a statistic that is sensitive to within-item variance is desirable for optimal experimental design, regardless of the distribution of the data. Carmines and Zeller [32] introduces a univariate parametric framework in which predictive accuracy can be lower-bounded by a decreasing function of ICC; as a direct consequence, a strategy with a higher ICC will, on average, have higher predictive performance on subsequent inference tasks. Unfortunately, this valuable theoretical result is limited in its applicability, as it is restricted to univariate data, whereas big data analysis strategies often produce multivariate data. We therefore prove the following generalization of this theorem Theorem 2. Under the multivariate Gaussian mixture model plus additive Gaussian noise setting, Discr provides a lower bound on the predictive accuracy of a subsequent classification task. Consequently, a strategy with a higher Discr provably provides a higher bound on predictive accuracy than a strategy with a lower Discr. See Appendix C for proof. Thus, Discr provides a theoretical extension of ICC to a multivariate model, and correspondingly, motivates optimizing experiments to obtain higher Discr. Figure 1, we highlight the properties of different statistics across a range of basic one-dimensional simulations, all of which display a characteristic notion of reproducibility: samples of the same item tend to be more similar to one another than samples from different items. In three different univariate simulations we observe two samples from ten items ( Figure 1A):

The (in)validity of various reliability statistics In
(i) discriminable has each item's samples closer to each other than any other items, (ii) offset shifts the second measurement a bit, so that it is further from the first measurement than another item, and (iii) outlier is the same as discriminable but includes two items with an outlier measurement.
We compare Discr to intraclass correlation coefficient (ICC), fingerprinting index (Fingerprint) [29], and k-sample kernel testing (Kernel) [33] (see Appendix A for details). ICC provides no ability for differentiating between discriminable and offset simulation, despite the fact that the data in discriminable is more reproducible than offset. Moreover, ICC is uninterpretable in the case of even a very small number of outliers, where ICC is negative. On the other hand, Fingerprint suffers from the limitation that if the nearest measurement is anything but a measurement of the same item, it will be at or near zero, as shown in offset. Kernel also performs poorly in offset and in the presense of outliers. In contrast, across all simulations, Discr shows reasonable validity: the statistic is high across all simulations, and highest when repeated measurements of the same item are more similar than measurements from any of the other items.

The power of reliability statistics in multivariate experimental design
We evaluate Discr, PICC (which applies ICC to the top principal component of the data), I2C2, Fingerprint, and Kernel on five two-dimensional simulation settings (see Appendix A for details). Figure 2A shows a two-dimensional scatterplot of each setting, and Figure 2B shows the Euclidean distance matrix between samples, ordered by item. Figure 2C   Discr typically achieves highest power for all settings and variances. settings, they are similarly useless. In the fifth simulation they are all at chance levels, as they should be, because there is no information about class in the data. This suggests that of these statistics, only Discr and Kernel can serve as satisfactory surrogates for predictive accuracy under these quite simple settings.

Discriminability empirically predicts performance on subsequent inference tasks
A test to determine reliability A prerequisite for making item-specific predictions is that items are different from one another in predictable ways, that is, are discriminable. If not, the same assay applied to the same individual on multiple trials could yield in unacceptably highly variable results. Thus, prior to embarking on a machine learning search for predictive accuracy, one can simply test whether the data are discriminable at all. If not, predictive accuracy will be hopeless. Figure 2D shows that Discr achieves approximately the highest power among all competing approaches in all settings and variances. This result demonstrates that despite the fact that Discr does not rely on Gaussian assumptions, it still performs as well or better than parametric methods when the data satisfy these assumptions (row (i)).. In row (ii) cross, only Discr and Kernel correctly identify that items differ from one another, despite the fact that the data are Gaussian, though they are not spherically symmetric gaussians. In both rows (iii) ball/disc and (iv) XOR, most statistics perform well despite the non-Gaussianity of the data. And when there is no signal, all tests are valid, achieving power less than or equal to the critical value. Non-parametric Discr therefore has the power of parametric approaches for data at which those assumptions are appropriate, and much higher power for other data. Kernel performs comparably to Discr in these settings, though with somewhat less power in row (iv) XOR.
A test to compare reliabilities Given two experimental designs-which can differ either by acquisition and/or analysis details-are the measurements produced by one method more discriminable than the other? Figure 2D shows Discr typically achieves the highest power among all statistics considered. Specifically, only Fingerprint achieves higher power in the Gaussian setting, but it achieves almost no power in the cross setting. Kernel achieves extremely low power for all settings, as does PICC. I2C2 achieves similar power to Discr only for the Gaussian and ball/disc setting. All tests are valid in that they achieve a power approximately equal to or below the critical value when there is no signal. Note that these comparisons are not the typical "k-sample comparisons" with many theoretical results, rather, they are comparing across multiple disparate k-sample settings. Thus, in general, there is a lack of theoretical guarantees for this setting. Nonetheless, the fact that Discr achieves nearly equal or higher power than the statistics that build upon Gaussian methods, even under Gaussian assumptions, suggests that Discr will be a superior metric for optimal experimental design in real data.

Optimizing experimental design via maximizing reliability in human brain imaging data
Human brain imaging data acquisition and analysis Consortium for Reliability and Reproducibility (CoRR) [34] has generated functional, anatomical, and diffusion magnetic resonance imaging (dMRI) scans from >1,600 participants, often with multiple measurements, collected through 28 different studies (22 of which have both age and sex annotation) spanning over 20 sites. Each of the sites use different scanners, technicians, and scanning protocols, thereby representing a wide variety of different acquisition settings with which one can test different analysis pipelines. Figure 3A shows the six stage sequence of analysis steps for converting the raw fMRI data into networks or connectomes, that is, estimates of the strength of connections between all pairs of brain regions. At each stage of the pipeline, we consider several different "standard" approaches, that is, approaches that have previously been proposed in the literature, typically with hundreds or thousands of citations [35]. Moreover, they have all been collected into an analysis engine, called Configurable Pipeline for the Analysis of Connectomes (C-PAC) [36]. In total, for the six stages together, we consider 2 × 2 × 2 × 2 × 4 × 3 = 192 different analysis pipelines. Because each stage is nonlinear, it is possible that the best sequence of choices is not equivalent to the best choices on their own. For this reason, publications that evaluate a given stage using any metric, could result in misleading conclusions if one is searching for the best sequence of steps. The dMRI connectomes were acquired via 48 analysis pipelines using the Neurodata MRI Graphs (ndmg) pipeline [37]. Appendix F provides specific details for both fMRI and dMRI analysis, as well as the options attempted.

Different analysis strategies yield widely disparate stabilities
The analysis strategy has a large impact on the Discr of the resulting fMRI connectomes ( Figure 3B). Each column shows one of 64 different analysis strategies, ordered by how significantly different they are from the pipeline with highest Discr (averaged over all datasets, tested using the above comparison test). Interestingly, pipelines with worse average Discr also tend to have higher variance across datasets. The best pipeline, FNNNCP, uses FSL registration, no frequency filtering, no scrubbing, no global signal regression, CC200 parcellation, and converts edges weights to ranks. While all strategies across all datasets with multiple participants are significantly discriminable at α = 0.05 (Discr goodness of fit test), the majority of the strategies (51/64 ≈ 80%) show significantly worse Discr than the optimal strategy at α = 0.05 (Discr comparison test).  Discriminability identifies which acquisition and analysis decisions are most important for improving performance While the above analysis provides evidence for which sequence of analysis steps is best, it does not provide information about which choices individually have the largest impact on overall Discr. To do so, it is inadequate to simply fix a pipeline and only swap out algorithms for a single stage, as such an analysis will only provide information about that fixed pipeline. Therefore, we evaluate each choice in the context of all 192 considered pipelines in Figure 4A. The pipeline constructed by identifying the best option for each analysis stage is FNNGCP ( Figure 4A). Although it is not exactly the same as the pipeline with highest Discr (FNNNCP), it is also not much worse (Discr 2-sample test, p-value ≈ 0.14). Moreover, except for scrubbing, each stage has a significant impact on Discr after correction for multiple hypotheses (Wilcoxon signed-rank statistic, p-values all < 0.001).

Dataset
Another choice is whether to estimate connectomes using functional or diffusion MRI ( Figure 4B).
Whereas both data acquisition strategies have known problems [38], the Discr of the two experimental modalities has not been directly compared. Using four datasets from CoRR that acquired both fMRI and dMRI on the same subjects, and have quite similar demographic profiles, we tested whether fMRI or dMRI derived connectomes were more discriminable. The pipelines being considered were the best-performing fMRI pre-processing pipeline (FNNNCP) against the dMRI pipeline with the CC200 parcellation. For three of the four datasets, dMRI connectomes were more discriminable. This is not particularly surprising, given the susceptibility of fMRI data to changes in state rather than trait (e.g., amount of caffeine prior to scan [36]).
The above results motivate investigating which aspects of the dMRI analysis strategy were most effective. We focus on two criteria: how to scale the weights of connections, and how many regions of interest (ROIs) to use. For scaling the weights of the connections, we consider three possible criteria: using the raw edge-weights ("Raw"), taking the log of the edge-weights ("Log"), and ranking the non-zero edge weights in sequentially increasing order ("Rank"). Figure 4C.i shows that both rank and log transform significantly exceed raw edge weights (Wilcoxon signed-rank statistic, sample size= 60, p-values all < 0.001). Figure 4C.ii shows that parcellations with larger numbers of ROIs tend to have higher Discr. Unfortunately, most parcellations with semantic labels (e.g., visual cortex) have hundreds not thousands of parcels. This result therefore motivates the development of more refined semantic labels.
Optimizing Discriminability improves downstream inference performance We next examined the relationship between the Discr of each pipeline, and the amount of information it preserves about two properties of interest: sex and age. Based on the simulations above, we expect that analysis pipelines with higher Discr will yield connectomes with more information about covariates. Indeed, Figure 5 shows that, for virtually every single dataset including sex and age annotation (22 in total), a pipeline with higher Discr tends to preserve more information about both covariates. The amount of information is quantified by the effect size of the distance correlation DCorr (which is exactly equivalent to Kernel [28,39]), a statistic that quantifies the magnitude of association for both linear and nonlinear dependence structures. In contrast, if one were to use either PICC, Kernel, or I2C2 to select the optimal pipeline, for many datasets, subsequent predictive performance would degrade.
Fingerprint performs similarly to Discr on this dataset. These results are highly statistically significant: the slopes of effect size versus Discr and Fingerprint across datasets are significantly positive for both age and sex in 82 and 95 percent of all studies, respectively (robust Z-test, α = 0.05).
Kernel performs poorly, basically always, because k-sample tests are designed to perform well with many samples from a small number of different populations, and questions of reproducibility across repeated measurements have a few samples across many different populations.

Reliability of genomics data
The first genomics study aimed to explore variation in gene expression across human induced pluripotent stem cell (hiPSC) lines with between one and seven replicates [40]. This data includes RNAseq data from 101 healthy individuals, comprising 38 males and 63 females. Expression was interrogated across donors by studying up to seven replicated iPSC lines from each donor, yielding bulk RNAseq data from a total of 317 individual hiPSC lines. While the pipeline and ranks edge transformation. A Wilcoxon signed-rank test is used to determine whether the mean for the best strategy exceeds the second best strategy: a * indicates that the p-value is at most 0.001 after Bonferroni correction. Of the best options, only no scrubbing is not significantly better than alternative strategies. Note that the options that perform marginally the best are not significantly different than the best performing strategy overall, as shown in Figure 3. includes many steps, we focus here for simplicity on (1) counting, and (2) normalizing. The two counting approaches we study are the raw hiPSC lines and the count-per-million (CPM). Given counts, we consider four different normalization options: Raw, Rank, and Log-transformed (as described above), as well as to mean-centering (normalizing each sample to have an average count of 0). The task of interest was to identify the sex of the individual.
The second genomics study [41] includes 331 individuals, consisting of 135 patients with nonmetastatic cancer and 196 healthy controls, each with eight DNA samples. The study leverages a PCRbased assay called Repetitive element aneuploidy sequencing system to analyze ∼750,000 amplicons distributed throughout the genome to investigate the presence of aneuploidy (abnormal chromosome counts) in samples from cancer patients (see Appendix F.1 for more details). The possible processing strategies include using the raw amplicons or the amplicons downsampled by a factor of 5 × 10 4 bases, 5 × 10 5 bases, 5 × 10 6 bases, or to the individual chromosome level (the resolution of the data), followed by normalizing through the previously described approaches (Raw, Rank, Log-transformed) yielding 5 × 3 = 15 possible strategies in total. The task of interest was to identify whether the sample was collected from a cancer patient or a healthy control.
Across both tasks, slope for discriminability is positive, and for the first task, the slope is significantly bigger than zero (robust Z-test, p-value = .001, α = .05). Fingerprint and Kernel are similarly only informative for one of the two genomics studies. For PICC, in both datasets the slope is positive and the effect is significant. I2C2 does not provide value for subsequent inference.  effectiveness (x-axis) versus effect size (y-axis). Both the x and y axes are normalized by the minimum and maximum statistic. For each study, the effect size is regressed onto the statistic of interest. Color and line width correspond to the study and number of scans, respectively (see Figure 3B). The solid black line is the weighted mean over all studies. Discr is the only statistic in which all slopes are positive. Moreover, the corrected p-value [42,43] is significant across datasets for both covariates (all p-values < .001). This indicates that pipelines with higher Discr correspond to larger effect sizes for the covariate of interest, and that this relationship is stronger for Discr than other statistics. A similar experiment is performed on two genomics datasets, measuring the effects due to sex and whether an individual has cancer. Each point represents a single processing strategy (there is only one dataset per covariate). (III) indicates the fraction of datasets with positive slopes and with significantly positive slopes, ranging from 0 ("None", red) to 1 ("All", green), at both the task and aggregate level.
Discr is the statistic where the most datasets have positive slopes, and the statistic where the most datasets have significantly positive slopes, across the neuroimaging and genomics datasets considered. Appendix F.2 details the methodologies employed.

Discussion
We propose the use of the Discr statistic as a simple and intuitive measure for experimental design featuring multiple measurements. Numerous efforts have established the value of quantifying reliability, repeatability, and replicability (or discriminability) using parametric measures such as ICC and I2C2. However, they have not been used to optimize reproducibility-that is, they are only used post-hoc to determine reproducibility, not used as criteria for searching over the design space-nor have non-parametric multivariate generalizations of these statistics been available. We derive goodness of fit and comparison (equality) tests for Discr, and demonstrate via theory and simulation that Discr provides numerous advantages over existing techniques across a range of simulated settings. Our neuroimaging and genomics use-cases exemplify the utility of these features of the Discr framework for optimal experimental design. An important consideration is that quantifying reliability and reproducibility with multiple measurements may seem like a limitation for many fields, in which the end derivative typically used for inference may be just a single sample for each item measured. However, a single measurement may often consist of many sub-measurements for a single individual, each of which are combined to produce the single derivative work. For example in brain imaging, a functional Magnetic Resonance Imaging (fMRI) scan consists of tens to thousands of identical scans of the brain at numerous time points. In this case, the image can be broken into identical-width time windows. In another example taken directly from the cancer genomics experiment below, a genomics count table was produced from eight independent experiments, each of which yielded a single count table. The last step of their pre-processing procedure was to aggregate to produce the single summary derivative that the experimenters traditionally considered a single measurement. In each case, the typical "measurement" unit can really be thought of as an aggregate of multiple smaller measurement units, and a researcher can leverage these smaller measurements as a surrogate for multiple measurements. In the neuroimagine example, the fMRI scan can be segmented into identical-width sub-scans with each treated as a single measurement, and in the genomics example, the independent experiments can each be used as a single measurement.
Discr provides a number of connections with related statistical algorithms worth further consideration. Discr is related to energy statistics [44], in which the statistic is a function of distances between observations [25]. Energy statistics provide approaches for goodness-of-fit (one-sample) and equality testing (two-sample), and multi-sample testing [45]. However, we note an important distinction: a goodness of fit test for discriminability can be thought of as a K-sample test in the classical literature, and a comparison of discriminabilities is analogous to a comparison of K-sample tests. Further, similar to Discr, energy statistics make relatively few assumptions. However, energy statistics requires a large number of measurements per item, which is often unsuitable for biological data where we frequently have only a small number of repeated measurements. Discr is most closely related to multiscale generalized correlation (MGC) [28,39], which combines energy statistics with nearest neighbors, as does Discr. Like many energy-based statistics, Discr relies upon the construction of a distance matrix. As such, Discr generalizes readily to high-dimensional data, and many packages accelerate distance computation in high-dimensionals [46].
Limitations While Discr provides experimental design guidance for big data, other considerations may play a role in a final determination. For example, the connectomes analyzed here are resting-state, as opposed to task-based fMRI connectomes. Recent literature suggests that the global signal in a rs-fMRI scan may be a nuisance variable for task-based approaches [47,48]. Thus, while Discr is an effective tool for experimental design, knowledge of the techniques in conjunction with the inference task is still a necessary component of any investigation. Further, in this study, we only consider the Euclidean distance, which may not be appropriate for all datasets of interest. For example, if the measurements live in a manifold (such as images, text, speech, and networks), one may be interested in dissimilarity or similarity functions other than Euclidean distance. To this end, Discr readily generalizes to alternative comparison functions, and will produce an informative result as long as the choice of comparison function is appropriate for the measurements.
It is important to emphasize that Discr, as well the related statistics, are neither necessary, nor sufficient, for a measurement to be practically useful. For example, categorical covariates, such as sex, are often meaningful in an analysis, but not discriminable. Human fingerprints are discriminable, but typically not biologically useful. In addition, none of the statistics studied here are immune to sample characteristics, thus interpreting results across studies deserves careful scrutiny. For example, having a sample with variable ages will increase the inter-subject dissimilarity of any metric dependent on age (such as the connectome). With these caveats in mind, Discr remains as a key experimental design consideration a wide variety of settings.
Conclusion The use-cases provided herein serve to illustrate how Discr can be used to facilitate experimental design, and mitigate reproducibility issues. We envision that Discr will find substantial applicability across disciplines and sectors beyond brain imaging and genomics, such pharmaceutical research. To this end, we provide open-source implementations of Discr for both Python and R [49,50]. Code for reproducing all the figures in this manuscript is available at https://neurodata.io/mgc.

A.1 Intraclass Correlation Coefficient The intraclass correlation coefficient (ICC) is a commonly
used data reproducibility statistic [30]. The population ICC, or ICC(1,1), is the fraction of the total variability that is across-item variability, that is, ICC is defined as the across-item variability divided by the within-item plus across-item variability. ICC has several severe limitations. First, it is univariate, meaning if the data are multidimensional, they must first be represented by univariate statistics, thereby discarding multivariate information. Second, ICC is based on an (overly simplistic) Gaussian assumption characterizing the data. Thus, any deviations from this assumption render the interpretation of the magnitude of ICC questionable, because non-Gaussian measurements that are highly reproducible could yield quite low ICC [51][52][53]. Third, the Intraclass correlation coefficient is highly sensitive to the design of the study [53,54]; care must be taken to ensure that the form of ICC chosen accurately reflects the design of the study of interest. Finally, there are numerous definitions of estimates of ICC [30], and researchers regularly use (and misuse) the different estimators in generic contexts [53,55].
The Image Intra-Class Correlation (I2C2) was introduced to mitigate ICC's univariate limitation [56]. Specifically, I2C2 operates on covariances matrices, rather than variances. To obtain a univariate summary of reproducibility, I2C2 operates on the trace of the covariance matrices, one of several possible strategies, similar to most multivariate analysis of variance procedures [57]. Thus, while overcoming one limitation of ICC, I2C2 still heavily leverages Gaussian assumptions of the data to justify its validity.

A.2 Fingerprinting Index
The fingerprinting index [29,58] provides a metric for quantifying individual connectivity profiles in resting-state MRI (fMRI). Specifically, the fingerprinting index operates on the pairwise correlation of the vectorized connectivity matrices. A high fingerprinting index corresponds to the connectivity matrices being more strongly correlated within-subject versus between-subject. An important clarification for fingerprinting is that the connectivity matrices must be more strongly correlated than any other measurement within a particular scanning session, otherwise the fingerprinting index will be 0. Unlike the other strategies employed in this manuscript, the fingerprinting index produces a statistic for each possible ordering of 2 measurement sessions, that is, if each item is measured s times, fingerprinting produces s(s − 1) statistics. To enable fingerprinting for assessing the effectiveness of a strategy, we instead averaged across all s(s − 1) statistics, which will henceforth be referred to as

Fingerprinting.
A.3 Kernel Methods Maximum mean discrepancy (MMD) [33] provides a non-parametric framework for comparing whether two samples are drawn from the same distribution. MMD subverts Gaussian assumptions by embedding the points in a reproducing kernel Hilbert Space (RKHS), and looking for functions over the unit ball in the RKHS which maximize the difference in the means of the embedded points. In the two-item regime, MMD can be shown to be equivalent to the Hilbert-Schmidt Independence Criterion (HSIC) [27,59,60], which provides a natural generalization of MMD when the number of classes exceeds two. To date, to our knowledge, there does not exist a k-sample variant of MMD.
Distance Components (DISCO) [45] extends the classical Analysis of Variance (ANOVA) framework to cases where the distributions are not necessarily Gaussian. In contrast to ANOVA which makes simplifying assumptions of normality, DISCO operates on the dispersion of the samples based on the Euclidean Distance, comparing the within-class dispersion to the between-class dispersion. DISCO produces a consistent test against general alternatives as the number of observations s per item goes to infinity. Shen and Vogelstein [61] shows a closed form relationship between Kernel and other Energy statistics approaches, such as Distance correlation. The result is that using Distance correlation for k-sample testing results in a test statistic that has bias relative to the Kernel statistic, but will yield the same p-value. Further, Shen and Vogelstein [61] shows the equivalence between Distance correlation and HSIC/MMD. Thus, in this manuscript, we use Kernel to refer to either DISCO or MMD as appropriate. In all cases, we use the default kernel, which is the Gaussian kernel with the typical bandwidth specification, as implemented in the kernlab package [62] (MMD) and energy (DISCO) package [63].
Note that in many real data scenarios, s is small (particularly, most "repeat measurements" datasets have s = 2), and the finite-sample performance of Kernel on such a small number of repeat trials is not known.

Appendix B. Population and Sample Discr.
Suppose that θ i ∈ Θ represents a physical property of interest for a particular item i. In a biological context, for instance, an item could be a participant in a study, and the property of interest could be the individual's true brain network, or connectome. We cannot directly observe the physical property, but rather, we must first measure θ i and then "wrangle" it. Call the measurement function, f ∈ F for a family of possible measurement functions F That is, g : Θ → W W W. So, measurements of θ i are observed as f (θ i ) = w i . However, w i may be a noisy, with measurement artefacts. Alternately, w i might not be the property of interest, for example, if the property is a network, perhaps w i is a multivariate time-series, from which we can estimate a network. We therefore have another function, f ∈ G : W W W → X X X , which represents the data wrangling procedure to take the measurement and produce an informative derivative (for instance, confound removal). The family of possible data wrangling procedures to produce the informative derivative is G. In this fashion, the output of interest is The goal of experimental design is to choose an f and g that yield high-quality and useful inferences, that is, that yield x's that we can use for various inferential purposes. When we have repeated measurements of the same items, we can use those samples to our advantage. Given x x x j i , which is the j th measurement of sample i, we would expect x x x j i to be more similar to x x x j i (another measurement of the same item), than to any measurement of a different item x x x j i . Formally, let δ : X X X × X X X → [0, ∞) be a distance metric, we define the population Discr: That is, "population Discr" D represents the average probability that the within-item distance δ(x x x j i , x x x j i ) is less than the between-item distance δ(x x x j i , x x x j i ). Discr depends on the choice of distance δ, as well as the measurement protocal f and the analysis choices g.
The population Discr represents a property of the distribution of θ i . In real data since we do not observe the true distribution, we instead rely on the sample Discr. Suppose a dataset consists of i ∈ {1, . . . , n} items, where each item i has J i repeat measurements. The sample Discr is defined: It can be shown [23] that the under the multivariate additive noise model in Assumption 3; that is, and E j i = c c c, that the sample Discr, Discr is both a consistent and unbiased estimator for population Discr.

Appendix C. Discr Provides an Informative Bound for Inference.
During experimental design, the extent of subsequent inference tasks may be unknown. A natural question may be, what are the implications of the selection of a discriminable experimental design? Formally, assume the task of interest is binary classification: that is, Y = {0, 1}, and we seek a classifier h : X → Y. The goal of experimental design in this context is to choose the options (f * , g * ) that will minimize the classification loss: (f * , g * ) = argmin (f,g)∈F ×G P(h(f (g(θ))) = y).
For a fixed (f, g), the minimal prediction error is achieved by the Bayes optimal classifier [64]: log P y i = y f (g(θ i )) + log π y , where π y = P(y i = y), and let L * f,g denote the error of the Bayes optimal classifier; that is, the error achieved by h * f,g . Assumption 3 (Multivariate Additive Noise Setting). We suppose following the multivariate additive Gaussian noise setting: To connect the above model with Eq. (1), we can let where we assume that η η η j i ⊥ ⊥ τ τ τ j i , and both η η η j i and τ τ τ j i are multivariate Gaussian. Using Bayes rule and Assumption 3, note that the probability that an observation x x x j i is from class y is given by: where Σ Σ Σ x = Σ Σ Σ θ + Σ Σ Σ is constant between the two classes (that is, the variance is homoscedastic), and y is a generic value in {0, 1} that a realization y i can take. This follows directly by taking the log of the density function of the multivariate normal distribution, and removing terms not proportional in y. The Bayes optimal classifier is: The Bayes error can be computed explicitly using that: using standard rules of integration.
Importantly, the Bayes error can, in fact, be upper bounded by a decreasing function of Discr, as shown in the theorem. In words, this theorem specifies the desirability of high Discr: a higher discriminability results in a lower bound on the error of future inferential tasks. Correspondingly, a strategy with a higher discriminability will have a lower bound on the error than another strategy with a lower discriminability.
where L * is the Bayes error, or the error achieved by the Bayes optimal classifier h * f,g (θ θ θ i ).

Proof of Theorem (4).
Consider the additive noise setting, that is To bound the probability above, we bound the θ i − θ i and separately. We start with the first term Here, σ 2 2 = tr(Σ Σ Σ θ ) is the trace of covariance matrix of θ i . We can apply Markov's Inequality for any t > 0: Let a and b be two constants satisfy: Furthermore, let t 2 = √ 2aσ σ θ , and define: If a 2 σ 2 ≥ 2σ 2 θ , then θ ≤ 1. According to the Paley-Zygmund Inequality [65], that is: for all 0 ≤ θ ≤ 1 and Z ≥ 0, we can plug in the θ above to achieve Plugging t 2 into the inequality in Equation (3), we have: Given that θ i 's and j i 's are independent by supposition, we can combine the two inequalities: Note that the resulted bound holds true even if a 2 σ 2 < 2σ 2 θ , as the right hand side becomes greater than 1. This produces a bound for σ θ σ : To obtain a bound on Bayes error, we apply Devijver and Kittler's result [66], which is that: ∆µ µ µ is the difference between means of two classes. Since j i is assumed to be independent of y y y i : With Σ Σ Σ x as the weighted covariance matrix of x x x: Var(x x x j i |y y y i = 0) + π 1 Var(x x x j i |y y y i = 1) = π 0 Var(θ i |y y y i = 0) + π 1 Var(θ i |y y y i = 1) + Var( j i ).
Therefore, Σ Σ Σ −1 x Σ Σ Σ −1 * (D), and we obtain: As a direct consequence of this theorem, we see: g 1 ) and (f 2 , g 2 ) are two analysis strategies, and suppose that D f 1 ,g 1 > D f 2 ,g 2 . Then the bound on the Bayes error for (f 1 , g 1 ) is lower than the bound on the Bayes error on (f 2 , g 2 ).
Proof. Direct application of Theorem 4, noting that Consequently, under the described setting, the pipeline that achieves a higher Discr has a lower bound on the Bayes error than competing strategies, despite the fact that the task is unknown during data acquisition and analysis. Complementarily, note that if we were to instead consider the predictive accuracy 1 − L * f,g , we can obtain a similar result to obtain a lower bound on the predictive accuracy via an increasing function of Discr. That is, in the context of the corollary, a more discriminable pipeline will tend to have a higher bound on the accuracy for an arbitrary predictive task.

Appendix D. Simulations.
The following simulations were constructed, where σ min , σ max are the variance ranges, and settings were run at 15 intervals in [σ min , σ max ] for 500 repetitions per setting. For a simulation setting with variance σ, the variance is reported as the normalized variance, σ = σ−σ min σmax−σ min . Dimensionality is 2, the number of items is K, and the total number of measurements across all items is 128. Typically, i indicates the individual identifier, and j the measurement index. Notationally, in the below descriptions, we adopt the convention that z z z j i obeys the true distribution for a single observation j of item i, and x x x j i incorporates the controlled error term • z z z t 1 = 0 0 0 t ∈ 1, . . . , 32 1 1 1 t ∈ 33, . . . , 64 Bayes error was estimated by simulating n = 10,000 points according to the above simulation settings, and approximating the Bayes error through numerical integration. The classification labels for K = 2 simulations were consistent with the individual labels, and for the K = 16, the first class consists of the 8 distributions whose means were leftmost, and the rest of the distributions were the other class.

D.2 Comparison Testing
Items are sampled with the same true distributions z z z j i as before, with the following augmentation: That is, the observed data x x x j i,k for item i, observation j, and sample k ∈ [2] is such that the first sample is distributed according to the true item distribution, and the second sample is distributed according to the true item distribution with an added noise term, where j i iid ∼ N 0 0 0, σ 2 I I I : By construction, one would anticipate Discr of the first sample to exceed that of the second sam-ple, as the second sample has additional error. Therefore, the natural hypothesis is: Appendix E. Hypothesis Testing.

E.1 Goodness of Fit Test
Recall the goodness of fit test, shown in Equation (2.2). We approximate the distribution ofŜ under the null through a permutation approach. The item labels of our N samples are first permutated randomly, andŜ 0,N is computed each time given the observed data X X X and the permuted labels. For a level α significance test, we compareŜ to the (1 − α) quantile Q 1−α of the empirical null distributionD 0,N , and reject the null hypothesis ifD N < Q 1−α . This approach provides a consistent and valid test under general assumptions.
Note that the permutation-based approach requires r computations of the sample Discr. n items of data, each featuring J i measurements.
(2) r an integer for the number of permutations.
a random shuffling of the measurements 5: Discr is computed for each of the two randomly combined datasets for each permutation. Finally, for each permutation, the all pairs of observed differences in Discr is computed. Finally, the observed statistic is compared with the differences under the null of the randomly combined datasets. The p-value is the fraction of times that the observed statistic is more extreme than the null. Note that we can use this approach for both one and two-tailed hypotheses for an experimental design having higher Discr, lower Discr, and equal Discr relative a second approach; we implement all three in the software implementation of the comparison test. The Algorithm for the comparison test is shown in Algorithm 2, with the alternative hypothesis as specified in Equation (2.3). The computational complexity is then . Note that for each permutation, the limiting step is the computation of the Discr in O N 2 max(p, s) . This is then offset through parallelization over T cores in the implementation. We extend this comparison test to all competing approaches to provide a robust p-value associated with both statistics of interest, for similar reasons to the above. Again, this permutation approach can be generalized to any statistic quantifying repeatability based on repeated measurements.

Algorithm 2 Discr Discriminability Comparison Test.Our implementation of the permutation test
for the hypothesis given in Equation (2.3) requires O r T N 2 max(p, s) time, where r is the number of permutations and T is the number of cores available for the permutation test. Above, the only alternative considered is that H A : D (1) > D (2) ; our code-based implementation provides strategies for H A : D (1) < D (2) and H A : D (1) = D (2) as well.
n items of data, each featuring J i measurements, from the first sample.
(2) z z z j i j∈[J i ],i∈ [n] n the observed data, from the second sample.
(3) r an integer for the number of permutations.
Ensure: p ∈ [0, 1] the p-value associated with the test.
The Discr of the first sample. 3: The Discr of the second sample. 4: The observed difference in Discr between samples 1 and 2.

5:
The for-loop below can be parallelized over T cores, as each loop is an independent 6: for i in 1 : r do 7: Generate a synthetic null dataset for each of the 2 samples, using a convex combination of the elements of each sample 8: for k in 1 : 2 do 9: a random shuffle of the measurements 10: Convex combination of random elements from each sample 13: Compute Discr of the convexly combined elements 14: end for 15: end for 16: Compute all pairs differences in Discr using the convexly-combined samples 17: for i in 1, . . . , r − 1 do 18: for j in i + 1, . . . , r do 19: d n ← c d n , d (1) n,i − d (2) n,j , d (2) n,j − d

F.1 Data Acquisition and Analysis
fMRI Analysis Pipelines The fMRI connectomes were acquired as follows. Motion correction is performed via mcflirt to estimate the 6 motion parameters (x, y, z translation and rotations). Registration is performed by first performing a cross-modality registration from the functional to the anatomical downsampled into non-overlapping windows of 5 × 10 4 bases, 5 × 10 5 bases, 5 × 10 6 bases, or to the individual chromosome level (the resolution of the data).

F.2 Effect Size Investigation
In this investigation, we are interested in learning how maximization based on the observed notion of reliability correlates with real performance on a downstream inference task. Recalling Corollary (5), we explore the implications of this corollary in a large neuroimaging dataset provided by the Consortium for Reliability and Reproducibility [24], and demonstrate that selection of the experimental design via Discr, in fact, facilitates improved downstream inference on both a regression and classification task. We further extend this to two separate genomics datasets investigating classification tasks, and again demonstrate that selection of experimental design via Discr improves downstream inference. This provides strong motivation for leveraging the Discr for experimental design.
Ideally, for a particular summary reference statistic, a high value will generally correlate with a positive effect size. For datasets i = 1, . . . , M where M is the total number of datasets, an analysis strategy j = 1, . . . , 192 for 192 total analysis strategies, and k = 1, . . . , 3 are our summary reference statistics of interest (Discr, PICC, Fingerprint, I2C2, Kernel), we fit the standard linear regression model Y = βX + , where we model the effect size Y estimated by DCorr [84] via a linear relationship with X, the observed reference statistic for approach k, with coefficient β. Note that the interpretation of β is the expected change in the effect size Y due to a single unit change in the observed reference statistic X. Both Y and X are uniformly normalized across all strategies within a single dataset to facilitate intuitive comparison across methods. For each reference statistic k, we pose the following hypothesis: Acceptance of the alternative hypothesis would have the interpretation that an increase in the observed reference statistic X would tend to correspond to an increase in the observed effect size Y , and the relevant test is the one-way Z-test. To robustify against model assumptions, we use robust standard errors [43]. Acceptance of the alternative hypothesis against the null provides evidence that an increase in the sample statistic corresponds to an increase in the observed effect size, where the responses (age, sex, cancer status) were not considered at the time the data were analyzed nor when the reference statistics computed. This provides evidence that the statistic is informative for experimental design within the context of this investigation. Model fitting for this investigation is conducted using the lm package in the R programming language [85].

F.3 Human Brain Imaging Dataset Descriptions
Useful Data Links All relevant analysis scripts and data for figure reproduction in this manuscript made publicly available, and can be found at https://neurodata.io/mgc.