Optimal Experimental Design for Big Data: Applications in Brain Imaging

. The cost of data acquisition and analysis is becoming prohibitively expensive for many research groups across disciplines. And yet, as more data are available, more researchers desire to investigate it, often to answer previously unconceived questions. How can one optimally acquire and analyze data to answer many different questions posed by many different investigators? We propose an approach to experimental design that leverages multiple measurements for each distinct item. The key insight is that each measurement of the same item should be more similar to other measurements of that item, as compared to measurements of any other item. In other words, we seek to optimally discriminate one item from another. We formalize the notion of discriminability, and introduce both a non-parameteric and parametric statistic to quantify the discriminability of potentially multivariate and/or non-Euclidean data. With this notion, one can make optimal decisions—either with regard to acquisition or analysis of data—by maximizing discriminability. Crucially, this optimization can be performed in the absence of any task-speciﬁc (or supervised) information. Optimizing decisions with respect to discriminability provably bounds performance on subsequent inference tasks. Simulations corroborate and extend this theory by demonstrating improved predictive accuracy with improved discriminability. We then apply this strategy to a brain imaging dataset built by the “Con-sortium for Reliability and Reproducability” which consists of 28 disparate magnetic resonance imaging datasets, each with tens to to hundreds of individuals that were imaged multiple times. Optimizing pipelines with respect to discriminability improves performance on multiple subsequent inference tasks, even though discriminability does not consider the tasks whatsoever.

1 Introduction As the size of data increases, scientists face two questions, in what manner should data be (i) acquired/collected and (ii) analyzed/processed. When the data will be used to answer multiple different questions, there is a conflict: if one optimizes for a single question, information required to answer other questions could be lost. This problem is exacerbated when the data will be used to answer unknown future questions, which is common for expensive data. In such scenarios, how can one make decisions that yield satisfactory answers for many questions? In other words, which experimental and analytical properties of the measurements should one optimize?
One goal would be to maximize aspects of measurement validity, such as, the degree to which the measurements corresponds to what it is purporting to measure. However, measurement validity often cannot be observed directly [1,2]. Even when it can be observed, increased validity often comes with a cost of increased variance. To give a simple example, a broken clock is typically not valid: its measurement does not correspond accurately to the true time very frequently (only twice per day). Yet, it has zero variance. Thus, if one were to increase variance, validity could also increase [3].
To complicate matters, in scientific measurement, only some sources of variability are of interest. For example, quantifying veridical biological heterogeneity is a typical goal in biomedical studies. On the other hand, many sources of variability are a nuisance, such as measurement noise. So, it is not all variability that one would desire to reduce, just the undesirable variability. Thus, a natural quantity to optimize would be a function that preserves biological variability while mitigating extraneous variability.
If one has acquired multiple measurements per item (e.g., an individual), then the intra-class correlation coefficient (ICC) is a possible quantity to optimize. ICC is the fraction of the total variability that is across-subject variability, that is, it is across subject variability divided by within subject plus across subject variability. ICC provides an index that can be naturally compared across datasets. ICC is therefore a useful quantity to optimize in experimental design. However, optimizing ICC has, to our knowledge, not previously been proposed, perhaps because it requires acquiring multiple measurements per item.
This is despite the fact that ICC is the de facto standard metric for evaluating the reliability of an experi-1 Johns Hopkins University, 2 Shanghai Jiaotong University, 3 Child Mind Institute, 4 Beijing Normal University, Nan-ment. That said, ICC has several limitations if one were to use it to optimize experimental design. First, it is a univariate measure, meaning if the data are multidimensional, they must first be represented by univariate statistics, thereby discarding multivariate information. Second, ICC is based on a overly simplistic Gaussian assumption characterizing the data. Thus, any deviations from this assumption render the interpretation of the magnitude of ICC questionable, because non-Gaussian measurements that are highly reliable could yield quite low ICC. We therefore generalize ICC in two ways. First, we introduce a multivariate parametric generalization, PICC, in which we compute ICC on the the first principle component of the data. Second, we introduce discriminability (abbreviated Discr throughout), a multivariate non-parametric generalization of ICC, replacing the variance computation with a rank-based distance computation. For both generalizations, we introduce a permutation procedure to obtain confidence intervals and p-values. This non-parametric Discr statistic and test enables comparing experiments and analyses for the study of repeatability and reliability without making strong assumptions or reducing the data to univariate statistics. We provide both theory and extensive simulations to illustrate the value of Discr for optimal experimental design.
The motivation of this work is a brain imaging dataset generated by the Consortium for Reliability and Reproducibility (CoRR) [4]. This dataset is an amalgamation of over 28 different studies, many of which were collected using different scanners, manufactured by different companies, run by different people, using different settings. Moreover, the scanned individuals span various age ranges, sexes, and ethnicities. Nonetheless, we are interested in finding a pipeline to process the data such that they can be used for many different inference tasks. After evaluating nearly 200 different pipelines on over 3000 scans, we determined the optimal pipeline, that is, the pipeline with the highest Discr. We then demonstrate that for every single dataset, on average, pipelines that achieve higher Discr also yield data with more information about multiple phenotypes. This is despite the fact that no phenotypic information whatsoever was incorporated into the optimal design criterion. This is in contrast with other potential design criteria, which did not exhibit this property. We therefore believe this approach to optimal experimental design will be useful for a wide range of disciplines and sectors. To facilitate its use, we make all of our code and data derivatives open access at https://neurodata.io/mgc.

Computing the Non-Parametric Discriminability Statistic
Consider n items, where each item has s measurements, resulting in N = n × s total measurements across items. The non-parametric discriminability statistic is computed as follows: 1. Compute the distance between all pairs of samples (resulting in an N × N matrix).
2. For all samples of all items, compute the fraction of times that a within-item distance is smaller than an across-item distance.
3. The Discr of the dataset is the average of the above mentioned fraction. A high Discr indicates that within-item measurements are more similar to one another than acrossitem measurements. For more algorithmic details, see Algorithm 9. For formal definition of terms, see Appendix A.
3 Theoretical properties of discriminability Under reasonably general assumptions, if within-item variability increases, predictive accuracy will subsequently decrease. Therefore, a statistic that is sensitive to within-item variance is desirable to select amongst datasets and pipelines, regardless of the distribution of the data. Carmines and Zeller [5] introduces a univariate parametric framework in which predictive accuracy can be lower-bounded by a decreasing function of ICC; as a direct consequence, a strategy with a higher ICC can, on average, have higher predictive performance on subsequent inference tasks. Unfortunately, this valuable theoretical result is limited in its applicability, as it is restricted to univariate data, whereas big data processing strategies often produce data in high dimensions. We therefore prove the following generalization of this theorem (see Appendix B for proof): Theorem 3.1. Under the multivariate additive noise setting, Discr provides a lower bound on the predictive accuracy of a subsequent classification task. Consequently, a strategy with a higher Discr 2 . CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint provably provides higher bound on predictive accuracy than a strategy with a lower Discr.
Thus, Discr provides a theoretical extension of ICC to a general multivariate model, and correspondingly, motivates the theoretical desirability of strategies with higher Discr than competing strategies.
4 Empirical properties of discriminability on simulated data 4.1 Simulation settings To develop insight into the performance of Discr, we consider four different simulation settings. Each includes between 2 and 20 items, with s measurements per item, in 2 dimensions. Figure 1A shows a two-dimensional scatterplot of each setting, and Figure 1B shows the Euclidean distance matrix between samples, ordered by item. The four settings are (see Appendix D for details): 1. Gaussian Each item is distributed according to a spherically symmetric Gaussian, therefore respecting the assumptions ICC.
2. Cross Both items have Gaussian distributions with the same mean, but they are no longer spherically symmetric and have different covariances: the different dimensions of each subject have different distributions. 3. Ball/Circle One item is distributed in the unit ball, the other on the unit circle, so that neither item is entirely characterized by a Gaussian. 4. No Signal Both items have the same Gaussian distribution. These settings were selected to illustrate the relative merits and demerits of three different statistics that one could use to evaluate the reliability of data. I2C2, a previously proposed multivariate method to quantify reliability of data [6]. I2C2 is based on a particular multivariate Gaussian assumption on the data, and could therefore be appropriate in such settings. PICC, which computes ICC on the first principal component of the data, could be appropriate for more general Gaussian settings. Discr, which is a non-parametric statistic, should be robust to distributional settings entirely.

Discr empirically predicts performance on subsequent inference tasks
We compare the empirical performance of Discr to and I2C2 by investigating the sensitivity of each statistic to changes in within-item variability. Figure 1C shows the impact of increasing within-item variance on the four different simulation settings. For the three with predictive information, increasing variance decreases predictive accuracy (green line). As desired, Discr also decreases nearly perfectly proportionally.
However, only in the first setting, where each item has a spherically symmetric Gasussian distribution, do I2C2 and drop proportionally. Even in the second (Gaussian) setting, I2C2 and are effectively uninformative about the within-subject variance. And in the third (non-Gaussian) setting, they are similarly useless. This suggests that of these statistics, only Discr can serve as a satisfactory surrogate for predictive accuracy under these general settings.

4.3
Discr provides a one-sample test for whether items are discriminable A prerequisite for making subject-specific predictions is that subjects are different from one another in predictable ways, that is, are discriminable. If not, the same assay applied to the same individual on multiple trials could result in unacceptably highly variable results. Thus, prior to embarking on a machine learning search for predictive accuracy using some data, one can simply test whether the data are discriminable at all.
If not, predictive accuracy will be hopeless. Letting D denote the Discr of a dataset with n items and s measurements per item, and D 0 denote the Discr of the same size dataset with zero subject specific information, the formal hypothesis test for Discr is  Figure 1D shows that Discr achieves as high power as I2C2, and higher power than PICC, in the spherical Gaussian setting. This result demonstrates that despite the fact that Discr does not rely on Gaussian assumptions, it still performs as well or better than parametric methods when the data satisfy these assumptions. In the cross setting, only Discr correctly identifies that items differ from one another, despite the fact that the data are Gaussian. In the annulus/disc setting, both Discr and I2C2 perform comparably. And when there is no signal, all tests are valid, achieving power less than or equal to the critical value. Non-parametric Discr therefore has the power of parametric approaches for data at which those assumptions are appropriate, and much higher power for other data.

4.4
Discr provides two-and multi-sample tests for optimal experimental design Given two approaches for obtaining a given dataset-which can differ either by experimental protocols and/or processing pipelines-are the measurements produced by one approach more discriminable than the other? Formally, letting D (1) be the Discr of the first approach, and D (2) be the Discr of the second approach, we have the following hypothesis:  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint Again, one could replace Discr with other test statistics, and we devised a permutation test to obtain confidence intervals and p-values (see Appendix C.2 for details). Figure 1D shows Discr achieves nearly as high or higher power than both I2C2 and ICC for all three settings with across-subject differences, and all tests are valid. The fact that Discr achieves nearly equal or higher power than the Gaussian methods, even under Gaussian assumptions, suggests that Discr will be a superior metric for optimal experimental design.
5 Empirical properties of discriminability on real data 5.1 Real data collection and processing Consortium for Reliability and Reproducibility (CoRR) [7] has generated functional MRI (fMRI) and diffusion MRI (dMRI) scans from >1,600 participants, often with multiple measurements, collected through 28 different studies spanning over 20 sites. Each of the sites use different scanners, technicians, and scanning protocols, thereby representing a wide variety of different settings with which one can test different processing pipelines. Figure 2A shows the six stage sequence of pre-processing steps for converting the raw fMRI data into connectomes, that is, estimates of the strength of connections between all pairs of brain regions. At each stage of the pipeline, we consider several different "standard" approaches, that is, approaches that have previously been proposed in the literature, typically with hundreds or thousands of citations [8]. Moreover, they have all been collected into a pre-processing pipeline engine, called Configurable Pipeline for the Analysis of Connectomes (C-PAC) [9]. In total, for the six stages together, we consider 2 × 2 × 2 × 2 × 4 × 3 = 192 different processing pipelines. Because each stage is nonlinear, it is possible that the best sequence of choices is not equivalent to the best choices on their own. For this reason, publications that evaluate a given stage using any metric, could result in misleading conclusions if one is searching for the best sequence of steps. The dMRI connectomes were acquired via 48 preprocessing pipelines using the Neurodata MRI Graphs (ndmg) pipeline [10]. Appendix E provides specific details for both fMRI and dMRI preprocessing, as well as the options attempted. Figure 2B shows the preprocessing strategy has a large impact on the Discr of the resulting fMRI connectomes. Each column shows one of 64 different preprocessing strategies, ordered by how significantly different they are from the pipeline with greatest Discr (averaged over all datasets, tested using the above twosample test). Interestingly, pipelines with worse average Discr also tend to have higher variance across datasets. The best pipeline, FNNNCP, uses FSL registration, no frequency filtering, no scrubbing, no global signal regression, CC200 parcellation, and converts edges weights to ranks. The majority of the strategies (51/64 = 79.6%) show significantly worse Discr than the optimal strategy at α = 0.05 (Discr 1-sample test). In other words, several standard procedures for processing these data reduce Discr on average, calling into question whether they should be used in resting state fMRI connectomics.

Discr identifies which acquisition and analysis decision are most important for improving
performance While the above analysis provides evidence for which sequence of processing steps is best, it does not provide information about which choices individually have the largest impact on overall Discr. To do so, it is inadequate to simply fix a pipeline and only swap out algorithms for a single stage, as such an analysis will only provide information about that fixed pipeline. Therefore, we evaluate each choice in the context of all 192 considered pipelines in Figure 3A. If one were to independently select the best option for each pre-processing stage (FNNGCP), although it is not exactly the same as the pipeline with highest Discr (FNNNCP), it is also not significantly worse (Discr 2-sample test, p-value = 0.14). Moreover, except for no scrubbing, each stage has a significant impact on Discr after correction for multiple hypotheses (Wilcoxon signed-rank statistic, p-values all < 0.001).
Another choice is whether to estimate connectomes using functional or diffusion MRI. Whereas both data acquisition strategies have known problems [11], the Discr of the two experimental modalities has not been directly compared. Using four datasets from CoRR that acquired both fMRI and dMRI on the same subjects, and have quite similar demographic profiles, we tested whether fMRI or dMRI derived 5 . CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.  per pipeline is a weighted sum of its discriminabilities across datasets. Each pipeline is compared to the optimal pipeline with the highest mean Discr, FNNNCP, using the above two-sample hypothesis test. The remaining strategies are arranged according to p-value, indicated in the top row.
connectomes were more discriminable. For three of the four datasets, dMRI connectomes were more discriminable. This is not particularly surprising, given the suseptibility of fMRI data to changes in state rather than trait (e.g., amount of caffeine prior to scan [9]).
The above results motivate investigating which aspects of the dMRI processing strategy were most effective. We focus on two criteria: how to scale the weights of connections, and how many regions of interest (ROIs) to use. Figure 3C.i shows that the log transform tends to yield more discriminable connectomes, though not significantly so (Wilcoxon signed-rank statistic= 891, p-value= 0.42). Both rank and log transform significantly exceed raw edge weights (Wilcoxon signed-rank statistic, p-value< 0.001). The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint ities. Unfortunately, most parcellations with semantic labels (e.g., visual cortex) have hundreds not thousands of parcels. This result therefore motivates the development of more refined semantic labels. pipelines are aggregated for a particular preprocessing step, with pairwise comparisons with the remaining preprocessing options held fixed. The beeswarm plot shows the difference between the overall best performing option and the second best option for each stage (mean in bigger black dot); the x-axis label indicates the best performing strategy. The best strategies are FNIRT, no frequency filtering, no scrubbing, global signal regression, the CC200 parcellation, and ranks edge transformation. A Wilcoxon signed-rank test is used to determine whether the mean for the best strategy exceeds the second best strategy: a * indicates that the p-value is at most 0.001 after Bonferroni correction. Of the best options, only no scrubbing is not significantly better than alternative strategies. Note that the options that perform marginally the best are not significantly different than the best performing strategy overall, as shown in Figure 2. (B) A comparison of the discriminabilities for the 4 datasets with both fMRI and dMRI connectomes. dMRI connectomes tend to be more discriminable, in 14 of 20 total comparisons. (C.i) Comparing raw edge weights (Raw), ranking (Rank), and log-transforming the edge-weights (Log) for the diffusion connectomes, the Log and Rank transformed edge-weights tend to show higher Discr than Raw. (C.ii) As the number of ROIs increases, the Discr tends to increase.

Optimizing Discr improves downstream inference performance
We next examined the relationship between the Discr of each pipeline, and the amount of information it preserves about two properties of interest: sex and age. Based on the simulations above, we expect that pipelines with higher Discr will yield connectomes with more information about covariates. Indeed, Figure 4 shows that, for every single one of the 28 datasets, a pipeline with higher Discr tends to preserve more information about both covariates. The amount of information is quantified by the effect size of the multiscale graph correlation statistic MGC [12,13], a statistic that quantifies the magnitude of association for both linear and nonlinear dependence structures. In contrast, if one were to use either PICC or I2C2 to select the optimal pipeline, for many datasets, subsequent predictive performance would degrade. These results are highly statistically significant: the slopes of effect size versus Discr are positive (Fisher's corrected [14] t-test, p-value < 0.001 for both sex and age), but not so for the other test statistics.

Discussion
We propose the use of the Discr as a simple and intuitive measure for experimental design featuring multiple measurements. Numerous efforts have established the value of quantifying reliability, repeatability, and replicability (or more generally, stability) using parametric measures such as ICC and I2C2. However, they have not been used to optimize stability-that is, they are only used Both the x and y axes are normalized by the minimum and maximum statistic. For each study, the effect size is regressed onto . Color and line width correspond to the study and number of scans, respectively (see Figure 2B). The solid black line is the weighted mean over all studies. Discr is the only statistic in which all slopes exceed zero. Moreover, we find that the corrected p-value [14] is significant across datasets for both covariates (med. p-value < .001). This indicates that pipelines with higher Discr correspond to larger effect sizes for the covariate of interest, and that this relationship is stronger for Discr than other statistics. Appendix E.2 details the methodologies employed.
post-hoc to determine stability, not used as criteria for searching over the design space-nor have non-parametric multivariate generalizations of these statistics been available. We derive one-sample (goodness-of-fit) and two-sample (equality) tests for Discr, and demonstrate via theory and simulation that Discr provides numerous advantages over existing techniques across a range of simulated settings. Our neuroimaging use-case exemplifies the utility of these features of the Discr framework for optimal experimental design.
Discr provides a number of connections with related statistical algorithms worth further consideration. Discr is related to energy statistics [15], in which the statistic is a function of distances between observations [16]. Energy statistics provide approaches for goodness-of-fit (one-sample) and equality testing (two-sample), and multi-sample testing [17]. Similar to Discr, energy statistics make relatively few assumptions. However, energy statistics requires a large number of measurements per item, which is often unsuitable for biological data where we frequently have only a small number of repeated measurements. Discr is most closely related to multiscale generalized correlation (MGC) [12,13], which combines energy statistics with nearest neighbors, as does Discr.
While Discr provides experimental design guidance for big data, other considerations may play a role in a final determination. For example, the connectomes analyzed here are resting-state, as opposed to task-based fMRI connectomes. Recent literature suggests that the global signal in a rs-8 . CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint fMRI scan may be a nuisance variable for task-based approaches [18,19]. Thus, while Discr is an effective tool for experimental design, knowledge of the techniques in conjunction with the inference task is still a necessary component of any investigation.
On this note, it is important to emphasize that Discr, as well the related statistics, are neither necessary, nor sufficient, for a measurement to be practically useful. For example, categorical covariates, such as sex, are often meaningful in an analysis, but not discriminable. Human fingerprints are discriminable, but typically not biologically useful. In addition, none of the statistics studied here are immune to sample characteristics, thus interpreting results across studies deserves careful scrutiny. For example, having a sample with variable ages will increase the inter-subject dissimilarity of any metric dependent on age (such as the connectome). With these caveats in mind, Discr remains as a key experimental design consideration a wide variety of settings.
Due to the high volume of open-access data with informative downstream inferential covariates, as well as the large number of open-source libraries for analyzing data, the connectomics use-case provided herein serves to illustrate how Discr can be used to facilitate experimental design. We envision that Discr will find substantial applicability across disciplines and sectors beyond brain imaging, such as genomics, pharmaceutical research, and many other aspects of big data science and industry. To this end, we provide open-source implementations of Discr for both Python and R [20,21]. Code for reproducing all the figures in this manuscript is available at https://neurodata.io/mgc.

Appendix A. Population and Sample Discriminability.
Suppose that θ i ∈ Θ represents a physical property of interest for a particular item i. In a biological context, for instance, an item could be a participant in a study, and the property of interest could be the individual's true brain network, or connectome. We cannot directly observe the physical property, but rather, we must first measure θ i and then "wrangle" (or pre-process) it. Call the measurement function, f ∈ F for a family of possible measurement functions F That is, f : Θ → W W W. So, measurements of θ i are observed as f (θ i ) = w i . However, w i may be a noisy, with measurement artefacts. Alternately, w i might not be the property of interest, for example, if the property is a network, perhaps w i is a multivariate time-series, from which we can estimate a network. We therefore have another function, g ∈ G : W W W → X X X , which represents the data wrangling procedure to take the measurement and produce an informative derivative (for instance, confound removal). The family of possible data wrangling procedures to produce the informative derivative is G. In this fashion, the output of interest is The goal of experimental design is to choose an f and g that yield high-quality and useful inferences, that is, that yield x's that we can use for various inferential purposes. When we have repeated measurements of the same items, we can use those samples to our advantage. Given x x x j i , which is the j th measurement of sample i, we would expect x x x j i to be more similar to x x x j i (another measurement of the same item), than to any measurement of a different item x x x j i . Formally, let δ : X X X × X X X → [0, ∞) be a distance metric, we define the population discriminability: That is, "population discriminability" D represents the average probability that the within-item distance . Discriminability depends on the choice of distance δ, as well as the measurement process f and the analysis choices g.
The population discriminability represents a property of the distribution of θ i . In real data since we do not observe the true distribution, we instead rely on the sample discriminability. Suppose a dataset consists of i ∈ {1, . . . , n} subjects, where each subject i has J i repeat measurements. The sample discriminability is defined: It can be shown [22] that the under the multivariate additive noise model; that is, and E j i = c c c, that the sample discriminability, Discr is both a consistent and unbiased estimator for population discriminability.

Appendix B. Discriminability Provides an Informative Bound for Inference.
During experimental design, the extent of subsequent inference tasks may be unknown. A natural question may be, what are the implications of the selection of a discriminable experimental design? Formally, assume the task of interest is binary classification: that is, Y = {0, 1}, and we seek a classifier h : X → Y. The goal of experimental design in this context is to choose the options (f * , g * ) that will minimize the classification loss: For a fixed (f, g), the minimal prediction error is achieved by the Bayes classifier [23]: and let L * f,g denote Bayes error, that is, the error achieved by h * f,g .

12
. CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint Theorem B.1. Assume the multivariate gaussian additive noise setting; that is: There exists a decreasing function γ(·) which depends only on θ and y s.t.: That is, the Bayes error can, in fact, be upper bounded by a decreasing function of discriminability, as shown in the proof below. As a direct consequence of this theorem, we see: Corollary B.1. Assume (f 1 , g 1 ) and (f 2 , g 2 ) are two processing strategies, and suppose that D f 1 ,g 1 > D f 2 ,g 2 . Then: In other words, the Bayes error achieved by strategy (f 1 , g 2 ) can, in fact, be upper bounded by the Bayes error achieveable by strategy (f 2 , g 2 ). Consequently, under the described setting, the pipeline that achieves a higher Discr can facilitate improved inference than competing strategies, despite the fact that the task is unknown during data collection and processing. Complementarily, note that if we were to instead consider the predictive accuracy 1 − L * f,g , we can obtain a similar result to obtain a lower bound on the predictive accuracy via an increasing function of Discr. That is, in the context of the corollary, a more discriminable pipeline will tend to have a higher accuracy on an arbitrary predictive task.

Proof of Theorem (B.1).
Consider the additive noise setting, that is x x x j i = θ i + j i , To bound the probability above, we bound the θ i − θ i and separately. We start with the first term

13
. CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint Here, σ 2 2 = tr(Σ Σ Σ θ ) is the trace of covariance matrix of θ i . We can apply Markov's Inequality for any t > 0: Let σ 2 1 = tr(Σ Σ Σ ) denote the trace of covariance matrix of j i , and let a and b be two constants satisfy Furthermore, we let t 2 = √ 2aσ 1 σ 2 , and let If a 2 σ 2 1 ≥ 2σ 2 2 , then θ ≤ 1. According to the Paley-Zygmund Inequality [24], that is, for all 0 ≤ θ ≤ 1 and Z ≥ 0, we can plug in the θ above to achieve Also plug in the t 2 for the inequality B.2, we have Understand the fact that θ's and 's are independent, we can combine the two inequalities Note that the resulted bound holds true even if a 2 σ 2 1 < 2σ 2 2 , as the right hand side becomes greater than 1. So we can have a bound on σ 2 σ 1 , To obtain a bound on Bayes error, we apply Devijver and Kittler's result [25], which is L ≤ 2π 0 π 1 1 + π 0 π 1 ∆µ µ µ T Σ Σ Σ −1 ∆µ µ µ .

C.1 One-Sample Test
Recall the one-sample hypothesis test, shown in Equation (4.1). We approximate the distribution ofD under the null through a permutation approach. Yhe subject labels of our N samples are first permutated randomly, andD 0,N is computed each time given the observed data X X X and the permuted labels. For a level α significance test, we compareD to the (1 − α) quantile Q 1−α of the empirical null distributionD 0,N , and reject the null hypothesis ifD N < Q 1−α . This approach provides higher power than the former approach, under similar assumptions.
Note that the permutation-based approach requires r computations of the sample Discr. The total computational complexity is then O N 2 max(p, rs) . This approach is only linear in the number of desired repetitions, and therefore is sensible for most settings in which the sample Discr can itself be computed. Moreover, we can greatly speed this computation up through parallelization. With T cores, the computational complexity is instead O N 2 max p, r T s) , as shown in Algorithm 9. We extend this one-sample test to both PICC and I2C2 to provide a robust p-value associated with both statistics of interest. Note that the permutation approach can be generalized to any statistic quantifying repeatability based on repeated measurements.

15
. CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint Algorithm 1 Discr One-Sample Permutation Test n items of data, each featuring J i measurements.
(2) r an integer for the number of permutations.
Output: p ∈ [0, 1] the p-value associated with the test.
compute observed sample discriminability Note that this for-loop can be parallelized over T cores, as the loops are independent processes 3: for i in 1, . . . , r do 4: a random shuffling of the measurements 5: The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint

C.2 Two-Sample Test
We implement two-sample testing using a permutation approach, similar to the one-sample testing. First, compute the observed difference in Discr between two design choices. The null distribution of the difference in Discr is constructed by first taking random convex combinations of the observed data from each of the two methods choices (the "randomly combined datasets").
Discr is computed for each of the two randomly combined datasets for each permutation. Finally, for each permutation, the all pairs of observed differences in Discr is computed. Finally, the observed statistic is compared with the differences under the null of the randomly combined datasets. The pvalue is the fraction of times that the observed statistic is more extreme than the null. Note that we can use this approach for both one and two-tailed hypotheses for an experimental design having higher Discr, lower Discr, and equal Discr relative a second approach; we implement all three in the software implementation of the two-sample test. The Algorithm for the two-sample test is shown in Figure  6, with the alternative hypothesis as specified in Equation (4.2). The computational complexity is then O r T N 2 max(p, max i (s i )) . Note that for each permutation, the limiting step is the computation of the Discr in O N 2 max(p, s) . This is then offset through parallelization over T cores in the implementation. We extend this two-sample test to both PICC and I2C2 to provide a robust p-value associated with both statistics of interest, for similar reasons to the above. Again, this permutation approach can be generalized to any statistic quantifying repeatability based on repeated measurements.

17
. CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint Algorithm 2 Discr Two-Sample Test n items of data, each featuring J i measurements, from the first sample.
(2) z z z j i j∈[J i ],i∈ [n] n the observed data, from the second sample.
(3) r an integer for the number of permutations.
Output: p ∈ [0, 1] the p-value associated with the test.
The Discr of the first sample.
The Discr of the second sample. 4: The observed difference in Discr between samples 1 and 2.

5:
The for-loop below can be parallelized over T cores, as each loop is an independent process 6: for i in 1 : r do 7: Generate a synthetic null dataset for each of the 2 samples, using a convex combination of the elements of each sample 8: for k in 1 : 2 do 9: a random shuffle of the measurements 10: Convex combination of random elements from each sample 13: Compute Discr of the convexly combined elements 14: end for 15: end for 16: Compute all pairs differences in Discr using the convexly-combined samples 17: for i in 1, . . . , r − 1 do 18: for j in i + 1, . . . , r do 19: Null distribution of the difference 20: end for 21: end for 22: p-value is fraction of times that observed Discr is more extreme than synthetic datasets 23: p =  The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint E.2 Effect Size Investigation In this investigation, we are interested in learning how maximization based on the observed notion of reliability correlates with real performance on a downstream inference task. Recalling Corollary (B.1), we explore the implications of this corollary in a large neuroimaging dataset provided by the Consortium for Reliability and Reproducibility [4], and demonstrate that selection of the experimental design via Discr, in fact, facilitates improved downstream inference on both a regression and classification task. This provides strong motivation for leveraging the Discr for experimental design.
Ideally, for a particular summary statistic, a high value will generally correlate with a positive effect size. For datasets i = 1, . . . , M where M is the total number of datasets, a processing strategy j = 1, . . . , 192 for 192 total processing strategies, and k = 1, . . . , 3 are our summary statistics of interest (Discr, , and I2C2), we fit the standard linear regression model Y = βX + , where we model the effect size Y estimated by MGC [42] via a linear relationship with X, the observed sample statistic for approach k (Discr, PICC, or I2C2), with coefficient β. Note that the interpretation of β is the expected change in the effect size Y due to a single unit change in the observed sample statistic X. Both Y and X are uniformly normalized across all strategies within a single dataset to facilitate intuitive comparison across methods. For each summary statistic k, we pose the following hypothesis: Acceptance of the alternative hypothesis would have the interpretation that an increase in the observed sample statistic X would tend to correspond to an increase in the observed effect size Y , and the relevant test is the one-way t-test. Acceptance of the alternative hypothesis against the null provides evidence that an increase in the sample statistic corresponds to an increase in the observed effect size, where the neither of the responses (age, sex) were known or considered at the time the data were processed nor the sample statistics were computed. This provides evidence that the statistic is informative for experimental design within the context of this investigation. Model fitting for this investigation is conducted using the lm package in the R programming language [43].

E.3 Dataset Descriptions
Useful Data Links All relevant analysis scripts and data for figure reproduction in this manuscript made publicly available, and can be found at https://neurodata.io/mgc.

21
. CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder.

22
. CC-BY-NC-ND 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/802629 doi: bioRxiv preprint