Numerical Instabilities in Analytical Pipelines Lead to Large and Meaningful Variability in Brain Networks

The analysis of brain-imaging data requires complex processing pipelines to support findings on brain function or pathologies. Recent work has shown that variability in analytical decisions can lead to substantial differences in the results, endangering the trust in conclusions1–7. We explored the instability of results by instrumenting a connectome estimation pipeline with Monte Carlo Arithmetic8,9 to introduce random noise throughout. We evaluated the reliability of the connectomes, their features10,11, and the impact on analysis12,13. The stability of results was found to range from perfectly stable to highly unstable. This paper highlights the potential of leveraging induced variance in estimates of brain connectivity to reduce the bias in networks alongside increasing the robustness of their applications in the classification of individual differences. We demonstrate that stability evaluations are necessary for understanding error inherent to scientific computing, and how numerical analysis can be applied to typical analytical workflows. Overall, while the extreme variability in results due to analytical instabilities could severely hamper our understanding of brain organization, it also leads to an increase in the reliability of datasets.

equivalent analyses and found widely inconsistent results 1 , 23 and it is likely that software instabilities played a role. 24 The present study approached evaluating reproducibility 25 from a computational perspective in which a series of brain 26 imaging studies were numerically perturbed such that the  ing as a cost effective and context-agnostic method for dataset 106 augmentation.

107
While the separability of individuals is essential for the 108 identification of brain networks, it is similarly reliant on net-109 work similarity across equivalent acquisitions (Hypothesis 2).

110
In this case, connectomes were grouped based upon session, 111 rather than subject, and the ability to distinguish one session 112 from another was computed within-individual and aggregated.

113
Both the unperturbed and pipeline perturbation settings per-114 fectly preserved differences between cross-sectional sessions 115 with a score of 1.0 (p < 0.005; optimal score: 0.5; chance:  acquisition-dependent bias inherent in the brain graphs.  In stark contrast, input perturbations led to highly unstable Index (BMI) groups and brain connectivity 12, 13 , using stan-195 dard dimensionality reduction and classification tools, and 196 compared this to reference and random performance (Fig-197 ure 3).

198
The analysis was perturbed through distinct samplings of 199 the dataset across both pipelines and perturbation methods. given that the quality of relationships between phenotypic 229 data and brain networks will be limited by the stability of the where e x is the exponent value of x and ξ is a uniform ran- where g i j is a graph belonging to class i that was measured 477 at observation j, where i = i and j = j .

478
Discriminability can then be read as the probability that an 479 observation belonging to a given class will be more similar to   The correlations between observed graphs ( Figure S1) across each grouping follow the same trend to as percent deviation, as 750 shown in Figure 1. However, notably different from percent deviation, there is no significant difference in the correlations 751 between pipeline or input instrumentations. By this measure, the probabilistic pipeline is more stable in all cross-MCA and 752 cross-directions except for the combination of input perturbation and cross-MCA (p < 0.0001 for all; exploratory).

753
The marked lack in drop-off of performance across these settings, inconsistent with the measures show in Figure 1 is due 754 to the nature of the measure and the graphs. Given that structural graphs are sparse and contain considerable numbers of  The complete discriminability analysis includes comparisons across more axes of variability than the condensed version.

759
The reduction in the main body was such that only axes which would be relevant for a typical analysis were presented.  Figure S2 explores the stability of univariate graph-theoretical metrics computed from the perturbed graphs, including modularity, 771 global efficiency, assortativity, average path length, and edge count. When aggregated across individuals and perturbations, the 772 distributions of these statistics ( Figures S2A and S22B) showed no significant differences between perturbation methods for 773 either deterministic or probabilistic pipelines.

774
However, when quantifying the stability of these measures across connectomes derived from a single session of data, the 775 two perturbation methods show considerable differences. The number of significant digits in univariate statistics for Pipeline 776 Perturbation instrumented connectome generation exceeded 11 digits for all measures except modularity, which contained 777 more than 4 significant digits of information ( Figure S2C). When detecting outliers from the distributions of observed statistics 778 for a given session, the false positive rate (using a threshold of p = 0.05) was approximately 2% for all statistics with the 779 exception of modularity which again was less stable with an approximately 10% false positive rate. The probabilistic pipeline 780 is significantly more stable than the deterministic pipeline (p < 0.0001; exploratory) for all features except modularity. When 781 similarly evaluating these features from connectomes generated in the input perturbation setting, no statistic was stable with 782 more than 3 significant digits or a false positive rate lower than nearly 6% ( Figure S2D). The deterministic pipeline was more 783 stable than the probabilistic pipeline in this setting (p < 0.0001; exploratory).

784
Two notable differences between the two perturbation methods are, first, the uniformity in the stability of the statistics,