Computationally scalable regression modeling for ultrahigh-dimensional omics data with ParProx

Abstract Statistical analysis of ultrahigh-dimensional omics scale data has long depended on univariate hypothesis testing. With growing data features and samples, the obvious next step is to establish multivariable association analysis as a routine method to describe genotype–phenotype association. Here we present ParProx, a state-of-the-art implementation to optimize overlapping and non-overlapping group lasso regression models for time-to-event and classification analysis, with selection of variables grouped by biological priors. ParProx enables multivariable model fitting for ultrahigh-dimensional data within an architecture for parallel or distributed computing via latent variable group representation. It thereby aims to produce interpretable regression models consistent with known biological relationships among independent variables, a property often explored post hoc, not during model estimation. Simulation studies clearly demonstrate the scalability of ParProx with graphics processing units in comparison to existing implementations. We illustrate the tool using three different omics data sets featuring moderate to large numbers of variables, where we use genomic regions and biological pathways as variable groups, rendering the selected independent variables directly interpretable with respect to those groups. ParProx is applicable to a wide range of studies using ultrahigh-dimensional omics data, from genome-wide association analysis to multi-omics studies where model estimation is computationally intractable with existing implementation.


Introduction
Omics technologies are principal modalities in today's systems biology and molecular research.
Despite significant advances and wide applicability, clinical omics data sets are often modestly sized in a vast majority of biomedical studies, often providing insufficient statistical information for phenotype association analysis or prognostication. Since the arrival of gene expression microarrays in the 1990s, the small sample size in omics data has challenged biostatisticians and bioinformaticians to grapple with the "small n, large p" problem, which constrains the ability of statistical models or machine learning methods to exploit the multi-dimensional feature space at a sufficient depth. As a consequence, omics data analysis has inevitably depended on univariate hypothesis testing for individual molecular features in combination with post hoc multiple testing correction procedures to control overall type I error, further complemented by enrichment analysis of biological processes and pathways in the molecular features not discarded by hypothesis testing. However, hypothesis testing-based analysis does not describe genotype-phenotype relationships in a multivariate space, and statistical analysis results conducted as such often ignore the covariance between functionally related molecules -the underlying structure among data features in omics data sets.
There are well-established approaches to multivariable modeling of high-dimensional data in supervised analysis problems, ranging from regularized linear regression models 1, 2, 3 to machine learning methods such as decision tree-based random forests (RF) 4 and support vector machines (SVM) 5 . The spectrum of model complexity in different approaches is often characterized by the trade-off between bias and variance for prediction. 6 On the one hand, more complex models tend to explain a training data set well, but an overfit model bears large variance in prediction between samples with similar data features in external data sets. On the other hand, simpler models may not fit the training data well compared to more complex machine learning methods, but predictions made by simpler models in independent data sets are more stable and less sensitive to small differences in the training data. With the sample size almost always smaller than the number of molecules measured in omics data, the overfitting problem of complex models is unavoidable in most application of machine learning methods, especially in the model-free approaches. By contrast, simpler methods such as regularized regression models are not flexible enough to capture non-linear relationships, but they produce highly interpretable models with low variance in predictions.
Today, with increasing throughput and decreasing cost of the experimental platforms, we are already transitioning from the era of small n, large p problems to a time of "large n, very large p" problems. This phenomenon is perhaps best exemplified by genome-wide association studies with genotypes at millions of loci and with a sample size greater than hundreds of thousands 7 , or multi-omics studies with tens of thousands of tumor biopsies in the Cancer Genome Atlas.
(TCGA) 8 In spite of the emergence of large-sample, high-dimensional omics studies, most data modeling approaches implicitly require in-memory storage of data and computation on central processing units (CPUs), which may be computationally expensive on standard computer hardware. The breaking point is yet to be recognized by those who routinely perform omics data analysis as the computer hardware has improved over time, but this emerging reality poses implementation challenges for future development of biostatistical and bioinformatics tools.
Motivated by these considerations, here we present a new software for fitting regularized linear regression models on high-dimensional clinical omics data, embodying an efficient optimization strategy and optional capability for parallel/distributed computing options. The implementation, called PARPROX, fits the group least absolute shrinkage and selection operator (group lasso) regression models for survival analysis or sample classification. During model fitting, variables are regularized by non-overlapping or overlapping group penalties specified by the user and the variables in the same group are penalized jointly to reflect known group information such as pathways and gene ontologies (GO). More importantly, we implemented PARPROX in the Julia language to allow for parallel computation with graphics processing units (GPUs) or distributed computing over cloud environments (Amazon Web Services, Google Cloud Platform, Microsoft Azure, etc.) natively, which enable the modeling for ultrahigh-dimensional data sets. We provide detailed protocols and illustration through three representative example data sets below.

Overview of applications
We demonstrate regression modeling of clinical omics data using PARPROX through three case studies. In the first application, we present a Cox regression analysis for overall survival outcome in 9,707 patients in TCGA, using the somatic mutations as predictor variables. 8,9 We create the predictor variables by counting mutations for ~56,000 DNA sequence segments in the codons and regulatory regions, in which mutation counts of individual sequence segments become independent variables (or covariates) and the individual proteins or interacting protein pairs harboring those sequence segments are considered as variable groups for structured regularization. In the second application, we establish a gene expression-based logistic regression model for pathological complete response (pCR) to neoadjuvant chemotherapy for breast cancer, using ~12,000 mRNA-level measurements in 469 patients as covariates and pathways/GO terms as the overlapping variable groups for structured regularization. 10,11,12,13 In the last application, we present a Cox regression analysis of 428 liver cancer patients using DNA methylation status of CpG islands in and out of coding regions as covariates. Each methylation probe represents a CpG island on the genome, and the probes associated with the same gene form a variable group in this case. As certain chromosomal regions are densely populated by multiple protein coding genes, some probes may belong to multiple adjacent genes, creating overlap in the variable groups, i.e. between adjacent genes. The methylation array platform used by TCGA contains as many as >865,000 probes originally, but we have trimmed this data to 289,509 probes for demonstration purposes. Each of these three data sets takes up to 4.3 gigabytes of storage even after trimming. Unless carefully managed, this size of data may cause serious issues in memory allocation, hence reading and modifying data entries, especially when there are overlaps among the variable groups (see Methods).
Here we describe each data set in more detail first. The first case study explores a multivariable Cox regression analysis using somatic mutations as predictor of cancer death risk.
Since somatic mutations are sparse and not reproducibly detected at predetermined loci in early tumors, statistical analysis associating "locus-level" somatic mutation data and clinical endpoint such as patient survival remains challenging. Alternatively, counts or rates of somatic mutations can be aggregated on predefined polymorphic regions or individual genes. 14, 15 However, such "gene-level" analysis fails to retain the resolution required for investigating the potential functional impact of mutations ( Figure 1A). To address this issue, we have recently proposed a functional region-based association testing approach for exome sequencing data, called Gene-to-Proteinto-Disease (GPD). 16 GPD separately counts mutations for coding regions pertaining to protein domains and 11 amino acid-long windows surrounding post-translational modification (PTM) sites, and performs univariate statistical analyses with a clinical endpoint. Figure 1B illustrates how GPD summarizes mutation counts per protein sequence segments of three different types. These newly organized data can be used as covariates in the regression model.
Using this framework, we summarized the entire somatic mutation data into a count data set with 9,707 patients and 55,961 protein sequence segments across the human exome. We then fit a pan-cancer Cox regression model with 55,961 variables and group lasso penalties for overall survival of the 9,707 patients using PARPROX ( Figure 1C). Here, all sequence segments in one gene are considered to form a variable group, representing the hypothesis that individual mutations accrued in different protein sequences may have independent functional impact, but those mutations in the same gene are likely to be associated with disease risk of the given individual subject collectively than individually. We also attempted the same regression analysis with group penalties reflecting simultaneous mutations on sequence segments in physically interacting proteins, which produced an interesting outcome that led to selection of no predictor variables. This is discussed in more detail below.
In the second example, we demonstrate that PARPROX produces a biologically interpretable logistic regression model of genes associated with pCR to neoadjuvant chemotherapy, a binary outcome determined by expert pathologists (Figure 1D). In this conventional "small n, large p" data example, we use biological pathways and GO terms as variable groups with arbitrary degree of overlap and nesting, 17,18 and show that the group lasso regression model optimized by PARPROX identifies a sparse prognostic gene signature enriched with specific biological processes, rendering the prognostic model high interpretability over other similarly performing alternatives.
In the third example, we demonstrate PARPROX in the context of analyzing data sets with a p so large that parallel computing is required to fit a regression model. We fit a Cox regression model with overlapping group lasso penalties on a DNA methylation data set from the liver hepatocellular carcinoma of TCGA. The DNA methylation array platform has probes representing genomic regions of high GC content, and as such, the dimensionality is much higher than other omics data sets where the measurements are often summarized to individual gene level, e.g.
gene expression or DNA copy number data. We show that PARPROX can perform regularized Cox regression with penalties jointly applied to probes associated with different segments of regulatory and coding regions for individual genes, where there are close to 90,099 variable groups formed over 289,509 variables. We show that the analysis can be completed within a reasonable amount of time using a single GPU, whereas another software package for overlapping group lasso regression analysis could not handle the size of the data.
Through these case studies, we show that PARPROX provides a robust and computationally efficient implementation to fit regularized regression models, with the ability to find optimal, interpretable models under the constraint of highly complex group penalties with arbitrary degrees of overlap. This capability opens the door to incorporating prior knowledge on the relationship among predictor variables in the regression modeling ( Figure 1E). Another important innovation of this implementation is that the model fitting process can be parallelized through distributed or parallel computing environments, if necessary, in case exorbitantly large data need to be analyzed, as illustrated by the third application data set.

Proximal gradient optimization for logistic and Cox regression models
Before the demonstration, we first describe the computational workflow of PARPROX in detail. We first describe this in the typical binary classification setting. The goal of the regression modeling is to understand the influence of p covariates = ( ! , … , " ) on the probability ( = 1 | ) of a subject belonging to class 1, where the two classes are labeled 0 and 1. In logistic regression, if there are n subjects, given the observed label # and covariates # = ( # ! , … , # " ) for each individual , the likelihood of the observed data is modeled based on the assumption that the log odds of the class membership is a linear combination of covariates, yielding the likelihood of the linear combination coefficient = ( ! , … , " ) If the number of covariates p is large as in omics data, it is customary and also reasonable to assume that only a few independent variables mainly determine the response. This is promoted by multiplying a prior probability to the likelihood that causes all but a few coefficients among ( ! , … , " ) to be zero, a process known as regularization. This prior typically takes the form of (1) The regularization parameter is typically selected via cross validation.
In a typical survival analysis setting, the goal is to understand the influence of p covariates = ( ! , … , " ) to the survival probability ( | ) = ( > | ) that the survival time of a subject is longer than time . In the Cox proportional hazards model, 19,20 given the -th subject, = 1, … , , with covariates # = ( # ! , … , # " ), whose time-to-death # or right-censoring time # is measured so that the observed survival time is # = ( # , # ) , the survival probability is equivalently modeled through the hazard function ℎ( the derivative of the survival function S: where ℎ ) ( | ) is the unspecified baseline hazard function. That is, the covariates affect the hazard multiplicatively in such a way that a linear combination of covariates determines the strength of the multiplication. The coefficient of linear combination is denoted by = ( ! , … , " ). Cox (1972) then proposes to get rid of the unknown baseline hazard in the fitting procedure by maximizing the partial likelihood where # is the indicator that is 1 if # ≤ # , i.e., the death of subject i is observed, and 0 otherwise.
Like logistic regression, if the number of covariates is large, a prior of the form ( ) ∝ (− || ||) is multiplied to the partial likelihood to promote a sparse model. The model fitting procedure then amounts to an optimization problem of minimizing which takes a form of a generalized linear model, similarly to the logistic regression problem (1).
The regularization parameter is typically selected via cross validation.
If this objective function (1) or (2) of the optimization problem were differentiable in , then the typical gradient descent method, which iteratively updates the current estimate of by moving slightly to the opposite direction of the vector of the first-order partial derivatives of the objective function, would eventually yield the correct estimate. Unfortunately, the objective functions (1) and (2) are not differentiable due to the presence of the norm || ||. Nevertheless, the term ( ) in (1) and (2) is differentiable, hence an extension of the gradient descent method called the proximal gradient method can instead be applied. The proximal gradient for (1) or (2) consists of two steps 21 : 1) Compute the gradient ( (0) ) of ( ) at the current estimate (0) is the coefficient vector .
2) Update the estimate by the formula where the || ⋅ || 4 4 in the right-hand side of equation (3) is the squared Euclidean norm, i.e., || || 4 4 = ( ! ) 4 + ⋯ + ( " ) 4 , and the argmin operator refers to the parameter value that minimizes the expression in the right-hand side. The scalar 0 is the step size that determines how far to move the estimate from the current candidate (0) . The idea of the proximal gradient method is to approximate ( ) by a spherically shaped quadratic function tangential to ( ) at (0) and is above it for all other values of and then minimize the approximate objective function. By iteratively doing so, the minimum of the original function (1) or (2) can be found (Figure 2A) even when the objective function is not differentiable at the optimal solution. For many choices of the norm || ||, the right-hand side of the second step (3) (1) and (2), which is inherently sequential and requires the whole dataset to be stored in a single device. With the advance of modern high-performance computing infrastructure, such as the GPU, the ability of parallel and distributed computation has become a commodity, and running a parallel algorithm is getting increasingly easier --PARPROX is an instance.

Pan-cancer survival analysis of somatic mutations using group lasso Cox regression with non-overlapping and overlapping group penalties
We first demonstrate PARPROX through survival analysis of somatic exome mutation data in TCGA pan-cancer cohort. As mentioned in the overview, we have mapped all somatic mutations curated by the Pan-Cancer consortium of TCGA to human protein sequence segments (see another. This result can be interpreted in two different ways. It is possible that cancer death riskassociated somatic mutations do not necessarily co-localize in genes encoding protein complex members, especially in the somatic mutation data of early primary tumors at the time of diagnosis.
In fact, when we examined co-mutation events in patients with death within five years of followup, only 126 pair of interacting proteins had simultaneous mutations in more than ten such patients. Further restricting to the patients who were deceased within two years, we had only 71 protein interactions with simultaneous mutations (Supplementary Table 2).
Alternatively, it is also possible that tumors collected from early diagnosis have a very low probability of harboring functionally consequential mutations on two or more essential members of a protein complex, and such events would not have been observed frequently in the early primary tumor collection of TCGA in the first place. Indeed, when we compared the number of interacting protein pairs with simultaneous mutations across the patients, we observed that only 8,682 out of 133,146 total interaction pairs (6.5%) bear more frequent simultaneous mutations than expected by random co-mutation on any pairs of proteins (more than 10 subjects).
(Supplementary Table 2  To benchmark the group lasso model, we also ran the same regression with Cox lasso regression, i.e., with L1 penalty on individual sequence segments but no group-wise regularization 23,24 (Supplementary Table 1). To our surprise, the two analyses selected substantially distinct prognostic models ( Figure 3C). First, the group lasso model selected 861 sequence segments as prognostic signature of cancer death risk, whereas the lasso model selected 288 sequence segments. The deflation in the number of selected sequence segments was expected since group lasso would maintain a sequence segment as predictor as long as there is another sequence segment of prognostic signal in the same protein. Second, the cancer type difference in the cancer death risk was highly inconsistent with the frequency of multi-year death events in the lasso model.
For instance, LAML is one of the cancers with the lowest survival rate in TCGA, yet the coefficient representing the difference in mutation-independent death risk between LAML and GBM (baseline cancer in this analysis) was negative, potentially suggesting that the death rate differences were over-adjusted by mutation data in the model fitting. BRCA is a cancer with overall 95% survival rate at five years of follow-up, yet the coefficient for the difference between BRCA and GBM was positive, again producing coefficients with signs inconsistent with the mortality rate differences between the two cancers. Therefore, not accounting for known structure amongst the covariates, i.e. the lasso model, seems to have considerable impact on the interpretability of the final model.
Based on the same selection criteria, the top prognostic sequence segments from the lasso model are shown in Figure 3B. It is striking that the four PTM sites (serine/threonine We next examined the regression coefficients from the pan-cancer analysis with those models fit on individual cancer data separately. Figure 3D shows the heatmap of regularized coefficients obtained from the pan-cancer analysis as well as those from analyses of individual cancer data. As expected, the sign of the coefficients was highly congruent among different analyses, although there were a few exceptions. Hence, we conclude that the pan-cancer survival analysis by the group lasso regression of PARPROX successfully pools shared effects of mutations on the risk of cancer death in this data.

Prognostic gene expression signature of pathological complete response using group lasso regression with overlapping groups
In the next case study, we demonstrate PARPROX through a classification problem. Re-analyzing the meta-analysis data of Prat et al., 13 we aim to identify an mRNA gene expression signature to classify breast cancer patients undergoing chemotherapy with anthracycline and neoadjuvant agents into two groups, i.e., pathological complete response (pCR) and residual disease (RD).
Here we use gene expression data sets of 12,307 genes and 469 patients in the training data set 11 and two test data sets 10, 12 (N=115 and N=244), and we use the pathways and GO terms as variable group information in the logistic group lasso regression. The analysis workflow is visually represented in Figure 4A. We next built classifiers of pCR using four different methods: logistic regression with lasso penalty, logistic regression with pathway-level overlapping group penalty (PARPROX), random forest, and support vector machine (SVM). In PARPROX analysis, we used external data resource that combined multiple pathway databases to define variable groups, resulting in 11,734 groups among the 12,307 variables (including those singletons that do not belong to any pathway or GO term). 18 With smaller size of the data set (12,307 by 469), the analysis was performed within a reasonable amount of time with PARPROX on a GPU (7 min for cross-validation, 19 min for final model fitting). A similar analysis could be performed using the GRPREGOVERLAP package in R (28 min for cross-validation, 28 min for entire solution path calculation). As shown in Figure 4A, we trained the classifiers in the training data by Hatzis et al., and made predictions of pCR on the two test data sets. When we compared the area under the curve of the receiver operating characteristic (ROC), the first three methods performed as well as one another (Figure 4B), and the predictions from the SVM with radial basis kernel, with cost and gamma parameters optimized through 10-fold cross-validation within the training data, did not perform better than the three methods (scores shown in Figure 4C).
Given the highly similar performance metrics across different methods, we next investigated the interpretability of the gene expression signatures. Since the two machine learning methods with greater complexity (RF and SVM) utilize all features in the respective classifiers, we did not pursue interpreting the underlying predictor. Instead, we compared the selected genes between the two logistic regression models with and without group penalties. Logistic regression with lasso penalty selected a total of 289 genes in the predictive signature (182 with positive and 107 negative coefficients, Supplementary Table 4). Subsequent pathway enrichment analysis showed that the genes with positive regression coefficients, those contributing to the better chance of pCR, had mild enrichment of mitotic cell cycle and DNA replication genes, whereas the genes with negative coefficients were not particularly enriched in any known pathways other than ECM organization.
By contrast, PARPROX analysis incorporating the pathway membership of genes selected a total of 829 genes (481 positive and 348 negative), a larger panel of genes than lasso logistic model above. As stated in the previous case study, this is an expected consequence of using the group penalty, which tends to select genes in the same pathway together if there is a true effect of pathway-wide gene regulation. A clear advantage of the group lasso penalty is that one can rank pathways based on the number of genes with non-zero coefficients ( Figure 5A). We selected five GO terms and one KEGG pathways with the largest number of genes with non-zero coefficients and large magnitudes in the sum of coefficients, with all six related to one overarching theme and sharing many common genes --DNA replication during mitotic cell cycle (Supplementary Table 3).

Prognostic signature of DNA methylation in liver cancer data -hign-dimensional data
In the third data, we fitted a Cox regression model with overall survival as outcome variable and DNA methylation probes located in distinct genomic positions relative to protein coding genes as predictor variables in a liver cancer data. In this data, even after selecting the probes located near protein coding genes only, the number of data features (p) is 289,508, with sample size of N = 428. In addition, we treated 90,099 genomic regions representing unique relative positions of probes as variable groups, including TSS1500, TSS200, 5' UTR, 1 st exon, gene body, and 3' UTR as annotated by the microarray vendor (see Methods). We thus guide the Cox regression model fitting to jointly penalize the probes in these genomic segments.
Using the overall survival as clinical endpoint, we first attempted to fit overlapping group Cox regression using GRPREGOVERLAP in R, the software was unable to perform model fitting and produce memory allocation errors in multiple desktop computers with at least 8GB RAM and 3.4 GHz quad-core intel i5 CPUs or better. By contrast, PARPROX was able to perform the C index- We have benchmarked the model against a Cox regression model with plain lasso penalty, which selected 114 CpG island probes. To our surprise, the probes selected by Cox regression with lasso penalty had poor overlap with the probes selected in the group lasso model, sharing only 16 common probes. In addition, C-index was comparable between the two models: 0.65 (± 0.03) with the group lasso penalty and 0.63 (± 0.03) with the lasso penalty. Similar to the second example above, this result is likely due to the fact that there are a large number of weakly predictive regression models with different predictor variable combinations with comparable degrees of association with overall survival. Among those options, group lasso chose the model that best represents the variable group structure we specified as the modeler. This observation also reaffirms that specification of variable group structure plays an influential role in the selection of data features associated with the clinical endpoint, and PARPROX provides the interface to fit these models in ultrahigh-dimensional data sets that would otherwise have been impossible to fit.

Discussion
In this work, we presented a scalable implementation to fit regression models for survival and classification analysis with structured group penalties representing biological prior information regarding the relationships among the independent variables. The proximal gradient optimization implemented in Julia language can distribute the iterative updates for parallel computation in the case of large-scale data sets, which is the major advance offered by PARPROX. We demonstrated the robustness of the implementation in both 'large n, large p' case (mutation data example) as well as 'large p, small n' case (gene expression data example) and showed that PARPROX can deal with survival regression using a very large-scale data set (p = 289,508) using parallel computing with GPU.
In contrast to the conventional differential expression analyses via hypothesis testing, our one-shot regression analysis strategy describes the multivariate relationship between clinical endpoint and high-dimensional molecular data using linear models. Linear models are often thought to be too restrictive to describe complex relationships between genotype and phenotype. almost linearly with addition of hardware, e.g., GPU. This is the key contrast to GRPREG and GRPREGOVERLAP packages implementing the coordinate descent method, which is an inherently sequential algorithm.

Optimization with structured sparsity
Both the logistic and proportional hazards models of the Results section admit a negative log- In other words, the regression coefficient is decomposed as a sum of latent components 6 = 6 6 and it is the norm of these components that is penalized (note || 6 || = || 6 ||). In this way, overlaps between the groups are allowed. When there is no overlap, penalty (4) reduces to the classical group lasso penalty. 2 Although the latter penalty may be straightforwardly defined for overlapping groups, it tends to select the complement of a union of groups --if two groups share a gene but one group is not selected, then the coefficient for the shared gene must be zero and the other group is only partially selected. The penalty (4), on the other hand, promotes the opposite and this property is desired for pathway selection.
Estimation of the generalized linear model under the penalty (4) can be formulated as the following optimization problem which, with an appropriate matrix such that = , can be equivalently written as an unconstrained optimization problem It can be shown that the second term ∑ 6 || 6 || 6∈9 is a norm of the aggregated latent vector = ( 6 ) 6∈9 and hence problem (5) has the same structure as problems (1) and (2) for all ∈ ; 0 is the step size between 0 and 2/ . Most importantly, PGD allows a distributed update of latent groups since they are independent of each other, hence can be significantly accelerated by modern parallel computing devices such as a GPU or distributed computing environments such as clouds.

Comparison with other software packages
The software package GRPREG for the R statistical computing environment fits linear, logistic, and Cox regression models with non-overlapping group lasso penalties, hence solves problems (1) and (2). The package GRPREGOVERLAP extends GRPREG to handle the latent group lasso penalty (4) to allow overlaps between groups, hence solves problem (5). The key differences between

Grid points for cross-validation
In PARPROX, the regularization parameter was chosen among 31 equally log-spaced values between 10 -4 and 10 -7 in the first case study, among 31 equally log-spaced values between 10 -6 and 10 -9 in the second case study, and among 21 equally log-spaced values between 10 -6 and 10 individuals.

Analysis of co-mutation frequency on protein interaction networks
The protein-protein interaction network has 133,146 unique pairs of interactions among 12,047 unique proteins. To estimate the significance of co-mutation frequency (the number of subjects having simultaneous mutation on both interacting proteins) of each pair of interacting proteins, we randomly sampled 133,146 pairs of interaction from the pool of proteins 1,000 times and calculated the co-mutation frequency for randomly sampled pairs in each iteration. The p-value of pair with co-mutation frequency F is defined as the number of pairs with co-mutation frequency higher than F divided by the total number of interactions (133,146), averaged across the 1000 iterations.

Breast cancer neoadjuvant chemotherapy complete response microarray data sets
We downloaded gene expression microarray data sets with sample annotation information from the Gene Expression Omnibus database, based on the information from Prat et al. 13

Interaction networks and gene pathways for variable group information
For variable group information in the TCGA somatic mutation data analysis with network penalty, we used protein-protein interaction network data from iRefIndex 32 and BioPlex. 33