A crowdsourcing approach for reusing and meta-analyzing gene expression data

803 capabilities for the (meta-) analysis of public gene-expression data sets. It was also designed to serve as a ‘commons’ for engaging diverse segments of the biomedical research community to help build a comprehensive and reusable repository of meta-information (e.g., sample groups, pairs and their annotations)—essential building blocks for constructing data compendia and performing meta-analyses (Fig. 1a). OMiCC can further serve as an educational tool to help students learn new biology by exposing them to hands-on exploration of large-scale data sets. This didactic mission is particularly important as biology is increasingly dominated by ‘big data’-driven approaches. More than 26,000 pre-normalized and quality-checked human and mouse studies comprising ~690,000 expression profiles from the Gene Expression Omnibus (GEO) (Supplementary Note 2) are now accessible through OMiCC. A core feature is the ability to easily create, annotate and share comparison group pairs (CGPs; Fig. 1b). A CGP comprises two collections (called sample groups) of gene expression profiles from a study, for example, blood transcriptomes of diabetic patients and of healthy controls. OMiCC provides easyto-use interfaces for constructing sample groups and CGPs and for annotating them using medical subject headings (MeSH)10, a standardized biomedical vocabulary used by PubMed, so that the resulting annotations are more easily interpretable and reusable by the community. Once a CGP is formed, OMiCC can compute significantly differentially expressed genes and a differential expression profile (DEP) capturing the differences in expression values for all genes between the sample groups (Fig. 1b). In contrast to approaches that use only statistically significant differentially expressed genes for comparison among CGPs4,11, DEPs can be collated across CGPs spanning one or more studies to form a data matrix operable by existing analysis tools and algorithms, including clustering and gene set enrichment To the Editor: Advances in high-throughput technologies have led to a rapid increase in the amount of data generated on a molecular, cellular and organismal scale1,2. The reuse and meta-analysis of large-scale data from multiple independent studies can increase the statistical power to obtain new and robust biological insights, compared with the analysis of any one study, and may serve as a productive starting point for informing the design of experiments3. Previous studies have successfully combined publicly available data from published studies to both reposition drugs4 and identify robust gene-expression signatures of transplant rejection5, infection status6,7, tumor subtypes and cancer progression8. However, these meta-analysis approaches are not trivial, often requiring study-related information that is not always available, as well as computational and statistical expertise that could discourage direct, hands-on participation of many biologists. Here we present OMics Compendia Commons (OMiCC) (https://omicc.niaid.nih. gov), a freely available tool, aimed at biologists with limited bioinformatics training, that uses a crowdsourcing approach to help overcome some of these challenges. OMiCC enables the broader biomedical research community to generate and test hypotheses through reuse and (meta-) analysis of existing data sets. Annotations, metadata and components of cross-study data compendia created by users are stored and made available to other users of the platform so that they may build on previous analyses and contribute their own annotations and analysis designs. In this way, OMiCC may help bring down barriers across communities and encourage a culture of sharing and openness in biomedical research. Millions of gene expression profiles reside in public databases1,2. These data could potentially be used to generate, assess or replicate hypotheses, even if the experiments were not originally designed to answer the same research questions. For example, data for evaluating the effect of a drug (in which drug-treated versus untreated subjects are compared) could be used to investigate the effects of gender on drug treatment. In addition, meta-analysis approaches9 will become increasingly effective for drawing robust conclusions from similar data sets generated from independent studies. However, the wealth of information available in public databases remains largely untapped, particularly by experimental biologists. One reason for this is that the steps involved in retrieving, processing and analyzing these data can be computationally and statistically complex for many biologists. Numerous resources have been created to enable the reuse and analysis of large-scale expression data (Supplementary Note 1), but they are generally limited to one or a subset of analytical steps, and therefore additional programming is still required for most workflows. Although commercial software has been developed to address some of these limitations, the algorithms are often proprietary, which makes incorporating external data into any analysis difficult, if not impossible. Furthermore, fee-based services could limit the size and diversity of the user community; less well-funded groups and research areas, as well as organizations from developing countries, tend to have less access. Another major barrier for both experimental and computational biologists alike is that structured, meta-information critical for data reuse and cross-study analyses is typically not readily available. It is often necessary to determine which samples from a study can be grouped, which groups can be meaningfully compared (e.g., a particular type of tumor samples versus normal), and what groups can be collated or compared within and across studies. Constructing such sample groups and comparison pairs requires biological expertise specific to the biological domain of the study; doing so en masse for all available studies is thus enormously timeconsuming and challenging. OMiCC provides programming-free A crowdsourcing approach for reusing and metaanalyzing gene expression data C O R R E S P O N D E N C E


Supplementary Information
A crowdsourcing approach for reusing and meta--analyzing gene expression data across studies and platforms Naisha Shah 1,* , Yongjian Guo 2,* , Katherine V. Wendelsdorf 1,* , Yong Lu 1 , Rachel Sparks 1 , John S. Tsang 1,# 1 Systems Genomics and Bioinformatics Unit and 2 Office of the Chief Laboratory of Systems Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA 20892 * These authors contributed equally to this work. # Correspondence: JST (john.tsang@nih.gov)

List of Supplementary Tables (Excel tables)
Supplementary Table 1: CGPs used in the meta-analyses of inflammatory bowel diseases use case (this information was generated using OMiCC and pieced together in Excel -see Supplementary Note 3).
Supplementary Table 2: Meta-analysis results of ulcerative colitis (UC) and Crohn's disease (CD) (output from OMiCC, which uses the RankProd 1 package for computing these statistics). Note that the columns "pfp.DOWNregulated" and "pfp.UPregulated" are terminologies used by the RankProd package and they can be interpreted as the FDR for genes with increased or decreased expression in disease relative to healthy controls, respectively. Also see Box 1 and Supplementary Note 3.
Supplementary Table 3: Gene-set enrichment analysis results for genes whose expression was increased or decreased in disease vs. healthy comparisons for ulcerative colitis and Crohn's disease (appear in different tabs of the Excel sheet). Analysis was performed separately for increased and decreased genes (DE genes determined using a FDR cutoff of 0.05 -see Box 1 and Supplementary Note 3). Table 4: Meta-analysis results of validation UC CGPs constructed from independent studies using OMiCC. The first tab of the Excel sheet ("gene stats") contains the gene level statistics (output from OMiCC in same format as Supplementary Table 2). The "up genes pathway" and "down genes pathway" tabs contain the gene-set enrichment analysis results for up-and down-changing genes in UC vs. healthy comparisons. Analysis was performed separately for up-and down-changing genes (DE genes determined using a FDR cutoff of 0.05 -see Box 1 and Supplementary Note 3).

Supplementary Note 1
Numerous useful resources have been created to enable the reuse and analysis of large--scale expression data, including 1) databases of manually--curated expression data (e.g., GEO DataSet 2 and NextBio 3 ), 2) tools for data retrieval (e.g., MaRe 4 ), sample annotation, and data collation (e.g., InSilicoDB 5 ), 3) software for data analysis and visualization (e.g., GenePattern 6 ), and 4) packages for meta--analysis of multiple data sets (e.g., INMEX 7 ). Despite these advances, these tools tend to focus on one or a subset of steps and thus forming complete workflows requires additional programming.
While there has been active commercial developments, including a platform called NextBio 8 that offers more turnkey solutions, its contents and annotations were pre--compiled using proprietary algorithms, and thus customization of analysis parameters, formation of new signatures, and using data sets not already available within their platform can be difficult, if not impossible. Another commercial platform called InsilicoDB 5 offers interfaces for annotating, collating, and analyzing both user--supplied and public data. However, it only allows a limited number of free analyses using public data sets (10 as of December 2015). In general, a fee--based service could limit the size and diversity of the user community; for example, less well--funded groups and research areas, as well as organizations from developing countries tend to have less access.

OMiCC software framework
OMiCC includes backend data--processing pipelines, a MySQL database (database schema available upon request), and a web server serving the web user interface (UI). The backend data--processing pipelines were mainly developed using the BPipe package 9 . These pipelines were used for retrieving data from GEO 2 , performing data normalization, formatting data output and updating the contents of the MySQL database. The pipelines were run on a Linux based computer cluster through the Sun Grid Engine.
The OMiCC database is updated and accessed by the backend data processing pipelines and web server through Hibernate, an object--relational mapping library that facilitates efficient data processing.
The OMiCC web server was developed using the Grails framework and is run on a Tomcat 7 web container. The Grails searchable plugin was used to index and search GEO data records through a Lucene indexing engine to further improve performance. The front--end web interface was developed using the JQuery library. The OMiCC data analysis jobs are executed using R on an Rserve server.

Gene--expression data and pre--processing
More than 26,000 human and mouse gene expression studies and the associated meta--data were downloaded from GEO capturing ~90% of the data deposited on or before June 2015 (not all data are covered because certain technology platforms are excluded; see below). In the future, our plan is to update OMiCC with new GEO releases once every 6 months. However, we plan to retrieve data at least 6 months old to avoid using newly deposited data since those tend to be updated frequently after initial deposition. Up to three types of expression data are made available to OMiCC users depending on the platform and data availability in GEO: 1) for Affymetrix platforms, RMA normalized data derived from the raw CEL files using the Affymetrix power tools package 10 ; 2) GEO series matrix data sets (i.e., GEO user--submitted versions of the data); and 3) quantile normalized versions of (2) --normalization was performed using the preprocessCore R package 11 . GEO encourages submission of Minimum Information About a Microarray Experiment (MIAME) compliant data, which requires submission of both raw and normalized data files. However, a substantial proportion of the GEO studies still do not have associated raw data files available. In addition, not all studies or the associated publication, if there is one, contain sufficient information on the process used to generate the normalized data. Thus, in addition to providing the GEO user-submitted normalized data, we also provide a quantile--normalized version of that data processed using our own procedure. In OMiCC, the choice of which normalized data to use, if more than one are available, is dependent on issues related to, for example, potential batch effects and technical artifacts of individual studies. Under the assumption that the user--submitted normalized files in GEO (i.e., the GEO series matrix data) were processed appropriately and because such files are available for almost all studies, by default OMiCC uses the quantile--normalized version of the GEO series matrix data. However, the user can choose other normalized file(s) for performing analyses within OMiCC.
Quality control assessment was performed on all data sets using arrayQualityMetrics R package 12 . Samples flagged as outliers by all three outlier--detection methods in the package were removed. The metrics used by these outlier detection methods were: 1) distances between arrays using mean absolute difference, 2) signal intensity distributions of the arrays by computing the Kolmogorov--Smirnov statistic, and 3) individual array quality using Hoeffding's D--statistic 12 .
We also provide probe--to--HUGO gene symbol mapping for more than 1900 GEO experimental platforms (GPLs 2 ) covering ~90% of the samples in our database. For a given platform, we first attempted to obtain the mapping from the manufacturer's website; if it was not available, we retrieved it either from the AILUN database 13 or from the corresponding GPL record in GEO 2 .

Construction of comparison group pairs (CGPs)
Gene expression studies and data sets from GEO can be searched and browsed using the OMiCC web interface. For querying, the US National Library of Medicine's controlled vocabulary MeSH (Medical Subject Headings) database 14 is used to facilitate searching using synonymous terms.
OMiCC can be used to create groups of samples from a study. A Comparison Group Pair (CGP) consists of two sample groups, e.g., perturbed sample group and control/unperturbed sample group (Fig. 1b). Once CGPs are created, they can be collated and differential expression profiles (DEPs -see below) can be created automatically using OMiCC. Both sample groups and CGPs can be shared with other OMiCC users by flagging them as "public"; these can then be searched and used by other OMiCC users. Each sample group can be annotated using 6 concepts, namely perturbation (e.g., IL--4 treatment), time with perturbation, disease, sample type (e.g., monocytes), sample source (e.g., blood) and other (to capture free text and additional annotation categories). We use MeSH terms to assist with the annotation process to better categorize the samples and to facilitate structured searches by other users. We also allow free--text for annotation, especially in cases where MeSH does not contain the appropriate tag terms. Similar to the study search interface, OMiCC also facilitates searches on publically available groups of samples and CGPs.

Gene--expression analysis
OMiCC can perform two main types of analyses given a CGP: 1) Differential expression analysis to derive a differential expression profile (DEP) -i.e., the magnitude and statistical significance of gene--expression differences between the two groups within the CGP for all genes, and significantly differentially expressed (DE) genes based on a statistical cutoff threshold; and 2) Meta--analysis of more than one CGP within the same or across different studies and experimental platforms (Fig.   1a). OMiCC can compute DEP statistics using several different methods: 1) Linear modeling using the Limma R package 15 , 2) Mann--Whitney test 16 , and 3) Student's t--test 17 . When samples across the two groups within a CGP are paired (i.e. individual samples with condition 1 have a corresponding matching samples in condition 2, e.g., the paired samples come from the same subject over two time--points), a paired statistical test is performed. By default, OMiCC selects the widely used Limma (Linear Models for Microarray data) 18 method to calculate expression differences between two sample groups in a CGP. Limma has been shown to perform better especially when working with small sample sizes 19 . The Limma R package provides several methods for adjusting the p--values to account for multiple hypothesis testing. We support the same set of multiple testing correction methods for Limma, as well as for the other two options in OMiCC (Mann--Whitney and Student's t-tests) using the p.adjust function in R. A list of significant DE genes is generated using a cutoff threshold on the resulting p--values or the adjusted p--values (at the user's discretion).
DEPs from selected CGPs within a compendium are merged into a gene/probe--by--CGP matrix file and can be downloaded for down--steam analysis. The user has the option of selecting different types of statistics for creating the data matrix such as t--statistic and B--values from Limma.
In addition, one can visualize hierarchical clustering of DEPs and the 500 most varying genes within the data matrix by downloading a heatmap created using the pheatmap R package 20 .
Meta--analysis in OMiCC is performed using the RankProd R package 1 . It is a non--parametric method and can achieve higher sensitivity and specificity compared to other meta--analysis methods according to Hong et al. 21 . RankProd first converts fold--change values derived from normalized expression data within a CGP into ranks. Then, a rank product is calculated for each gene across the CGPs. Lastly, meta--p value and false discovery rate (called "percentage of false positive predictions" in RankProd) are calculated based on permuted expression values. See the original RandProd publication referenced above for further details.
To support cross--platform analyses, DEP creation, collation, and meta--analyses can be performed at the gene (HUGO symbols) level whenever probe--to--gene mapping information is available (See "Gene--expression data and pre--processing" in Methods). Promiscuous probes, i.e., those that mapped to multiple gene symbols, were removed from further analysis. For gene symbols covered by multiple probes, we adopted the commonly used approach of taking the median across these probes 22 .

OMiCC output, visualization and supported down--stream analysis tools
When a DEP analysis is performed on CGP(s), OMiCC provides, for each CGP: 1) statistics describing differential expression for all genes or probes (depending on the user's selection of working at the probe or gene level); 2) a list of differentially expressed (DE) genes based on a default or a user--

Supplementary Note 3: Meta--analysis of IBD
We conducted the analysis when OMiCC contained data deposited to GEO on or before January 2014 (OMiCC currently covers up to June 2015; see Supplementary Note 2 on OMiCC's data update approach). We used OMiCC's search interface to query for human IBD studies using the term "Crohn's" with the aim of finding studies containing both disease subtypes (CD and UC). Based on the information provided in OMiCC and in the original paper of each study, we limited studies to those containing tissue samples from CD, UC and healthy subjects in the same study. For simplicity, we restricted microarray platforms to two versions of the most widely used platforms from Affymetrix (Affymetrix Human Genome U133 Plus 2.0 Array, and Affymetrix GeneChip Human Genome U133 Plus 2.0 Array)-OMiCC allows user to filter on microarray platforms in a search; OMiCC also indicates which platforms are the most widely used ones to help guide user to select the platforms that are likely the most robust. Note that OMiCC contains pre--processed data from and supports a large number of platforms, not only Affymetrix. We next created CGPs from studies found using these criteria (Supplementary Table 1).
We performed two meta--analyses at the gene level using "Normalized GEO Data" (see Supplementary Note 2 on the different types of data): 1) for all CD versus healthy CGPs, and 2) for all UC versus healthy CGPs. Significantly differentially expressed genes with a "percentage of false positive predictions" (PFP: a false discovery rate (FDR) equivalent metric from the RankProd 1 meta--analysis package used in OMiCC) ("Adjusted P value" in OMiCC) of less than 0.05 were identified within OMiCC. The meta--analysis results were downloaded from OMiCC for down--stream gene--set enrichment analysis using the ToppGene Suite software 26 , separately for the up--and down--changing genes in disease vs. healthy subjects. The meta--analysis results can be accessed at the following OMiCC analysis results page: For our UC validation meta--analysis, we queried human datasets in OMiCC using the term "Ulcerative Colitis". Studies were limited to those that contained samples from subjects with UC, but not CD samples (assessing CD replication using meta--analysis was not possible as only one additional CD study was found using the same search criteria we used for UC). From this list of studies, we selected the same platforms we used in our "discovery" analysis (Affymetrix Human Genome U133 Plus 2.0 Array, Affymetrix GeneChip Human Genome U133 Plus 2.0 Array). We then eliminated any studies in which the RNA was not derived from tissue biopsy. Next, studies that did not contain both affected subjects and control subjects were removed. Finally, we evaluated the primary literature of each study and eliminated any studies in which we were unsure if samples may have been duplicated in another candidate study. This included studies in which the publications had shared authors or originated from the same research institution. In some instances, the authors of the respective publications were contacted to clarify if some samples may have been duplicated. The meta--analysis results can be accessed at the following OMiCC analysis results page (also see Supplementary Table 4): https://omicc.niaid.nih.gov/myProject/showResult/837?ac=O7NQ5LMDM We imported data from Supplementary Tables 2--4 into R to assess replication between discovery and validation CGP sets. Supplementary Figures 1--3 were generated in R (The "VennDiagram" package 27 was used to generate Venn Diagrams). We used the GeneOverlap package to compute Fisher's Exact Test p values on overlaps at the gene and gene set/pathway levels (https://www.bioconductor.org/packages/release/bioc/html/GeneOverlap.html). Correlation in Supplementary Fig. 1 was assessed using a linear model. R code will be available upon request.