PathCORE: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia

Background Investigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies. Results We developed the PathCORE software to identify global pathway-pathway relationships, i.e. those evident across a broad data compendium. PathCORE starts with the results of robust feature construction algorithms, which are now being developed and applied to transcriptomic data. PathCORE identifies pathways grouped together in features more than expected by chance as functionally co-occurring. We performed example analyses using PathCORE for a microbial compendium for which eADAGE features were already available and a TCGA dataset of 33 cancer types that we analyzed via NMF. PathCORE recapitulated previously described pathway-pathway relationships and suggested additional relationships with biological plausibility that remain to be explored. The software also identifies genes associated with each relationship and includes demonstration source code for a web interface that users can modify to (1) visualize the network produced by applying the PathCORE software to their specific inputs and (2) review the expression levels of associated genes in the original data. The web interface can help biologists design experiments to test the relationships that were identified using PathCORE. Conclusions PathCORE is a hypothesis generation tool that identifies co-occurring pathways from the results of unsupervised analysis of the growing body of gene expression data and visualizes the relationships between pathways as a network. Software that steps beyond within-pathway relationships to between-pathway relationships can reveal levels of organization that have been less frequently considered.

within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect 23 the activity of other pathways. Past efforts to identify relationships between biological processes have 24 evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments.

25
Individual experiments can highlight condition-specific pathway-pathway relationships; however, 26 constructing a complete network of such relationships across many conditions requires analyzing results 27 from many studies.

28
Results: We developed the PathCORE software to identify global pathway-pathway relationships, i.e. 29 those evident across a broad data compendium. PathCORE starts with the results of robust feature 30 construction algorithms, which are now being developed and applied to transcriptomic data. PathCORE 31 identifies pathways grouped together in features more than expected by chance as functionally co-32 occurring. We performed example analyses using PathCORE for a microbial compendium for which 33 eADAGE features were already available and a TCGA dataset of 33 cancer types that we analyzed via 34 NMF. PathCORE recapitulated previously described pathway-pathway relationships and suggested 35 additional relationships with biological plausibility that remain to be explored. The software also identifies 36 genes associated with each relationship and includes demonstration source code for a web interface that 37 users can modify to (1)  The number of publicly available genome-wide datasets is growing rapidly [1]. High-throughput 50 sequencing technologies that measure gene expression quickly with high accuracy and low cost continue 51 to enable this growth [2]. Expanding public data repositories have laid the foundation for computational 52 methods that consider entire compendia of gene expression data to extract biological patterns [3]. These 53 patterns may be difficult to detect in measurements from a single experiment. Unsupervised approaches, 54 which identify important signals in the data without being constrained to previously-described patterns, 55 may discover new expression modules and thus will complement supervised methods, particularly for 56 exploratory analyses [4,5].

58
Feature extraction methods are a class of unsupervised algorithms that can reveal unannotated 59 biological processes from genomic data [5]. Features can be constructed so that they preserve the genes 60 in the dataset: each feature is defined by a subset of influential genes, and these genes suggest the 61 biological or technical pattern captured by the feature. However, these features are often designed to be 62 independent or may be considered in isolation [6,7]. When examined in the context of knowledgebases 63 such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [8], most features are significantly 64 enriched for more than one biological gene set [5]. In this work, we refer to such a gene set by the 65 colloquial term, pathway. It follows then that such features can be described by sets of functionally related 66 pathways. We introduce the PathCORE (pathway co-occurrence relationships) software, an approach for 67 connecting features learned from the data to known biological gene sets, e.g. pathways from KEGG or 68 other databases. features. This means that the pathways must be perturbed in a sufficient fraction of the compendium to 77 be captured by the model. The number of samples in the compendium used for model generation and the 78 fraction of samples needed to observe a specific signature is expected to vary for different machine 79 learning methods. To avoid simply discovering relationships between gene sets that share many genes, 80 PathCORE also incorporates an optional pre-processing step that corrects for a situation Donato

154
Implementation 155 PathCORE identifies functional links between known pathways from the output of feature 156 construction methods applied to gene expression data (Fig. 1a, b). The result is a network of pathway co-157 occurrence relationships that represents the grouping of biological gene sets within those features. We 158 correct for gene overlap in the pathway annotations to avoid identifying co-occurrence relationships 159 driven by shared genes. Additionally, PathCORE implements a permutation test for evaluating and 160 removing edges-pathway relationships-in the resulting network that cannot be distinguished from a null 161 model of random associations. Though we refer to the relationships in the network as co-occurrences, it 162 is important to note that the final network displays co-occurrences that have been filtered based on this 163 permutation test (Fig. 1c). Our software is written in Python and pip-installable (PyPI package name:  [26]; in ADAGE or eADAGE it is termed the weight matrix 177 [5,7]; in NMF it is the matrix W, where the NMF approximation of the input dataset A is A ~ WH

374
As an optional step, users can set up a Flask application for each PathCORE network. Metadata 375 gathered from the analysis are saved to TSV files, and we use a script to populate collections in a 376 MongoDB database with this information. The co-occurrence network is rendered using the D3.js force-377 directed graph layout [28]. Users can select a pathway-pathway relationship in the network to view a new 378 page containing details about the genes annotated to one or both pathways (Fig. 4a).

380
We created a web interface for deeper examination of relationships present in the pathway co-

408
The information specified in (2) requires an "expression score" for every sample. A sample 409 expression score is calculated using the genes we selected in goal (1) The methods we implement in PathCORE can be used independently of each other (Fig. 1c).

432
Here, we present analyses that can be produced by applying the full PathCORE pipeline to models 433 created from a transcriptomic compendium by an unsupervised feature construction algorithm. Input 434 pathway definitions are "overlap-corrected" (correcting for gene overlap between definitions) for each 435 feature before enrichment analysis. An overlap-corrected, weighted pathway co-occurrence network is 436 built by connecting the pairs of pathways that are overrepresented in features of the model. Finally, we 437 remove edges that cannot be distinguished from a null model of random associations based on the 438 results of a permutation test.

440
PathCORE also offers support for users interested in experimentally verifying a pathway-pathway 441 relationship (Fig. 4). We provide the code for setting up a web application where the network can be this to emphasize the co-occurrence relationships that are more stable [29]-that is, the relationships that 472 appear across multiple models. Finally, we removed edges in the aggregate network that were not 473 significant after FDR correction when compared to the background distributions generated from 10,000 474 permutations of the network. We applied PathCORE to features built by both NMF (Fig. S2) and eADAGE 475 (discussed below). The pathway-pathway network generated by NMF [30] is smaller than that generated 476 by eADAGE; that is, there are fewer pathway-pathway connections in the network. It is possible that this 477 is due to a difference in the stability of the results from the two algorithms or the comprehensiveness of 478 the features extracted by each approach. eADAGE includes an ensemble step that improves model 479 consistency. The eADAGE authors also observed that models constructed by this ensemble procedure 480 also more comprehensively captured pathways than non-ensemble models [5]. The PathCORE analysis 481 of a 300-feature NMF decomposition of the P. aeruginosa compendium produced a KEGG network that is 482 similar in size to the PID network (Fig. S2, 6).

484
The eADAGE co-occurrence network identifies a number of pathway-pathway relationships that 485 have been previously characterized (Fig. 5). This suggests that PathCORE can capture functional links 486 between biological pathways. Three glucose catabolism processes co-occur in the network: glycolysis, 487 pentose phosphate, and the Entner-Doudoroff pathway (Fig. 5a). We also found a cluster relating 488 organophosphate and inorganic phosphate transport-and metabolism-related processes (Fig. 5b).

489
Notably, phosphate uptake and acquisition genes are directly connected to the hxc genes that encode a 490 type II secretion system. Studies in P. aeruginosa clinical isolates demonstrated that the Hxc secretion

508
We used the PathCORE web application for the eADAGE KEGG P. aeruginosa network 509 (https://pathcore-demo.herokuapp.com/PAO1) to analyze the connection between the phosphate 510 transport system and a type II general secretion system pathway (Fig. 3a, b; edge page at 511 https://goo.gl/Hs5A3e). The sixteen genes reported on the edge page all have odds ratios above 34; 512 these genes are at least 34 times more likely to appear in the gene signatures in which both of these 513 pathways are overrepresented. Such genes may help to reveal the biological basis of the co-occurrence 514 relationship. In this case, the results suggest that there may be some overlap between machinery for 515 transporting phosphate into the cell and secreting substances out of the cell via type II secretion.

516
Alternatively, the two processes may be mechanistically separate but coregulated such that phosphate 517 scavenging molecules may be secreted by type II secretion coincidentally to aid in phosphate acquisition.

518
The heatmap of the fifteen least expressed samples shows that the pstB and pstS genes, annotated to 519 the phosphate transport system, are consistently expressed higher relative to the other genes in the edge 520 for these samples. The pstS, pstC, pstA, and pstB genes are proximal of the phosphate-specific transport 521 (Pst) operon that encodes a high-affinity orthophosphate transport system [37]. Future studies will 522 examine whether deletion of the Pst phosphate system impairs secretion by the type II secretion system.

524
To assess whether the relationships identified in the pathway analysis paralleled gene expression 525 patterns in the context of a published experiment, we looked across experiments to determine if the 526 genes contained in the edge were co-regulated. As an example of the types of relationships that we 527 observed, we present a single experiment, E-GEOD-22164 from Folsom et al. (Fig. 3c;

570
We found that PathCORE detects modules of co-occurring pathways consistent with our current 571 understanding of cancer-related interactions (Fig. 6). Importantly, because the connections were 572 constructed from many different cancer-types, these modules may represent pathway-pathway 573 interactions present in a large proportion of all tumors and may be good candidates for targeted 574 treatments.

576
For example, a module composed of a FoxM1 transcription factor network, an E2F transcription 577 factor network, Aurora B kinase signaling, ATR signaling, PLK1 signaling, and members of the Fanconi 578 anemia DNA damage response pathway are densely connected (Fig 6a). When two pathways share an 579 edge in the co-occurrence network, they are overrepresented together in one or more features. The 580 connections in this module recapitulate well known cancer hallmarks including cellular proliferation 581 pathways and markers of genome instability, such as the activation of DNA damage response pathways 582 [42]. We found that several pairwise pathway co-occurrences correspond with previously reported 583 pathway-pathway interactions [43][44][45]. We also observed a hub of pathways interacting with Wnt 584 signaling (Fig. 6b)

591
Two modules in the network relate to angiogenesis, or the formation of new blood vessels (Fig.   592 6c, d). Tumors transmit signals that stimulate angiogenesis because a blood supply provides the 593 necessary oxygen and nutrients for their growth and proliferation. One module relates angiogenesis 594 factors to cell proliferation. This module connects the following pathways: PDGFR-beta signaling, FAK-595 mediated signaling events, VEGFR1 and VEGFR2-mediated signaling events, nuclear SMAD2/3 596 signaling regulation, and RB1 regulation (Fig. 6c). These functional connections are known to be involved 597 in tumor proliferation [49][50][51]. The other module indicates a direct relationship by which the VEGF 598 pathway interacts with the S1P3 pathway through Beta3 integrins (Fig. 6d). S1P3 is a known regulator of

605
Finally, PathCORE revealed a large, densely connected module of immune related pathways 606 (Fig. 6e). We found that this module contains many co-occurrence relationships that align with immune

643
Unsupervised analyses of genome-scale datasets that summarize key patterns in the data have 644 the potential to improve our understanding of how a biological system operates via complex interactions 645 between molecular processes. However, interpreting the features generated by unsupervised approaches 646 is still challenging. PathCORE is a component of a software ecosystem that connects the features 647 extracted from data to curated resources. It is important to note that PathCORE will only be able to 648 identify pathways that occur in features of the underlying model. For these pathways to be mapped to 649 features, they will need to be transcriptionally perturbed in at least some subset of the compendium.

650
PathCORE is post-hoc analysis software that can be applied to models already known to capture 651 biological patterns from gene expression compendia. A graph resulting from PathCORE can help to 652 identify global pathway-pathway relationships-a baseline graph-that complements existing work to 653 identify interactions between pathways in the context of a specific disease. The specific niche that 654 PathCORE aims to fill is in revealing to researchers which gene sets are most closely related to each 655 other in machine learning-based models of gene expression, which genes play a role in this co-