From genome to phenome: Predicting multiple cancer phenotypes based on somatic genomic alterations via the genomic impact transformer

Cancers are mainly caused by somatic genomic alterations (SGAs) that perturb cellular signaling systems and eventually activate oncogenic processes. Therefore, understanding the functional impact of SGAs is a fundamental task in cancer biology and precision oncology. Here, we present a deep neural network model with encoder-decoder architecture, referred to as genomic impact transformer (GIT), to infer the functional impact of SGAs on cellular signaling systems through modeling the statistical relationships between SGA events and differentially expressed genes (DEGs) in tumors. The model utilizes a multi-head self-attention mechanism to identify SGAs that likely cause DEGs, or in other words, differentiating potential driver SGAs from passenger ones in a tumor. GIT model learns a vector (gene embedding) as an abstract representation of functional impact for each SGA-affected gene. Given SGAs of a tumor, the model can instantiate the states of the hidden layer, providing an abstract representation (tumor embedding) reflecting characteristics of perturbed molecular/cellular processes in the tumor, which in turn can be used to predict multiple phenotypes. We apply the GIT model to 4,468 tumors profiled by The Cancer Genome Atlas (TCGA) project. The attention mechanism enables the model to better capture the statistical relationship between SGAs and DEGs than conventional methods, and distinguishes cancer drivers from passengers. The learned gene embeddings capture the functional similarity of SGAs perturbing common pathways. The tumor embeddings are shown to be useful for tumor status representation, and phenotype prediction including patient survival time and drug response of cancer cell lines.*


Introduction
Cancer is mainly caused by the activation of oncogenes or deactivation of tumor suppressor genes (collectively called "driver genes") as results of somatic genomic alterations (SGAs; Vogelstein et al., 2013), including somatic mutations (SMs, Kandoth et al. 2013;Lawrence et al. 2014), somatic copy number alterations (SCNAs; Ciriello et al. 2013;Zack et al. 2013), DNA structure variations (SVs; Stransky et al. 2014), and epigenetic changes (Jones and Baylin, 2002).Precision oncology relies on the capability of identifying and targeting tumor-specific aberrations resulting from driver SGAs and their effects on molecular and cellular phenotypes.However, our knowledge of driver SGAs and cancer pathways remains incomplete.Particularly, it remains a challenge to determine which SGAs (among often hundreds) in a specific tumor are drivers, which cellular signals or biological processes a driver SGA perturbs, and which molecular/cellular phenotypes a driver SGA affects.
Current methods for identifying driver genes mainly concentrate on identifying genes that are mutated at a frequency above expectation, based on the assumption that mutations in these genes may provide oncogenic advantages and thus are positively selected (Dees et al., 2012;Lawrence et al., 2013).Some works further focus on the mutations perturbing conserved (potentially functional) domains of proteins as indications they may be driver events (Reva et al., 2011;Niu et al., 2016).However, these methods do not provide any information regarding the functional impact of enriched mutations on molecular/cellular phenotypes of cells.Without the knowledge of functional impact, it is difficult to further determine whether an SGA will lead to specific molecular, cellular and clinical phenotypes, such as response to therapies.What's more, while both SMs and SCNAs may activate/deactivate a driver gene, there is no well-established frequency-based method that combines different types of SGAs to determine their functional impact.
Conventionally, an SGA event perturbing a gene in a tumor is represented as a "one-hot" vector spanning gene space, in which the element corresponding to the perturbed gene is set to "1".This representation simply indicates which gene is perturbed, but it does not reflect the functional impact of the SGA, nor can it represent the similarity of distinct SGAs that perturb a common signaling pathway.We conjecture that it is possible to represent an SGA as a high-dimensional vector, in the same manner as the "word embedding" (Mikolov et al., 2013;Pennington et al., 2014) in the natural language processing (NLP) field, such that the representation reflects the functional impact of a gene on biological systems, and genes sharing similarly functions should be closely located in such embedding space.Motivated by this, we propose a scheme for learning "gene embeddings" for SGA-affected genes, i.e., a mapping from individual genes to low-dimensional vectors of real numbers that are useful on multiple predictive tasks.Our gene embeddings focus on the functional impact of SGAs.
Based on the assumption that SGAs perturbing cellular signaling systems often eventually lead to changes in gene expression, we introduce an encoder-decoder architecture neural network model called "genomic impact transformer" (GIT) to predict DEGs and detect potential cancer drivers with the supervision of DEGs.While deep learning models are being increasingly used to model different bioinformatics problems (Lee et al., 2016;Lan et al., 2018), to our knowledge there are few studies using neural network to model the relationships between SGAs and molecular/cellular phenotypes in cancers.The proposed GIT model has the following innovative characteristics: (1) The encoder part of transformer (Vaswani et al., 2017) first uses SGAs observed in a tumor as inputs, maps each SGA into a gene embedding representation, and combines gene embeddings of SGAs to derive a personalized "tumor embedding".Then decoder part decodes and translates the tumor embedding to predict DEGs.
(2) A multi-head self-attention mechanism (Bahdanau et al., 2014;Xu et al., 2015) is utilized in the encoder, which is a technique widely used in NLP to choose the input features that significantly influence the output.It perfectly suits our hope of selecting the important cancer drivers from the whole set of SGAs.It differentiates SGAs by assigning different weights to them so that it can potentially distinguish SGAs that have impact on DEG from those do not, i.e., detecting drivers from passengers.(3) Pooling inferred weighted impact of SGAs in a tumor produces a personalized tumor embedding, which can be used as an effective feature to predict DEGs and other phenotypes.(4) Gene embeddings are pre-trained by a "Gene2Vec" algorithm and refined by the GIT, which capture the functional impact of SGAs to the cellular signaling system.Our results and analysis indicate that above innovative approaches enable us to derive powerful gene embedding and tumor embedding representations that are highly informative of molecular, cellular and clinical phenotypes.

Somatic genomic alterations (SGAs) pre-processing
We obtain SGA data, including SMs and SCNAs of 4,468 tumors consisting of 16 cancer types directly from TCGA portal (Network et al., 2013).We collectively designate all SGAs affecting a gene using the name of the gene being perturbed.After processing genomic data from TCGA, we use a binary variable in a "one-hot" vector to indicate the genomic status of a gene.For example, we represent the genomic status of TP53 as 1, if it is perturbed by one or more of SM/SCNA events in a tumor.

Differentially expressed genes (DEGs) pre-processing
Gene expression data are pre-processed and obtained from the Firehose browser of the Broad Institute.We determine whether a gene is differentially expressed in a tumor by comparing the gene's expression in the tumor against a distribution of the expression values of the gene in the corresponding tissue-specific "normal" or control samples.For a given cancer type, assuming the expression of each gene (log 2 based) follows a Gaussian distribution in control sample, we calculate the pvalues by determine the probability of observing a expression value from control distribution.If the p-value is equal or smaller than 0.005, the gene is considered as differentially expressed in the corresponding tumor.However, if a DEG is associated with an SCNA event affecting it, we remove it from the DEG list of the tumor.

2.3
The genomic impact transformer (GIT) neural network 2.3.1 GIT network structure: encoder-decoder architecture Figure 1a shows the general structure of the genomic impact transformer (GIT) model with an overall encoder-decoder architecture.GIT mimics hierarchically organized cellular signaling system (Chen et al., 2015(Chen et al., , 2016)), in which a neuron may potentially encode the signal of one or more signaling proteins.When cellular signaling system is perturbed by SGAs, it often can lead to changes in measured molecular phenotypes, such as gene expression changes.
Thus, for a tumor t, the set of its SGAs {g} m g=1 is connected to the GIT neural network as observed input (Figure 1a bottom part squares).
The impact of SGAs is represented as embedding vectors {e g } m g=1 , and the impact of them in a tumor is combined into a tumor embedding vector e t through an attention mechanism in the encoder (Figure 1a middle part).We further explicitly represent cancer type s and its influence on encoding system e s of the tumor because tissue type influences which genes are expressed in cells of a specific tissue and thereby can be differentially expressed in cancer cells when compare to normal tissue.Since it is wellknown that the majority of SGAs in a tumor are likely passengers and only few are drivers, we design a self-attention module to assign a tumorspecific weight to each SGA to modulate its impact (Figure 1b,c).We use a multi-head attention mechanism to represent the fact that a tumor usually involves multiple processes, and they potentially can be represented by multiple single-head self-attention mechanisms.After encoding the SGAs and cancer type into tumor embedding, the decoder module, which consists of a feed forward multi-layer perceptron (MLP; Rosenblatt, 1958), finally translates the functional impact of SGAs and cancer type into DEGs of the tumor (Figure 1a top part).

Pre-training gene embeddings using Gene2Vec algorithm
In this study, we project the discrete binary representation of SGAs perturbing a gene into a continuous embedding space, which we call "gene embeddings" of corresponding SGAs.While such embeddings can be directly learned using the GIT model, it has been shown in the field of NLP that the pre-trained word embeddings can significantly improve the performance in other related NLP tasks (Devlin et al., 2018;Tao et al., .. .

2018)
. Such pre-trained word embeddings can capture the knowledge of co-occurrence pattern of the words in languages and exhibit sound semantic properties: words of similar semantic meanings are close in embedding space, e.g., e "each" ≈ e "every" .We therefore propose an algorithm called "Gene2Vec" to pre-train the gene embeddings, which is closely related the skip gram word2vec (Mikolov et al., 2013) pre-training algorithm.The biology rationale behind Gene2Vec algorithm is that we are able to portrait the co-occurrence pattern of SGAs in each tumor, i.e., mutually exclusive mutations (Vandin et al., 2012), using gene embeddings and gene context embeddings.
Given the gene embedding e g of an SGA-affected gene g and context embedding of any possible SGA-affected gene c : V = {v c } c ∈G , where G is the set of all possible SGA-affected genes, the skip gram paradigm assumes the probability that an alteration in gene c happens together with the alteration in gene g within a tumor with probability: . (1) We use the negative sampling (NS) technique, a simplified version of noise contrastive estimation (Gutmann and Hyvärinen, 2012), to approximately maximize the log-likelihood of skip gram, which will otherwise be computational expensive to optimize if directly following Equation 1.By initializing gene embeddings with the pre-trained gene embeddings in the GIT, we transfer the knowledge learned from SGA pattern behind tumors.These gene embeddings are further updated and fine-tuned by the GIT model with the supervision of affected DEGs.

Encoder: multi-head self-attention mechanism
Instead of allowing SGAs and cancer type variable to directly connect to encoder, we start with an embedding layer, which enables us to use attention mechanisms to differentiate impact of potential driver and passenger SGAs (Figure 1a middle part).For all SGA-affected genes {g} m g=1 and the cancer type s of a tumor t, we map them to corresponding gene embeddings {e g } m g=1 and a cancer type embedding e s from a lookup table E = {e g } g∈G ∩ {e s } s∈S , where e g and e s are real-valued vectors.From implementation perspective, we treat cancer types in the same way as SGAs, except the attention weight of it is fixed to be "1".
The look-up table {e g } g∈G is initialized with Gene2Vec pre-trained gene embeddings and refined by GIT here.

Decoder: multi-layer perceptron (MLP)
For a specific tumor t, we feed tumor embedding e t into a MLP with one hidden layer as decoder, using non-linear activation functions and fully connected layers, to produce the final predictions ŷ for DEGs y; where ReLU(x) = max(0, x) is rectified linear unit, and σ(x) = (1+exp(−x)) −1 is sigmoid activation function.
The output of decoder and actual values of DEGs are used to calculate the 2 -regularized cross entropy, which is minimized during training: where W = {W l } 2 l=0 , Θ and E defined in Section 2.3.3;cross entropy loss defined as: and p regularizer defined as:

Training and evaluation
We utilize PyTorch (https://pytorch.org/) to train, validate and test the Gene2Vec, GIT (variants) and other conventional models (Lasso and MLPs; Section 3.1).The training, validation and test sets are split in the ratio of 0.33:0.33:0.33 and fixed across different models.The hyperparameters are tuned over the training and validation sets to get best results, trained on training and validation sets, and finally applied to the test set for evaluation if not further mentioned below.The models are trained by updating parameters using backpropagation (Rumelhart et al., 1986), specifically, using mini-batch Adam (Kingma and Ba, 2014) with default momentum parameters.Gene2Vec uses mini-batch stochastic gradient descent (SGD) instead of Adam.Dropout (Hinton et al., 2012) and weight decay ( p -regularization) are used to prevent overfitting.We train all the models 30 to 42 epochs until they fully converge.The output DEGs are represented as a sparse binary vector, with overall 31.3% of genes being differentially expressed.In order to judge the performance of different methods, we utilize various performance metrics including accuracy, precision, recall and F1 score, where F1 is the harmonic mean of precision and recall: The training and test are repeated for five runs get the mean and variance of evaluation metrics.More evaluation methods and details, including that of GO enrichment analysis (Section 3.2), driver detection (Section 3.3), survival analysis (Section 3.4) and drug response (Section 3.5) are in corresponding sections to make the discussion more explicit and compact.

GIT statistically detects real biological signals
The task of GIT is to predict DEGs (dependent variables) using SGAs as input (independent variables).We plot both F1 score and accuracy on test set as the function of trained epochs (Figure 2 "real data"), which indicate that the model gains capability of predicting DEGs as training proceeds, and finally reachs a stable state.In order to validate that GIT is able to extract real statistical relationships between SGAs and DEGs, we randomly shuffle the positions of DEGs in the DEG vector of a tumor, i.e., randomly relabel DEG names, and then train a GIT to predict DEGs from SGAs.We compare the performance of models trained with random datasets, by plotting F1 score and accuracy during training of the models (Figure 2 "shuffled data").Note that, since most DEGs in the data are zeros, a trivial solution is to call every DEG as 0, which can also achieve good overall accuracy and minimize loss, but that will result in a low F1 because of low recall.Indeed, the test F1 score in the DEG-permutation case drops to a very low value due to the same reason.These results demonstrates that GIT is able to capture real statistical relationships between SGAs and DEGs from the noisy biological data, instead of simply behaving as a super function fitter that fits random relationships.
As comparison, we also train and test the Lasso (Tibshirani, 1994) and MLPs (Rosenblatt, 1958) as baseline prediction models to predict DEGs based on SGAs.The Lasso model is appealing in our setting because, when predicting a DEG, it can filter out most of irrelevant input variables (SGAs) and keep only the most informative ones, and it is a natural choice in our case where there are 19.8kpossible SGAs.However, in comparison to MLP, it lacks capability of portraiting complex relationships between SGAs and DEGs.On the other hand, while conventional MLPs have sufficient power to capture complex relationships-particularly, the neurons in hidden layers may mimic signaling proteins (Chen et al., 2016)  We compare the precision, recall, F1 score, as well as accuracy of different methods to have a thorough comparison between GIT and traditional methods, as shown from 1st to 4th, and last rows of Table 1, one can conclude that GIT outperforms all these other conventional baseline methods for predicting DEGs in all metrics, indicating the specifically designed structure of GIT is able to soar the performance in the task of predicting DEGs from SGAs.
In order to evaluate the utility of each module (procedure) in GIT, we conduct ablation study by removing one module at a time: the cancer type input ("can"), the multi-head self-attention module ("attn"), and the initialization with pre-trained gene embeddings ("init").The impact of each module can be detected by comparing to the full GIT model.As one can see from 5th to lat row of Table 1, all the modules in GIT help to improve the prediction of DEGs from SGAs in terms of overall performance: F1 score and accuracy.

Gene embeddings of SGAs serve as abstract representations of their functional impact
We examine whether the gene embeddings really reflect some biological properties of genes.We first examine if the genes sharing similar embeddings also share same any function annotation, using the Gene Ontology (GO; Ashburner et al., 2000) as a means to evaluate functional similarity of genes.GO is an ontology database that describes the relations of biological concepts and systematically assigns annotations to specific genes or proteins.We mainly concentrate on evaluating whether SGAaffected genes share GO annotations in the "biological process" domain, based on the assumption that genes involved in a common biological process will likely share common functional impact.The top 1,474 frequently altered genes (affected by SGAs for more than 150 times across all the patients in the dataset) are used for evaluation, considering that the gene embeddings of rare SGAs may not be well learned.
The first metric we apply to examine the consistency of gene embedding space, is what we call "nearest neighborhood (NN) accuracy" with respect to GO similarity.It is defined as the expectation of whether a pair of genes (g, c) that are nearest neighbors in the embedding space share at least one same GO term: where 1(statement) is the indicator function; GO(g) the set of GO terms assigned to gene g; NN(e g ) the set of nearest neighbors of e g .
The expectation E is approximated by iterating over all possible pairs of genes.Table 2 shows that, by capturing the co-occurrence pattern of somatic alterations, the Gene2Vec pre-tained gene embeddings improve 36% in NN accuracy over random chance of any pair of the genes Apart from the NN accuracy, which only reflects the function similarities between two adjacent genes in embedding space, we also evaluate whether a cluster of genes close in an embedding space share GO annotations through "GO enrichment", which is defined as follows: where Clust(g) is the cluster that gene g belongs to.GO enrichment considers the function similarities of genes that are close in the embedding space.The larger it is, the higher correlated are the GO functions and clusters (and it equals to 1 in random case).We perform clustering analysis of SGAs in embedding space using k-means clustering, and calculate GO enrichment, and we vary number of clusters (k) to derive clusters with different degrees of granularity.The change of GO enrichment with different numbers of clusters is shown in Figure 3a.As one can see, when the genes are randomly distributed in the embedding space, they get GO enrichment of 1.However, in the gene embedding space, the GO enrichment increases fast until the number of clusters reaches 40, indicating strong correlation between the clusters in embedding space and the functions of the genes.Finally, we group the genes into 40 clusters, and show the t-SNE visualization (Maaten and Hinton, 2008) of genes in the embedding space (Figure 3b left panel).Using PANTHER GO enrichment analysis (Mi et al., 2013), 12 out of 40 clusters are shown to be enriched in at least one biological process.As one can see, most of gene clusters are well-defined and tight located in the projected t-SNE space.As case study, we take a close look at one cluster (Figure 3b right panel), which contains a set of functionally similar genes, such as that code a protein family of type I interferons (IFNs), which are responsible for immune and viral response (de Weerd and Nguyen, 2012).In sum, all the above results indicate that the gene embeddings capture the functional similarity of SGAs.

Self-attention mechanism reveals SGAs with impact on cancer cell transcriptome
While it is a widely accepted that cancer is mainly caused by SGAs, but not all SGAs observed in a cancer cell are causative (Vogelstein et al., 2013).Previous methods mainly concentrate on searching for SGAs with higher than expected frequency to differentiate candidate drivers SGAs from passenger SGAs.GIT provides a novel perspective to address the problem: identifying the SGAs that have functional impact on cellular signaling systems and eventually lead DEGs as candidate drivers.The multi-head self-attention module provides us a means to select the tumorspecific important drivers by assigning higher weights to such genes, where the weights are calculated given the set of all SGAs in a tumor (Figure 1c; Equation 3-6).For example, the SGAs that are weighted most by the attention mechanism happen in gene PIK3CA, TP53, GATA3 in breast cancer patient "TCGA-D8-A1JJ" (Figure 1a).These results agree with the knowledge that these genes are important drivers for breast cancers, and they all have significant impact on transcriptome of cancer cells, particularly given that both TP53 and GATA are transcription factors.
Here we compare the relationship of overall attention weights and the frequencies of somatic alterations in all the cancer types (Pan-Cancer) from our test data, as shown in Figure 4.In general the attention weights is in high correlation with the alteration frequencies of the genes, agreeing with the notion that SGAs which are significantly enriched in cancers are likely drivers (SGAs bearing functional impact).For example, common cancer drivers such as TP53 and PIK3CA are the top two SGAs selected by both methods.Interestingly, our self-attention mechanism assigns high weights to many of genes previously not designated as drivers, indicating these genes are potential cancer drivers although their roles in cancer development remain to be further studied.
Table 3 lists top SGAs ranked according to both attention weights and frequencies in pan-can and five selected cancer types, where known cancer drivers from TumorPortal (Lawrence et al., 2014) and IntOGen (Gonzalez-Perez et al., 2013) are marked as bold font.Apart from TP53 and PIK3CA as drivers in pan-can analysis (Kandoth et al., 2013), we also find the top cancer drivers in specific cancer types consistent with our knowledge of cancer oncology.For example, CDH1 and GATA3 are drivers of breast invasive carcinoma (BRCA; Network et al., 2012), CASP8 is known driver of head and neck squamous cell carcinoma (HNSC; Stransky et al., 2011), STK11, KRAS, KEAP1 are known drivers of lung adenocarcinoma (LUAD; Network et al., 2014b), PTEN and RB1 are drivers of glioblastoma (GBM; Brennan et al., 2013), and FGFR3, RB1, HSP90AA1, STAG2 are known drivers in urothelial bladder carcinoma (BLCA; Network et al., 2014a).It can be inferred that the attention mechanism has better ability to detect drivers than frequency method.It should be noted that attention weights reflect relatively how important an SGA affects overall DEGs.Therefore an SGA that affects only a small number of DEGs will receive lower weight in our model, but this does not necessarily exclude fact that a low weight gene can be cancer drivers.

Personalized tumor embeddings reveal distinct survival profiles of cancer patients
Besides learning the specific biological function impact of SGAs on DEGs, we further examine whether the learned embedding of a tumor can represent characteristics of the tumor that may affect phenotypes of patients.Since "tumor embedding" e t of tumor t is generated by pooling attention-weighted embeddings of SGAs in a tumor, it reflects perturbation status of cellular signaling system.Here, we evaluate the utility of tumor embeddings in two perspectives: (1) Discovering patterns of tumors potentially sharing common disease mechanisms across different cancer types; (2) Using tumor embedding to predict patient survival.
After training a GIT model using pan-can samples, the full tumor embedding vector e t representing each tumor, and its consisting part -cancer type embedding vector e s are extracted.We use the t-SNE plot of tanh-curved tumor embeddings to illustrate the relationships of tumors across different cancer types.Figure 5a shows that, when cancer type embedding e s is included in full tumor embedding e t , which has a much higher weight than any individual gene embedding (Figure 1b, Equation 2) and dominates the full tumor embedding, tumor samples are clustered according to cancer types.This is not surprising as it is well appreciated that expression expression of many genes are tissuespecific (Hoadley et al., 2014), and it is the reason that we include a cancer type embedding in our model so that it can capture the cancertype-specific information, thus allowing gene embeddings to capture the universal impact of SGAs in different cancer types.To examine the pure  effect of SGAs on tumor embedding, we remove the effect of tissue by subtracting cancer type embeddings e s , followed by clustering tumors in these stratified tumor embedding space.Figure 5b shows t-SNE plot of stratified tumor embedding (e t -e s ).It is interesting to see that each dense area (potential tumor clusters) includes tumors from different tissues of origins.Clusters of attentioned-weighted gene embeddings in these tumors may reflect shared disese mechanisms (pathway perturbations) among tumors, warranting further investigations.
The second experiment is to test whether differences in tumor embeddings (thereby difference in disease mechanisms) are predictive of patient clinical outcomes.We conduct unsupervised k-means clustering using only breast cancer tumors from our test set of TCGA, which reveals 3 three groups (Figure 5c).The patient groups exhibit significant difference in survival profiles, with a log-rank test (Mantel, 1966) p-value of 0.017 (Figure 5d).Using tumor embeddings as input features, we train 1,2regularized (elastic net; Zou and Hastie, 2005) Cox proportional hazard models (Cox, 1992) for a 10-fold cross-validation experiment.This leads to a ranked list of tumors according to predicted survivals/hazards, and the concordance index (CI) value of our rank is 0.795, indicating that the trained model is very accurate.We further split test samples into two groups divided by the median of predicted survivals/hazards and plot the ground truth survival profiles of the patients in the two groups, which yields a logrank p-value of 5.1×10 −8 (Figure 5e).Note, if our prediction is random, it will mix patients with different survival outcomes randomly into two groups and lead to insignificant survival difference.
In an essence, the differences in tumor embeddings among tumors from a common cancer type reflect differences of the impact of SGAs from the tumors.As shown in above sections, distinct SGAs may share similar embeddings if they share similar functional impact.Thus, two tumors may have similar tumor embeddings even though they do not share any SGAs, as long as the functional impact of distinct SGAs from these tumors are similar.Therefore, we hypothesize that tumor embedding makes it easier to discover common disease mechanisms and their impact on patient survival.In contrast, if one represent tumors in original SGA space, it is more difficult to find common disease mechanisms.To test this, we also perform clustering analysis on breast cancer tumors represented in original SGA space, followed similar surivival analysis as described in the previous paragraph (Figure 5f-h).Indeed, representing tumors in SGA space has much smaller power to detect the difference in patient survival.

Tumor embeddings are informative for predicting drug responses of cancer cell lines
Precision oncology concentrates on using patient-specific omics data to determine optimal therapies for a patient.We set out to see if SGA data of cancer cells can be used to predict their sensitivity to anti-cancer drugs.We hypothesize that, if tumor embedding e t serves as an abstract representation of collective impact of SGAs on the signaling system in a cancer cell, tumor embedding should be informative features for training prediction models.We use the CCLE (Barretina et al., 2012) dataset, which performs drug sensitivity screening over hundreds of cancer cell lines and 24 anti-cancer drugs.The study collects genomic and transcriptomic data of these cell lines, but in general the genomic data (except the molecularly targeted genes) from a cell line are not sufficient to predict sensitivity its sensitivity to different drugs.
We discretize the response of each drug following the procedure described by Ding et al. (2017); Barretina et al. (2012).Since CCLE only contains a small subset of mutations in TCGA dataset (around 1,600 gene mutations), we retrain the GIT with this limited set of SGAs in TCGA, using default hyperparameters we set before.Cancer type input is removed as well, which is not explicitly provided in CCLE dataset.The output of tumor embeddings e t are then extracted as features.
We formulate drug response prediction as a binary classification problem with 1 -regularized cross entropy loss (Lasso), where the input can be raw sparse SGAs or tanh-curved tumor embeddings tanh(e t ).
Following Barretina et al. (2012), we perform 10-fold cross-validation experiment training Lasso using either inputs to test the drug response prediction task of four drugs with distinct targets: TKI258 (FGFR inhibitor), RAF265 (RAF inhibitor), Lapatinib (EGFR inhibitor), and Sorafenib (RTK inhibitor).As is shown in Figure 6, Lasso regression using tumor embeddings consistently outperforms the models trained with original SGAs as inputs.Specifically, in the case of drug Sorafenib, the raw mutations just give random prediction results, while the tumor embedding  is able to give predictable results.It should be noted that it is possible that certain cancer cells may host SGAs along the pathways related to FGFR, RAF, EGFR and RTK, rendering them sensitive to the above drugs.Such informaiton can be implicitly captured and represented by the tumor embeddings, so that the information from raw SGAs are captured and pooled to enhance classification accuracy.

Discussion
Despite the significant advances in cancer biology, it remains a challenge to reveal disease mechanisms of each individual tumor, particularly which and how SGAs in a cancer cell lead to the development of cancer.Here we propose the GIT model to learn the general impact of SGAs, in the form of gene embeddings, and to precisely portrait their effects on the downstream DEGs with higher accuracy and F1 score.With the supervision of DEGs, we can further assess the importance of an SGA using multi-head selfattention mechanisms in each individual tumor.
The key advantage of transforming SGA into a gene embedding space is that it enables the detection and representation of functional impact of SGAs on cellular processes, which in turn enables detection of common disease mechanisms of tumors even if they host different SGAs.This capability is valuable in systems biology, when systematic perturbations are applied to a cellular system, and certain molecular/cellular phenotypes are measured as readout of perturbation.We anticipate that GIT, or other future models like it, can be applied broadly to gain mechanistic insights of how genomic alterations (or other perturbations) lead to specific phenotypes, thus providing a general tool to connect genome to phenome in different biological fields and genetic diseases.
There are a few future directions for further improving the GIT model.First of all, GIT is data-driven, which has the advantage of discovering novel causal relationships between SGAs and DEGs.However, an intelligent model should be able to combine prior knowledge with data to gain insights of problems at the hand.Decades of biomedical research has accumulated a rich body of knowledge, e.g., Gene Ontology and gene regulatory networks, may be incorporated as prior of the model to boost the performance, as done by (Ma et al., 2018).Secondly, although our network has considered several different types of input like SGAs and cancer types, it is possible that, in some cases, we can have other useful information such as clinical feature which might be critical to our tasks.By modifying the GIT a bit, we might get a suitable network structure for these tasks.Thirdly, in our current experiments, we only have 4,468 tumor samples for training, test and validation, which is a small sample size in comparison to most other deep learning tasks.Nonetheless, we still got excellent results with fairly low bias.We expect that by getting a larger corpus of tumor data with mutations and gene expressions, we will be able to train a better GIT model to minimize potential overfitting problems or variance of models.Lastly, more clinically oriented investigations are warranted to examine, when trained with large volume of tumor omics data, the learned embeddings of SGAs and tumors may be applied to predict sensitivity or resistance to anti-cancer drugs based SGA data that are becoming readily available in contemporary oncology practice.
the Pennsylvania Department of Health.The content is solely the responsibility of the authors and does not necessarily represent the official views of the above funding agencies.Skip gram method with negative sampling loss is utilized to pre-train the gene embeddings which can reflect the co-occurrence pattern of somatic genomic alterations (SGAs) in tumors.Hyperparameters including initial step size, decay of step size, and mini-batch size are tuned to make the pre-training of gene embeddings converge fast and sound; see "Gene2Vec" row of Table S1 for parameters.The negative sampling loss converges after training for around 20 epochs on the SGAs of 4,468 patients in TCGA dataset.The feasibility of pre-trained gene embeddings is further validated in Section 3.2.
Data: Genomic alterations in each tumor: T = T i = {g i1 , g i2 , ..., g im(i) } i=1,2,...,N .Result: Pretrained gene embedding of each gene: Zn f (g) 3/4 , g ∈ G; // Normalized sub-sampled frequency // Initialize gene embeddings and context embeddings while not converges do l ← 0; // Total loss of a mini-batch samples for b = 1, 2, ..., batch_size do g ∼ f ; // Sample a gene Algorithm S1: Gene2Vec algorithm to pre-train the gene embeddings using skip gram with negative sampling loss.Given the context information of somatic genomic alterations (SGAs) in each cancer patient, i.e., whether two SGAs happened together in a single tumor, we pre-train the gene embeddings (and context gene embeddings) using similar techniques to word2vec.Skip gram is used to predict the probability of co-occurred SGAs c given a known SGA g, as explained in Equation 1. Negative sampling loss is utilized to accelerate the maximization of log-likelihood in the skip gram assumption.Instead of original mutation frequency f (g), the negative sampling frequency of SGA is sub-sampled by scaling to f (g) 3/4 .In practice, the step size η in mini-batch gradient descent is decayed after training for every epoch to converge fast and prevent overfitting.Note that E here is defined slightly different from that in main context, which contains both gene and cancer type embeddings.
Table S1.Tuned model structures and hyperparameters of Gene2Vec, Lasso, MLPs and GIT.The hyperparameters, including number of neurons in each hidden layer, step size, training epochs, training batch size, dropout rate, coefficients of regularizer are tuned using training and validation sets.Gene2Vec: Stochastic gradient descent (SGD) is used following word2vec (Mikolov et al., 2013).The step size is decayed by 0.78 after training for every epoch.The ratio of negative sample size to positive sample size is set to be 5. Lasso: It requires more epochs and larger step size to converge due to its large number of parameters.MLPs: Dropout is applied at each hidden layer to reduce overfitting.It is not shown in the structure below due to space limit.GIT: There are additional structure parameters for the attention module.The dimension of attention parameter β j is 400, and the number of heads h is set to be 128 (j=1,2,...,128).Notations: "ns" means "negative sampling", "emb" means "embedding layer", "σ" means "sigmoid activation function", "fc" means "fully connected layer", "relu" means "ReLU activation function", "attn" means "attention mechanism".Left panel shows the case when one-hot cancer type vector is concatenated with binary SGA vector, while right panels is simply the case of binary SGA vector.Opposed to represented as compact tumor embeddings in Figure 5a,b, no matter whether the cancer type information is included, the t-SNE visualization of tumors represented as raw SGA doesn't show any significant patterns.We conclude that since binary SGA representation of tumor is sparse, it is hard to measure the difference/similarity between different tumors in Euclidean space.Table S2.Groups of genes using k -means clustering in the gene embedding space.Genes that are altered in at least 150 out of 4,468 tumors are clustered to filter out genes whose gene embeddings that may not be learned well.1,474 candidate genes in total are finally analyzed.

Fig. 1 .
Fig. 1.Structure of GIT.(a) GIT first encodes the overall impact of SGAs {g} m g=1 and cancer type s on the state of the cancer cells in a tumor t using multi-head self-attention mechanism.Then the tumor embedding e t is decoded by a MLP to predict DEGs.Example SGAs of a TCGA breast cancer case (TCGA-D8-A1JJ) are shown, in which PIK3CA, TP53 and GATA3 are assigned with largest attention weights.(b) A two-dimensional demo that shows how attention mechanism combines multiple gene embeddings of SGAs {e g } m g=1 and cancer type embedding e s into a tumor embedding vector e t using attention weights {α g } m g=1 .(c) Calculation of attention weights {α g } m g=1 using gene embeddings {e g } m g=1 and parameters of attention sub-network {W 0 , Θ = {θ j } h j=1 }.Each gene embedding e g is fed through a 2 layer sub-network to produce a scalar α g .

Fig. 2 .
Fig. 2. The change of F1 score and accuracy on the test set as GIT trains on real data or DEG-permuted data.

Fig. 3 .Fig. 4 .
Fig. 3. (a) GO enrichment of genes within clusters derived from k -means clustering using tumor embeddings as input features.The line and shade area show the mean and the standard deviation of GO enrichment scores of clusters.X-axis indicates the number clusters k as input for k -means algorithm; the Y-axis denotes enrichment score calculated according to Equation 13.(b) t-SNE visualization of gene embeddings learned by GIT model.The different colors represent different k -means (40 clusters) clustering labels.An enlarged inset of a cluster is shown as well, which contains a set of closely related genes which we refer to "IFN pathway".
Fig. 5. (a-b) t-SNE visualization of tumor embeddings learned by GIT model.(a) t-SNE of full tumor embedding e t learned with cancer type available as an input.(b) t-SNE of stratified tumor embedding (e t -e s ), where cancer type embedding e s is subtracted.(c-e) Tumor embeddings reveal distinct survival profiles of cancer patients.(c) k -means clustering of breast cancer tumors using tumor embeddings.PCA plot shows internal distribution of BRCA tumors.(d) Kaplan-Meier estimators of the three breast cancer groups using tumor embeddings.(e) Cox regression using tumor embeddings.(f-h) SGAs alone as tumor representations are not informative of predicting survival profiles.(f) PCA plot showing k -means clustering of BRCA tumors using their SGA vectors.Most tumors merge around origin (Cluster 1; with small number of SGAs), while others (Cluster 2,3; with large number of SGAs) are outliers and far away from origin.(g) KM estimators and log-rank test on the three BRCA tumor groups in the SGA space.(h) Cox regression using SGAs (top mutated 474 genes are used).

Fig. 6 .
Fig. 6.ROC curves and the areas under the curve (AUCs) of Lasso models trained with original SGAs and tumor embeddings representations on predicting responses to four drugs.

Fig. S1 .
Fig. S1.Somatic genomic alteration (SGA) distribution of different cancer types in the TCGA dataset.16 cancer types are sorted in descending order of sample size.Tumor samples of each cancer type are further sorted in ascending order of SGA numbers.Each red line shows the median number of SGAs in the tumors of specific cancer type.
Fig. S5.Number of shared differentially expressed genes (DEGs) caused by somatic genomic alterations (SGAs) in the "IFN pathway".Based on the results of causal relationship of SGAs and DEGs from Tumor-specific Causal Inference(Cai et al., 2018), we calculate the number of shared caused DEGs of any pair of SGAs in the dataset, whose average value is 0.75.However, in the IFN pathway, the average value is 359.29, which is significantly much larger than the random case.This indicates that the genes in IFN pathway are functionally similar and regulate common sets of downstream biological processes.The number of shared DEGs in the diagonal (equals to the total number of DEGs caused by a single SGA) is not shown for clarity.

Fig. S7 .
Fig. S7.Survival profiles of breast cancer subtypes inferred from PAM50.PAM50 is a widely used baseline for identifying subgroups of breast cancer, which classifies the breast cancers based on the expression profiles of the 50 most important genes that indicative of basal and myoepithelial features in breast cancer(Network et al., 2012).However, the log-rank test (p-value = 0.27) of PAM50 subtypes is not significant possibly due to limited sample size.Comparing with Figure5d, this reflects the utility of tumor embedding to represent tumor status.

Table 1 . Performances of GIT (variants) and conventional methods.
genomics, nor do they explain the signaling process and distinguish driver SGAs.

Table 2 . NN accuracy with respect to GO in different gene embedding spaces.
in the database.The fine-tuned embeddings by GIT further show one fold increase in NN accuracy.The higher NN accuracy, the functionally similar genes are more close to each other in the embedding space.These results indicate that the learned gene embeddings are consistent with the gene functions, and it maps the discrete binary SGA representation into more meaningful and compact embedding space.

Table 3 . Top five SGA-affected genes for Pan-Cancer and a few selected cancer types, ranked according to attention weight and alteration frequency respectively.
(Gonzalez-Perez et al., 2013)rding to TumorPortal(Lawrence et al., 2014)and IntOGen(Gonzalez-Perez et al., 2013)are marked in bold font.