Influence of tissue context on gene prioritization for predicted transcriptome-wide association studies

Transcriptome-wide association studies (TWAS) have recently gained great attention due to their ability to prioritize complex trait-associated genes and promote potential therapeutics development for complex human diseases. TWAS integrates genotypic data with expression quantitative trait loci (eQTLs) to predict genetically regulated gene expression components and associates predictions with a trait of interest. As such, TWAS can prioritize genes whose differential expressions contribute to the trait of interest and provide mechanistic explanation of complex trait(s). Tissue-specific eQTL information grants TWAS the ability to perform association analysis on tissues whose gene expression profiles are otherwise hard to obtain, such as liver and heart. However, as eQTLs are tissue context-dependent, whether and how the tissue-specificity of eQTLs influences TWAS gene prioritization has not been fully investigated. In this study, we addressed this question by adopting two distinct TWAS methods, PrediXcan and UTMOST, which assume single tissue and integrative tissue effects of eQTLs, respectively. Thirty-eight baseline laboratory traits in 4,360 antiretroviral treatment-naïve individuals from the AIDS Clinical Trials Group (ACTG) studies comprised the input dataset for TWAS. We performed TWAS in a tissue-specific manner and obtained a total of 430 significant gene-trait associations (q-value < 0.05) across multiple tissues. Single tissue-based analysis by PrediXcan contributed 116 of the 430 associations including 64 unique gene-trait pairs in 28 tissues. Integrative tissue-based analysis by UTMOST found the other 314 significant associations that include 50 unique gene-trait pairs across all 44 tissues. Both analyses were able to replicate some associations identified in past variant-based genome-wide association studies (GWAS), such as high-density lipoprotein (HDL) and CETP (PrediXcan, q-value = 3.2e-16). Both analyses also identified novel associations. Moreover, single tissue-based and integrative tissue-based analysis shared 11 of 103 unique gene-trait pairs, for example, PSRC1-low-density lipoprotein (PrediXcan’s lowest q-value = 8.5e-06; UTMOST’s lowest q-value = 1.8e-05). This study suggests that single tissue-based analysis may have performed better at discovering gene-trait associations when combining results from all tissues. Integrative tissue-based analysis was better at prioritizing genes in multiple tissues and in trait-related tissue. Additional exploration is needed to confirm this conclusion. Finally, although single tissue-based and integrative tissue-based analysis shared significant novel discoveries, tissue context-dependency of eQTLs impacted TWAS gene prioritization. This study provides preliminary data to support continued work on tissue context-dependency of eQTL studies and TWAS.

influences TWAS gene prioritization has not been fully investigated. In this study, we addressed this question by adopting two distinct TWAS methods, PrediXcan and UTMOST, which assume single tissue and integrative tissue effects of eQTLs, respectively. Thirty-eight baseline laboratory traits in 4,360 antiretroviral treatment-naïve individuals from the AIDS Clinical Trials Group (ACTG) studies comprised the input dataset for TWAS. We performed TWAS in a tissue-specific manner and obtained a total of 430 significant gene-trait associations (q-value < 0.05) across multiple tissues. Single tissue-based analysis by PrediXcan contributed 116 of the 430 associations including 64 unique gene-trait pairs in 28 tissues. Integrative tissue-based analysis by UTMOST found the other 314 significant associations that include 50 unique gene-trait pairs across all 44 tissues. Both analyses were able to replicate some associations identified in past variant-based genome-wide association studies (GWAS), such as high-density lipoprotein (HDL) and CETP (PrediXcan, q-value = 3.2e- 16). Both analyses also identified novel associations. Moreover, single tissue-based and integrative tissue-based analysis shared 11 of 103 unique gene-trait pairs, for example, PSRC1-low-density lipoprotein (PrediXcan's lowest q-value = 8.5e-06; UTMOST's lowest q-value = 1.8e-05). This study suggests that single tissue-based analysis may have performed better at discovering gene-trait associations when combining results from all tissues. Integrative tissue-based analysis was better at prioritizing genes in multiple tissues and in traitrelated tissue. Additional exploration is needed to confirm this conclusion. Finally, although single tissue-based and integrative tissue-based analysis shared significant novel discoveries, tissue context-dependency of eQTLs impacted TWAS gene prioritization. This study provides preliminary data to support continued work on tissue context-dependency of eQTL studies and TWAS.

Introduction
Improving antiretroviral therapy (ART) efficacy and safety is an ongoing goal for addressing the HIV pandemic. According to the Joint United Nations Programme on HIV and AIDS (UNAIDS) (http://aidsinfo.unaids.org/), approximately 36.7 million people worldwide were living with human immunodeficiency virus (HIV) in 2016. Over the past three decades there has been immense progress on HIV care and treatment, and in 2017 there were about 20.9 million HIV-positive people who had access to ART. The connection of genomics with pharmacology has led to the discovery of numerous single nucleotide polymorphisms (SNPs) in drug absorption, distribution, metabolism, and elimination (ADME) genes and off-target genes. Many SNPs have been related to effects and/or pharmacokinetics of antiretroviral drugs 1-6 . However, most trait-related SNPs lack connections to actual functional genes, which suggests the need for alternative analysis approaches.
The emerging field of transcriptome-wide association studies (TWAS) offer a new way to directly identify gene-trait associations via integration of genotypic data and expression quantitative trait loci (eQTLs). eQTLs are an important class of genetic functional elements, which affect transcriptional regulation on target genes. Integration of eQTL information with genotypic data allows TWAS to estimate the extent to which a gene's expression level is regulated by genetic variants and how this correlates with traits of interest 8 . The Genotype Tissue Expression Project (GTEx 7 ) provides the data and the opportunity to identify eQTLs and estimate effect sizes for multiple human tissues (44 tissues in GTEx v6p). With GTEx, TWAS can explore gene-trait associations on tissues whose gene expression profiles are otherwise hard to obtain, such as liver and heart. However, current TWAS focuses primarily on eQTLs identified in a tissue-by-tissue manner, while many studies have either acknowledged or supported the power of an integrative tissue context in identifying singletissue and multi-tissue eQTLs 9,10 .
In this study, we aimed to address whether and how single tissue and integrative tissue context of eQTLs influence TWAS gene prioritization by comparing two distinct TWAS methods, PrediXcan 11 and Unified Test for MOlecular SignaTures (UTMOST 12 ). PrediXcan uses elastic-net regression model and identifies eQTLs in a tissue-by-tissue manner. UTMOST adopts group-lasso and search through all tissues at once to spot eQTLs of a certain gene. This strategy allows UTMOST to identify single-tissue specific eQTLs similar to PrediXcan but increase the chance of detecting multi-tissue eQTLs. Here, 38 baseline (i.e. pre-ART) laboratory values and genotypic data of 4,360 ACTG clinical trials participants from multiple previous studies [13][14][15][16][17][18][19] comprised the input for TWAS. Genotyping had been previously generated in multiple phases with Illumina assays: 650Y (phase I), 1M Duo (phase II and III), or Human Core Exome (phase IV). We performed the two TWAS methods separately in a tissue-specific manner (i.e. 44 tissues) ( Figure 1). If tissue contextdependency of eQTLs did not affect TWAS gene prioritization, we expected to observe shared gene-trait associations between single tissue-based analysis (PrediXcan) and integrative tissue-based analysis (UTMOST). The results partially supported this hypothesis, but also suggested varied gene prioritization abilities of single tissue-based and integrative tissue-based approaches respectively. The former found more unique gene-trait pairs, while the latter tended to prioritize genes expressed in multiple tissues. This study provides supportive evidence for tissue context-dependency of eQTLs and its impact on TWAS gene prioritization.

Data and Study Participants
In this study, we used four different genotyping phases of ACTG studies in a combined dataset that included samples and data from participants in prospective, randomized ARTnaïve treatment trials [13][14][15][16][17][18][19] . Clinical trial designs and results, and results of a genome-wide pleiotropic study results for baseline laboratory values have been described elsewhere 13 The computational preparation of genotypic data included pre-imputation quality control (QC), imputation, and post-imputation quality control. Pre-and post-imputation quality control followed the same guidelines 22 and used PLINK1.90 23 and R programming language. Imputation was performed on ACTG phase I-IV combined genotype data. Genotyped variants surviving the preimputation quality control comprised the input datasets for imputation, which used IMPUTE2 24 with 1000 Genomes 25 Phase 1 v3 as the reference panel. ACTG phase I-IV combined imputed data had 4,941 individuals and 27,438,241 variants. The following procedures/parameters were used in the post-imputation quality control by PLINK1.90: sample inclusion in phase I-IV phenotype collection, biallelic SNP check, imputation score (> 0.7), sex check, genotype call rate (> 99%), sample call rate (> 98%), and minor allele frequency (MAF > 5%), and relatedness check (π > 0.25). Subsequent principal component analysis (EIGENSOFT 26 ) projected remaining individuals onto the 1000 Genomes Project sample space to examine for population stratification. The first three principal components were used as covariates to adjust for population structure in the subsequent analysis. The final QC'ed ACTG phase I-IV combined imputed data contained 2,185,490 genotyped and imputed biallelic SNPs for 4,360 individuals ( Figure 1).

Phenotypic data-
The ACTG clinical trials included in this analysis collected baseline (i.e., pre-ART) laboratory traits from 5,185 ART-naïve individuals. We only retained individuals who have been genotyped and traits that were normally distributed and met a criterion of phenotype missing rate < 80%. The final combined phenotype dataset of ACTG genotyping phase I-IV retained 38 traits and the same number of individuals as the QC'ed imputed dataset ( Figure 1).

Predict Unmeasured Gene Expression Levels
We adopted two TWAS methods, PrediXcan and UTMOST, to predict unmeasured gene expression levels in a tissue-specific manner. PrediXcan and UTMOST have estimated SNP effect sizes on gene expression levels in 44 tissues, which are available at http:// predictdb.org/ and https://github.com/Joker-Jerome/UTMOST, respectively. The PrediXcan and UTMOST scripts were pulled from their GitHub project repositories on April 23 rd and Jun 6 th , 2018, respectively.
PrediXcan and UTMOST followed the same multivariate models. Let N denote the sample size and M denote the number of eQTLs in a certain gene. A gene's expression level can be predicted using the multivariate model as follows: where E is the N × 1 vector of predicted gene expression levels of the gene, X is the N × M matrix of genotypes, and β is the M × 1 vector of eQTLs' estimated regulatory effects on the gene.
Predicted gene expression levels were likely to differ between the two methods as each has a different hypothesis of eQTL regulatory mechanisms in terms of tissue context-dependency.
To discover trait-related tissues without assumptions, we predicted gene expression levels in 44 tissues.

Transcriptome-wide Association Analysis
We tested for gene-trait associations by performing transcriptome-wide association tests on predicted gene expression levels and ACTG baseline lab traits using PLATO 27,28 . All baseline labtraits included in this study were continuous and thus were modeled using linear regression. Age, sex, and the first three principal components calculated by EIGENSOFT were included as covariates in linear models to adjust for sampling biases and underlying population structure. PrediXcan and UTMOST have different degrees of diversity in the number of eGenes and gene-trait associations among tissues. To avoid biases due to an uneven number of associations among tissues, p-values were adjusted using FDR with using Benjamini-Hochberg procedure 29 in a tissue-specific manner. For this study, we consider gene-trait associations significant if they had single tissue-wise q-value < 0.05.

Results
We compared the influence of tissue context-dependency of eQTLs on TWAS gene prioritization by comparing single tissue-based analysis (PrediXcan) and integrative tissuebased analysis (UTMOST). We performed TWAS on ACTG phase I-IV combined datasets. The data aggregation of ACTG phase I-IV provided a larger sample size to ensure the power of identifying gene-trait association. QC procedures left the ACTG phase I-IV combined imputed data with 4,360 individuals and 2,185,490 SNPs. There were 38 baseline lab traits in the final phenotypic datasets.
Single tissue-based and integrative tissue-based analysis identified a total of 430 significant gene-trait associations (103 unique gene-trait pairs regardless of tissue, q-value < 0.05) and share 11 unique gene-trait pairs. Single tissue-based analysis identified 116 of the 430 significant associations (64 unique gene-trait pairs), encompassing 41 genes, 17 traits, and 28 tissues. Integrative tissue-based analysis identified the remaining 314 significant associations (50 unique gene trait pairs), encompassing 38 genes, 20 traits, and all 44 tissues.

Tissue Context-dependency Influenced TWAS Gene Prioritization
Gene prioritization results from single tissue-based analysis (PrediXcan) and integrative tissue-based analysis (UTMOST) were compared to evaluate the influence of tissue contextdependency of eQTLs on TWAS. Single and integrative tissue-based analyses shared 11 of 103 unique gene-trait pairs regardless of tissue ( Table 1). Several of these replicated the findings of previous studies ( Table 2). The lowest p-value by integrative tissue-based analysis was for MROH2A-total bilirubin levels 20 (UTMOST, q-value = 6.0e-27), which had a moderate p-value from single tissue-based analysis (q-value = 0.005). Another replication was between PSRC1 and two lipid-related traits, cholesterol and LDL, which have been reported in other studies [30][31][32][33] . Although it was SORT1, which neighbors PSRC1, that has been functionally related to LDL via mice knockdown experiments 34 . ALDH5A1 and GPLD1 have been associated with the liver function test, alkaline phosphatase (ALP) 35 . In the cases of PSRC1, ALDH5A1, and GPLD1, integrative tissue-based analysis (UTMOST) prioritized the genes in their biological function-related organ, liver, which was not always the case for single tissue-based analysis (PrediXcan). Possible novel associations were observed between absolute neutrophil count and C1orf204 36 , ATF6, and VANGL2 37 .
Single tissue-based analysis identified novel gene-trait associations, which warrants further investigation. One interesting example was the association of ITLN1 with multiple traits, including HIV-1 viral load, triglyceride, and total neutrophil count. As ITLN1 was reported in a previous Crohn's disease study 40 , our result suggested an potential relationship between Crohn's disease and HIV infection 41 .

Integrative Tissue-based Analysis Found Multi-tissue Gene-trait Associations
Regardless of tissue, integrative tissue-based analysis using UTMOST identified 50 unique gene-trait pairs (Figure 3). Although it prioritized fewer genes, the integrative tissue-based analysis was more likely to prioritize multiple tissues where genes are expressed. For instance, PSRC1 is highly expressed in almost all tissues 7 . PSRC1-LDL and cholesterol associations were prioritized in at least ten more tissues by integrative tissue-based analysis Most importantly, they were found consistently in the liver which is critically involved in lipid regulation. There was some evidence for distinct associations identified via integrative tissue-based approach (Table 2), such as ADAMTS4 42 with white blood cell count (artery, qvalue = 0.023), and AMFR 43 with fasting HDL (most significant in heart, q-value = 3.2e-05).
Other prioritized genes suggested novel associations and potential pleiotropy. Most prioritized genes have been associated with other traits by GWAS according to GWAS Catalog 44 . Similar to the single tissue-based approach, integrative tissue-based analysis prioritized total bilirubin-associated genes from the UGT1A 45 gene locus (UGT1A7 and UGT1A10) across multiple tissues.

Discussions
This study investigated whether and how TWAS gene prioritization was influenced by tissue context-dependency of eQTLs by comparing two approaches, single tissue-based TWAS (implemented in PrediXcan) and integrative tissue-based TWAS (implemented in UTMOST). PrediXcan evaluated eQTLs' effects in the context of a single tissue, which did not consider potential multi-tissue effects of eQTLs UTMOST estimated eQTLs' effect in an integrative tissue setting and increase the chance of identifying multi-tissue eQTLs. We found that both types of analyses could replicate associations discovered by previous studies and identify novel ones. While there were a fair number of overlaps, the two types of analyses prioritized different sets of genes. Single tissue-based analysis identified more unique gene-trait associations. Integrative tissue-based analysis tended to prioritize the same associations in multiple tissues and most importantly association were found in tissues critically related to traits of interest. Results suggest that tissue context-dependency of eQTLs influenced TWAS gene prioritization results.
The comparison raised questions of power and type I error rate of tested TWAS approaches. Integrative tissue context has shown an improved power in identifying eQTLs. As such, integrative tissue-based analysis might have universally greater power in identifying traitassociated genes than single tissue-based analysis. However, in this study, single tissuebased analysis found more validated associations ( Table 2). It is hard to tell if integrative tissue-based analysis has universally greater power as expected, whereas single tissue-based analysis happened to identify more false positives. It is also possible that one type of analysis outperformed the other at certain scenarios. A simulation study is necessary to discern these possibilities. Similar to GWAS, prioritized genes might merely be tag genes for causal ones. Both kinds of analyses prioritized genes at the chromosome 1p13.3 locus where a lipid-related gene, SORT1, is located. Single tissue-based analysis associated multiple lipid-related traits with genes that neighbor SORT1, such as SARS, CELSR2, PSRC1, and ALX3, which all are in the 1p13.3 locus and the same topologically associating domain (TAD 46,47 ). Besides PSRC1, integrative tissue-based analysis repetitively identified SLC6A17. Even though it is not adjacent to SORT1, this gene is in the 1p13.3 locus and might serve as a tag gene for causal one(s). Hence, for TWAS, prioritized genes might be merely tag genes and finemapping of causal genes may need a larger search boundary than GWAS, such as TADs.
Future investigation or validation experiments may be needed to explain the prioritized genes and/or tissues. For example, UGT1A1 glucuronidates bilirubin in the liver 48 , but single tissue-based analysis only identified a UGT1A1-total bilirubin association in skin. Further analysis found that there was no single UGT1A1 eQTL identified in liver by either PrediXcan or UTMOST trained on GTEx v6p or v7 data. It is likely that identification of UGT1A1 eQTLs is limited by tissue sample size (N liver = 175) or genetic variants may regulate UGT1A1 via mechanisms other than transcriptional regulation. Another observation of this study was that genes adjacent to UGT1A1 sporadically showed up as significant in either single tissue-based or integrative tissue-based analysis, including USP40, UGT1A6, UGT1A7, UGT1A10, KCNJ13, and also MROH2A 20 . These genes span 1Mbp in chromosome 2 and locate within the same TAD 46,47 . The repetitive pattern may suggest a specific regulatory activity that targets the whole genetic region of KCNJ13-USP40-UGT1A-MROH2A.
TWAS can prioritize trait-related genes, which may be important for HIV-positive patients regarding genetically informed therapeutic development and drug safety. This study showed that TWAS were able to not only replicate known associations, but also identify novel genetrait associations. It also suggested the importance of biological context in eQTL studies, and the ensemble of TWAS methods with different transcriptional regulation assumptions gave a more comprehensive picture of gene-trait relationships. In the future, we would like to perform cross-tissue TWAS analysis 12,49 , which aggregate gene-trait association information across all tissues and even across different consortia to further prioritize the trait-related genes and better describe the genetic architecture of complex diseases. This study investigates the influence of tissue context-dependency of eQTLs on TWAS gene prioritization by comparing two distinct TWAS methods, PrediXcan and UTMOST. PrediXcan assumes single tissue context of eQTLs, while UTMOST assumes eQTLs to possibly have effects in multiple tissues. Manhattan plot of gene-trait associations identified by PrediXcan. X-axis showed only significant traits. Y-axis was the q-value transformed by -log10. For simplicity, the plot only shows the lowest p-value of a gene-trait pair, which may appear in multiple tissues. Manhattan plot of gene-trait associations identified by UTMOST. X-axis showed only significant traits. Y-axis was the q-value transformed by -log10. For simplicity, the plot only showed the most significant p-value of a gene-trait pair, which may appear in multiple tissues. Li  Bolded tissues are known trait-related tissues.
* denotes the most significant tissue and/or trait that were associated with genes. + q-value in the most significant tissue denoted by asterisk.