Coupled Mixed Model for joint genetic analysis of complex disorders from independently collected data sets: application to Alzheimer’s disease and substance use disorder

,

Genome-wide Association Studies (GWAS) have helped revealed about 10,000 of genetic 2 associations between human genome and diseases [1]. With the success of GWAS on 3 single data sets, a natural follow-up is to investigate the results by integrating multiple 4 data sets [2], which we refer to as "joint study or analysis". A joint study may unveil 5 genetic mechanisms which cannot be found in a single analysis [3]. For example, several 6 recent studies have revealed overlapping genetic factors that influence multiple 7 psychiatric disorders [4], genetic correlations between schizophrenia, ADHD, depression, 8 and use of cigarettes and cannabis [5], and association between schizophrenia and illicit 9 drug use [6]. Co-occurrence of substance use disorder (SUD) and psychopathology has 10 been observed in national epidemiologic surveys [7,8] which suggests further joint study 11 of SUD and diseases involving cognitive dysfunction should be conducted. 12 However, joint genetic analysis using independently collected data sets can be 13 extremely challenging. In addition to the issues commonly expected when analyzing 14 single data sets such as population stratification [9], a naïve application of conventional 15 methods for the joint analysis will almost inevitably result in false discoveries which can 16 be caused by other confounding factors such as the confounders due to different data 17 collection procedures. Moreover, independently collected data sets often do not share 18 the phenotypes of interest. For example, there are barely any genetic studies that 19 collect enough samples with both SUD and Alzheimer's disease as phenotypes, which 20 pose challenging situations for studies that aim to find common genetic factors 21 underlying these diseases. These problems are illustrated in detail in Fig. 1. For two 22 data sets that are originally collected for an independent study of the red and the blue 23 phenotype, respectively, a joint analysis aims to discover the common genetic 24 associations for these phenotypes. However, key problems arise as to how to deal with 25 the challenges due to i) missing information about the phenotypes, e.g., as illustrated in 26 Fig. 1, the corresponding phenotypes of the other data set (i.e. red phenotype for blue 27 data set, or blue phenotype for red data set), and ii) confounding factors present in 28 different data sets (including population stratification, family structures, cryptic 29 relatedness, and data collection confounders). In other words, all the information that is 30 enclosed in dashed lines in Fig. 1 needs to be inferred. 31 Several methodological attempts have been made towards joint analysis using 32 different data sets with different phenotypes. The proposed methods are built on 33 summary statistics calculated using single data sets. For example, McGeachie et al. 34 conducted a post analysis after independent univariate GWAS testings on different 35 phenotypes [10]. Giambartolomei et al. proposed a method which assesses whether the 36 common genetic variants associated with different phenotypes identified using single 37 data sets are consistent [11]. Kang et al. jointly analyzed multiple studies with varying 38 environmental conditions using a meta-analysis based on a random-effect model [12]. Illustration of the existing challenges when conducting joint analysis on two independently collected data sets for two phenotypes. For example, two independent data sets (blue data and red data) are collected with case/control phenotypes for blue symptom and case/control phenotypes for red symptom, respectively. The goal is to estimate the coefficients of SNPs (β (1) and β (2) ) for blue symptom and red symptom respectively. The only prior knowledge we have is that there are some common genetic factors between blue symptom and red symptom, and therefore, the blue symptom and red symptom might be correlated in a weak manner. However, we do not have blue symptoms collected for red data set, or red symptoms collected for blue data set. More importantly, there will be population confounding factors and data collection confounding factors intervening the estimation of coefficients and these confounding factors must be estimated and accounted. This figure shows all the variables that are related to the joint analysis. The variables enclosed in dashed lines must be inferred. meta-analysis [14]. He et al. introduced PleioPred, which jointly models multiple 46 genetically correlated diseases and a variety of external information including linkage 47 disequilibrium and diverse functional annotations [15]. Finally, Wen et al. proposed a 48 Bayesian hierarchical model for co-localization analysis [16]. 49 In this paper, we propose a method for joint association analysis of two GWAS data 50 sets, namely Coupled Mixed Model (CMM), that aims to address all the aforementioned 51 challenges and provide a reliable joint analysis of the data sets by inferring the missing 52 information as illustrated in Fig. 1. In particular, CMM infers the missing phenotypes 53 and various confounding factors with the maximum likelihood estimation derived from 54 multiple coupled multivariate sparse mixed models. 55 Different from the summary statistics-based methods described above, our method 56 models the genotypes and phenotypes of the samples directly from the original data sets 57 to estimate the effect sizes of genetic variants. Our method is also different from the 58 approaches for missing phenotype imputation such as [17,18] in that our method aims 59 to address the challenges when there are no empirical data which allows the correlation 60 between different phenotypes to be measured, which is a common situation for 61 independently collected data sets usually not generated for joint analysis purposes. 62 Using extensively simulation experiments, we compare our method with previous 63 approaches for conducting joint analysis. Furthermore, we apply our methods to real 64 GWAS data sets previously generated for studying substance use disorder (SUD) and 65 Alzheimer's diseases (AD), respectively, for joint analysis.

67
We now introduce the CMM method for joint genetic analysis of two independently 68 collected GWAS data sets. We first present a model that deals with missing phenotypes 69 and the confounders as discussed above and also shown in Fig. 1, and then in the next 70 section, we present an algorithm that estimates the effect sizes of genetic variants by 71 optimizing the objective function derived from the model.

72
The following are the notations we use in this work: The subscript denotes the 73 identifier of data set, and the superscript in parentheses denotes the identifier of 74 phenotypes. Genotypes and phenotypes are denoted as X and y, respectively. Also, n 75 denotes the sample size, and p denotes the number of SNPs. Specifically, consider a 76 scenario as illustrated in Fig. 1, X 1 and X 2 represent the genotypes of the samples in 77 data sets 1 and 2 with the dimension of n 1 × p and n 2 × p, respectively. y 2 is not observed.

82
Our method does not require n 1 = n 2 . However, for the convenience of the 83 presentation of our method, we will assume n 1 = n 2 = n. The case of n 1 = n 2 can be 84 easily generalized by weighing the corresponding cost function components with 1/n 1 85 and 1/n 2 , respectively. Following the similar logic, although our method can be 86 extended to the case of generalized linear models (e.g. for case-control data), we 87 introduce our method with the simplest linear models for the clearness of discussion.

88
Straightforwardly, we have the following relations for the scenario shown in Fig. 1: where (j) i stands for residual noises for data set i with phenotype j, and , where I is an identity matrix with the shape of n × n; u for the confounding effects due to population stratification, family structures and 91 cryptic relatedness in data set i with phenotype j; and v i accounts for the confounding 92 effects due to data collection in data set i. 93 We have u ) for data set i with phenotype j. As observed by 94 Devlin et al. [9], population stratification can cause false discoveries because there exist 95 real associations between a phenotype and untyped SNPs that have similar allele 96 frequencies with some typed SNPs not actually associated with the phenotype, which, 97 as a result, can lead to false associations between the phenotype and the typed SNPs.

98
Because these false associations due to confounders from population stratification are 99 phenotype specific, we model σ 2 u (j) as phenotype-specific. Hence, although we have four 100 different variance terms for population confounders (i.e. u  2 ), they 101 are only parameterized by two scalars (i.e., σ 2 u (1) and σ 2 u (2) ). K i = X i X T i is the kinship 102 matrix, constructed following standard genetics convention. A more sophisticated 103 construction of the kinship matrix may be used to improve detection of the signals, but 104 these details are beyond the scope of this paper. One can refer to examples in [19][20][21][22] 105 for more details.

106
On the other hand, we have v i ∼ N (0, Iσ 2 vi ) for data set i. Because the confounders 107 due to data collection are only related to the data collection procedure, we model σ 2 vi as 108 data set specific.

109
For the independently collected data sets, we only observe < X 1 , y 2 >. We are interested in estimating β (1) and β (2) , and to achieve this goal, we 111 also need to estimate y 2 , σ 2 u j , and σ 2 vi in Eq. 1.

112
In order to estimate β (1) and β (2) , we minimize the joint negative log-likelihood 113 function with the Laplace distribution as a prior distribution over β (1) and β (2) to 114 account for the fact that there are only a subset of SNPs that contribute to the 115 phenotype. Additionally, as our goal is to identify the SNPs jointly associated with 116 phenotypes 1 and 2, we encourage the method to find common SNPs responsible for 117 both phenotypes. Therefore, we constrain our method to find a non-empty intersection 118 (i.e., the cardinality of the intersection is at least c, where c is a scalar, and c > 0) of the 119 support of β 1 and β 2 . As a result, the main cost function, which is the negative regularization of β and common support constraint, can be represented as follows: where Σ is the covariance matrix defined as: and we have:σ and supp(·) stands for the support of a vector and | · | stands for the cardinality.

125
The detailed derivation of the optimization function 2 is described in S1 File. The 126 key steps involve replacing y 1 with X 1 β (2) , and replacing y 2 with X 2 β (1) , and then 127 writing out the joint likelihood function of Equation 1.

134
Estimating variances of the population confounders and data 135 collection confounders 136 Despite the complicated cost function, solving for is relatively straightforward because we have: 2 , and X 2 are all fully observed.

137
Equivalently, we have: Thus, we can use conventional methods for solving linear mixed models to estimate [23,24]). K i is usually constructed from X i X T i in genomic 139 applications [21], which may introduce an "over-representing problem", since X i X T i has 140 a full rank and hence represents the relationship between every pair of the samples in the 141 data set [25]. To solve this problem, we use a truncated-rank approach proposed in [26] 142 to reduce the rank of K. Specifically, we set the non-dominant eigenvalues of K to be 143 zero with a simple inspection of the slope of the eigenvalues as follows: if Si−Si+1 S0 ≤ 1 n , 144 where n is the number of samples, then we set the eigenvalue S i to be zero.

145
An iterative updating algorithm for estimating effect sizes and 146 the covariance matrix of the joint likelihood 147 We only have {β (1) , β (2) , t} left to be estimated, however, directly estimating t is 148 difficult since it involves in four coupled terms in Function 2. To simplify the problem, 149 we introduce an approximation to decouple the dependencies among t, β i and σ ii , 150 leading to a neat solution involving two steps that can be conducted iteratively until 151 convergence. The proof of the convergence of the algorithm is in the S1 File.

152
Calculating t and σ ii given β i 153 Calculating t and σ ii is straightforward because both calculations can be done analytically. The analytical form for σ ii is shown in Equation 3. The analytical form of t can be derived by equating the derivative of this following convex function to zero: Numeric methods are needed for estimating {β (1) , β (2) }, which we achieve by 155 alternating the direction method of multipliers (ADMM) [27]. ADMM introduces 156 another constraint forcing β (1) = β (2) . In general, constraining two vectors to be equal 157 is stronger than constraining two vectors to have common support. Hence, we remove 158 the common support constraint in Function 2. With ADMM, the optimization involves 159 solving the following three functions iteratively: Function 4 and Function 5 can be solved via proximal gradient descent [28] where the 1 163 regularization term is regarded as the proximal operator.

165
Although we have introduced the CMM method in the context of analyzing two 166 independently collected data sets with different phenotypes, it is noteworthy that CMM 167 can still be used for analyzing multiple data sets with two different phenotypes.

168
May 24, 2018 6/18 However, analyzing multiple data sets with multiple different phenotypes requires a 169 more sophisticated design of the method which is beyond the scope of this paper.

170
Implementation 171 The implementation of Coupled Mixed Model (CMM) is available as a python software. 172 Without installation, one can run the software with a single command line. It takes 173 binary data in a standard Plink format for each of the two data sets as input. If there 174 are mismatched SNPs between the data sets, CMM will use the intersection of these 175 SNPs. We recommend the users to query CMM to identify a specific number SNPs for 176 each data set and CMM can tune the hyperparameters according to this specified 177 number. However, to achieve the same goal, users can also choose to specify 178 regularization parameters (i.e., λ 1 , λ 2 in Equation 2 and ρ in Equation 6). If none of 179 these information is specified, CMM will automatically conduct cross-validation to tune 180 parameters. The implementation is available as a standalone software 1 and will be Benjamini-Hochberg (BH) procedure [30].

198
• CD: Combining data-set approach. CD directly merges two data sets into one, and linearly dependent features, mostly used in genomic studies [33]. PL is 214 applied to the independent data sets with the same manner as LR.

215
• JL: Joint Lasso. A method we implement in this paper for a fair comparison of 216 our proposed CMM method. JL solves the lasso problems jointly with the 217 constraint β (1) = β (2) with ADMM. This step can be roughly seen as an 218 intermediate step between the popular CD approach and our proposed method. 219 We use the logistic regression-equivalent as the phenotypes are binary.

221
Data Generation

222
We use SimPop [34] to simulate the SNPs of the two independently collected data sets. 223 For each data set, we simulate 10000 SNPs for 500 samples from five different 224 populations with migration behaviors. Each population also unevenly splits into five 225 sub-populations. Therefore, it can be seen as these 500 samples are from 25 regions 226 (denoted as G) out of five continents. These two SNP data sets are denoted as X 1 and 227 X 2 , following the notation described in Section . X 1 and X 2 are both 500 × 10000 228 matrices.

237
To introduce confounders such as population stratification and family structure, we 238 use the aforementioned regions G to affect the phenotypes differently (the effects of 239 these regions are denoted as γ, sampled from a Gaussian distribution N (0, 100)). To 240 simulate data collection confounders, we introduce a scalar e, sampled from a Gaussian 241 distribution N (0, 1).

242
Finally, we have the responses as: Then, we transform these continuous phenotypes into binary phenotypes via Bernoulli sampling with the outcome of the inverse logit function (g −1 (·)) over current responses. Therefore, we have: We run the experiments with 10 different random seeds and evaluate these methods   associated with one phenotype from the corresponding data set. We use the ROC curve 246 for the visualization of the performance comparison. (We also plot the area under ROC 247 curves in S1 Fig. -S3 Fig).  Furthermore, we plot the results of the ROC curves of the compared methods in 281 term of their ability to find the associated SNPs separately for each data set, which are 282 shown in Fig. 3, where "Phe 1" and "Phe 2" stands for the two phenotypes, respectively. 283 Notably, it is not surprising to see that the two curves for the two phenotypes almost 284 overlap with each other. This happens because i) the two data sets were generated by 285 using the same data generation protocol with two different sets of parameters, and ii) 286 the results reported are averaged results across 10 different random seeds. Moreover, we 287 notice that, these curves exhibit the same patterns as those in Fig. 2, which is also not 288 surprising for most competing methods when the overlapping set is only selected as the 289 intersection of the support of the two estimated coefficients.

290
The results shown in these curves, along with those in Fig. 2 In the read data analysis, we apply our proposed CMM method to two GWAS data sets 301 generated previously to investigate genetic association in Alzheimer's disease (AD) and 302 substance use disorder (SUD). The AD data set was collected from the Alzheimer's previous studies [35]. There are 257361 SNPs in these two data sets left to be examined. 313 Due to the statistical limitation of selecting variables using cross-validation and 314 information criteria in high dimensional data [33], we inquire the method for a fixed 315 number of SNPs to be selected, following previous studies (i.e. [36,37]) and the 316 hyperparameters of our model will be tuned automatically through binary search for the 317 set of parameters to report the number of SNPs we require. To mitigate the 318 computation load, the algorithm will terminate the searching of hyperparameters when 319 the number of the reported SNPs lies within 50% to 200% of the number we inquire. 320 We inquire for 30 SNPs selected in each data set, and CMM identifies 20 SNPs 321 associated with SUD and 40 SNPs associated with AD when the algorithm converges.

322
There are five SNPs associated with both SUD and AD. The identified SNPs are listed 323 in Tables 1 and 2 for AD and SUD, respectively. Although our method identified 40 324 SNPs for AD, we only show the top 20 as the primary interest of the current work is to 325 identify common SNPs associated with both of the phenotypes.

326
In particular, rs2131691, which resides in the ANO3 gene is identified as the SNP 327 with the most significant effect size. ANO3 is reported to be associated with AD in 328 silico [39]. To the best of our knowledge, there is no previous literature that associates 329 ANO3 with SUD. However, evidence has shown that mutations in ANO3 produced 330 functional changes that affected striatal signal transduction pathways [59,60] and also 331 that dysfunction of striatal signal transduction pathways is implicated in SUD [61].

332
Therefore, these previous observations support our finding that rs2131691 is connected 333 to SUD. positive modulation of the TRPV1 channels can be a potential target for mitigation of 337 AD [41], suggesting an important involvement of TRPV1 in AD. On the other 338 hand, [43] have also shown that TRPV1 plays a key role in morphine addiction.

339
Also, [42] showed that the deletion of TRPV1 in mice altered behavioral effects of 340 ethanol which indicates a connection between TRPV1 and alcoholism. 341 rs1709317, which resides in KLHL29, is identified as the 8 th SNP and 5 th SNP 342 associated with AD and SUD respectively. We do not find strong evidences that 343 support the association between KLHL29 and either AD or SUD. However, we notice 344 that KLHL29 has been reported as one of the top 25 small airway epithelium 345 hypomethylated genes of smokers compared with nonsmokers [62], also reported to be 346 associated with smoking cessation [45]. 347 rs4713797, which resides in PACSIN1, is identified as the 10 th SNP and 6 th SNP 348 associated with AD and SUD, respectively. PACSIN1 has been implicated in various 349 neurodegenerative disorders, including Huntington and Alzheimer's diseases [49,50,63]. 350 Additionally, the association between PACSIN1 and SUD has been reported in a study 351 which investigated whether DNA methylation patterns in early life are prospectively 352 associated with SUD in adolescence [51].

353
rs1057744, which resides in JAG2, is identified as the 11 th and 16 th SNP associated 354 with AD and SUD, respectively. [52] reported the association of JAG2 with late-onset 355 Alzheimer's disease. However, it is not previously known that JAG2 is associated with 356 SUD. Table 2. The top SNPs that our method identifies in substance use disorder study with joint study in Alzheimer's disease The SNPs are ranked by the absolute values of their estimated effect sizes. SNPs in bold are the ones that are shared with the results in Alzheimer's disease study. The MAFs reported in the table are calculated using the case-control substance use disorder experimental data. The information of whether a SNP is located within a region of a gene is taken from the Database for Single Nucleotide Polymorphisms (dbSNP) [38], and listed in the 'Gene' column. The literature support showed supports the association between the corresponding gene and the disease either in vivo or in silico from other independent data sets. Abbreviations: AD: Alzheimer's Disease, ALC: Alcoholism, HD: Huntington's disease, SCZ In this paper, we propose a method that can enable a joint genetic association analysis 363 of different phenotypes using two independently collected data sets. Typically, 364 conducting joint analysis of independently collected data sets with naïve approaches will 365 inevitably lead to many false discoveries, because these data sets usually do not have 366 the same phenotypes, and also many confounding effects such as population 367 stratification, family structures, cryptic relatedness, and data collection confounders are 368 present in the data sets.

369
To address the challenges in the joint analysis, we propose a novel method, Coupled 370 Mixed Model (CMM), that can estimate the effect sizes of genetic variants and 371 accounting for both population confounding factors and data collection confounding 372 factors. We further present an algorithm that allows an efficient parameter estimation 373 of the objective function derived from our model. As an essential step of our method 374 involves decoupling of the dependency of parameters and updating them iteratively, we 375 also present a convergence proof of our proposed iterative updating algorithm.

376
With extensive simulation experiments, we showed the superior performance of our 377 methods in comparison with seven competing methods in terms of both finding the 378 common genetic variants associated with two different phenotypes, as well as finding the 379 genetic variants associated with each phenotype. Further, we apply our CMM method 380 to identify the common SNPs associated with both Alzheimer's disease and SUD. CMM 381 identified five SNPs associated with both Alzhimer's disease and SUD. Literature survey 382 provide strong evidences to support these findings.

383
To the best of our knowledge, our proposed method is a novel computational tool 384 that enables the joint analysis of two independently collected data sets while accounting 385 for various confounders simultaneously. In the future, possible extensions include 386 development of more sophisticated methods that allow joint association analysis of 387 multiple phenotypes by using multiple individually collected data sets simultaneously.