A General Quantitative Genetic Model for Haplotyping a Complex Trait in Humans

Uncertainty about linkage phases of multiple single nucleotide polymorphisms (SNPs) in heterozygous diploids challenges the identification of specific DNA sequence variants that encode a complex trait. A statistical technique implemented with the EM algorithm has been developed to infer the effects of SNP haplotypes from genotypic data by assuming that one haplotype (called the risk haplotype) performs differently from the rest (called the non-risk haplotype). This assumption simplifies the definition and estimation of genotypic values of diplotypes for a complex trait, but will reduce the power to detect the risk haplotype when non-risk haplotypes contain substantial diversity. In this article, we incorporate general quantitative genetic theory to specify the differentiation of different haplotypes in terms of their genetic control of a complex trait. A model selection procedure is deployed to test the best number and combination of risk haplotypes, thus providing a precise and powerful test of genetic determination in association studies. Our method is derived on the maximum likelihood theory and has been shown through simulation studies to be powerful for the characterization of the genetic architecture of complex quantitative traits.


INTRODUCTION
The high-throughout technology of single nucleotide polymorphisms (SNPs) provides a powerful tool for studying the detailed genetic and developmental architecture of complex traits, such as human diseases, because SNPs residing within a coding sequence can alter the biological function of a protein that forms a phenotype [1][2]. However, current experimental techniques have still not achieved a point at which multiple SNPs can be easily observed at their diplotype level [3]. Such a technological limitation makes it difficult to associate the phenotypes of a trait with specific DNA sequence variants (known as haplotypes) constructed by a set of SNPs, although recent genetic studies suggest that a gene may determine a complex trait through its haplotype rather than genotype [4][5][6][7][8]. More recently, a statistical model has been derived to estimate and test haplotype effects on trait variation with a random sample drawn from a natural population [9][10][11]. This model implements the population genetic properties of gene segregation into a unifying mixture-model framework for haplotype discovery. It assumes that one haplotype composed of alleles at multiple SNPs is different from the remaining haplotypes in terms of genetic effects on a trait. The former is called the reference or risk haplotype [9], whereas the latter is collectively called the non-reference or non-risk haplotype. This simplified assumption allows the direct use of a traditional biallelic quantitative genetic model [12] and facilitates the definition and estimation of genetic effects triggered by different haplotypes, but it is limited in *Address correspondence to this author at the Department of Statistics, University of Florida, Gainesville, FL 32611, USA; Tel: (352)392-3806; Fax: (352)392-8555; E-mail: rwu@stat.ufl.edu practical use when there is substantial variation among the non-risk haplotypes.
The motivation of this work is to expand Liu et al.'s [9] original idea to model all possible effects of individual haplotypes by constructing a multi-allelic quantitative genetic model within the mixture model framework. The multiallelic model deals with genetic effects triggered by multiple alleles at a single gene and is thought to be important for explaining genetic variation in a natural population. We use the multi-allelic model to define various additive and dominance effects due to multiple risk haplotypes. Conventional model selection criteria are incorporated to choose the optimal number and combination of risk haplotypes responsible for quantitative variation of a trait. We derived closed forms for the EM algorithm to estimate a variety of genetic parameters including haplotype frequencies and haplotype effects. Simulation studies are used to test the statistical behavior of the model and validate its usefulness and utilization.

Population and Quantitative Genetic Models
Suppose there are genetically associated SNPs each with two alleles designated as 1 and 0. Let p and q be the 1-allele frequencies for the first and second SNP, respectively. Thus, the 0-allele frequencies at different SNPs will be 1 -p and 1-q. These two SNPs segregating in a natural population form four haplotypes, 11, 10, 01 and 00, whose frequencies are constructed by allele frequencies and linkage disequilibrium (D) between the two SNPs, i.e., p 11 = pq + D, p 10 = p(1 -q) -D, p 01 = (1 -p)q -D, and p 00 = (1 -p)(1 -q) -D. We use p = (p 11 , p 10 , p 01 , p 00 ) to denote the haplotype frequency vector. These haplotypes unite randomly to generate 10 distinct diplotypes and 9 distinct genotypes. If the population is at Hardy-Weinberg equilibrium, the frequency of a diplotype is expressed as the product of the frequencies of the two haplotypes that constitute the diplotype ( Table 1). The frequency of the double zygotic genotype is the summation of the frequencies of its two possible diplotypes.
We are interested in the detection of risk haplotype(s) constructed by the alleles of the two SNPs which encodes a quantitative trait. Below given are different genetic models used to identify risk haplotypes.

Biallelic Model
Liu et al. [9] assumed that all haplotypes are sorted into two groups, risk and non-risk, and defined the combination of risk and non-risk haplotypes as a composite diplotype. Let A 1 and A 0 be the risk and non-risk haplotypes, respectively, which are equivalent to two alternative alleles if the two associated SNPs considered are viewed as a "locus". Thus, for such a "biallelic locus", we have three possible composite diplotypes whose genotypic values are specified as

Composite Diplotype
Genotypic Value where μ is the overall mean, a is the additive effect due to the substitution of the risk haplotype by the non-risk haplo-type, and d is the dominance effect due to the interaction between the risk and non-risk haplotypes. These parameters are arrayed in qB = (μ, a, d).
There are a total of seven options to choose the risk haplotype. First, because any one haplotype from 11, 10, 01 and 00 can be risk, there are four choices for determining the risk haplotype. Second, any two haplotypes can be different from the rest, which includes three possibilities for combining the risk vs. non-risk haplotypes. All these options can be tabulated as follows: The optimal choice of a risk haplotype for the biallelic model is based on the maximum of the likelihoods calculated for each of the seven options described above.

Triallelic Model
It is possible that there are two distinct risk haplotypes which are each different from non-risk haplotypes. This case

A0A0 A0A0 A0A0
Two alleles for each of the two SNPs are denoted as 1 and 0, respectively. Genotypes at different SNPs are separated by a slash. Diplotypes are the combination of two bracketed maternally and paternally derived haplotypes. Risk haplotype(s) is assumed as [11] for the biallelic model, [11] and [10] for the triallelic model, and [11], [10] and [01] for the quadriallelic model. is regarded as a "triallelic locus". Let A 1 and A 2 be the first and second risk haplotypes, and A 0 be the non-risk haplotype, which form six composite diplotypes with genotypic values expressed as Composite Diplotype Genotypic Value where μ is the overall mean, a 1 and a 2 are the additive effects due to the substitution of the first and second risk haplotype by the non-risk haplotype, and d 12 , d 10 and d 20 are the dominance effects due to the interaction between the first and second risk haplotype, between the first risk haplotype and the non-risk haplotype and between the second risk haplotype and non-risk haplotype, respectively. These parameters are arrayed in qT = (μ, a 2 , a 2 , d 12 The optimal combination of risk haplotypes for the triallelic model corresponds to the maximum of the likelihoods calculated for each of the six possibilities.

Quadriallelic Model
If there are three distinct risk haplotypes, we need a quadriallelic genetic model to specify haplotype effects. Let A 1 , A 2 and A 3 be the first, second and third risk haplotypes, and A 0 be the non-risk haplotype, which form 10 composite diplotypes with genotypic values expressed as

Composite Diplotype
Genotypic Value where μ is the overall mean, a 1 , a 2 and a 3 are the additive effects due to the substitution of the first, second and third risk haplotype by the non-risk haplotype, and d 12 , d 13 , d 23 , d 10 , d 20 , and d 30 are the dominance effects due to the interaction between the first and second risk haplotype, between the first and third risk haplotype, between the second and third risk haplotype, between the first risk and non-risk haplotype, between the second risk and non-risk haplotype, and between the third risk and non-risk haplotype, respectively. These parameters are arrayed in qQ = (μ, a 1 , a 2 , a 3 d 12 , d 13 , d 23 , d 10 , d 20 , d 30 ).

LIKELIHOOD
Assume that a total of n subjects are sampled from a Hardy-Weinberg equilibrium population and that each subject is genotyped for many SNPs and phenotyped for a quantitative trait. Consider two of the SNPs that form nine genotypes with observed numbers generally expressed as n r 1 r' 1 /r 2 r' 2 (r 1 , r' 1 , r 2 , r' 2 = 1,0). The phenotypic value of the trait for subject i is expressed in terms of the two-SNP haplotypes as where i is the indicator variable defined as 1 if subject i has a composite diplotype j and 0 otherwise, e i is the residual error, normally distributed as N(0, 2 ), and J is the number of composite diplotypes expressed as 3 for the biallelic model J = 6 for the triallelic model (7) 10 for the quadriallelic model.
The genotypic values of composite diplotypes and variance are arrayed by a quantitative genetic parameter vector q = ( qB , 2 ) for the biallelic model, ( qT , 2 ) for the triallelic model, and ( qQ , 2 ) for the quadriallelic model.

THE EM ALGORITHM
A closed-form solution for the EM algorithm has been derived to estimate the unknown parameters that maximize the likelihoods. The estimates of haplotype frequencies are based on the log-likelihood function L( p | S), whereas the estimates of genotypic values of composite diplotypes and the residual variance are based on the log-likelihood function L( q | y, S, p ). These two different types of parameters can be estimated using a two-stage hierarchical EM algorithm (see [9] for a detailed implementation).

MODEL SELECTION
The formulation of likelihoods (10), (11) and (12) is based on the assumption that one or more haplotypes are risk haplotypes for the biallelic, triallelic and quadriallelic model. However, a real risk haplotype under each of these models is unknown from raw data (y, S). Also, we are uncertain about the optimal number of risk haplotypes. An additional step for the choice of the most likely risk haplotypes and their number should be implemented. The simplest way to do so is to calculate and compare the likelihood values within the model by assuming that any one or more of the four haplotypes can be a risk haplotype, and AIC or BIC among the models by assuming different numbers of risk haplotypes [13]. Thus, we obtain possible likelihood values and AIC/BIC as follows: The largest likelihood and the smallest AIC/BIC value calculated is thought to correspond to the most likely risk haplotypes and their optimal number.

HYPOTHESIS TESTS
The genetic architecture of a quantitative trait is characterized by quantitative genetic parameters (including haplotype effects and the mode of their inheritance). The model proposed provides a meaningful way for estimating the genetic architecture of a trait. The estimated genotypic values for the composite diplotypes can be used to estimate additive and dominance genetic effects of haplotypes by The additive and dominance effects under different models can be tested by formulating the null hypothesis that the effect being tested is equal. The estimates of the parameters under the null hypotheses can be obtained with the same EM algorithm derived for the alternative hypotheses but with a constraint of the tested effect equal to zero. The loglikelihood ratio test statistics for each hypothesis is thought to asymptotically follow a x 2 -distributed with the degree of freedom equal to the difference of the numbers of the parameters being tested under the null and alternative hypotheses.

HAPLOTYPING WITH THREE SNPS
Li et al. [11] constructed a conceptual framework and statistical algorithm for haplotyping a quantitative trait with three SNPs. For a set of three SNPs, there are eight different haplotypes, among which it is possible to have one to seven risk haplotypes. The biallelic model specifies one risk haplotype which may be composed of one (8 cases), two (24 cases), three (56) or four haplotypes (170). The triallelic, quadrialleli, pentaallelic, hexaallelic, septemallelic and octoallelic models contains 28, 56, 170, 56, 24 and 8 cases, respectively. It can be seen that the model selection procedure to determine the optimal number and combination of risk haplotypes will become exponentially more complicated when the number of SNPs increases.

MONTE CARLO SIMULATION
The statistical properties of the model are investigated through simulation studies. Given a certain sample size of subjects (n = 100, 400 or 1000), two SNPs (each with two alleles 1 and 0) were simulated by assuming that 10 diplotypes follow a multinomial distribution with the frequencies determined by allele frequencies p = 0.6 and q = 0.6 and linkage disequilibrium D = 0.05. By hypothesizing risk haplotypes under biallelic, triallelic and quadriallelic models, composite diplotypes can be determined for each double-SNP genotype. The phenotypic values of a quantitative trait were simulated as a normal distribution with mean depending on composite diplotypes and variance determined under different heritability levels (H 2 = 0.1, 0.2 and 0.4).
For a practical data set, the number and combination of risk haplotypes that govern a phenotypic trait is unknown. Thus, the simulation performed here will elucidate the procedure and power to determine risk haplotypes by the new model. The data sets simulated with given risk haplotypes under each quantitative genetic model were analyzed by biallelic, triallelic and quadriallelic models, respectively. For each analysis, the likelihood and model selection criteria, AIC and BIC, are calculated with display (13). The power to correctly identify risk haplotypes was calculated from 1000 simulation replicates. Fig. (1) illustrates such power under different heritabilities, sample sizes and genetic models. For the data simulated under the biallelic model, a correct risk haplotype can well be determined with a sample size of 200 even when the heritability of the trait is modest (0.1). In this case in which a small number of genetic parameters are included, the BIC performs better than the AIC. For a data set simulated under the triallelic model, the power of haplotype detection reduces considerably, compared with the data set simulated by the biallelic model. If the heritability of a trait is as low as 0.1, about 1000 subjects are needed to achieve the power of 0.8. With the heritability increasing to 0.2 or 0.4, the same power needs about 600 or 300 subjects, respectively. It is interesting to note that the AIC performs better than the BIC when the heritability is low (0.1 or 0.2), whereas the two criteria perform similarly when the heritability is high (0.4).
The data set simulated under the quadriallelic model contains a very large number of genetic parameters to be estimated. As expected, the power of haplotype detection in this case will be reduced (Fig. 1). When the heritability is as low as 0.1, a sample size of 1000 can only achieve a power of 0.2. But with an increasing heritability, the power will increase dramatically. For example, a power of > 0.9 can be achieved with 600 subjects when the heritability is 0.4. For the quadriallelic model-simulated data, the AIC always performs better than the BIC because the latter poses too heavy penalty in this case.
The estimates of population (including allele frequencies and linkage disequilibrium) and quantitative genetic parameters (including additive and dominance effects) for each simulated data set were evaluated by calculating their sampling errors. Previous work suggested that the estimates of population genetic parameters display great precision even for a sample size of 100 [9]. Here, our simulation studies will focus on the assessment of the precision of quantitative genetic parameter estimates under different heritabilities and sample sizes. For the data set simulated with the biallelic model, the additive and dominance effects can be precisely estimated even with a heritability of 0.1 and a sample size of 100 ( Table 2). Increasing heritabilities and sample sizes increase estimation precision dramatically.
The data set simulated under the triallelic model contains two additive effects and three dominance effects. A sample size of 100 is adequate for precise estimates of the additive effects even for a low heritability (0.1), but the reasonable estimates of the dominance effects need increasing sample size (400 or more) if the heritability is 0.1 ( Table 3). For a high heritability (0.4), a small sample size (100) can provide relatively precise estimates of the dominance effects. For the data set simulated with the quadriallelic model, three additive effects and six dominance effects are included. Still, a low sample size (100) can provide very good estimates of the additive effects even for a low heritability. To reasonably estimate the dominance effects, we need a large sample size (1000) for the heritability of 0.1 or a moderately large sample size (400) for the heritability of 0.4 ( Table 4).

DISCUSSION
Single nucleotide polymorphisms (SNPs) are powerful markers that can explain interindividual differences in disease risk and drug responsiveness in humans. For genes containing multiple SNPs, haplotype structure (i.e., the linear arrangement of different SNP alleles on each of the two homologous chromosomes) is thought to be the principal de- Fig. (1). Power to detect correct risk haplotypes from the data simulated by a biallelic, triallelic and quadriallelic model, respectively, under different heritabilities and sample sizes. Model selection criteria are based on AIC and BIC.  [9], our model was founded on the mixture model-based framework in which the frequencies of haplotype distribution and haplotype effects are estimated with the closed form of the EM algorithm. But our model was incorporated by two important theories from different fields, one regarding the segregation and inheritance of multiple alleles at a single locus in quantitative genetics and the second regarding model selection procedures in statistics. Liu et al.'s [9] model was framed on a biallelic model in which one haplotype constructed by a set of associated SNPs was assumed to perform differently from the rest of the haplotypes. Traditional quantitative genetic theory mostly based on biallelic inheritance provide a basis for es-  timating the additive and dominance effects due to two alternative alleles at a functional gene, but fails to characterize the genetic effects due to all possible combinations between multiple alleles. We have for the first time implemented multiallelic quantitative genetic theory into the estimation process of haplotype effects, in which multiple additive effects and multiple dominance effects due to multiple functional haplotypes can be estimated and tested separately or jointly.
The new model expands the idea of haplotyping a complex trait to study the detailed genetic control of the trait in a precise way.
To deal with multiple risk haplotypes, an issue arises naturally about the selection of most likely risk haplotypes from a pool of haplotypes. This will include the optimal number of risk haplotypes and their combination that provide a best fit to the given data. We implemented model selection procedures into the test process of haplotype diversity and effects with two commonly used criteria, AIC and BIC. Extensive simulation studies were performed to investigate the statistical properties of the model and its utilization. Given a real data set, we do not know about the type and number of risk haplotypes. But these can be estimated with model selection by assuming different types of genetic models, biallelic (one risk haplotype), triallelic (two risk haplotypes) and quadriallelic (three risk haplotypes). Simulation studies with two-SNP haplotypes provide a table of model selection approaches (Tables 2-4) to detect most likely risk haplotypes hidden in a genetic association data set based on a range of sample size and heritability as well as the types of genetic models.
The human genome contains millions of SNPs distributed over 23 pairs of chromosomes [14]. However, these SNPs were observed to locate in different haplotype blocks of the human genome [15][16]. For a given block, there are a particular number of representative SNPs or htSNPs that uniquely identify the common haplotypes in this block or QTN. Several algorithms have been developed to identify a minimal subset of htSNPs that can characterize the most common haplotypes [2,[17][18]. The idea given in this article can be used to find risk haplotypes of these htSNPs by modeling an arbitrary number of SNPs [11], and extended to detect haplotype-haplotype interactions [10], haplotypeenvironment interactions, parent-of-origin effects of haplotypes in genetic association studies and haplotypes regulating pharmacodynamic reactions of drugs [19]. Although these works will be computationally expensive, it should not be computationally prohibitive if combinatorial mathematics, graphical models, and machine learning are incorporated into closed forms of parameter estimation. With detailed extensions that take account into more realistic biological and genetic problems, our model may provide an efficient solution to the growing need for haplotype data collection and association studies.

ACKNOWLEDGEMENT
The preparation of this manuscript is supported by NSF grant (0540745) to RW.