Best Prediction of the Additive Genomic Variance in Random-Effects Models

The additive genomic variance in linear models with random marker effects can be defined as a random variable that is in accordance with classical quantitative genetics theory. Common approaches to estimate the genomic variance in random-effects linear models based on genomic marker data can be regarded as estimating the unconditional (or prior) expectation of this random additive genomic variance, and result in a negligence of the contribution of linkage disequilibrium (LD). We introduce a novel best prediction (BP) approach for the additive genomic variance in both the current and the base population in the framework of genomic prediction using the genomic best linear unbiased prediction (gBLUP) method. The resulting best predictor is the conditional (or posterior) expectation of the additive genomic variance when using the additional information given by the phenotypic data, and is structurally in accordance with the genomic equivalent of the classical additive genetic variance in random-effects models. In particular, the best predictor includes the contribution of (marker) LD to the additive genomic variance and possibly fully eliminates the missing contribution of LD that is caused by the assumptions of statistical frameworks such as the random-effects model. We derive an empirical best predictor (eBP) and compare its performance with common approaches to estimate the additive genomic variance in random-effects models on commonly used genomic datasets.

T HE additive genetic variance is defined as the variance of the breeding value (BV) and is the most important determinant of the response of a population to selection (Falconer and Mackay 1996). The additive variance can be estimated from observations made on the population and is a principal component of the (narrow-sense) heritability, which is one of the main quantities of interest in many genetic studies (Falconer and Mackay 1996). The heritability is em-inent, among other things, for the prediction of the response to selection in the breeder's equation (Piepho and Moehring 2007;Hill 2010). Although nonadditive genetic variation exists, most of the genetic variation is additive, such that it is usually sufficient to investigate the additive genetic variance (Hill et al. 2008). More specifically, epistasis is only important on the gene level but not for the genetic variance (Hill et al. 2008), and Zhu et al. (2015) show that, for human complex traits, dominance variation contributes little. Nevertheless, linkage disequilibrium (LD) is an important factor, especially when departing from random mating and Hardy-Weinberg equilibrium, which is often the case in animal breeding (Hill et al. 2008;L. Dempfle, personal communication).
The additive genomic variance is defined as the variance of a trait that can be explained by a linear regression on a set of markers (de los Campos et al. 2015).
genome-wide association studies (GWAS) in order to find quantitative trait loci (QTL) by single-marker fixed-effect regression combined with variable selection. After having added the estimated corresponding genomic variances of the single statistically significant loci, they asserted that they could only account for a fraction of the "true" genetic variance. For instance, Maher (2008) found that only 5% instead of the widely accepted heritability estimate of 80% of human height could be explained. Golan et al. (2014) state that the "true" genetic variance is generally underestimated when applying variable selection, e.g., GWAS, to genomic datasets which are typically characterized by their high dimensionality, where the number of variables (markers) p is much larger than the number of observations n. It is well known that a lot of traits are influenced by many genes, and that at least some loci with tiny effects are missed when using variable selection or even single-marker regression models.
Consequently, Bernardo (1994) decided to fit all [restriction fragment length polymorphism (RFLP-)] markers in maize jointly using genomic best linear unbiased prediction (gBLUP), where he assumes the marker effect vector to be random. In animal breeding, Meuwissen et al. (2001) used Bayesian approaches (BayesA and BayesB) to fit all markers jointly in order to predict BVs.
Then, Yang et al. (2010) estimated the genomic variance in an approach that they termed genome-wide complex trait analysis genomic restricted maximum likelihood (GCTA-GREML) (Yang et al. 2011). They showed that quantifying the combined effect of all single-nucleotide polymorphisms (SNPs) explains a larger part of the heritability than only using certain variants quantified by GWAS methods. They illustrate their results on a dataset on human height by pointing out that they could explain a heritability, also termed "chip heritability" (Zhou et al. 2013), of $45%. They concluded that the main reason for the remaining missing heritability was incomplete LD of causal variants with the genotyped SNPs, which refers to the general difference between the genetic variance and the genomic variance (Powell et al. 2010;de los Campos et al. 2015). However, the GCTA-GREML approach can be biased upwards as well as downward (Wolc et al. 2013;de los Campos et al. 2015;Fernando et al. 2017a;Lehermeier et al. 2017).
Recently, there has been a general discussion whether estimators for the genomic variance account for LD between markers, which is defined as the covariance between the marker genotypes (Bulmer 1971). Some authors argue that estimators similar to GCTA-GREML lack the contribution of LD (Krishna Kumar et al. 2016a,b;Lehermeier et al. 2017), whereas others (Yang et al. 2016) resolutely disagree. More specifically, Krishna Kumar et al. (2016a,b) state that, in GCTA-GREML, the contributions of the p markers to the phenotypic values are assumed to be independent normally distributed random variables with equal variances. Thus, they claim that the random contribution made by each marker is not correlated with the random contributions made by any other marker, which leads to a negligence of the contribution of LD to the additive genomic variance.
In a study on the model plant Arabidopsis thaliana (1001 Genomes Consortium 2016), Lehermeier et al. (2017) use Bayesian ridge regression (BRR) to relate the phenotype flowering time to the genomic data. They use an estimator (termed M2) based on the posterior distribution of the marker effects obtained by Markov Chain Monte Carlo (MCMC) methods and show that this estimator explains a larger proportion of the phenotypic variance than the estimator, termed M1, based on gBLUP (VanRaden 2008;Yang et al. 2010Yang et al. , 2011. Lehermeier et al. (2017) argue that the reason for the better performance of the Bayesian estimator for the additive genomic variance [already mentioned in Sorensen et al. (2001), Fernando and Garrick (2013), Zhou et al. (2013), Fernando et al. (2017b] is the explicit inclusion of LD. We show that the additive genomic variance in linear models with random marker effects (RME) can be defined as a random variable. Based on this premise, we propose a novel predictor of the additive genomic variance and place existing estimators in a joint framework permitting comparison with the new predictor. We contribute to the solution of many of the above mentioned controversies by reviewing common approaches to estimate the additive genomic variance, e.g., GCTA-GREML, and show that they estimate the unconditional (or prior) expectation of the random additive genomic variance. Combined with the assumptions on the unconditional distribution of the marker effects in the gBLUP-method, this leads to an insufficient adaptation to the data and a negligence of the contribution of LD.
We introduce a novel best prediction approach for the additive genomic variance in both the current and the base population, i.e., we use the conditional (or posterior) expectation of the random additive genomic variance given the additional information by the phenotypic values for an improved adaptation to the data. We decompose the best predictor into the GCTA-GREML estimator and a function for the contribution of marker LD, which determines whether GCTA-GREML is biased up-or downward. The best predictor is structurally in accordance with the genomic equivalent of the additive genetic variance from classical quantitative genetics, i.e., it explicitly includes the contribution of LD. We propose an empirical best predictor (eBP), and illustrate our theoretical results on several commonly used genomic datasets.

Linear models
The connection of the n-vector y of phenotypic values and the n-vector g of genomic values is given by (1) where m denotes a fixed intercept, 1 n :¼ ð1; . . . ; 1Þ ⊤ is a n-rowvector containing 1s, e $ N ð0; s 2 e I n Þ denotes environmental deviations, and I n is the identity matrix of dimension n.
The statistical model that is probably most popular in genomic applications is the random-effects model (REM), in which the genomic values are assumed to be normally distributed g $ N ð0; s 2 g GÞ; (2) where G is a variance-covariance matrix and s 2 g . 0 is the variance component of the vector of genomic values g.
In the following, we assume that the genome is mapped with p 2 ℕ markers and we denote by X the n 3 p design matrix coding the genotypes of the markers.
In this framework, the n 3 n-matrix G introduced in Equation 2 is called the genomic relationship matrix (GRM), and is most commonly defined as (3) where P :¼ I n 2 1 n 1 ⊤ n =n is the idempotent n 3 n-matrix used for column-wise mean-centering, and c :¼ 2 P p j ð1 2 p j Þ, where p j denotes the frequency of the minor allele at marker j (VanRaden 2008).
Then, the genomic values g can be separated into the coded genotypes of the single markers X and their corresponding p-vector b of marker effects such that The p individual components of the marker effect vector b are drawn at random from a common fixed normal distribution for each marker genotype: with positive variance component s 2 b :¼ s 2 g =c. Model (1) is called linear equivalent model (Henderson 1984) to the "standard" additive linear regression model The centering matrix P in Equations 4 and 6 is mandatory for the equivalence of models (1) and (6), and to be able to consistently apply model (6). The restriction of the column-means of the marker genotypes in model (6) to be 0 guarantees that the sample mean of the genomic values in Equation 4 equals 0 ð g :¼ 1 n 1 ⊤ n g ¼ 0Þ. This ensures the uniqueness of the definition of the vector g of genomic values in Equation 4. Otherwise, a different coding of the marker genotypes leads to different genomic values g.
Inferences on quantities based on genomic data in model (1) and (6) are often performed with the gBLUP method (Bernardo 1994). Model (6) allows for marker-specific investigations and inferences on the genomic contribution to the phenotypic values, whereas estimation of parameters in model (1) has computational advantages. For additional information on the gBLUP-method, we refer to Appendix Genomic Best Linear Unbiased Prediction.
Model (6) is a realization of n draws from the underlying data-generating process of the (mean-centered) marker genotypes ðX 1 ; . . . ; X p Þ (Bühlmann and van de Geer 2011). This distribution, as well as the corresponding genomic values in Equation 4, relate to the current population of individuals.
When we are interested in the genomic values in the corresponding consistent base population, we should take the relationship (correlation) between the individuals into account (Powell et al. 2010;Legarra 2016). In what follows, we assume a given n 3 n relationship matrix R. Instead of the genomic values g or PXb; we investigate the uncorrelated genomic values defined by These are realizations of n draws from the underlying datagenerating process ðX * 1 ; . . . ; X * p Þ of marker genotypes in the base population. The sample mean of the genomic values in the base population, 1 n 1 ⊤ n g * , is usually different from 0.

Definitions of the genomic variance
The additive genetic variance (in the base population also called genic variance) with respect to a trait is defined as the theoretical variance of the genetic value based on the respective QTLs (Falconer and Mackay 1996). The genomic variance has been described as the variance of the genomic values that can be explained by a linear regression on a set of markers (de los Campos et al. 2015). This does not, however, imply a unique formal definition of the genomic variance.
Here, we give an overview of different approaches to define a genomic variance in the framework of the linear models (1) and (6).
Without further assumptions on the nature of the genome, we can define the sample variance of the mean-centered n-vector of genomic values g ¼ PXb (see Equation 4) in the current population (Ould Estaghvirou et al. 2013). Here,Ŝ X :¼ X ⊤ PX=ðn 2 1Þ defines the sample variance-covariance matrix of the marker genotypes in the current population.
In the base population, we define the sample variance of the uncorrelated genomic values g * ¼ X * b (see Equation  7). Here,Ŝ X * :¼ ðX * Þ ⊤ PX * =ðn 2 1Þ defines the sample variance-covariance matrix of the marker genotypes in the base population. Alternatively, we can define the theoretical variance of the genomic values directly in the REM. The linear model is generated by drawing from the data-generating process of the marker genotypes (representative individual), and the model assumptions in the REM dictate that marker effects are random variables. This gives rise to three different scenarios for the variance of the genomic values in the REM (marker genotypes random, marker effects random, or both random).
The additive genomic variance of a randomly sampled (representative) individual (Gianola et al. 2009;de los Campos et al. 2015;Fernando et al. 2017b where S X denotes the variance-covariance matrix of the marker genotypes. The variance of a randomly sampled (representative) individual with RME is given by This is not the additive genomic variance (Gianola et al. 2009;de los Campos et al. 2015). The variance of a randomly sampled trait averaged over individuals with fixed genotypes X equals and does not equal the additive genomic variance. We derive the equalities in Equations 10-12 in more detail in the Appendix Theoretical Variances of the Genomic Values in the REM. These quantities refer to the genotypes in the current population. We can apply the same definitions in the base population by considering the data-generating process of the genotypes in the base population (exchange X by X * ).
In Table 1, we give an overview of the different possibilities to define the variance of the genomic values in the REM.
In actual applications, we have to replace S X in Equation 10 by its estimatorŜ X . Consequently, the sample variance (Equation 8) as well as the theoretical variance (Equation 10) effectively represent the additive genomic variance, i.e., the genomic equivalent of the additive genetic variance (Bulmer 1971;Falconer and Mackay 1996), in the current population. In the following, we do not explicitly distinguish between the sample or the theoretical version of the variance, and will speak only of the additive genomic variance.
From now on, we focus on the estimation of the additive genomic variance in the general form which is a non-negative quadratic form of the genomic values. By specifying the positive semidefinite and symmetric n 3 n-matrix B, we determine whether the genomic variance refers to the current population ðB ¼ I n Þ (see Equation 8), or the base population ðB ¼ R 20:5 PR 20:5 Þ (see Equation 9). Because the randomness of the marker genotypes is not explicitly necessary to derive Equation 13, we can easily express all results in terms of the genomic values g defined in the equivalent model. In the framework of the REM, the marker effects b in model (6) and the genomic values g in model (1) are random variables. Consequently, the additive genomic variance in Equation 13 is also a random variable, and, as such, has to be predicted in an optimal way before finally being estimated [in the same way that the RME b are predicted by the BLUP and then estimated by the eBLUP (Kackar and Harville 1984;Jiang 1999)].
First, we will show that estimators for the unconditional expectation of Equation 13, like GCTA-GREML, are of the form of Equations 11 and 12, and, therefore, do not estimate the additive genomic variance.
Then, we introduce the (frequentist) best predictor for the additive genomic variance s 2 g;B ; and show that this approach maintains the structure of the additive genomic variance (Equation 13)-the genomic equivalent of the additive genetic variance.

The expectation of the additive genomic variance
The expectation of the random variable s 2 g;B in Equation 13 minimizes the quadratic form with respect to all real numbers a, i.e.,ã :¼ E½s 2 g;B is the best approximation of s 2 g;B in the absence of additional information (van der Vaart 2007). The unconditional (or prior) expecta- because of the properties of the trace. For the additive genomic variance in the current population, s 2 g , we choose B ¼ I n in Equation 14 and obtain Analogous quantities for the base population can be obtained by exchanging X by X * . The sample variance s 2 g and the theoretical variance VarðXbjbÞ define the sample and theoretical version of the additive genomic variance.
in model (6) in model (6) or in the equivalent model. Often (VanRaden 2008;Yang et al. 2010Yang et al. , 2011Speed et al. 2012;Vinkhuyzen et al. 2013;Legarra 2016), the matrix R used for the transformation to the base population is assumed to be the GRM G defined in Equation 3. Then, the unconditional expectation V * of the additive genomic variance simplifies to and the variance component s 2 g from the gBLUP-method is considered as the additive genomic variance in the base population. We have shown, however, that s 2 g is only the unconditional expectation of the additive genomic variance.
We recommend caution when using the simplification in Equation 19 because the GRM G is in general singular (because P is singular), and therefore G 21 is not well defined.
We emphasize that only the diagonal elements of the sample variance-covariance matrix (Ŝ X orŜ X* ) of the marker genotypes influence the unconditional expectations V, V * ; and V * s of the additive genomic variance. The model assumptions in the REM dictate the matrix E½bb ⊤ to be diagonal, which leads to a negligence of the off-diagonal elements ofŜ X orŜ X* in Equation 14. The covariances (LD) between the marker genotypes are not included, and V, V * ; and V * s are of the same form as VarðXbÞ in Equation 11 and 1 n trðCovðXb j XÞÞ in Equation 12. This implies that the unconditional expectation of the random additive genomic variance s 2 g;B is structurally not fully in accordance with the (random) additive genomic variance in Equation 13.
Explicit formulae for the estimation of the unconditional expectations V, V * ; and V * s will be given in the Appendix Estimation of the Additive Genomic Variance in the REM.

Best prediction of the additive genomic variance
The unconditional (prior) expectation of s 2 g;B in Equation 14 is strongly influenced by the model assumption on the marginal distribution of the marker effects, and does not use additional information given by the phenotypic values y in model Equations 1 and 6. Estimation procedures for the unconditional expectation [which include e.g., restricted maximum likelihood (REML)], however, make use of the data y to a certain extent.
By contrast, the conditional (posterior) expectation, given the phenotypic values y, makes best use of the information provided by y even before the final estimation is performed.
Generally, the conditional (or posterior) expectation of a random variable Z (in our case Z ¼ s 2 g;B ) given the knowledge of the random vector Y is defined as the "best prediction" (Searle et al. 1992; van der Vaart 2007) of the random variable Z. The best predictor is the unique function g 0 ðYÞ that minimizes the mean square error of prediction The best predictor in Equation 20 is, by definition, an unbiased predictor for the random variable Z and g 0 ðYÞ maximizes the correlation CorðZ; gðYÞÞ, i.e., we can replace the target random variable Z by the best predictor defined in Equation 20 in an optimal way (Searle et al. 1992). Instead of inferring the unobservable target random variable, we conduct inferences on the best predictor. Because the best predictor has realized in a given dataset ðY ¼ yÞ, it is estimable (Searle et al. 1992).
In the following, we introduce a novel approach of considering the frequentist best predictor instead of the unconditional expectation for the random additive genomic variance s 2 g;B in Equation 13. We proceed according to Equation 20, and define for the given phenotypic values y, because X is constant, and, therefore, independent of y. The matrix of conditional second moments of the marker effects b is usually nondiagonal (contrary to E½bb ⊤ ), and can be expressed as using the BLUP m bj y :¼ E½ bj y of the random vector b, and the variance-covariance matrix P bj y :¼ Covðb j yÞ of b given the data y. Then, the best predictor equals BP s 2 where the last equality holds because of the connection of the BLUPs and the conditional variance-covariance matrices in models (1) and (6), see also Appendix Genomic Best Linear Unbiased Prediction.
For the best predictor of the additive genomic variance in the current population we set B ¼ I n in Equation 21, and obtain in model (6) or in the terminology of the equivalent model (1). For the best predictor of the additive genomic variance in the base population we set B ¼ R 20:5 PR 20:5 in Equation 21, and obtain in model (6) or in the terminology of the equivalent model (1). We note that Equations 24 and 25 have structural similarities with the iterative maximum-likelihood equation for the variance component s 2 b when using the expectationmaximization algorithm (Sorensen and Gianola 2002).
We emphasize that the best predictor of the additive genomic variance in the current population (W), as well as in the base population ðW * Þ; includes the contribution of all elements of the sample variance-covariance matrix of marker genotypes (Ŝ X orŜ X* ), and, hence, comprises LD information, contrary to the unconditional expectations V, V * ; and V * s of the additive genomic variance from the previous section.
Explicit formulae for the eBPs of the additive genomic variance, as well as a formula for W * s (approximate approach using the GRM G for transformation to the base population), will be given in the Appendix Estimation of the Additive Genomic Variance in the REM. We compare the use of the unconditional expectation and the best predictor for the prediction of the random additive genomic variance in the REM in Table A.1 and Table A.2 in the Appendix.

Statistical analysis (genomic data)
For an illustration of the theoretical results of the previous sections, we used the mice dataset that comes with the Rpackage BGLR (Pérez and de los Campos 2014). The data originally stem from an experiment by Valdar et al. (2006a,b) in a mouse population. The dataset contains the matrix X with values in f0; 1; 2g of p ¼ 10; 346 polymorphic marker genotypes that were measured in n ¼ 1814 mice. The trait (n-vector y) under consideration was body length (BL). The relationship of the mice is recorded in the n 3 n pedigree matrix R, and is used for the transformation to the base population.
Additionally, we used the publicly available historical wheat dataset that also comes with the R-package "BGLR" (Pérez and de los Campos 2014). The data originally stems from CIMMYT's Global Wheat Program and consists of n ¼ 599 lines of wheat, where the trait under consideration was average grain yield. The phenotypes are divided up into four basic target sets of environments designated as Wheat I, Wheat II, Wheat III, and Wheat IV, of which we considered the only first. The dataset contains the matrix of marker genotypes for p ¼ 1279 markers as well as a relationship matrix.
Moreover, we analyzed a population of n ¼ 1057 fully sequenced Arabidopsis lines for which phenotypes and genotypes are publicly available by the effort of the Arabidopsis 1001 Genomes project (1001 Genomes Consortium 2016). The lines represent natural inbred lines and we examined the same trait, namely flowering time at 10°(FT10), and the same p ¼ 193; 697 SNP-markers that were used in Lehermeier et al. (2017). For these data, no relationship matrix was available.
We conducted all calculations with the free software R (R Development Core Team 2017). For each dataset, we used the gBLUP-method in the equivalent version (computational advantages) implemented in the R-package "sommer" (Covarrubias-Pazaran 2017) to fit a REM. We used REML to obtain estimates (ŝ 2 g andŝ 2 e ) for the variance components. We also obtained estimates of the best predictor of the genomic effects m gjy and the their variance-covariance matrix Sm gjy .
We used this outcome for the estimation of the unconditional expectation V and the BP W of the additive genomic variance in the current population, as well as the unconditional expectation V * and the BP W * for the additive genomic variance in the base population (except for the Arabidopsis dataset, where no relationship matrix was available). Although the GRM is not invertible, we will show in the Appendix Estimation of the Additive Genomic Variance in the REM how to use the GRM for a transformation to the base population, and to calculate the corresponding unconditional expectation V * s and the BP W * s for the additive genomic variance in the base population.
Detailed information about the calculations and the programming code is provided in the Supplemental Material, File S1.

Data availability
The authors affirm that all data necessary for confirming the conclusions of this article are represented fully within the manuscript and the supplemental material that has been uploaded to figshare. File S1 contains a detailed description of the estimation of the genomic variances for the gBLUPmethod as well as the corresponding R-code and its output. Supplemental material available at Figshare: https://doi.org/ 10.25386/genetics.8114117.

Results
In the first section of Table 2, we present the estimation results for the unconditional expectation V and the best predictor W for the additive genomic variance in the current population. In the mice and wheat datasets,V exceedsŴ, whereas for the Arabidopsis data, the eBP is about double the size of the unconditional expectation.
The sample variance of the phenotypic values has been scaled to 1. The sum ofV and the residual variance is larger than the phenotypic variance for the mice and wheat data but smaller for the Arabidopsis data. Technically, it is possible to define the heritability in two ways, namely with respect to the phenotypic variance, and with respect to the sum of the additive genomic variance and the residual variance. The sum of the eBPŴ and the residual variance, however, equals the scaled phenotypic variance of 1 remarkably exactly for all datasets considered.
In the second section of Table 2, we first present the estimation results for the unconditional expectation V * and the best predictor W * for the additive genomic variance in the base population using the given relationship matrices for the transformation. For the mice data,V * andŴ * are similar to their analogsV andŴ in the current population. For the wheat data, however, the estimated unconditional expectation and eBP in the base population are around five times larger than those in the current population, and exceed the sample phenotypic variance in the current population. By this approach, it is not possible to define a heritability in the base population because both the estimate of the residual variance and the phenotypic sample variance refer to the current population.
The estimation results for the unconditional expectation V * s and the best predictor W * s for the additive genomic variance in the base population using the GRM for the transformation differ from those using the given relationship matrices by a considerable amount. In the mice and wheat data, V * s is larger than W * s , whereas, for the Arabidopsis data the eBP exceeds the estimated unconditional expectation. This conforms to the behavior of V and W in the current population.

Discussion
We have shown that commonly used estimators for the additive genomic variance in the REM with genomic marker data are based on the unconditional expectation of the random additive genomic variance. We have introduced a novel best prediction approach for the random additive genomic variance in both the current and the base population. In the following, we discuss several important implications.

Current and base population
Common ways of estimating the additive genomic variance focus on the base population. These approaches are independent of the actual current population, and, consequently, valid even if the generations change.
If one aims at the response of a population to selection, however, it might be more meaningful to estimate the additive genomic variance in the actual given population. This implies that the estimation of the genomic variance has to be conducted again when the individuals change. A formal definition of the heritability is best possible in the current population, where the phenotypic and residual variance are estimable.
We have preferred to use given relationship matrices for the transformation of the genomic values to the base population. In the case that such a matrix is not available, we have shown how to use genomic relationship matrices for the We also present the corresponding heritabilities with respect to the sample variance of the phenotypic values and with respect to the sum of the additive genomic and residual variance s 2 e : In addition, we depict the estimation results for the unconditional expectation V * (V * s when using the GRM for the transformation) and the best predictor W * (W * s when using the GRM for the transformation) for the additive genomic variance in the base population. a Heritability with respect to phenotypic sample varianceŝ 2 y ; which has been scaled to 1. b Alternative definition of the phenotypic variance that depends on the estimate of the genomic variance. c Alternative definition of the heritability that depends on the alternative definition of the phenotypic variance.
transformation, although a formal inversion of GRMs is, in general, not possible.
In Table 2, we illustrate that we can decompose the sample phenotypic variance into the sum of the eBP of the additive genomic variance in the current population and into the estimated residual variance. This is due to the orthogonal projection property of the conditional expectation, which gives the best approximation of the random additive genomic variance. This enables a unique definition of the heritability in the current population. It is never possible, however, to transfer the residual variance to the base population. Consequently, a definition of the heritability in the base population is not straightforward.

The gBLUP-method and the Bayesian approach
The frequentist gBLUP-method can also be set up in the context of Bayesian regression models (with prior distribution for the effect vector as defined in Equation 5, and uninformative priors for the variance components). Lehermeier et al. (2017) considered the additive genomic variance s 2 g in the current population (see Equation 8), and used BRR to estimate in the current population. M 2 does not describe the genomic variance in the base population, and should not directly be compared with approaches introduced e.g., in Yang et al. (2010Yang et al. ( , 2011. Analogously to the best predictor W * (see Equation 24), for the genomic variance in the base population, one can consider as the posterior mean of the genomic variance in the base population in Bayesian regression models. The frequentist gBLUP-method provides a more formal approach to the prediction of the random additive genomic variance in linear models with random effects than the Bayesian approach. It enables the derivation of explicit formulas for the predictors (unconditional expectation and best predictor) of the random additive genomic variance using the standard output of the gBLUP-method, which goes hand-in-hand with a fast implementation of the empirical version of the predictors. The connection between the BLUP m bj y and its covariance for the RME with the additive genomic variance are clearly visible. This enables us, for instance, to derive the decomposition of the best predictor of the random additive genomic variance into the unconditional expectation and a function for the marker LD in the following section.

Influence of LD
In the section Definitions of the genomic variance, we saw that the (random) additive genomic variance equals in the current population, and in the base population. We emphasize that the variance-covariance matrix of the marker genotypes (marker LD) plays a decisive part in the determination of the additive genomic variances in both the current and the base population. The variances s 2 g and s 2 g * are structurally in accordance with the classical additive genetic variance (Bulmer 1971;Falconer and Mackay 1996), which is caused by the genotypes, whereas the genotypic effects are fixed.
In the REM, however, the marker effects are random, with unconditional expectation 0 and unconditional diagonal variance-covariance matrix with equal variances s 2 b . As a consequence, the unconditional expectation of the additive genomic variance in the base population contain only the variances of the marker genotypes in the corresponding population. In addition, the unconditional expectation resembles both the variance of a randomly sampled trait for a randomly sampled individual and the variance of a randomly sampled trait for individual with fixed genotypes, see Table 1 for an overview. We show in the Appendix Estimation of the Additive Genomic Variance in the REM that we can partition the best predictor of the random additive genomic variance s 2 g;B in the following way: in the current population, and Lð yÞ ¼ in the base population ðS m bj y ¼ Covðm bj y ÞÞ.
The best predictor, therefore, consists of the unconditional expectation of the additive genomic variance (no contribution of LD) and a function that explicitly contains the weighted contribution of marker LD. This function determines whether estimators like GCTA-GREML (unconditional expectation of the random genomic variance in the base population) are biased upwards or downward, i.e., it determines the direction and the magnitude of the bias of GCTA-GREML (this method is based on the assumption that the function L constantly equals 0). In addition, we notice that this bias depends not only on the sign of the covariance between the marker genotypes, but both on the sign and the magnitude of the weighted covariances.
When best (conditional) estimators are considered, the inclusion of LD is intrinsic; in that case, discussions about whether or not LD should be included or even optimally included become immaterial.
We emphasize that, contrary to the unconditional expectation, the best predictor maintains the structure of the additive genomic variance s 2 g and s 2 g * , because the function L can be decomposed into the weighted sample variances and covariances of the marker genotypes. Instead of the marker effects, the components of the matrix m bj y m ⊤ bj y 2 S m bj y , which is typically nonzero and nondiagonal, take the part of the weighting factors of the elements ofŜ X andŜ X* : The best predictor maintains the structure of the additive genomic variance in both the current ðs 2 g Þ and the base population ðs 2 g * Þ; and, thus, conforms to the classical genetic variance (Bulmer 1971;Falconer and Mackay 1996).
The difference between the estimators V and W (V * and W * ;V * s and W * s ) is given by the estimated Lð yÞ; and can be obtained from Table 2. We notice that the weighted contribution of marker LD is large and positive in the case of the Arabidopsis data, whereas in the mice and wheat data, the weighted contribution of marker LD is slightly negative.
To sum up, the application of the unconditional expectation of the additive genomic variance combined with the model assumptions on the marker effects in random effect models cause, at least partially, the missing contribution of LD to the estimated additive genomic variance. This goes hand-in-hand with the critique expressed in Krishna Kumar et al. (2016a,b). It is, however, less important when estimating the additive genomic variance in the base population, where the individuals are uncorrelated and less LD persists (although the marker genotypes need not be uncorrelated).
The best prediction approach eliminates the problem of the missing contribution of LD to the additive genomic variance that is caused by mathematical modeling (e.g., the assumptions in the random-effects model).

Concluding remarks
The variability in the marker genotypes causes the variability in the genomic values in a population of individuals. Hence, the additive genomic variance is induced by the variance of the marker genotypes. In the application of the random-effects model, the marker effects are assumed to be random, which introduces another source of variability, namely a statistical one, in the genomic values.
We have shown that commonly used estimators use the unconditional expectation to handle this statistical randomness. However, we recommend the use of the best prediction approach (conditional expectation) that uses the additional information given by the phenotypic data in an optimal way, minimizes the mean square error of prediction, includes the contribution of LD, and maintains the structure of the genomic equivalent of the classical additive genetic variance.

Genomic Best Linear Unbiased Prediction
In the REM ðb j $ N ð0; s 2 b ÞÞ for the model we have that The marker effect vector b cannot be estimated because it is a random variable. Henderson (1984) introduced the concept of the prediction of b, which refers to the estimation of the realized values of the random effects. The best linear unbiased predictor (BLUP) m bj y for b is given by m bj y ¼ E½b j y (Henderson 1984;Searle et al. 1992). The conditional expectation is the unique best predictor, i.e. it is unbiased and has minimal mean square error of prediction see e.g. Kotz et al. (2000). Consequently, the BLUP equals and is linear in y. The conditional variance-covariance matrix of b equals and the variance-covariance matrix of the BLUP m bj y equals S m bjy :¼ Covðm bj y Þ ¼ CovðE½b j yÞ ¼ CovðbÞ 2 E½Covðb j yÞ The actual estimation of the parameters in model (6) with the BLUP-method is a two-stage procedure (Das et al. 2004). In the first stage, a BLUE for the fixed quantities and a BLUP for the random variables are derived. However, they involve the variance components s 2 b and s 2 e as unknown parameters. In a second stage, these parameters are replaced by estimates, and the estimators for the BLUE and the BLUP are referred to as empirical BLUE (eBLUE) and empirical BLUP (eBLUP) (see Kackar and Harville 1984;Jiang 1999). Investigations on the properties of the eBLUE and the eBLUP are very complex (Searle et al. 1992), and often only approximate results are obtained (Kackar and Harville 1984;Jiang 1999;Das et al. 2004).
Assume that we are provided with estimators for the variance components using, e.g., REML (Patterson and Thompson 1971;Corbeil and Searle 1976;Searle et al. 1992). These estimated variance components are nonlinear functions of the data y, and, consequently, the eBLUEm for the intercept and the eBLUP for the marker effects b are not even linear in the data y anymore (despite their naming). The unbiasedness of the eBLUE and the eBLUP can be asserted if the estimated variance componentsŝ 2 b andŝ 2 e are non-negative, even functions in y, translation-invariant, and if the expectations of the eBLUE and eBLUP are finite (Kackar and Harville 1984). When using REML estimates for the variance components, these requirements are satisfied and the eBLUEm and the eBLUPm bj y are bias-free estimators for m and b (Jiang 1999).
Conditional on the estimation of the variance components (ignoring the randomness in the second stage of the estimation of the eBLUP), the variance-covariance matrix of the eBLUPm bj y equals because Cov y 2m1 n j s 2 b ¼ŝ 2 b ; s 2 e ¼ŝ 2 We can transfer the results derived in model (6) to model by using their equivalence in distribution. The gBLUP for g equals PXs ðy 2 m1 n Þ: The conditional variance-covariance matrix of g is obtained as PXðs Gs 2 g ; (34) as well as the variance-covariance matrix of the BLUP m gjy S m gjy :¼ Covðm gj y Þ ¼ PXCovðm bj y ÞX ⊤ P ¼ ð30Þ PXs 2 b X ⊤ PSPXs 2 b X ⊤ P ¼ ð3Þ s 2 g GðGs 2 g þ s 2 e I n Þ 21 Gs 2 g : The variance-covariance matrix of the eBLUP equals s 2 g GSGs 2 g 2 s 2 g GS1 n 1 ⊤ nS Gs 2 g 1 ⊤ nS 1 n :

Theoretical Variances of the Genomic Values in the REM
We review three different definitions of the theoretical variance of the genomic values in the REM (marker genotypes random, marker effects random, or both random). We focus the following analysis on the linear model (6) because of the explicit separation of marker genotypes and marker effects. For simplicity, we focus on the genomic variance in the current population. The results for the base population are obtained by replacing the data-generating process X with X * .

Random genotypes and random effects
If the marker genotypes as well as the marker effects are the source of genomic variation, we calculate the variance of the genomic value according to the law of total variance as: The unconditional expectation and the variance operator in the second line apply to the random marker effect vector b. Because of the model assumptions on the marker effects in Equation 5, and the mean-centered marker genotypes ðE½X ¼ 0Þ, we obtain with the interpretation as the variance of a randomly sampled (representative) individual for a trait with random effects.

Fixed genotypes and random effects
If the genomic variation is caused by the marker effects only, and the marker genotypes are fixed, then the n-vector of genomic values is normally distributed: In order to obtain an average theoretical variance of the individuals in the sample, we calculate the mean trace of the variance-covariance matrix of the genomic values: This approximately equals the variance of a randomly sampled individual for a randomly sampled trait (see Equation 37). Even when the marker genotypes are fixed, their sample variance-covariance matrix contributes to the theoretical variance of the genomic values when averaging over the individuals in the sample.

Random genotypes and fixed effects
Probably the most common assumption on the nature of the genome is that the marker genotypes are random, whereas the marker effects are fixed (Falconer and Mackay 1996). In order to translate these assumption to the variance of the genomic values in the REM, we have to condition on the marker effects (i.e. we fix them). Then, the theoretical variance of the genomic values of a individual with random marker genotypes (representative individual), and with fixed marker effects, equals and describes the genomic equivalent of the definition of the additive genetic variance (Bulmer 1971;Falconer and Mackay 1996).

Estimation of the Additive Genomic Variance in the REM
In the sections The expectation of the additive genomic variance and Best prediction of the additive genomic variance, we have introduced ways to predict the random additive genomic variance s 2 g;B :¼ in the REM, namely by using the unconditional expectation and the best predictor BP s 2 In the following, we introduce estimators for these quantities and investigate their properties.

Estimation of the unconditional expectation
For any given positive semidefinite matrix B, the unconditional expectation 1 n 2 1 s 2 b trðX ⊤ PBPXÞ in model (6) can be estimated by 1 n 2 1ŝ 2 b trðX ⊤ PBPXÞ after having obtained an estimateŝ 2 b of the variance component s 2 b . In the equivalent model (1), we can estimate 1 n 2 1 s 2 g trðBGÞ by using 1 n 2 1ŝ 2 g trðBGÞ for any positive semidefinite matrix B.
The specification of B as in the section The expectation of the additive genomic variance leads to the explicit form of the estimatorsV

Empirical best prediction
Because of equalities (Equations 29 and 30) and the variancecovariance matrix Consequently, the best predictor of s 2 g;B defined in Equation 21 equals BP s 2 where Lð yÞ :¼ 1 n 2 1 trðX ⊤ PBPX½m bj y m ⊤ bj y 2 S m bjy Þ ¼ 1 n 2 1 trðB½m gj y m ⊤ gj y 2 S m gjy Þ: We have partitioned the best predictor of s 2 g;B into the unconditional expectation of s 2 g;B and the random variable L, which is realized in the phenotypic data y. The random variable L specifies the adaption of the best predictor to the data and incorporates the contribution of (marker) LD. The expectation of L over all possible data y is 0 because E h m bj y m ⊤ bj y 2 S m bjy i ¼ Covðm bj y ÞþE½m bj y E½m bj y ⊤ 2S m bjy ¼ 0: The sign of the realization of L determines whether the best predictor is larger (positive weighted LD) or smaller (negative weighted LD) than the unconditional expectation.
The task of finding an eBP for s 2 g;B is reduced to estimating the realized values of L because of the connection derived in Equation 42. We replace the BLUP and their variance-covariance matrix in Equation 43 by the eBLUP and their estimated variancecovariance matrix: We assume that we are provided with REML-estimatorsŝ 2 b andŝ 2 e for the variance components. Then, we find because the eBLUP is unbiased for b, i.e. E½m bj y ¼ E½b ¼ 0 (Jiang 1999). Unfortunately, the unbiasedness of the estimated variance-covariance matrix of the eBLUP can be asserted only in a trivial way by conditioning on the estimated variance components: for the additive genomic variance s 2 g;B . The specification of B as in the section Best prediction of the additive genomic variance leads to the explicit form W :¼ eBPðs 2 g Þ ¼V þ trðŜ X ½m bj ym ⊤ bj y 2Ŝm bj y Þ ¼V þ 1 n 2 1 trð½m gj ym ⊤ gj y 2Ŝm gj y Þ of the eBP for the additive genomic variance in the current population, and to the eBP W * :¼ eBPðs 2 g * Þ ¼V * þ trðŜ X* ½m bj ym ⊤ bj y 2Ŝm bjy Þ ¼V * þ 1 n 2 1 trðPR 20:5 ½m gj ym ⊤ gj y 2Ŝm gj y R 20:5 Þ for the additive genomic variance in the base population. Using the GRM G for a transformation to the base population is not well-defined because G is singular. However, becauseV * s (see Equation 41), is commonly used, we want to find an analogous formula for the eBP in this set-up.
Instead of calculating G 20:5m gj y ¼ G 20:5ŝ2 g GðGŝ 2 g þŝ 2 e I n Þ 21 ð y 2m1 n Þ and G 20:5Ŝm gj y ¼ G 20:5ŝ2 g G b SGŝ 2 g 2ŝ 2 g G b S1 n 1 ⊤ n b SGŝ 2 g 1 ⊤ n b S1 n 2 4 3 5 G 20:5 ; we usem * gj y :¼ŝ 2 g G 0:5 ðGŝ 2 g þŝ 2 e I n Þ 21 ð y 2m1 n Þ and^S * m gjy :¼ŝ 2 g G 0:5 b SG 0:5ŝ2 g 2ŝ 2 g G 0:5 b S1 n 1 ⊤ as an approximation of the eBP of the additive genomic variance in the base population when using the GRM for the transformation. Overview of prediction approaches for the random additive genomic variance in the random-effects model with the gBLUP-method s 2 g;B ¼ 1 n 2 1 b ⊤ X ⊤ PBPXb

Unconditional expectation Best prediction
General Formula E½s 2 g;B ¼ 1 n 2 1 s 2 b trðX ⊤ PBPXÞ BPðs 2 g;B Þ ¼ 1 n 2 1 trðX ⊤ PBPX½m bj y m ⊤ bj y þ S bj y Þ Current population V ¼ s 2 b trðŜ X Þ W ¼ trðŜ X ½m bj y m ⊤ bj y þ S bj y Þ Base population V * ¼ s 2 b trðŜ X* Þ W * ¼ trðŜ X* ½m bj y m ⊤ bj y þ S bj y Þ Features • Best approximation of the additive genomic variance in the absence of information • Best approximation of the additive genomic variance using additional information given by phenotypic values (optimal adaptation to the data by application of the conditional or posterior expectation) • No inclusion of LD • Explicit inclusion of LD • Orthogonal decomposition of the phenotypic variance in the current population (unique definition of the heritability) • Genomic equivalent of the additive genetic variance X, matrix of marker genotypes; P, matrix for column-wise mean-centering; B, positive semidefinite matrix; s 2 b ; variance component of the marker effects b; m bjy ; BLUP of b; S bjy ; conditional covariance matrix of b given the phenotypic data y;Ŝ X ; sample variance-covariance matrix of the marker genotypes in the current population;Ŝ X* ; sample variance-covariance matrix of the marker genotypes in the base population. Table A.2 Overview of prediction approaches for the random additive genomic variance in the random-effects model with the gBLUP-method in the equivalent version of the linear model s 2 g;B ¼ 1 n 2 1 g ⊤ Bg

Unconditional expectation Best prediction
General formula E½s 2 g;B ¼ 1 n 2 1 s 2 g trðBGÞ BPðs 2 g;B Þ ¼ 1 n 2 1 trðB½m gjy m ⊤ gj y þ S gj y Þ Current population V ¼ 1 n 2 1 s 2 g trðGÞ W ¼ 1 n 2 1 trð½m gj y m ⊤ gj y þ S gj y Þ Base population V * ¼ 1 n 2 1 s 2 g trðPR 20:5 GR 20:5 Þ W * ¼ 1 n 2 1 trðPR 20:5 ½m gj y m ⊤ gj y þ S gj y R 20:5 Þ Features • Best approximation of the additive genomic variance in the absence of information • Best approximation of the additive genomic variance using additional information given by phenotypic values (optimal adaptation to the data by application of the conditional or posterior expectation)

• No inclusion of LD
• Explicit inclusion of marker LD • Transformation with GRM: s 2 g replaces V * • Orthogonal decomposition of the phenotypic variance in the current population (unique definition of the heritability) • Genomic equivalent of the additive genetic variance G, genomic relationship matrix; P, matrix for column-wise mean-centering; B, positive semidefinite matrix; s 2 g ; variance component of the genomic values g; m gjy ; the BLUP of g; S gjy ; conditional covariance matrix of g given the phenotypic data y; R, relationship matrix.