Quantifying the impact of between-study heterogeneity in multivariate meta-analyses

Measures that quantify the impact of heterogeneity in univariate meta-analysis, including the very popular I2 statistic, are now well established. Multivariate meta-analysis, where studies provide multiple outcomes that are pooled in a single analysis, is also becoming more commonly used. The question of how to quantify heterogeneity in the multivariate setting is therefore raised. It is the univariate R2 statistic, the ratio of the variance of the estimated treatment effect under the random and fixed effects models, that generalises most naturally, so this statistic provides our basis. This statistic is then used to derive a multivariate analogue of I2, which we call . We also provide a multivariate H2 statistic, the ratio of a generalisation of Cochran's heterogeneity statistic and its associated degrees of freedom, with an accompanying generalisation of the usual I2 statistic, . Our proposed heterogeneity statistics can be used alongside all the usual estimates and inferential procedures used in multivariate meta-analysis. We apply our methods to some real datasets and show how our statistics are equally appropriate in the context of multivariate meta-regression, where study level covariate effects are included in the model. Our heterogeneity statistics may be used when applying any procedure for fitting the multivariate random effects model. Copyright © 2012 John Wiley & Sons, Ltd.


Introduction
Meta-analysis, the statistical process of pooling the results from separate studies concerned with the same treatment or issue, is a well-established tool in medical statistics. Meta-analysis does, however, present both computational and conceptual difficulties associated with between-study heterogeneity. This additional source of variation is usually modelled using the random effects model [1][2][3]. Here, the between-study variance allows for any apparent over-dispersion of studies' results.
If the between-study variance is assumed to be zero, then the model is conventionally referred to as the fixed effects model. This model simplifies the resulting interpretations, and eases computation, but the assumption of no between-study variation seems generally implausible, unless it is known that the studies are performed in the same way and involve individuals sampled from the same population.
Tests for the presence of heterogeneity exist but have low power [4], and their use to choose between fixed and random effects models is generally discouraged [4][5][6]. Statistics that quantify the impact of heterogeneity have been proposed as an alternative to this testing [5], and I 2 , which describes the proportion of variability in point estimates that is due to heterogeneity rather than within-study sampling error, is now almost always provided in addition to the results from the standard heterogeneity test. In addition to I 2 , Higgins and Thompson [5] suggested two further heterogeneity statistics, H 2 and R 2 ; we describe all three of these heterogeneity statistics in Section 4. The very popular I 2 has, however, recently received criticism from Rücker et al. [7] who show how this quantity depends on the size of the studies. Although this dependence is clearly explained by Higgins and Thompson [5], this has resulted in some questioning its use. Our position is that the heterogeneity statistics are useful descriptive statistics when used in conjunction with the estimate of the between-study variance and all the other usual inferential statistics, such as the pooled effect.
More recently, multivariate meta-analysis [8][9][10][11] has become established. Here, multiple study outcomes are combined in a single multivariate analysis, to account for their correlation. For example, diagnostic test studies provide estimates of sensitivity and specificity, which are usually negatively correlated between studies. The multivariate methods are generalisations of their more commonly used univariate counterparts and possess many advantages, but they also have their limitations [10]. The most commonly referred to advantage of the multivariate approach is the 'borrowing of strength' that can occur as a result of the utilisation of correlation. This applies to both the pooled estimates and the between-study variance estimates. For example, it has been shown that the multivariate model gives a smaller mean-square error and, on average, standard error of the pooled estimates than the univariate method [12]. Now that multivariate meta-analysis has arrived, and the importance of univariate measures for quantifying the impact of heterogeneity is well understood, an obvious missing component is the development of appropriate multivariate measures of heterogeneity. Although there has been methodological development in the form of White's I 2 statistics [13], as described in Section 5.4, the intention here is to develop some multivariate heterogeneity statistics that are either generalisations or analogues of the established univariate statistics.
We follow Higgins and Thompson [5] by conceptualising the heterogeneity statistics as quantifying the impact of between-study heterogeneity. By this, we refer to the impact of both the between-study variances and correlations, that is, the entire between-study covariance matrix. Testing the null hypothesis that there is no between-study variation, and estimating the magnitude of this, are related procedures that address different statistical questions. We focus on the impact that the heterogeneity has on the precision of the estimated treatment effect, by comparing the precision of estimates from a random effects meta-analysis to those from a fixed effects analysis. For example, if the random effects model provides pooled estimates with similar precision to those from a fixed effects meta-analysis, then the heterogeneity is considered to have little impact.
Although we focus on the relative precision of estimates, the random effects model gives smaller studies greater weight so that the heterogeneity can also impact directly on the point estimate of treatment effect if small studies provide estimates that differ to those from larger ones. If this is the case, then a special investigation is required because the various heterogeneity measures do not attempt to quantify the impact of small study effects or publication bias. This type of issue is exacerbated in the multivariate setting because, in addition to these possibilities, the borrowing of strength may also depend on the amount of heterogeneity. We therefore might anticipate that the multivariate methods provide greater capacity for random and fixed effects analyses to provide notably different point estimates. Here, we do not attempt to quantify the impact of heterogeneity on the location of the point estimate of treatment effect, or the amount of borrowing of strength afforded by multivariate rather than univariate analyses, but these are also important issues and may form the subject of future work.
The unfashionable (because I 2 has become so popular) R 2 statistic, the ratio of the variances of the pooled treatment effect under the random and fixed effects models, is the most natural to extend multivariately. We begin with this as our basis and define an R statistic; univariately, R is the square root of the established R 2 statistic. We then apply this to define a multivariate I 2 statistic, I 2 R . Our I 2 R statistic describes the proportion of the variation of the pooled vector of estimates under the random effects model, which is due to between-study variation. We also provide a multivariate H 2 statistic, the ratio of a generalisation of Cochran's heterogeneity statistic and its associated degrees of freedom, and an accompanying generalisation of the usual I 2 statistic I 2 H . The R statistic is based on the covariance matrix of the estimated treatment effects, and H 2 is based on the residuals from a fixed effects model fit. Hence, they can also be used in the context of multivariate meta-regression [14], where covariate effects are included, and for any procedure for fitting the random effects model.
We set out the rest of the paper as follows. We briefly present the univariate and multivariate models in Section 2 and apply these to our sample datasets in Section 3. We present the univariate heterogeneity statistics in Section 4. In Section 5, we derive our multivariate measures; and in Section 6, we apply our proposed measures, and their univariate counterparts, to our examples. We explain how our measures of heterogeneity may be used in the context of multivariate meta-regression in Section 7 and conclude with a discussion in Section 8.

Univariate and multivariate random effects meta-analysis
We present the multivariate random effects meta-analysis model and explain how this reduces to the usual univariate model in a single dimension. We denote the vector of outcomes (or estimates) for the ith study as Y i . For example, Y i may be a vector containing the log hazard ratios of overall and disease-free survival.
The entries of Y i may be correlated, and it is assumed that where N denotes a multivariate normal distribution, i is the true underlying effect for the ith study and S i is the covariance matrix of Y i . The matrices S i are referred to as the within-study covariance matrices; their entries are estimated in each study in practice but regarded fixed and known when pooling the studies' results. Estimating the within-study covariances or correlations, to provide the off diagonal entries of the S i , is often difficult in practice, but a variety of approaches are possible [10].
The multivariate random effects model allows for the possibility that the i may vary from one study to the next and further assumes that is the (overall) treatment effect vector and † is the between-study covariance matrix. Marginally, this provides the conventional multivariate random effects meta-analysis model where the Y i are further assumed to be independent. If all entries of † are constrained to zero, then the model reduces to a fixed effects model. The conventional univariate random effects model is simply the marginal distribution of the first (say) study outcome. In one dimension, and written in the more usual univariate notation, this means that each study provides a univariate Y i N. ; 2 i C 2 /. If all within-study and between-study correlations are assumed to be zero, then the multivariate random effects model is the collection of the univariate random effects models for each of the study outcomes.
The standard procedure for making inferences about the treatment effect approximates the true between-study covariance with O † [10]. After performing this estimation, the pooled (maximum likelihood) estimates are given by where n is the number of studies, and these estimates are approximately normally distributed with covariance matrix Alternatively, the covariance matrix can be obtained from the observed Fisher information matrix, and Stata's mvmeta [13] uses this method as its default. Equations (2) and (3) require an estimated betweenstudy covariance matrix, and a variety of estimates are available [10]. A fixed effects model is fitted by constraining all entries of O † to zero in (2) and (3). If some studies have missing outcomes, then, assuming that these are missing at random, such studies can be incorporated into these matrix solutions by allocating notional estimates with very large within-study variances and corresponding within-study correlations of zero, or better by modifying these equations to use the marginal model from (1) for the observed data. If inferences for particular subsets of outcomes are required, then these are obtained from the corresponding marginal distributions from (2) and (3). In one dimension, this reduces to the usual univariate formulae, that is, (2) and (3)

Examples
In this section, we apply the methods described in Section 2 to some contrasting examples and informally assess the impact of the between-study heterogeneity on the precision of the pooled estimates. We used the Stata program mvmeta with its defaults throughout. Hence, we adopted the restricted maximum likelihood estimation of the between-study covariance matrix and used the entire observed Fisher information matrix (including the variance components) to compute the covariance matrix of the treatment effect parameters. White [13] described in detail the alternative estimation methods, but we used the defaults here because these acknowledge the uncertainty in the estimation of the between-study covariance matrix and because restricted maximum likelihood is well established in multivariate meta-analysis. Hence, the uncertainty in the between-study covariance matrix is reflected in the results that follow, but we do not wish to imply that this is fully taken into account.

Example 1: the periodontal data
The periodontal data from Berkey et al. [15,16] involve five studies providing the mean difference between surgical and non-surgical procedures for treating periodontal disease, with improvement in probing depth and improvement in attachment level as the two endpoints of interest (measured in mm one year after treatment). We show the data in Table I, and the within-study covariances are known. The within-study correlations are positive as one might expect because both outcomes are associated with positive patient outcomes. We show the results from the univariate and multivariate random effects meta-analyses in Table II. The univariate and multivariate analyses are in good agreement and indicate that the surgical procedure improves probing depth by about 0.35 mm more than the non-surgical procedure, but that the non-surgical procedure improves attachment level by a similar amount than the surgical procedure. There is, however, a large amount of between-study variation whose impact is clear from the covariance matrix of the estimated treatment effects from random and fixed effects multivariate meta-analyses. These covariance matrices are, using mvmeta's defaults, respectively. Because the within-study covariance matrices are regarded as known, C F is treated as a constant, and is given by (3) with all entries of O † set to zero. From an inspection of the relative magnitudes of the diagonal entries of these two covariance matrices, it appears that heterogeneity has a greater Table I. Periodontal data, providing the mean difference between surgical and non-surgical procedures for treating periodontal disease, with improvement in probing depth and improvement in attachment level as the two end points of interest (measured in mm, one year after treatment). impact on the second outcome than on the first. Because the univariate and multivariate analyses are in such good agreement, it might be anticipated that the conventional univariate heterogeneity statistics (Section 4) will give a good indication of the impact of heterogeneity on the marginal inferences for both outcomes. The best way to quantify the impact of heterogeneity on the joint inferences of both outcomes, when using the multivariate model, is less clear however.

Example 2: the sleep data
Our second example is the sleep data of McDaid et al. [17]. This is another bivariate example, and the full dataset is available from the authors on request. Here, we have 26 studies providing the mean effect of treatment for obstructive sleep apnoea/hypopnoea syndrome in terms of change in Epworth Sleepiness Scale (Y 1 ) and change in systolic blood pressure (Y 2 ). Twenty three studies give information on Y 1 , but only 10 studies give information on Y 2 , so there is a considerable scope for borrowing strength for 2 in a multivariate meta-analysis. The within-study correlations, and hence the off diagonal entries of the within-study covariance matrices, are unknown but Y 1 and Y 2 may be positively correlated because both a lack of sleep and high blood pressure may be associated with elevated stress levels. Here, we perform an illustrative multivariate analysis assuming all within-study correlations are 0.4, which provide modest correlations within studies; other values could also be explored in a sensitivity analysis, and other options for dealing with the unknown within-study correlations are possible [10]. We show the results from the univariate and multivariate random effects meta-analyses in Table III. Here, the estimated between-study correlation is one so that the estimated random effects lie at the edge of their parameter space, which has consequences for the estimation [12].
The results in Table III suggest that the treatment for obstructive sleep apnoea/hypopnoea syndrome is effective for both outcomes. Again, there is considerable heterogeneity whose impact is shown by the covariance matrix of the estimated treatment effects from random and fixed effects multivariate meta-analyses The univariate and multivariate estimates are very different, particularly for the second outcome (Table III). In particular, the estimate of the marginal between-study variance of the second outcome is sensitive to the choice between univariate or multivariate meta-analyses. Thus, the conventional univariate heterogeneity statistics cannot be expected to give a good indication of the impact of heterogeneity on the marginal inferences for the second outcome when using the multivariate model, or the joint inferences for both outcomes.

Example 3: the prognostic value of MYCN and chromosome 1p
This is similar to the example used by Riley [11], but here, we include 73 observational studies that examine two effects: overall and disease-free survival. The full dataset is available from the authors upon request. These studies assess the prognostic value of up to two factors, MYCN and chromosome 1p, in patients with neuroblastoma and were also used as an example by Jackson et al. [10]. Patients either have a 'high' or 'low' level of MYCN and either chromosome 1p presence or deletion. Studies provide up to four estimates of effect, each of which is an estimated unadjusted log hazard ratio of survival, either of the high relative to the low level group of MYCN, or chromosome 1p deletion to its presence. The within-study correlations are unknown to the authors but are taken here in an illustrative analysis as 0.7. The log hazard ratios are highly likely to be positively correlated within and between studies across all four outcomes, because MYCN and chromosome 1p are often highly correlated in a patient, whereas overall and disease-free survival are inherently correlated by their definitions. We used the variables Y 1 Table III. Parameter estimates for the sleep data in example 2 using the random effects model.  The parameters 1 and 2 denote the average log hazard ratios for disease-free survival for high to low MYCN and the deletion to the presence of chromosome 1p, respectively. Parameters 3 and 4 denote these same hazard ratios for overall survival. We show the standard errors of the treatment effect parameters in parentheses.
and Y 2 to denote the log hazard ratio for disease-free survival for high to low MYCN and the deletion to the presence of chromosome 1p markers, respectively; Y 3 and Y 4 denote the corresponding overall survival log hazard ratios. Thirty four, 8, 50 and 10 studies report Y 1 to Y 4 , respectively. The average log hazard ratio estimates in Table IV are significantly greater than zero; and hence, chromosome 1p and MYCN have a prognostic value for both disease-free and overall survival. The heterogeneity also has notable impact for this example and The estimates of the marginal between-study variances are quite sensitive to the choice between univariate and multivariate meta-analyses. For an example such as this, in a relatively high dimension and where there is much missing data and borrowing of strength occurs [10], there is little to provide reassurance that the conventional univariate heterogeneity statistics will adequately quantify the impact of heterogeneity in the multivariate analysis. Furthermore, particular subsets of the treatment effects are jointly of interest, such as those relating to the two types of survival and the two markers separately. Methods for quantifying the impact of heterogeneity for more than a single outcome are particularly valuable here, and something more sophisticated than the established univariate statistics is required.

Conventional univariate measures of heterogeneity
Higgins and Thompson [5] originally defined three univariate measures of the impact of heterogeneity, R 2 , H 2 and I 2 , which we will extend multivariately. Variations have, however, subsequently been suggested [18]. We use the univariate notation in this section, Y i N i ; 2 i C 2 . We use O 2 to denote DerSimonian and Laird's [1] estimate of the between-study variance and Q to denote Cochran's heterogeneity statistic [1,18]. Hence, we have where V R and V F are the length of the confidence intervals for the treatment effect that arise from the random and fixed effects model, respectively, assuming that standard normal quantiles are used to construct both intervals; t distribution quantiles are sometimes suggested when using random effects models in meta-analysis [14]. Higgins and Thompson define R in terms of the treatment effect's standard errors under the random and fixed effects models, but it is the relative length of the confidence intervals that generalises multivariately. We use the notation V R and V F because multivariately, these quantities become generalised notions of the volumes of confidence regions that arise from the two models. This involves a slight clash of notation with Higgins and Thompson who use v R and v F to denote the variance of the estimates of treatment effect. Simulation-based [19] and analytical [18] investigations of the univariate measures of heterogeneity have been performed.

Multivariate measures of heterogeneity
From a comparison of the O C R and C F obtained from multivariate meta-analyses for our three examples, and our interpretation of the impact of heterogeneity as referring to the relative precision of estimates resulting from random and fixed effects multivariate meta-analyses, it is clear that the heterogeneity has a considerable impact for all three examples. The aim is to quantify this impact.

Multivariate R statistic
The univariate R statistic is perhaps the univariate heterogeneity statistic that is most naturally extended to achieve this aim. Let p denote the number of treatment effect parameters that the heterogeneity statistic applies to, which for now we assume is the dimension of the meta-analysis, and we also assume that standard normal quantiles are used to construct both random and fixed effects confidence regions. As alluded to earlier, we denote the volumes of the confidence regions for all p outcomes in that arise from the random and fixed effects models as V R and V F , respectively, and we define This R statistic reduces to the usual univariate meta-analysis measure when p D 1. In one dimension, V R and V F are the lengths of the random and fixed effects confidence intervals; and in two and three dimensions, they are areas and volumes, respectively. In four or more dimensions, V R and V F are generalised notions of volumes of the random and fixed effects confidence regions. We calculate R as shown in equation (4) irrespective of the method used to estimate the between-study covariance matrix, the form that this takes, or the method used to obtain O C R once the between-study covariance has been estimated. One interpretation of R is as the ratio of the geometric means of the standard errors of the random and fixed effects estimates of treatment effect that result when their normal approximations are reparameterised in terms of the rotated co-ordinate system where the principal axes are used. In linear algebra, this is referred to as writing the normal approximation's associated quadratic form in its standard position. R is therefore an average ratio of the lengths of random and fixed effects confidence intervals across all outcomes.
We show in Appendix A that (4) can be conveniently written as where jC a j is the determinant of C a . The matrices O C R and C F are obtained when fitting the random and fixed effects models, respectively, so that the measure (5) is easy to obtain from the standard output from statistical software. Equation (5) provides further interpretation of our R statistic because the determinant of a covariance matrix is referred to as the generalised variance. This is considered to be a good scalar dispersion statistic for multivariate data. Equation (5) shows that R is a function of the ratio of the generalised variances of the estimated treatment effect under the random and fixed effects models.

Multivariate I 2
R statistic It is the univariate I 2 statistic that is most commonly used in practice, so a multivariate version of this can be expected to be especially valuable. Higgins and Thompson [5] provide the univariate relationship I 2 D .H 2 1/=H 2 (their Equation (10)) and show that H 2 and R 2 measure similar quantities. This suggests the definition The quantity j O C R j 1=p is the square of the geometric mean of the standard errors of the estimated treatment effect from the random effects model when its associated quadratic form is in its standard position. Hence, j O C R j 1=p is an average variance resulting from the random effects model, and I 2 R is the proportion of this variance, which is explained by between-study heterogeneity. The proposed multivariate I 2 R is therefore an analogue of the univariate I 2 statistic, with a similar but not identical interpretation to its established univariate counterpart. If all studies are the same 'size' (S i D S 1 for all i), then (6) simplifies to the usual I 2 statistic univariately, but this is not the case in general. We return to the apparent issue of truncating potentially negative I 2 R statistics to zero so that these cannot be negative, just as in the univariate case when equating I 2 D .H 2 1/=H 2 as described in Section 4, in Section 5.5.

Multivariate H 2 and I 2
H statistics A multivariate H 2 statistic is also desirable, primarily to provide a direct extension of the univariate I 2 statistic. The univariate H 2 statistic is defined directly in terms of Q as shown in Section 4. The difficulty in using the matrix Q proposed by Jackson et al. [14] for the purposes of quantifying heterogeneity multivariately is discussed when presenting White's I 2 statistics in Section 5.4 but an alternative multivariate generalisation of Q is where O Y i is the fitted value from the fixed effects model. The subscript s emphasises that Q s is a scalar, a 2 test statistic that can be used to test the null hypothesis that there is no between-study heterogeneity.
We make the obvious multivariate generalisation where v denotes the degrees of freedom of Q s , that is, the total number of univariate estimates minus the dimension of the meta-analysis. We can then define an I 2 statistic on the basis of H 2 as another possible measure of heterogeneity. The H 2 statistic retains its interpretation from the univariate case and the multivariate H 2 and I 2 H statistics simplify to the conventional H 2 and I 2 statistics univariately. We suggest that I 2 H is truncated to zero if .H 2 1/=H 2 < 0, following the usual convention in the univariate case.
The quantity that the multivariate H 2 statistic estimates is derived in Appendix B.

White's I 2 statistics
Another way to generalise the univariate I 2 statistic multivariately is to exploit the fact that the univariate I 2 may be written in terms of Q, as max.0; .Q .n 1//=Q/ [18]. A multivariate Q matrix has recently been suggested, which can be used to estimate the between-study covariance matrix. Jackson et al. [14], however, 'note that the proposed Q matrix is not based on matrix operations'. Hence, it is not clear that any standard matrix operation would be capable of transforming this Q matrix into a heterogeneity statistic. It is, at best, extremely difficult to base any truly multivariate measure of heterogeneity on this Q matrix; and hence, I 2 is hard to generalise multivariately in this way. Despite these issues, White [13] has recently proposed multivariate I 2 statistics that reduce to the univariate measure in one dimension. This defines I 2 statistics as the ratios of the estimated between-study variances and the sum of these variances and 'typical within-study variances', where these within-study variances are the ratio of coefficients from the recently proposed multivariate DerSimonian and Laird estimating equations [14]. This may not be inappropriate, but White's I 2 , just like the conventional univariate one, depends crucially on this typical variance. Any method for computing such a variance becomes increasingly problematic as the dimension of the meta-analysis increases. Furthermore, White's I 2 statistics do not truly reflect the multivariate nature of the model fit, or the association between the estimates, because they are merely functions of the estimated marginal between-study variances and the within-study covariance matrices. Despite these limitations, we will compare White's I 2 statistics with those that we develop here.

Multivariate R and I 2
R statistics for subsets of the outcomes One might be especially interested in quantifying heterogeneity for a subset of the outcomes; for example, some effects might be considered to relate to primary trial outcomes. Both the R and I 2 R measures can be easily applied to just a subset of the estimated effects by taking the corresponding submatrices of O C R and C F and by performing the calculations shown in (5) and (6) where p is taken as the dimension of the subset of the outcomes under consideration. We show in Appendix C that our multivariate R statistics, for all or just some of the outcomes, are greater than or equal to one if (3) is used to obtain the variance of the pooled estimates; and hence, the corresponding I 2 R statistics are greater or equal to zero. Therefore, no truncation of I 2 R statistics to zero in such instances is ever required. Because other methods for obtaining the variance of the pooled estimates, for example, using the observed Fisher information matrix, approximate the variance (3), we anticipate that the truncation out of I 2 R statistics is not likely to be a common occurrence irrespective of the procedure used, but we suggest that this is performed where necessary.
It is also possible that one might be interested in quantifying heterogeneity for certain contrasts or linear combinations of the effects. For example, a linear combination of sensitivity and specificity might be important in determining the value of a diagnostic test [20]. Upon obtaining the covariance matrix of the fixed and random effects models' estimates of these linear combinations, these could also be used in (5) and (6) to obtain multivariate R and I 2 R statistics for any linear combinations of interest. However, the properties of Q s , if we instead sum over a subset of the outcomes (and so use the corresponding subvectors of Y i and O Y i , and submatrices of S 1 i , when computing (7)) are not clear. We therefore propose that H 2 and I 2 H be used to quantify the heterogeneity for all treatment effect parameters, whereas R and I 2 R can be used for all, or a subset of, these parameters as desired. We provide a summary of all the various heterogeneity measures and in Table V, where the interpretations of the conventional univariate heterogeneity statistics are as described by Higgins and Thompson [5].

Applying the proposed heterogeneity statistics to our examples
We summarized the various heterogeneity measures for our three example datasets in Tables VI, VII and VIII, where we continue to use p to denote the number of treatment effect parameters that the heterogeneity statistics apply to. We present all the heterogeneity statistics in our tables but restrict our interpretations to the I 2 statistics. This is because the univariate I 2 statistic is easily the most popular, but the reader may also use the R and H 2 statistics to interpret the impact that the heterogeneity has in each case.   The variable p is the number of treatment effect parameters that the statistic applies to, and columns 1 -4 indicate whether the statistics apply to this particular parameter. R, I 2 R , H 2 and I 2 H are the proposed multivariate heterogeneity statistics; I 2 u and I 2 W are the conventional univariate I 2 statistic and White's [13] I 2 statistic, respectively.

Example 1: the periodontal data
The standard univariate I 2 and White's I 2 statistics are in good agreement in Table VI. The I 2 R statistic for the first outcome is, however, noticeably larger than the others (0.78 compared with 0.72). This suggests that the heterogeneity may have a little more impact for the first outcome than the standard univariate statistic indicates. Interestingly, the I 2 statistics relating to both outcomes jointly .p D 2/ are almost as great as those for the second outcome alone. This suggests that the very considerable heterogeneity, coupled with the uncertainty in the estimates of the between-study covariance matrix from pooling just five studies, has more impact for the joint analysis than an average of the univariate measures might be thought to indicate. To summarise, the univariate heterogeneity statistics describe the impact of the heterogeneity for the marginal inferences quite well, as anticipated, but the multivariate heterogeneity statistics add further insight. All the I 2 statistics are greater than 0.7, and some are larger than 0.9, which indicates that the heterogeneity has a considerable impact.

Example 2: the sleep data
The univariate and multivariate heterogeneity statistics for the first outcome in Table VII are in good agreement, but those for the second outcome differ greatly as anticipated. Because the standard univariate and White's I 2 statistics for the second outcome depend on the univariate and multivariate estimates of † 22 , respectively, it is clear from Table III that they must be in poor agreement. Those more familiar with the standard univariate measure might suspect that White's I 2 is unduly effected by the large multivariate estimate of † 22 , because this I 2 is a function of this particular and rather extreme entry of the estimated between-study covariance matrix. Although I 2 R tempers White's I 2 for the second outcome slightly, it confirms that the impact of the heterogeneity is more considerable for 2 in a multivariate meta-analysis than a univariate meta-analysis, even after taking into account the entire model fit. The multivariate I 2 R and I 2 H heterogeneity statistics for both treatments, and hence the joint inference and the meta-analysis as a whole, are in good agreement. The I 2 statistics are not as large as those for the previous example, but they are all greater than 0.3, indicating that the heterogeneity has a quite considerable impact.

Example 3: the prognostic value of MYCN and chromosome 1p
White's I 2 statistics are generally larger than the usual univariate I 2 statistics in Table VIII, as anticipated from the larger estimated between-study variances from the multivariate model shown in able IV. However, the I 2 R statistics for single treatment effect parameters (p D 1) are even larger than White's, suggesting that the impact of the heterogeneity is even greater. It seems that allowing for the uncertainty in the model fit, coupled with the relatively high dimension of the multivariate meta-analysis and the amount of missing data, means that the heterogeneity has a little more impact than either of the previously suggested I 2 statistics indicate. These larger multivariate heterogeneity statistics are also apparent for the p > 2 statistics, which are averages of the corresponding p D 1 heterogeneity statistics. The general picture from Table VIII is that the impact of the heterogeneity is really quite considerable for this example, and much more so than suggested by the univariate or White's I 2 statistics.

Multivariate meta-regression
The multivariate random effects model may be extended to a multivariate meta-regression model where the treatment effect vector includes study level covariate effects. For example, if the first outcome depends on a covariate x, we replace 1 with˛1 Cˇ1x i . We refer to all parameters in the mean of (1) as treatment effect parameters, irrespective of whether they are intercept or covariate effects. The inference for the treatment effect parameters follows in an analogous way to (2) and (3); here, the model is fitted as a weighted linear regression model, where all weights are regarded as known [14].
All our heterogeneity measures are immediately applicable to a multivariate meta-regression. We can obtain the covariance matrix of the estimates of all treatment effect parameters under the assumptions of fixed and random effects meta-regression models and obtain R and I 2 R statistics for any combinations of treatment effect parameters of interest. Now that covariate effects are included, we may be especially interested in all parameters for a particular outcome, for example. Equation (B5) continues to provide the expected value of Q s , where the degrees of freedom v is the number of observations minus the total number of treatment effect parameters (intercept terms plus covariate effects). Hence, we define H 2 D Q s =v and I 2 H D .H 2 1/=H 2 as heterogeneity statistics for a meta-regression; but, just as in meta-analysis, we only use these to quantify the impact of heterogeneity for all treatment effect parameters.

Discussion
We have proposed some multivariate measures of the impact of heterogeneity in a multivariate meta-analysis. All aspects of the data contribute to the calculations; and hence, our measures can be expected to perform well regardless of the amount of borrowing of strength involved and any vagaries of the particular dataset under consideration. A considerable advantage of our proposals is that they are relatively easily computed from standard output. A potential limitation of our proposals is that it is tempting to interpret them in an overly simplistic fashion. For example, the Cochrane handbook [6] has important things to say about the interpretation of univariate I 2 statistics, and these same issues apply here. Perhaps most importantly, the Cochrane handbook makes it clear that the use of particular thresholds when interpreting heterogeneity statistics can be misleading.
Perhaps one advantage of H 2 and I 2 H is that they both reduce to the conventional measures univariately. The multivariate R statistic also simplifies to the univariate R, but I 2 R is an analogue of I 2 that only simplifies to I 2 univariately if all studies are the same size. I 2 R has a similar but different interpretation to the conventional univariate I 2 statistic. Another advantage of H 2 and I 2 H is that their computation does not require fitting the random effects model; only the fixed effects model fit is required. This eases their computation, and these multivariate heterogeneity statistics can be obtained without comparing the fixed effects results to any particular random effects model. Hence, H 2 and I 2 H provide a means to quantify the heterogeneity concisely in situations where many possible random effects models and estimation methods are to be considered. Alternative R and I 2 R statistics are obtained for different fitted random effects models, for example, using different estimation procedures for estimating the betweenstudy covariance matrix. Alternatively, one could fit a reduced random effects model, where perhaps all between-study variances are assumed to be identical. Irrespective of how the between-study covariance matrix has been estimated, and the assumptions made about its form, the R and I 2 R statistics may be calculated for each fitted model. This may also be considered an advantage of these statistics, however, because the impact of heterogeneity may be thought to depend on this modelling and estimation.
There is therefore no single 'best' multivariate heterogeneity statistic, but the meta-analyst who desires a single heterogeneity statistic, and is committed to using I 2 in the univariate scenario, is likely to find the I 2 H statistic appealing. However, the meta-analyst who requires a more thorough investigation into the impact of the heterogeneity, on all the various combinations of treatment effects of interest, is more likely to use R and I 2 R . Our measures quantify the impact of heterogeneity in an analogous way to the conventional univariate heterogeneity statistics, but the multivariate scenario allows a much richer array of possibilities. We are currently developing statistics that describe the amount borrowing of strength afforded by multivariate, rather than univariate, meta-analyses and other related quantities that might also be of interest. In addition to such possibilities, we now advocate routinely reporting the estimated between-study covariance matrix, and the covariance matrix of the estimate of the treatment effect, although we recognise that we ourselves have not always fully reported them. These quantities enable others to make use of all aspects of the model fit in any subsequent analysis and provide many insights into the properties of the fitted multivariate model. We also advocate providing our and other heterogeneity statistics as further descriptive statistics because they also add insight. For example, by comparing univariate and multivariate measures of heterogeneity, meta-analysts can directly assess the impact of heterogeneity for univariate and multivariate analyses of their data.
The uncertainty in the univariate heterogeneity measures is usually considerable, however, and this can also be expected multivariately. A related concern is that the properties of the proposed multivariate statistics are poorly understood, and we accept that these deserve further investigation and this may form the subject of a future paper. Attempting to derive multivariate confidence regions corresponding to the proposed measures is at best extremely difficult. We leave the best way to do this, and indeed the question of whether this is necessary or desirable, as an open research question, but those who require an indication of the uncertainty in their measures might consider bootstrapping. In any case, practitioners typically use the univariate heterogeneity statistics as descriptive statistics. The statistics are interpreted as measuring the impact of the heterogeneity for their particular meta-analysis, and any regard for the properties of the measures under repeated sampling is a secondary consideration.
Other multivariate measures are also possible, and a statistic similar in concept to (5), but based on the trace of the covariance matrices, may warrant consideration. Some meta-analysts use quantiles from the t distribution when calculating confidence intervals using the random effects model and may wish to scale up our R statistics by the ratio of t and standard normal quantiles to reflect this, which provides another variation of our methods.
The cautiously minded statistician is likely to want to perform separate univariate meta-analyses in addition to a multivariate meta-analysis, to see if the univariate and multivariate results differ qualitatively. Similarly, it may be of interest to see how the univariate heterogeneity measures compare with those proposed here. If the various heterogeneity statistics differ substantially, then this is of interest; and the reasons for this should be explored and, if possible, explained. Our proposed methods measure the impact of heterogeneity differently to the standard methods, and we think that they are preferable when multivariate meta-analysis has been used because they accurately reflect the multivariate nature of the model fit. Greater insight into the data is generally afforded by looking at the data in a variety of ways, and we hope that our methods will embellish, rather than diminish, the established ways of measuring the impact of heterogeneity in meta-analysis that we have taken as our inspiration here.
confidence ellipsoid (the length of the interval in 1-d) is therefore where k p D 2 p=2 =p.p=2/; if p D 1, k p D 2; if p D 2, k p D and if p D 3, k p D 4 =3, for example. Because the product of the eigenvalues is the determinant of a square, matrix (A1) becomes Similarly, The ratio of the volume of the confidence ellipsoids is V R =V F . Combining the two equations immediately above with (4) gives (5).

Appendix B
The univariate H 2 statistic estimates the quantity 1 C D 1 C 2 = 2 when all studies are the same size [5]. We can assess the properties of Q s , and hence what H 2 estimates, by assembling all the Y i into a single vector Y and fitting a fixed effects meta-regression model where O is the vector of residuals from the regression of Z on W, H is the corresponding 'hat' matrix and are the true errors that can be obtained from (B4). We have the usual results that .I H/ is symmetrical, idempotent and its trace equals its rank, which equals the regression's degrees of freedom (v  (5) and I 2 R > 0 in (6).