Simultaneous inference for multiple marginal generalized estimating equation models

Motivated by small-sample studies in ophthalmology and dermatology, we study the problem of simultaneous inference for multiple endpoints in the presence of repeated observations. We propose a framework in which a generalized estimating equation model is fit for each endpoint marginally, taking into account dependencies within the same subject. The asymptotic joint normality of the stacked vector of marginal estimating equations is used to derive Wald-type simultaneous confidence intervals and hypothesis tests for multiple linear contrasts of regression coefficients of the multiple marginal models. The small sample performance of this approach is improved by a bias adjustment to the estimate of the joint covariance matrix of the regression coefficients from multiple models. As a further small sample improvement a multivariate t-distribution with appropriate degrees of freedom is specified as reference distribution. In addition, a generalized score test based on the stacked estimating equations is derived. Simulation results show strong control of the family-wise type I error rate for these methods even with small sample sizes and increased power compared to a Bonferroni-Holm multiplicity adjustment. Thus, the proposed methods are suitable to efficiently use the information from repeated observations of multiple endpoints in small-sample studies.


Introduction
In empirical studies where for each subject multiple endpoints are observed, it is often of interest to identify predictive factors for several of these endpoints. To this end, regression models for the different endpoints can be defined to test respective null hypotheses on the model parameters. However, if for each endpoint, one (or more) hypotheses are tested, a multiple testing problem arises and adjustments for multiplicity are required to control, for example, the family wise type I error rate (FWER).
In this manuscript, we focus on settings where all or some of the multiple endpoints are measured repeatedly and derive multiple testing procedures and simultaneous confidence intervals that account for the correlation between the endpoints as well as the correlation between the repeated measurements of each endpoint. The endpoints may be on different scales (we particularly consider continuous, binary and count data), and the regression models may differ across endpoints. The proposed tests improve the Bonferroni test which is typically strictly conservative.
The testing procedures are based on generalized estimating equation (GEE) models 1 that are separately fitted for each endpoint. Thereby each model accounts for the dependencies between repeated observations of the according endpoint.
We use a representation of stacked estimating equations to show joint multivariate asymptotic normality of the regression coefficient estimators from different models and to estimate their covariance matrix, such that parametric inference methods based on a multivariate normal approximation can be applied. The method generalizes the approach by Pipper et al. 2 who used stacked estimating equations for multiple generalized linear models and Rochon 3 who studied the case of repeated bivariate measurements comprising a continuous and a binary endpoint. Jensen et al. 4 applied a similar representation to multiple linear mixed models for repeatedly observed continuous endpoints. Also see Verbeke et al. 5 for a recent review on the analysis of multivariate longitudinal data.
While inference based on the multivariate normal approximation can be justified for large sample sizes by asymptotic arguments, 6 it may be inaccurate for smaller samples. In particular, the bias and variability of nuisance parameter estimates are neglected in purely asymptotic methods, resulting in too liberal inference procedures.
To improve the small sample properties of Wald tests, we generalize bias-adjustment procedures proposed for covariance matrix estimators in single GEE models, 7 to the case of multiple marginal GEE models. Furthermore, similar to the studies by Hasler and Hothorn 8 and Pan and Wall, 9 we use multivariate t-and F-distributions to better control the type I error rate. We further propose a maximum-type generalized score test and show via simulation that it is a viable small sample alternative to the Wald test.
The paper is structured as follows: In Section 2, multiple marginal GEE models and a bias-adjusted covariance matrix estimator are introduced. In Section 3, we define Wald and score test statistics to test multiple linear contrasts and derive corresponding simultaneous confidence intervals. In Section 4, the proposed methods are applied to a retina disease study. Furthermore, in Section 5, we investigate the small sample properties of the proposed methods in a simulation study. Finally, in Section 6, we conclude with a discussion. and ðmÞ is a scale parameter. In typical applications, the link function and the variance function are derived from the canonical representation of an exponential family model. 10 Throughout the manuscript, we assume that regression coefficients and nuisance parameters are unique to one model and not shared between any two models. The models for different endpoints are estimated independently, see Section 7 for a discussion on alternative approaches of joint estimation.

Generalized estimating equations
To account for dependencies between repeated observations within the same subject, the regression coefficients b ðmÞ and their covariance matrix are estimated based on the generalized estimating equation approach. 1 Thus, the estimateb ðmÞ is given by the solution of the generalized estimating equation with subject-wise contributions U ðmÞ Given b ðmÞ , the parameters a ðmÞ and ðmÞ may be consistently estimated from the residuals S ðmÞ i , i ¼ 1, . . . , K by moment estimators. 1 Given a ðmÞ and ðmÞ , an estimate for b ðmÞ is found as solution to equation (1). Iteration of these two estimation steps results in a consistent estimateb ðmÞ such that asymptotically K 1=2 ðb ðmÞ À b ðmÞ Þ is multivariate normal (Theorem 2 in Liang and Zeger 1 ). A consistent estimator for the covariance matrix of the limiting normal distribution as proposed in Liang  Note that in case the mean model is misspecified,b ðmÞ will typically converge to a vector b (m) that defines the model within the chosen mean structure that best approximates the true model in the sense of minimized Kullback-Leibler distance. 11,12 In that case, the proposed methods provide inference on the parameters of the approximating model.

Multiple marginal models
We are interested in simultaneous inference on the regression coefficient vectors b ð1Þ , . . . , b ðMÞ and approximate the joint distribution of the stacked vectorb ¼ ðb ð1ÞT , . . . ,b ðMÞT Þ T by a multivariate normal distribution based on the framework of Pipper et al. 2 By equation (1),b (together with the marginal model estimates for the nuisance parameters a ðmÞ and ðmÞ , m ¼ 1, . . . , M) is the solution to the stacked estimating equation Similar to the case of a single GEE model, for increasing number of subjects K, ffiffiffi ffi K p ðb À bÞ converges to a multivariate normal distribution with mean zero and covariance matrix lim p 1 K @UðbÞ @b À1 providedb is consistent for b and certain regularity conditions are met. Here lim p denotes the limit in probability when K goes to infinity. Consistency ofb follows ifb ðmÞ is consistent for b ðmÞ for all m ¼ 1,. . .,M. The essential regularity conditions concern the derivatives of U with respect to b (see Chapter 5.3 in the book by Van der Vaart 6 and Chapter 9.1 in the book by Cox and Hinkley 13 ). However, the matrix of first derivatives H ¼ @UðbÞ @b is a block diagonal matrix of the matrices H ðmÞ ¼ @U ðmÞ ðb ðmÞ Þ @b ðmÞ . Hence, conditions such as existence of derivatives, a dominating function, expectation and a matrix-inverse are inherited if they are met by all marginal models.
An estimate of the covariance matrix ofb is given bŷ T is calculated from the stacked vectors U i . The componentsB ðm,m 0 Þ ¼ P K i¼1 U ðmÞ i ðb ðmÞ ÞU m 0 i ðb ðm 0 Þ Þ T correspond to the empirical correlation between the contributions of a subject to the estimating equations of models m, m 0 .Ĥ is a block diagonal matrix with block elementsĤ ðmÞ ¼ @U ðmÞ ðb ðmÞ Þ @b ðmÞ that are the corresponding estimates from the marginal models. The resulting multiple model sandwich variance estimator maintains the marginal GEE sandwich variance estimators in the diagonal blocks, while the off diagonal blocks contain estimates for the covariances between estimated regression coefficients from different models.

Bias-adjusted covariance estimator
The covariance matrix estimator in GEE models is consistent but it is in general not unbiased. With small sample sizes, the variances may be underestimated which leads to an inflation of the type I error rate of hypothesis tests and to confidence intervals with coverage less than the nominal 1 -level. Based on the bias-adjusted estimator proposed by Mancl and DeRouen 7 (see also Wang et al. 14 ) for a single GEE model, we derive a bias-adjusted covariance estimator for multiple models, given bŷ and I i is the identity matrix with matching dimension. See the supplemental material Section S.1 for details. The matrices D i , V i and, consequently,P ii are block diagonal with block elements D ðmÞ i , V ðmÞ i ,P ðmÞ ii , m ¼ 1, . . . , M, respectively.D adj contains in the diagonal blocks the bias-adjusted variance matrix estimators that would result from separate marginal models and the off-diagonal blocks contain bias-adjusted estimates of the covariances between regression coefficient estimates from different models.

Missing data
Values of some Y ðmÞ ij or x ðmÞ ij may be missing in an actual data set. As for single GEE models, 1b is consistent for b if some observations are missing completely at random (MCAR), i.e. missingness is completely independent from any missing or non-missing values of the included variables, and each model is fit with the respective available data. Also, inference based on the asymptotic normality of the stacked score vector U stays unaffected under MCAR. If a subject i has to be excluded entirely from the model for the m-th endpoint due to missing data, the realization of the respective contribution to the estimating equation is treated as U ðmÞ (1) and (2) and subsequent calculations. Note that this does not bias the estimated covariance matrix (3) or (4) as the effective sample size enters these equations in terms of the observed informationĤ À1 and not the number of clusters.
If data are missing at random (MAR), i.e. the missingness may depend on non-missing values of observed variables, residuals may be biased. Consequently, a model fit with available data may result in biased and inconsistent estimates. This bias may be counteracted by weighing observed residuals with the inverse probability of non-missingness of the given data point. Under MAR, these probabilities may in principle be estimated from the observed data. There are different ways to introduce weights in the generalized estimating equation. [15][16][17][18] Our software implementation, discussed in Section 6, allows for weights that resemble a scale factor for each observation, similar to the GENMOD procedure in SAS. 19 Here, subject-wise contributions to a correspondingly weighted generalized estimating equation are of the form where W i is a diagonal matrix of weights for the observations of subject i. When all observations of a subject receive the same weight, this formulation is equivalent to the cluster-weighted GEE proposed by Fitzmaurice and Laird. 15 With an identity working correlation, it is equivalent to the weighted GEE proposed by Robins et al. 16 3 Maximum-type tests and quadratic form tests for linear hypotheses Consider a null hypothesis of the form H 0 : Lb À r ¼ 0, where L is a matrix of linear constraints with c rows and number of columns equal to the length of b and r is a vector of matching dimension. Each row of this equation corresponds to an elementary null hypothesis H i : ðLbÞ i ¼ r i , i ¼ 1, . . . , c. Furthermore, assume that Lb ¼ r has at least one solution in b. In Section 3.1, we construct asymptotic hypothesis tests for the global null hypothesis H 0 that are based on the maximum of multivariate Wald statistics, and we propose adjustments of the tests for small samples. In Section 3.2, a maximum test based on score statistics is proposed. In Section 3.3, we discuss Wald and score tests for H 0 that are based on quadratic forms. In Section 3.4, the closed testing principle is applied to construct multiple testing procedures, allowing for decisions on intersection and elementary hypotheses with type I error rate control. Furthermore, simultaneous confidence intervals for ðLbÞ i , i ¼ 1, . . . , c corresponding to a single step multiple testing procedure are derived.

Maximum-type Wald test
where we use the normal approximation ffiffiffi ffi K p ðLb À LbÞ % Nð0, KLDL T Þ to define the critical value q 1À as the solution of Pðmax i¼1, ..., c jZ i j q 1À Þ ¼ 1 À . 20 Here, Z denotes a c-dimensional multivariate normal variable with mean 0, unit variances and correlation structure given by LDL T , such that covðZÞ ¼ diagðŜEÞ À1 LDL T diagðŜEÞ À1 , where the vector of standard errorsŜE is given by the square roots of the diagonal entries of LDL T . Similarly, the p-value for the maximum test is defined as Pðmax i¼1, ..., c jZ i j ! max i¼1, ..., c jððLbÞ i À r i Þ=Ŝ E i jÞ.
L may be below full rank since the quantile q 1À is also defined for a degenerate multivariate normal distribution with singular covariance matrix.

Small sample improvements
For small samples, the type I error rate of the above test can be considerably greater than the nominal level. As a first small sample improvement, the bias-adjusted covariance estimateD adj (equation (4)) may be used instead ofD (equation (3)). Furthermore, to also account for the variability of the covariance estimators, the critical value q 1À of the multivariate normal distribution can be replaced by the critical value t 1À of a multivariate t-distribution, 21 such that Pðmax i¼1, ..., c jT i j t 1À Þ ¼ 1 À , where T is distributed according to a c-dimensional multivariate t-distribution with correlation matrix diagðŜEÞ À1 LDL T diagðŜEÞ À1 and an appropriate number of degrees of freedom (df). See the earlier studies 7,8,22 for related approaches in the context of multiple contrast tests. As a simple method to choose the error degrees of freedom, we propose df ¼ min m¼1, ..., M ðK À p ðmÞ Þ where p ðmÞ is the number of regression coefficients in model m (compare with Munzel and Hothorn 23 ). Alternative methods to choose degrees of freedom for multivariate comparisons are discussed in Section 7.

Maximum-type score test
We derive the maximum-type generalized score test as an approximation to the Wald test. By first order approximation, Lb À r ¼ Lb À Lb % ÀLH À1 UðbÞ. Hence, tests for H 0 can be constructed based on the right hand side ÀLH À1 UðbÞ and its normal approximation under the null hypothesis, Nð0, LH À1 BH À1 L T Þ. Under a simple null hypothesis, the true b is known, under a composite null hypothesis, a restricted estimate e b, which satisfies L e b À r ¼ 0, is plugged in. To estimate the limiting distribution covariance matrix, H and B are replaced by The maximum-type score test rejects Z denotes a c-dimensional multivariate normal variable with mean 0, and covariance matrix given by diagð f For a single marginal GEE model, the restricted estimate e b ðmÞ can be computed by the iterative restricted weighted least squares algorithm e b ðm,jþ1Þ ¼ e b ðm,j Þ À @U @b ð e b ðm,j Þ Þ À1 ðUð e b ðm,j Þ Þ À L T k ð j Þ Þ, with the vector of Lagrange multipliers , compare with Rao and Toutenburg. 24 Here, the second superscript indicates the iteration number. Whereb can be understood to maximize a quasi-likelihood with first derivative U, e b maximizes the quasi-likelihood subject to the restriction of H 0 .
If the null hypothesis Lb À r ¼ 0 has a block diagonal structure such that each constraint involves only parameters of one marginal model, we have Then, the restricted estimate e b is a stacked vector of restricted estimates from the marginal models. We consider only null hypotheses covered by equation (5). Otherwise, the elements of e b needed to be estimated jointly for all models. However, contrasts between coefficients from different marginal models are rarely of interest, if they correspond to different units or scales of measurement. If outcomes are indeed measured at the same scale and units, they can be modelled together in one GEE model.

Small sample considerations
With the score test, nuisance parameters are estimated under the null hypothesis based on the restricted estimate e b, which is less variable thanb. (In the limit covð ffiffiffi ffi Consequently, we may expect that the nuisance parameters are estimated with less variability, too, and the type I error rate control with the score test is improved compared to the unadjusted Wald test. In principle, though, the Mancl and DeRouen bias adjustment can be extended to the estimate e B. Instead of See supplemental material Section S.1 for the derivation. Also, a multivariate t reference distribution could be used. We will, however, focus on the unadjusted score test in the numeric simulations.

Quadratic form tests
Alternatives to the maximum-type tests may be derived based on the normal approximation of the multivariate Wald and score statistics. An example are quadratic form tests. The quadratic form Wald test rejects H 0 if where Q ð 2 Þ c ð1 À Þ denotes the 1 -quantile of the chi-squared distribution with c degrees of freedom. As small sample improvement,D may be replaced byD adj . Further, Q ð 2 Þ c ð1 À Þ may be replaced by cQ ðFÞ c,df ð1 À Þ or df dfÀcþ1 cQ ðFÞ c,dfÀcþ1 ð1 À Þ, where Q ðFÞ c,df ð1 À Þ denotes the 1 -quantile of an F-distribution with c numerator degrees of freedom and df denominator degrees of freedom, chosen in the same way as for the maximum test. The former option is analogous to an F-test, where the variability of an assumedly independent and chi-squared distributed single nuisance parameter is taken into account. The latter option is analogous to Hotelling's test which adjusts for the variability of an assumedly independent and Wishart distributed covariance matrix estimate. This approach was described by Kenward and Roger 25 for random effects models and by Pan and Wall 9 in the context of single GEE models.
Similar to the maximum score test, this test is based on the normal approximation for the statistic ÀL e H ðÀ1Þ Uð e bÞ. For an alternative derivation see the study by Boos. 26 If L is below full rank, a generalized matrix inverse may be applied in the calculation of the quadratic form statistics and c is replaced by the rank of L.
In terms of the multivariate space of Lb À r, the quadratic form test statistic and the maximum test statistic apply different metrics to measure deviations from the null vector. The quadratic form test is monotone in the noncentrality parameter ðLb À rÞ T ðLDL T Þ À1 ðLb À rÞ; however, it is in general not monotone in the observed effects jðLbÞ i j. In contrast, the maximum test is monotone in jðLbÞ i j. In the setting of single regression models, a familiar application of quadratic form tests is testing a null hypothesis of equal effects between all, c þ 1 say, stages of some grouping variable. There, the elements of Lb constitute a set of c arbitrarily selected between-group differences and a quadratic metric is appropriate to assess the overall deviation from the null hypothesis. If, however, the null hypothesis refers to a set of effects in multiple models, each elements of Lb has an individual interpretation and asks for a test that is monotone in the individual observed effects. Therefore, a maximum-type test will typically be preferable when testing hypotheses across multiple models.

Multiple testing procedures
The tests considered so far test a global null hypothesis H 0 : Lb ¼ r. Each row of this equation corresponds to an elementary null hypotheses H i : ðLbÞ i ¼ r i , i ¼ 1, . . . , c while controlling the FWER in the strong sense requires an appropriate multiple testing procedure.

Single step procedure
Based on the maximum-type Wald test, a single step multiple testing procedure with strong FWER control at level rejects H i if jððLbÞ i À r i Þ=Ŝ E i j 4 q 1À . Corresponding simultaneous 1 -Wald confidence intervals for Lb are given by q 1À may be replaced by t 1À to use a multivariate t reference distribution. Note that the single step procedure based on the maximum-type score test may not control the FWER because its multivariate reference distribution is valid only under the global null hypothesis H 0 .

Closed testing procedure
A general and more powerful multiple testing procedure can be constructed with the closed testing principle. 27 Let I ¼ f1, . . . , cg denote the index set of the elementary hypotheses. According to the closed testing principle, an intersection hypothesis \ i2S H i , S I can be rejected with strong control of the FWER at level if all intersection hypotheses \ i2S 0 H i with S S 0 are rejected by a local level test. Note that in the context of linear hypotheses, the intersection hypothesis \ i2S H i corresponds to the set of linear contrasts ðLbÞ i2S ¼ ðrÞ i2S . Thus, to construct a multiple testing procedure, we define for each such intersection hypothesis a level test given by one of the tests described above and decide on the intersection and elementary hypothesis according to the closed testing principle. If the matrix L is not of full rank, some intersection hypotheses are equivalent. It is therefore not necessary, and in fact would reduce the power of the procedure, to test all intersection hypotheses in the closed testing procedure. Instead, the test of a hypothesis \ i2S H i may be substituted by the test for an equivalent hypothesis \ i2S 0 H i with S & S 0 . Shaffer 28 describes a general method to identify redundant intersection hypotheses. In the context of linear contrast tests, and under the assumption that Lb ¼ r has at least one solution, two intersection hypotheses \ i2S H i and \ i2S 0 H i are equivalent if the corresponding contrast matrices ðLÞ i2S and ðLÞ i2S 0 define the same region in the parameter space. This is the case if rank ðLÞ i2S Multiplicity adjusted p-values for the test of \ i2S H i are defined as the smallest family-wise significance level for which \ i2S H i can be rejected using the closed testing procedure or, equivalently, as the maximum of local p-values for all local tests of \ i2S 0 H i , S S 0 .
When using maximum-type tests, a weighted closed testing procedure as discussed by Xi et al. 29 may be applied to account for differences in importance of the tested hypotheses, depending on the study aims.

Example -A retina disease study
In a recent exploratory study, the association between two metric endpoints Y ð1Þ and Y ð2Þ , both measuring retinal function, and three categorical variables X ð1Þ , X ð2Þ , X ð3Þ , each representing the condition of one of three retinal cell layers, was analyzed. X ð1Þ and X ð2Þ allow for three stages of deterioration in {0, 1, 2} and X ð3Þ comprises two stages {0, 1}. Within each eye, the set of variables was measured at 29 to 51 distinct locations defined through a common grid. In total, the study data comprise observations from 1489 locations in 35 eyes of 18 patients. Six marginal analysis models (7) to (12) EðY ð2Þ ij Þ ¼ ð6Þ Here, 1 is the indicator function. Each model was fit using the GEE method with patient as clustering variable and specifying an exchangeable working correlation structure. Note that the robust variance estimation via the GEE approach was preferred over a mixed model since the true correlation structure is most likely too complicated to be explicitly modelled correctly.
Six null hypotheses, addressing the association between an outcome and one independent factor, are regarded in the study: H 1 : ð1Þ 1 ¼ ð1Þ 1 ¼ 0. We illustrate the application of multivariate inference for the set of hypotheses fH i , i ¼ 1, . . . , 6g based on the joint distribution of the coefficients from all six models. Following the discussion at the end of Section 3.3, we use maximum tests. Define the contrast matrices L ð1Þ ¼ L ð2Þ ¼ L ð4Þ ¼ L ð5Þ ¼ ðð0, 1, 0Þ T , ð0, 0, 1Þ T , ð0, 1, À 1Þ T Þ T and let L ð3Þ ¼ L ð6Þ ¼ ð0, 1Þ. The right-hand side vector is r ¼ 0. Then, the intersection hypotheses \ i2S H i , S f1, . . . , 6g correspond to L S b S ¼ 0 where L S is a block diagonal matrix composed of the matrices L ðiÞ , i 2 S and b S is the stacked vector of b ðiÞ , i 2 S. Each of these hypotheses is tested by a maximum-type Wald test, using the bias-adjusted covariance matrix estimate and a multivariate t-distribution with df ¼ K -3 ¼ 18 -3 ¼ 15 degrees of freedom (or df ¼ K -2 ¼ 16 if only H 3 and H 6 are involved) as reference distribution. Adjusted pvalues resulting from the closed test for fH i , i ¼ 1, . . . , 6g are calculated as described in Section 3.4. For comparison, adjusted p-values according to the Bonferroni-Holm method 30 are also calculated. Table 1 shows the unadjusted p-values of the separate maximum tests for H 1 , . . . , H 6 , adjusted p-values resulting from the application of the Bonferroni-Holm method to the former unadjusted p-values and adjusted p-values calculated by applying the closed testing procedure outlined in Section 3.4 to the set of hypotheses fH 1 , . . . , H 6 g. Hypotheses H 1 , H 2 and H 3 are rejected at a family-wise 5% significance level with both multiplicity adjustments. Also, for H 4 and H 5 , both methods give similar results and do not reject. The test for H 6 has a local p-value of 0.0237, and the Bonferroni-Holm adjustment results in an adjusted p-value of 0.0710, such that the hypothesis is not rejected with this procedure. In contrast, the closed test based on maximum tests across multiple marginal GEE models results in a multiplicity adjusted p-value of 0.0362, allowing for the rejection of H 6 .

Type I error rate and power comparisons in finite samples
A simulation study was performed to investigate the power and type I error rate of the proposed hypothesis tests as well as the coverage probability of simultaneous confidence intervals in settings with multiple differently scaled endpoints. The supplemental material Section S.2 contains a further simulation study assessing the coverage of simultaneous confidence intervals based on multiple marginal GEE models in a recently planned clinical trial in actinic keratosis with a continuous and a binary endpoint.
Here, Group is a binary variable in {0, 1}, e.g. indicating treatment versus control. In the simulation study, inference for the corresponding coefficients ðmÞ 1 , m ¼ 1, . . . , M was studied. x ðmÞ i corresponds to a covariate that is specific for the m-th outcome and that is observed once for each patient. The vector ðx ð1Þ i , . . . , x ðMÞ i Þ was drawn from a multivariate normal distribution with zero mean vector, unit variances and all pair-wise correlations set to 0.4.
To simulate correlated observations ðY ð1Þ i1 , . . . , Y ð1Þ in i , . . . , Y ðMÞ i1 , . . . , Y ðMÞ in i Þ, for each subject, i ¼ 1, . . . , K first a latent multivariate normal vector n i ¼ ð ð1Þ i1 , . . . , ð1Þ in i , . . . , ðMÞ i1 , . . . , ðMÞ in i Þ with mean zero and unit variances was sampled. Elements of n i corresponding to repeated observations of the same endpoint had pair-wise correlations of 0.75. The correlation between elements corresponding to different endpoints was 0, 0.25, 0.5 or 0.75 to model zero, weak, intermediate and strong correlations between endpoints. The intermediate correlation was our base setting used in most simulation scenarios. Observations on the continuous, count data and binary outcomes were then obtained by the quantile substitution Y ðmÞ ij ¼ Q ðmÞ ij ðÈð ðmÞ ij ÞÞ. Here, È is the standard normal distribution function and Q ðmÞ ij is the quantile function of, depending on the type of endpoint, a normal distribution with mean ðmÞ ij and variance 1, a negative binomial distribution with mean ðmÞ ij and dispersion parameter 1 (resulting in a variance of ðmÞ ij þ ðmÞ 2 ij ), or a Bernoulli distribution with mean ðmÞ ij . The resulting pair-wise correlations between the marginal Wald or score statistics were close to the corresponding correlation between the latent variables or, for pairs involving non-continuous endpoints, slightly below the latent variable correlation.

Hypothesis tests and confidence intervals
We tested the global null hypothesis H 0 : ð1Þ  (13) to (15), canonical variance functions for linear, Poisson and logistic regression models, respectively, and subject as clustering variable. Note that inference in the Poisson type GEE models is valid in the presence of overdispersion due to the robust covariance estimation. The exchangeable working correlation structure was specified for all models.
We investigated the performance of the following hypothesis tests as described in Section 3: The quadratic form Wald test using the chi-squared, F or scaled F reference distribution, the maximum-type Wald test using the multivariate normal or multivariate t-distribution as reference distribution, the quadratic form score test with a chi-squared reference distribution and the maximum-type score test with a multivariate normal reference distribution. For the Wald statistics, all tests were calculated, both, with and without bias adjustment of the covariance matrix estimate. For comparison, we further included Bonferroni-Holm tests for the maximumtype Wald statistics using the 1-/2/M quantile of a univariate normal or univariate t-distribution as critical quantile.
Simultaneous confidence intervals according to equation (6) were calculated for scenarios with three endpoints and intermediate correlations for ð1Þ 1 , ð2Þ 1 and ð3Þ 1 , with and without bias adjustment of the covariance matrix estimate and based on a critical quantile of either a multivariate normal or t-distribution.
For methods based on a multivariate t-distribution or an F-distribution, df ¼ K -3 error degrees of freedom were used. For all tests, the nominal type I error rate was ¼ 0.05.

Simulation scenarios
We considered scenarios with K ¼ 40 and K ¼ 100 subjects, with K/2 subjects in each class of the Group variable. The true coefficients in the data generating models were b ðmÞ ¼ ð0, ðmÞ 1 , 0:25Þ for the continuous endpoints, For each scenario, 10 5 simulation runs were performed, except for more computation intensive simulations addressing the effect of increasing numbers of endpoints, where 2 Â 10 4 simulation runs were performed. The power for each test was calculated as the proportion of simulation runs in which H 0 was rejected.

Simulation results
Simulation results regarding the type I error rate of tests for H 0 with M ¼ 3 and M ¼ 12 and intermediate correlations are shown in Table 2. Using any of the Wald tests without small sample adjustments leads to severe inflation of the type I error rate, and the inflation is increasing with the number of endpoints and decreasing with the number of subjects. Among the studied scenarios, the type I error rate was up to 10% for the unadjusted maximum-type Wald test and up to 41% for the unadjusted quadratic form Wald test. Both, the bias adjustment of the covariance estimate and the distributional approximation via an F-or a multivariate t-distribution, are required to control the type I error rate at the nominal level. For the quadratic form Wald test, using a scaled F statistic in analogy to Hotelling's T 2 test is required to control the type I error rate across all scenarios.
The score statistics exhibit favorable properties in the simulation, with type I error rates very close to the nominal level. No small sample adjustment in terms of bias adjustment or refined distributional approximation is required.
We studied the power of those procedures for which type I error rate control was observed in the simulation.  Table 3. Throughout these scenarios, the multivariate maximum-type Wald test has a power advantage of some percentage points over the Bonferroni test. Furthermore, the score test is considerably more powerful than the Wald test and this holds for, both, the quadratic form statistics and the maximum statistics.
We further studied for scenario (a), the power of closed testing procedures, which utilize the above tests for each intersection hypothesis, to reject particularly H 1 , H 2 or H 3 , as well as the power to reject at least one elementary hypothesis or all elementary hypotheses, see Table 4. For comparison, the closed testing procedure based on Bonferroni tests (which results in the Bonferroni-Holm procedure) is included. Similar to the results on the global test, the maximum-type Wald test is more powerful than the Bonferroni-Holm test and the score tests are for most decisions more powerful than the Wald tests.
The coverage probability of simultaneous confidence intervals for ð ð1Þ 1 , ð2Þ 1 , ð3Þ 1 Þ did not depend on the actual values of the coefficients, up to simulation error. The observed values for scenario (a) are shown in Table 5.

Statistic
Type Approximation Bias adj.
Type I error rate (%) Both considered small sample adjustments are required to achieve a coverage probability of at least the nominal value of 95%. Similar results were observed in the simulation contained in the supplemental material Section S.2.
In a further simulation study, we investigated the impact of increasing the number of endpoints and increasing the correlation between endpoints for scenarios with K ¼ 40 subjects and an effect in all endpoints under the alternative. We included only those tests that controlled the type I error rate. The power was calculated for sample sizes K ¼ 40 and K ¼ 100 in three different scenarios. In scenario (a), there was an effect in all three endpoints ( ðiÞ   The simulation results are shown in Figure 1. Here, we also included the case M ¼ 1. To allow for an unambiguous comparison with the case of multiple endpoints of different types, we computed for M ¼ 1 the results for models with a single continuous, count data and binary endpoint, respectively, and plotted the average power across these three models from a total of 2 Â 10 4 simulations. The benefit of the maximum-type Wald and score tests over the Bonferroni test becomes more pronounced for larger correlations and their power (under the considered alternative with an effect in all endpoints) is increasing with the number of endpoints. Under high correlations, the power of the maximum-type Wald and score test is approximately constant with an increasing number of endpoints, whereas the power of the Bonferroni test is decreasing. Note that in an extreme case of correlation 1, the maximum test would be identical to a test for a single endpoint, with no loss in power, whereas the Bonferroni test would correspond to a single-endpoint test at level /M hence loosing power.
The power of the quadratic form Wald and score tests depends strongly on the correlation between endpoints. As seen in Table 3 and discussed in Section 3.3, if there is an effect in all endpoints and the endpoints are positively correlated, the direction of the effects is not the direction of deviations from the null hypothesis which are considered particularly large by the metric of these tests. This becomes more pronounced with an increasing number of endpoints and increasing correlation, and the power of the quadratic form tests decreases rapidly.

Software implementation
The proposed test procedures were implemented in the R-package 'mmmgee' 31 that is available from the CRAN repository. 32 The model fitting routines are based on those of the R-package 'geeM', 33 and multivariate normal or t distribution probabilities are calculated using the package 'mvtnorm'. 34 The mmmgee package provides three main functions: geem2 fits marginal GEE models as described in Sections 2.1 and 2.2. mmmgee calculates the estimate (equation (3)) or the bias adjusted estimate (equation (4)) of the covariance matrix of a stacked vector of regression coefficients from multiple marginal GEE models fitted with geem2. mmmgee.test calculates the multiple hypothesis tests and simultaneous confidence intervals described in Section 3. The latter functions are applied to a list of models fitted with geem2. As a special case, the package may also be applied to test hypotheses within a single GEE model.
An instance of a simulated data set with M ¼ 3, K ¼ 40 and intermediate correlations as described in Section 5 is included in the package as exemplary data. The R code below invokes an analysis as used in the simulation studies in Section 5. Marginal GEE models are fit for the three endpoints using geem2. The function mmmgee.test is applied to test the global null hypothesis H 0 : ð1Þ 1 ¼ ð2Þ 1 ¼ ð3Þ 1 ¼ 0 as well as the elementary hypotheses H 1 : ð1Þ 1 ¼ 0, H 2 : ð2Þ 1 ¼ 0 and H 3 : ð3Þ 1 ¼ 0 in a closed testing procedure. In the example, a maximum-type Wald test using the bias adjusted covariance matrix estimate and a multivariate t reference distribution is requested.
The output includes the test statistic, degrees of freedom and p-value for the test of the global null hypothesis as described in Section 3.1. It further shows the estimated contrasts, which in this case correspond to ð1Þ 1 , ð2Þ 1 and ð3Þ 1 , the right hand side value of each contrast under the respective null hypothesis, the unadjusted p-values and multiplicity adjusted p-values according to the closed testing procedure of Section 3.4: library(mmmgee) data(datasim) mod1<-geem2(Y.lin$gr.langþx1,id¼id,data¼datasim, family¼"gaussian",corstr¼"exchangeable") mod2<-geem2(Y.poi$gr.langþx2,id¼id,data¼datasim, family¼"poisson",corstr¼"exchangeable") mod3<-geem2(Y.bin$gr.langþx3,id¼id,data¼datasim, family¼"binomial",corstr¼"exchangeable") Li<-matrix(c(0,1,0),nrow¼1) mmmgee.test(list(mod1,mod2,mod3),L¼list(Li,Li,Li), statistic¼"Wald",type¼"maximum",biascorr¼TRUE, asymptotic¼FALSE,closed.test¼TRUE) Pan and Wall 9 report good results for single GEE models using a Satterthwaite approximation. However, this method requires the estimation of the variance of the covariance matrix estimate. Alternatively, degrees of freedom may be calculated from the effective sample size, which is the number of independent observations that would result in the same efficiency as the observed sample of partially dependent observations. 22 This method requires that the covariance structure is correctly specified, which is otherwise not required for GEE models. In some cases, it may be reasonable to attribute different degrees of freedom to different tested contrast, e.g. if the number of parameters strongly differs between the marginal models. In that case, a method described by Hasler and Hothorn 8 may be utilized: For each tested contrast, a critical value is calculated from a multivariate t-distribution with common correlation matrix and contrast-specific degrees of freedom. The test decision is based on comparison of the individual test statistics with the respective critical values. Another way to improve the distributional normal approximation of a statistic is to directly calculate and adjust for the error of approximation. Kauermann and The information in parentheses shows that the bias adjustment for the covariance matrix was applied to all tests using Wald statistics; furthermore, the reference distributions with abbreviations as in Table 2 are indicated. The results are based on 2 Â 10 4 simulation runs.
Carroll 37 propose this solution for the case of univariate contrast tests in GEE models. The method could in principle be extended to multivariate tests.
The other small sample improvement we investigated is the bias adjustment of the covariance matrix estimate. We focused on the method of Mancl and DeRouen 7 ; however, related methods proposed in the literature for single GEE models 14 may as well be extendable to the multiple marginal models approach.
We regarded the score test as an approximation to the Wald test. Note that generalized score tests with quadratic form test statistics may also be constructed from the score vector U and a generalized inverse of its asymptotic covariance matrix (which may be singular). 26 For the case of a linear hypothesis H 0 , the resulting test is identical to the quadratic form score test motivated via approximation of the Wald test. The latter approach is, however, easily applicable to construct maximum type tests. In the numeric investigations, the score tests did not require adjustments to the asymptotic approximation to control the type I error rate and they were more powerful than the corresponding Wald tests. In contrast to Guo et al., 38 who studied quadratic form score tests for single GEE models, we did not observe a conservative behaviour, but the type I error rate was controlled almost exactly at the nominal level. These results may not hold for all possible analysis scenarios but they suggest the score tests as viable small sample alternative to the Wald tests. Confidence sets for Lb corresponding to a score test may in principle be found as set of all vectors r 0 2 R c such that H 0 : Lb À r 0 ¼ 0 is rejected. 38 To provide contiguous intervals or sets, the test statistic needs to be a convex function of r. In contrast to the Wald statistic, the score statistic depends on r in a non-trivial way, as the statistic is based on a model fitted under the constraint Lb À r ¼ 0. Thus, the required convexity property needs to be checked for each given class of models, which may not be easily done except for some special cases.
We focused on two-sided inference. The extension to one-sided tests for null hypotheses of the form H 0 0 ¼ \ c i¼1 fðLbÞ i r i g and according one-sided confidence intervals is straight forward for maximum-type Wald tests (compare with Hothorn 20 ). The least favorable configuration under H 0 0 is Lb ¼ r sinceD is estimated without restriction and the multivariate normal or t reference distribution is monotone in the assumed mean vector. Hence, evaluation of the one-sided Wald test under the configuration Lb ¼ r is sufficient. To extend the maximum-type score test to one-sided hypotheses, the restricted estimate e b has to be calculated under the according inequality restriction which is subject of further research.