Imputing missing covariate values for the Cox model

Multiple imputation is commonly used to impute missing data, and is typically more efficient than complete cases analysis in regression analysis when covariates have missing values. Imputation may be performed using a regression model for the incomplete covariates on other covariates and, importantly, on the outcome. With a survival outcome, it is a common practice to use the event indicator D and the log of the observed event or censoring time T in the imputation model, but the rationale is not clear. We assume that the survival outcome follows a proportional hazards model given covariates X and Z. We show that a suitable model for imputing binary or Normal X is a logistic or linear regression on the event indicator D, the cumulative baseline hazard H0(T), and the other covariates Z. This result is exact in the case of a single binary covariate; in other cases, it is approximately valid for small covariate effects and/or small cumulative incidence. If we do not know H0(T), we approximate it by the Nelson–Aalen estimator of H(T) or estimate it by Cox regression. We compare the methods using simulation studies. We find that using log T biases covariate-outcome associations towards the null, while the new methods have lower bias. Overall, we recommend including the event indicator and the Nelson–Aalen estimator of H(T) in the imputation model. Copyright © 2009 John Wiley & Sons, Ltd.


INTRODUCTION
Multiple imputation (MI) [1] is commonly used to perform statistical inference in the presence of missing data. Unlike simpler imputation methods, it can yield inferences that accurately reflect the uncertainty due to the missing data. MI is typically more efficient than complete cases analysis when covariates have missing values. Implementations in Stata [2,3], SAS [4] and R [5] have led to its widespread use.  broadly missing at hospital level: that is, its measurement appears to have been largely a matter of hospital policy, and it appears to be approximately MCAR. We exclude three patients with no follow-up. A number of possible prognostic variables were available, from which five were selected by analysis of the patients with observed ESR. The variables are listed in Table I. Analysis by multivariable fractional polynomials [16] suggested that wcc and t_mt should be entered into the analysis model as wccˆ3 and log(t_mt+1), respectively. Because of the large number of missing values of ESR, complete cases analysis uses less than half the data set. However, the rest of the data set carries information about the associations between the other covariates and the outcome, so it is sensible to use MI for the analysis of these data. We will use the data to compare different ways to incorporate the outcome in the imputation model.

Multiple imputation
We briefly describe MI for a single incomplete variable X , a vector of complete variables Z and complete outcome Y . We assume that we have an imputation model p(X |Y, Z ; ) parameterized by . Formally, MI involves drawing values of the missing data X mis from the predictive distribution p(X mis |X obs , Y, Z ) = p(X mis |X obs , Y, Z ; ) p( |X obs , Y, Z )d , where p( |X obs , Y, Z ) is the Bayesian posterior distribution of [1]. In practice, this may be achieved (with implicit vague priors) by (1) fitting the model p(X |Y, Z ; ) to the cases with observed X , yielding an estimate (typically an MLE)ˆ with estimated variance-covariance matrix S ; (2) drawing a value of , * , say, from its posterior, perhaps approximated as N (ˆ , S ); and (3) drawing values of X mis from p(X |Y, Z ; * ) [6].
Where some of the Z variables are also incomplete, the method of MI by chained equations (MICE) [10] starts by filling in missing values arbitrarily, then applies the above univariate method for each incomplete variable in turn, using the current imputed values of Z when drawing new values of X , and vice versa. The procedure is iterated until convergence, which often requires fewer than 10 cycles [2]. An alternative non-iterative procedure is available if the data are monotonically missing [8].
Once imputed data sets have been created, analysis is performed on each data set separately. Let Q r be the point estimate of a (scalar or vector) parameter of interest for the r th imputed data set (r = 1, . . . , m) with variance-covariance matrix U r . These values are then combined by Rubin's rules [1]: the overall point estimate isQ = (1/m) r Q r with varianceŪ +(1+1/m)B, wherē U = (1/m) r U r and B = (1/(m −1)) r (Q r −Q)(Q r −Q) T . Tests and confidence intervals for a scalar parameter are constructed using a t-distribution with degrees of freedom given by Rubin's formula [1] or an alternative [17].

Conditional distribution of covariates
We now focus on the case of a survival outcome T with event indicator D (1 for events, 0 for censored observations). We assume that the outcome follows the Cox proportional hazards model where again X is incomplete and Z is complete. We also need an 'exposure model' p(X |Z ; ) in order to allow for the incomplete X .
In the appendix, we prove a number of exact and approximate results about the imputation model p(X |T, D, Z ) in terms of the model parameters = ( , X , Z , h 0 (.)). These results are used to motivate regression models p(X |T, D, Z ; ), where the parameter is some function of . In practice, we do not know , but we can estimate the parameters directly from the complete cases. Therefore, the models below are stated in terms of the unknown parameters , which typically differ across different models.
First, with binary X and no Z , we have In other words, the missing X may be imputed by fitting a logistic regression of X on D and H 0 (T ) to the complete cases.
Second, with binary X and binary or categorical Z , if we take the most general exposure model logit p(X = 1|Z ) = Z , then we get where terms such as 3Z represent a set of dummy variables with their coefficients.
In other cases we can only obtain approximate results. For binary X with more general (possibly vector-valued) Z , we make a Taylor series approximation for exp ( Z Z ) that is valid when Z Z has small variance. Using the exposure model logit p(X = 1|Z ) = 0 + 1 Z gives and the addition of an interaction term 4 H 0 (T )Z improves the accuracy of the approximation. Further, if the user believed that a particular transformation of Z was needed for predicting X , then this transformation should be entered in the imputation model.
For Normal X , we make a fuller Taylor series approximation for exp ( X X + Z Z ) that is valid when X X + Z Z has small variance. Using the exposure model X |Z ∼ N( 0 + 1 Z , 2 ) gives using a first-order approximation, and again the addition of an interaction term 4 H 0 (T )Z improves the accuracy of the approximation. Equations (A7) and (A8) in the appendix suggest that departures from the above model will be most marked when both var( X X ) and H 0 (t) exp ( XX + ZZ ) (roughly the overall cumulative hazard at the event time T ) are large.

A small empirical investigation
We explored the distribution of X |T, D empirically using 100 000 simulated data points and the model described above with standard Normal X , X = 0.7, h 0 (t) = 1 (so H 0 (t) = t) and censoring times uniformly distributed on [0, 2]. Figure 1 shows smoothed graphs of the conditional mean and standard deviation, E[X |T, D] and SD(X |T, D). A linear regression on D and H 0 (T ) would be shown by parallel straight lines for the mean and a constant SD. Some departures from linear regression are seen: the mean graphs are somewhat curved and converging, and the SD declines with T .
Taken together, these results suggest that logistic or linear regression of X on D, H 0 (T ) and Z may be appropriate in many situations, and that including an interaction between Z and H 0 (T ) may improve the approximation, but that the approximation will not work well in situations with strong covariate effects and large cumulative incidences.

Implementation
In practice, H 0 (T ) is unknown and must be estimated. We consider three possible methods.

Substantive knowledge.
In many applications, the baseline hazard may be approximately known: for example, in following a cohort of healthy individuals over a small number of years, the baseline hazard could be assumed to be roughly constant. In this case it would be reasonable to assume H 0 (T ) ∝ T . This may be a useful 'off-the-shelf' method.

Nelson-Aalen method.
When the covariate effects X and Z are small, we may approximate H 0 (T ) ≈ H (T ), which is easily estimated before imputation using the Nelson-Aalen estimator. It seems possible that this method will perform well for moderate sized X and Z because small errors in estimating H 0 (T ) are unlikely to have much impact on the imputations.

Cox method.
We also propose estimating H 0 (T ) iteratively: first, imputing X using the current estimate of H 0 (T ), then fitting the Cox proportional hazards model to the data using the current values of the covariates X, Z and extracting a revised estimate of the baseline hazard function H 0 (T ). This fits conveniently within the MICE algorithm: in each imputation cycle, as well as updating each incomplete variable in turn, we also update H 0 (T ) by fitting the Cox model.
Because it is unlikely that H 0 (T ) will change much from one iteration to the next, we also consider a less computationally intensive version in which H 0 (T ) is updated only on the first k cycles. Here we will use k = 2.

Theoretical properties.
We note that the methods described in Sections 3.4.2 and 3.4.3 do not acknowledge the uncertainty in estimating H 0 (T ). As a result, they are not Bayesianly proper [1], so that standard errors may be too small and confidence intervals may be too narrow. However, we do allow for uncertainty in the coefficient of H 0 (T ), so we do not expect any undercoverage to be important.

SIMULATION STUDY
We now present simulation studies to compare the methods introduced in Section 3. These are summarized in Table II. We first consider the simple case of binary or Normal X and no Z .

One covariate: design of simulation study
The covariate X was either binary with P(X = 1) = X or standard Normal, so that its standard deviation = √ X (1− X ) or 1, respectively. X was missing completely at random with probability M . Survival times were drawn from a Weibull distribution h T (t) = T t −1 exp ( X X ). Random censoring times were drawn from a Weibull distribution with the same shape parameter, h C (t) = C t −1 . The parameter values used were M = 0.5; X = 0.5; T = 0.002; = 1; X = 0, 0.5, 1; and C = 0.002 (corresponding to approximately 50 per cent censoring). When X = 0, Table II the sample size n was chosen to give 90 per cent power to detect a significant association between binary X and survival at the 5 per cent level, using Collett's formula [18]. For Normal X , the sample size was chosen to be the same as for binary X with the same value of X . When X = 0, the sample size was chosen to be the same as that for X = 0.5.
In sensitivity analyses, one parameter was varied at a time: 'High censoring', C = 0.01 (corresponding to approximately 83 per cent censoring); 'low missing', M = 0.2; 'shape 2', = 2; and 'administrative censoring', censoring at a fixed time computed to give the same censored fraction as random censoring when X = 0.
The imputation methods described above were used to construct m = 10 imputed data sets. The analysis model was a Cox regression on X . Results from the imputed data sets were combined using Rubin's rules as described in Section 3.1. For comparison, we also analysed each simulated data set before introducing missing values (PERFECT) and using complete cases only (CC). In each case we estimated the bias and the empirical standard error of the point estimate; the relative error in the average model-based standard error, defined as its difference from the empirical standard error of the point estimate minus 1; the coverage of a nominal 95 per cent Normal-theory confidence interval; and the power of a Normal-theory 5 per cent significance test of the null hypothesis of =0. Table III shows the key results. NO-T is strongly biased towards the null, the proportionate bias equalling the proportion of missing data. LOGT is mildly biased (up to 10 per cent) towards the null. All other methods have no appreciable bias. All methods except NO-T have similar empirical standard errors (results not shown). NO-T has smaller empirical standard error as a result of its bias towards the null.

One covariate: results for binary X .
All methods except NO-T have model-based standard errors that compare well with the empirical standard errors. NO-T has a standard error that is up to 70 per cent too large. All methods except NO-T have coverage between 93 and 96 per cent in all cases, while coverage of NO-T varies from 73 to 100 per cent (results not shown).
Power was very low for NO-T, reduced by up to 6 per cent for LOGT and by up to 3 per cent for T2 (only when = 1), compared with other methods. Differences in power between other methods appear to be consistent with chance.
Results with 'high censoring' were very similar to the base case; results with 'low missing' showed weaker patterns than the base case; and with 'Shape 2' and 'Administrative censoring', results with < 1 showed weaker patterns than those shown with = 1.
Because of its poor properties, we do not consider NO-T in further simulation studies. Table IV show that some methods, notably COX but also LOGT and T2 (when = 1), show small bias towards the null. (Note that when n=84 the PERFECT and CC methods show small-sample bias away from the null.) We did not explore precision because of the presence of bias. Coverage was 93-96 per cent for LOGT, T, T2 and NA, and 92-97 per cent for COX and COX* (results not shown). Power was greatest with T, NA and COX* methods. T2 had noticeably less power than T when =1, but was not superior when = 2. We conclude that LOGT is somewhat suspect because of potential bias towards the null. All other methods considered are adequate, and T, NA and COX* may be the best. There is no gain from the extra computational burden in COX, which if anything performs worse than COX*.

Two covariates: design of simulation study
We next added in a complete covariate Z . We took X and Z to be standard Normal with correlation . The analysis model was now h T (t) = T t −1 exp ( X X + Z Z ). We were especially interested in seeing what happens as X and Z get larger, since Section 3.2 suggested that this is where our approximations may break down.
We induced missing data in X only, using a MCAR mechanism as before. We took M = 0.5, T = 0.002, = 1 and random censoring with C = 0.002 in all simulations: these choices for M and C were found in the univariate study to be most sensitive to different analysis methods. Further, we took all combinations of = 0, 0.5; X = 0, 0.25, 0.5; and Z = 0, 0.25, 0.5.
To explore how the missing data mechanism affects the results, we repeated the bivariate simulation under the MAR mechanism logit P(M X |X, Z ) = Z , where M X indicates missingness of X : this yielded 50 per cent missing values. We did this in the case X = Z = 1 only.
We used all the methods proposed before, with the exception of NO-T, which had performed very poorly, and COX, which had not performed well enough to justify its computational burden in the univariate study. In addition, we introduced a modification of the NA method that includes the interaction of Z with H (T ) in the imputation model: we call this method NA-INT.

Two covariates: results.
Results forˆ X are given in Table V. We first consider the MCAR case. Bias towards the null increases with increasing values of X , Z and . It is worst for T2, being up to 20 per cent of the true value of . Precision is not compared because of the presence of bias. Model-based standard errors are up to 17 per cent too high, with the discrepancy increasing with X . Despite these problems, coverage was adequate (94-97 per cent) for all methods (results not shown). Power was greatest with T, NA, NA-INT and COX* methods, and worst for LOGT and T2. The NA-INT method performed very similarly to the NA method.
Results for the MAR case show increased bias inˆ X , increased error in the model-based standard errors and decreased power, but the comparisons between methods are similar to the MCAR case.
Results forˆ Z are given in Table VI. There was small bias away from the null when X >0 and =0.5 because the small bias inˆ X seen previously leads to residual confounding. Model-based standard errors were all accurate to within 10 per cent. Coverages ranged from 94 to 97 per cent (results not shown). Power was similar for all MI methods, but was substantially greater for MI than for CC.

RESULTS FOR THE RENAL CANCER DATA
As stated in Section 2, the analysis model of interest for these data is a proportional hazards model including covariates esr, haem, who, trt, (wcc)ˆ3 and log(t_mt+1). For ease of comparison, we scale the quantitative covariates esr, haem, (wcc)ˆ3 and log(t_mt+1) by their CC standard deviations.
Before imputing the missing values, the skewed variables wcc and t_mt were transformed to an approximate Normal distribution using the lnskew0 program in Stata, which replaces a variable X with log (±X −k) where k and the sign are chosen so that log (±X −k) has zero skewness. Although esr was non-Normally distributed, it was not transformed because exploratory linear   Bold cells indicate estimates that differ from the NA estimate by more than 20 per cent of the NA standard error, or standard errors that differ from the NA standard error by more than 20 per cent of the NA standard error. Monte Carlo error in parameter estimates is no more than 0.003 in all cases.
regression on the other covariates suggested that its conditional distribution was approximately Normal. Imputation was performed on the transformed variables using the ice routine in Stata [2, 3] and including the outcome variables appropriate to each method. Transformed values of wcc and t_mt were converted back to the original scale and then formed into the terms wccˆ3 and log(t_mt+1) for the analysis model. The COX and COX* methods were implemented by additional programming within ice. We used m = 1000 imputations so that Monte Carlo error did not disguise any differences between methods.
We first look at differences between the CC method and all imputation methods (Table VII). One would expect standard errors for the coefficient of a variable X to be smaller by MI than by CC when there are a substantial number of observations with observed X , but missing data in other variables. In the present data, this would suggest that, compared with CC standard errors, MI standard errors would be similar for esr and smaller for all other variables. The expected pattern is observed for the other variables. However, the standard errors for esr are somewhat increased. This may reflect other features of the current data or may be a chance finding. There are substantial differences in point estimates.
Turning to comparisons between imputation methods, the main differences are seen for esr, with T2 and NO-T giving point estimates less than half those for other methods. Smaller differences are seen for other methods. The only other variable whose coefficients show substantial differences between imputation methods is haem. This is the variable with the strongest correlation (−0.61) with esr, and the differences between the methods reflect residual confounding as a consequence of the attenuated estimates of the coefficient of esr.

DISCUSSION
We have developed an approximate theoretical rationale for imputing missing covariates in a Cox model using new methods based on the cumulative baseline or marginal hazard (NA, NA-INT, COX and COX*). These methods have the appealing property that they are invariant to monotonic transformation of the time axis, like the Cox proportional hazards model itself, but unlike more commonly used methods (LOGT and T).
Our simulation study allows us to choose between these methods. The NA method performed at least as well as the more complex NA-INT, COX and COX* methods, appearing to have the lowest bias and highest power in most simulations. We therefore consider this to be the best method in general. The NA method is simple to implement in standard software. For example, using ice in Stata [2,3], after the data have been stset, the Nelson-Aalen estimator is produced by sts gen HT=na and then the MICE algorithm is implemented by ice HT _d X* with appropriate options.
However, all methods were somewhat biased towards the null when covariates were strongly predictive of outcome. This is because the imputation models were not entirely correct. The MI procedure might be improved by using predictive mean matching [19], which aims to draw from the empirical distribution rather than the fitted conditional distribution. Our explorations of this approach in the context of our simulation studies suggest that it can perform very poorly: in particular, when there is no true association between covariate and outcome, predictive mean matching gives implausible distributions of imputed values and very variable estimated coefficients. This appears to be a consequence of small imputation models; strengths and limitations of predictive mean matching are a topic for further research.
This paper did not aim to compare MI with complete cases analysis. The standard view is that MI is more efficient than complete cases for estimating the coefficient of a variable whenever some of the other model covariates are incomplete [11]. Our results support this view, since MI procedures had greater power than complete cases in Table VI but not in Tables III, IV and V.  Indeed, MI had worse power in Tables III, IV and V because we used too few imputations: had our aim been a fair comparison of MI with complete cases, we would probably have needed to use m = 50 or more imputations.
We assumed a proportional hazards survival model. Other non-parametric survival models, such as the accelerated life model and the proportional odds model, do not yield simple imputation models for the covariates. For the proportional odds model, it can be shown that a Taylor series approximation with X ≈ 0 suggests a logistic model for X on S 0 (T ), D and their interaction. Thus in principle, different methods are required for these models. We suggest that the NA method might be a reasonable first choice, but that more flexible imputation models should be carefully considered.
We have explored and compared methods in the setting of a single incomplete covariate, but our finding that D and H 0 (T ) should be included in imputation models for incomplete covariates is equally relevant for any form of MI that is based on regression models for incomplete covariates. These include imputing from a multivariate normal distribution [13], imputing using monotone missing methods and imputing via chained equations [5]. We therefore recommend that, instead of the logarithm of survival time, imputation should be based on the Nelson-Aalen estimate of the cumulative hazard to the survival time.
APPENDIX A Under the PH model h(t|X, Z ) = h 0 (t) exp ( X X + Z Z ), the log-likelihood for the outcomes, given complete data, is = + X D + X H 0 (T ) (A3) so the correct model for imputing missing X is a logistic regression on D and H 0 (T ). If in addition X is small, the model further simplifies to logit p(X = 1|T, D) = + X (D − H 0 (T )), but this simplification is unlikely to be useful in practice, and we do not pursue it further. More generally, in the presence of a single categorical Z , model (A2) is exactly a logistic regression on D, Z , H 0 (T ) and the interaction between Z and H 0 (T ).
In other cases, we have no exact results. However, if we assume an exposure model logit p(X = 1|Z ) = 0 + 1 Z and approximate exp ( Z Z ) ≈ exp ( ZZ ) (whereZ is the sample mean of Z ) in (A2) for small var( Z Z ), we get