Likelihood estimation for a longitudinal negative binomial regression model with missing outcomes

Joint damage in psoriatic arthritis can be measured by clinical and radiological methods, the former being done more frequently during longitudinal follow-up of patients. Motivated by the need to compare findings based on the different methods with different observation patterns, we consider longitudinal data where the outcome variable is a cumulative total of counts that can be unobserved when other, informative, explanatory variables are recorded. We demonstrate how to calculate the likelihood for such data when it is assumed that the increment in the cumulative total follows a discrete distribution with a location parameter that depends on a linear function of explanatory variables. An approach to the incorporation of informative observation is suggested. We present analyses based on an observational database from a psoriatic arthritis clinic. Although the use of the new statistical methodology has relatively little effect in this example, simulation studies indicate that the method can provide substantial improvements in bias and coverage in some situations where there is an important time varying explanatory variable.


Introduction and motivation
This work was motivated by on-going analyses of a longitudinal database of information on patients with the chronic condition psoriatic arthritis (Hanley et al., 1988). This database, which was initiated in 1978, derives from the psoriatic arthritis clinic at the Toronto Hospital and records information from visits to the clinic which occur approximately once every 6 months. The principal measure of a patient's disease progression is the number of damaged joints. Damage to a joint is a permanent condition, typically reflecting immobility, and can be contrasted with disease activity, which is reflected by pain and inflammation of joints, which can often be alleviated by treatment.
There are two principal ways to measure joint damage: through a clinical examination and clinician assessment of damage, and through assessment of X-rays of joints, this typically being restricted to hands and feet. The clinical examination can be performed on every clinic visit but is often considered more subjective than the radiological examination, which is generally undertaken at intervals of approximately 2 years. More frequent routine X-rays would not be clinically acceptable owing to safety and resource implications.
Past studies of disease progression (e.g. Gladman and Farewell (1999), Gladman et al. (1994) and Bond et al. (2007)) have used the number of damaged joints developing between visits to the clinic as a primary outcome measure where the joint counts are those determined by clinical examination. The increment in joint counts has been related to explanatory variables, with a particular interest in the relationship between time varying explanatory variables reflecting disease activity and the subsequent occurrence of damage. With complete data, where we observe the damaged joint count and all explanatory variables of interest at each clinic visit, it is straightforward to compute the increment in the number of damaged joints and to fit a relevant model, such as a Poisson or negative binomial regression model.
After 30 years of data collection, the next phase of data analysis will focus on predictors of clinical damage at the individual joint level rather than at the patient level. Before initiating this work, it was felt important to provide as much evidence as possible that findings based on clinical damage are informative about disease progression and, in particular, would be consistent with those that might have been derived through use of radiological measures of damage if these were as frequently available. There is now enough evidence to consider this for patient level measurements and the work that is reported here was motivated by the desire to examine patient level predictors of radiological damage to confirm earlier findings that were based on clinical damage. The situation that arises in such analyses is that we have information on time varying explanatory variables being collected at each visit to the clinic but the primary radiological outcome measure is updated less frequently.
One approach is to consider only the data that are collected on the clinic visits when a radiograph is taken. This enables a regression model to be fitted using standard software for the implementation of generalized linear models (McCullagh and Nelder, 1989) and was done by Bond et al. (2007). However, it may be desired, and felt to be more sensible, to use the updated information on explanatory variables that is available from the intermediate clinic visits when a radiograph has not been taken. The primary aim of this paper is to present a means to implement a likelihood method for performing such estimation and to consider how much difference there is between the first 'naive' but straightforward method and the second more comprehensive approach to the analysis of the radiological outcome data. If consistent results are found between these two methods for the modelling of radiological data, then, given the comparable results that were found by Bond et al. (2007), reassurance is provided concerning the use of clinical measures of damage in the analysis of data at the individual joint level. Although, in this application, consistency of results between the two analyses of radiological data is a particular advantage, in other settings there might be the expectation that the more complicated analysis will differ from and be more appropriate than the naive method. Of course, if this were true for the psoriatic arthritis data, the more important question clinically would be whether the appropriate analysis of the radiological damage is consistent with the analysis of the clinical damage.
Section 2 defines some general notation and presents the technical details for dealing with missing outcome data. We perform some simulations in Section 3 to determine how much the analysis suggested can differ from a naive analysis in situations that are characterized by the influence and variability of an explanatory variable as well as the pattern of missingness. Section 4 outlines very briefly how informative observation might be incorporated into the likelihood methodology. Section 5 applies the method to the data that motivated the work. In addition, models for clinical damage are examined so that by deleting observations we can check how the method compares with the use of complete data with no missing outcomes. The paper concludes with a discussion in Section 6.

Methodology
Assume that each patient i has m i clinic visits (m i 2), and there are n patients. At each visit to the clinic there is, potentially unobserved, a variable D i,j the number of damaged joints, and a vector of explanatory variables X i,j , where 1 i n and 1 j m i . Define the response variable Y i,j = D i,j+1 − D i,j , 1 j < m i , so the current set of explanatory variables is used to predict the next increment in the damaged joint count. Note that the number of increments Y i,j observed is one fewer than the number of visits at which the joint counts D i,j are observed and that there is no restriction that the damaged joint count at the first clinic visit, D i,0 , be 0. We could allow the explanatory variables X i,j to include functions of the damaged joint counts observed up to and including visit j. For simplicity, however, we exclude this possibility initially and extend the methodology to permit it subsequently.
We assume that where f {·} defines a probability distribution function, on the integers, and has a finite number of parameters. The location parameter is assumed to be a known function μ.·/ of a linear combination of the explanatory variables plus any offset terms, η = X β + O, and any further nuisance parameters are given by ψ. In our motivating example, f is the negative binomial distribution, μ is the exponential function and ψ is a single nuisance parameter representing overdispersion. If all the D i,j , and consequently all the Y i,j , were observed then we can define the likelihood function to be In contrast, consider the case where we observe (dropping the i-suffix) X 1 , X 2 and X 3 , the explanatory variables on visits 1, 2 and 3, but we observe only D 1 and D 3 with the second visit missing. In this case we can only infer that 0 Y 1 D 3 − D 1 , 0 Y 2 D 3 − D 1 and Y 1 + Y 2 = D 3 − D 1 . Provided that the observation pattern corresponds to the Y s being missing at random (Rubin, 1976), the likelihood function can be written as where η j = X j β + O j . The fact that the random variable Y 2 has a finite and discrete sample space means that we can calculate the likelihood accurately, rather than having to approximate an integral, which would be the case if the state space was continuous.
Where there is a sequence of multiple missing observations, it is helpful to think of the idea of a multistate model (Cox and Miller, 1965;Billingsley, 1981). Define the states to be the number of damaged joints. If we observe X 1 , . . . , X k−1 and only D 1 and D k then we know that the unobserved sequence of states must be an increasing sequence starting at D 1 and finishing at D k . Modelling assumptions provide us with the transition probabilities. These are defined to be the matrices P j , 1 j k − 1, with elements P j r,c (r, c meaning 'row-column'). P j r,c is the probability that the patient goes to state c at visit j + 1 conditional on their being in state r at visit j, so P j r,c = f {c − r, μ.η j /, ψ} r c, 0 r > c. .2/ For the first transition, P 1 is a row vector, since we have observed that r = D 1 , rather than a range of possible values. For the final transition, P k−1 is a column vector, since we have observed that c = D k . For the remaining values of j (1 < j < k − 1/ P j is a square matrix with D k − D 1 + 1 rows and columns, since r and c both can take values from {D 1 , D 1 + 1, . . . , D k − 1, D k }. The likelihood is the matrix product, which is a scalar number since the first and last matrices are row and column vectors respectively. Hitherto, we assumed that X j did not include any of the values of the response variable observed at current or previous visits. When calculating the transition probability matrix P j for a set of parameter values, this allows us to calculate μ.η j / just once, and then the vector of values of f {·} can be calculated for just the first row of P j , and then reused (without any further calculations) in the subsequent rows. This helps to ease the computational cost. However, if we want to include functions of the current, but potentially unobserved, value of D j at transition j, which is denoted as Z.D j /, then an extension of equation (2) is required. A motivation for this is to take account of correlation. In this case, we modify the location parameter μ.η j / to become The computational cost of this is that to calculate P j now requires a separate evaluation of f {·} with different arguments for every non-zero entry of P j . We have defined the likelihood contribution for every sequence of contiguous missing values that are bounded by observed values. Two or more successive observed values give a simple product of probabilities, similar to equation (1). If the patient's records finish with a sequence of contiguous missing values then these observations are discarded since they give no information. The complete likelihood for the entire data is the product over visits and patients of all such terms. The complete likelihood is then used to derive maximum likelihood estimates or a posterior distribution if combined with prior distributions on the parameters. If the likelihood is maximized by using a Newton-type method (Dennis and Schnabel, 1983) the approximation to the Hessian matrix is used to estimate the covariance matrix for the parameter estimates and subsequently to form confidence intervals that are based on the asymptotic normality of maximum likelihood estimates.

Simulation
For our simulations, we generated independent outcome variables Y from a negative binomial distribution, with dispersion parameter 1, and with the linear predictor 1 + Xβ. The coefficient β took values 1 or 2; the explanatory variable X was a binary variable with values 0 or 1. This implicitly assumes equidistant observations. Our first simulation held X constant within a patient and hence is comparable with a randomized control trial. This situation acts as a validation for the simulation in that little gain can be expected when explanatory variables are not time varying. We generated data for 10 'patients', five of whom had X = 1, and each patient had 10 observations. The situation of 10 patients is not of practical interest but is used to illustrate the nature of possible effects and a more realistic scenario is examined subsequently. To study the effect of the amount of missingness, we selected at random, without replacement, a sample of fixed size from the integers {2, 3, . . . , 9}. These observation numbers were then deleted from the simulated data, and only the sum of the relevant Y s retained (we ensured that the first and last observations were never deleted). The simulated data were then analysed in three ways: using all the original data and standard software to give a benchmark for comparison, using the missing data and standard software where the extra information that is provided by additional observations of explanatory variables was just ignored and using the missing data and the new methods of Section 2. For each model specification, 500 simulations were undertaken and the maximum likelihood estimates and associated 95% confidence intervals were recorded. From these, the mean-squared error, the bias and coverage of the three parameter estimates (intercept, β and dispersion) were estimated. Table 1, part (a), presents selected results of these simulations, tabulated by the number of missing observations and by method.
As expected, all three methods have minimal bias and adequate coverage for the location term and the regression coefficient which should be estimated consistently by all methods. The same patterns were seen for both values of β and therefore results only for β = 1 are presented here and subsequently. If data were simulated according to a Poisson distribution, and Poisson regression models were used, the naive method and our likelihood method would coincide since the explanatory variable x is constant within patients and the sum of Poisson distributions is Poisson. The negative binomial distribution can be considered as a Poisson distribution with a multiplicative random effect whose variance is a function of the dispersion parameter. Hence it is not surprising that the naive method performs reasonably well for the coefficients (when associated with patient constant explanatory variables) but not the dispersion. Thus, although the naive method is biased for estimation of the dispersion parameter of the generating negative binomial, it does provide consistent estimation of the appropriate dispersion parameter for the naive analysis. The difference between these dispersion parameters increases as the number of missing observations increases. With observations that are not equidistant, the naive method would also involve some lack of fit since the sum of negative binomials would not be a negative binomial in this situation but the effect might be minimal in practice. Tables 1, part (b), and 2, part (a), present the results of additional simulation exercises. These were similar to the first two; however, the difference was that the explanatory variable x varied within patients. This is the situation in which the naive method has the potential to give misleading results. The set of explanatory variables was simulated according to the distributions P.X 1 = 1/ = 0:5, The parameter p was set to 0.8 to generate a set of X -values that had a positive dependence, and with p = 0:2 for a negative dependence. These two patterns of explanatory variables were then held constant over 500 simulations.
The results demonstrate that the naive method performs poorly for all parameters, with increasing bias with increased missingness. Our likelihood method is unbiased and gives good coverage. The effect of the missingness is to increase the mean-squared error.
Given that the simulations with a negative dependence produced the largest bias in the naive method, we choose to consider a large sample simulation using the same 'true' model. We repeated the simulation but with 100 patients rather than 10, each patient with 10 observations and with the same patterns of missingness as used previously. The results are shown in Table 2, part (b).
The results are similar to those for the smaller sample size. The bias and mean-square error are similar to those in Table 2, part (a), for all three parameters in the naive method but the bias, mean-squared error and coverage for our likelihood method and the full data become very close with increased sample size. The coverage for the naive method becomes worse, since the standard errors become smaller with increased sample size, but the bias is of order 1.
This simple simulation study illustrates that the methodology in Section 2 is of value primarily when explanatory variables are time varying. Furthermore, the effect of ignoring updates of such variables that are available from visits for which an outcome measure is not taken depends on the extent of mismeasurement of the explanatory variables that this induces.

Informative observation
The previous section was based on an implicit assumption that the pattern of the missing count data was not informative and thus missing at random. To generalize the approach, the observed data can be regarded as coarsened data (Heitjan and Rubin, 1991) in the sense that what is observed is a subset of the complete-data sample space that is defined by all observed and unobserved Y i,j -values. A similar view was adopted by Shardell et al. (2007) for longitudinal observation of discrete event time data. Non-informative observation can then also be termed coarsening at random.
To allow coarsening not at random, we require some model of the observation process. Let G i,j be a variable that describes the level of coarsening. Further, let h{g i,j |y i,j , x i,j , z.d i,j /, γ} be the probability function for G i,j conditional on the joint counts and explanatory variables at visit j and with unknown parameters γ. Then the likelihood development of Section 2 will be unchanged except that all occurrences of f {r, μ.η/, ψ} will be replaced by a product of the form f {·} h{·}.
We let G i,j be a binary variable that takes the value 1 if the increment in damaged joint count Y i,j is observed and 0 otherwise; hence In principle, the observation process could be modelled in other manners, e.g. by a model for the width of the interval in which Y is observed to lie. Again, this could be simply incorporated into the development of Section 2. However, when considering informative observation in the following section, we shall restrict attention to a binary observation variable and further simplify, for illustration purposes, by allowing observation to depend only on the increment in the joint count Y. It is convenient then to adopt the logistic model logit{h.1|y i,j /} = γ 0 + γ 1 y i,j : . 4/ This is the selection model for a dropout process that was suggested by Diggle and Kenward (1994) where a non-zero value for the regression coefficient γ 1 corresponds to a departure from an assumption of missingness at random. The identifiability of such models is highly dependent on distributional assumptions (Fitzmaurice et al., 1996). Also, the effect of missingness that depends only on outcome is expected to be limited with discrete data, being non-existent in the case of logistic regression (Farewell, 1979). We adopt the model represented by equation (4) here simply to illustrate the potential to extend the methods in Section 2. Adapting the methods to pattern-mixture models (Little, 1993) would also be of interest.

Application
The starting point for the results that are reported here was the negative binomial model that was presented by Bond et al. (2007) in their Table 4. The use of a negative binomial model is motivated by overdispersion in the data, particularly linked to an excess of 0s compared with a Poisson model. The model was developed from the psoriatic arthritis data that were introduced in Section 1. A summary of the characteristics of the patients at their initial clinic visits is given in Table 3. The model was for the increment in the number of damaged joints, determined clinically, between clinic visits. On the basis of a process of variable selection, it modelled the relationship between the log-relative-damage rates and explanatory variables: sex, age, time in clinic, initial erythrocyte sedimentation rate (ESR) current number of tender joints, current number of swollen joints, medication (none, non-steroidal anti-inflammatory drugs (NSAIDs), disease modifying anti-rheumatic drugs (DMARDs) and steroids). These correspond to the explanatory variables X as defined in Section 2 and most of these variables are time varying so there is the potential for bias in the estimation of regression coefficients if outcome data are not available at each clinic visit. The model also estimated relative rates that are associated with the current damage count (corresponding to the dynamic covariates Z.D/ which were referred to in Section 2), in particular with a categorization of the damage count (0, [1-4], [5-9] and greater than 9), and an interaction variable being the product of the length of the disease and an indicator of whether the patient has any damaged joints. The offset term O j was the logarithm of the time interval between observations j and j + 1; hence the coefficients estimate the relative damage rate (on a log-scale). Here we consider models with the same explanatory variables and dynamic covariates as chosen by Bond et al. (2007) but which are estimated from different subsets of the data. The specific form of the model can be written as where X i,j and Z i,j are row vectors of the covariates described above for patient i at visit j, which occurred at time t i,j ; β X and β Z are vectors of coefficients to be estimated and θ is the dispersion parameter to be estimated. Our analysis incorporates the interpatient correlation of the observations from one patient through the use of dynamic covariates, i.e. the current damage count and a related interaction. Aalen et al. (2004) have discussed the practical use of this approach. Alternative routes could include using random effects (McCulloch and Searle, 2001), or using generalized estimating equations (Liang and Zeger, 1986). These are discussed briefly in Section 6. Use of the current damage count has the practical advantage of being immediately interpretable to a clinical audience. The implicit assumption is that, conditional on the current damage count, the next increment is independent of previous increments. In previous work (Bond et al., 2007), the model that is represented by equations (5) and (6) was fitted and standard errors of the regression coefficients were estimated by using the generalized estimating equation methodology with an exchangeable working correlation. The estimates and standard errors were very similar to those obtained by using regular maximum likelihood estimation. This is consistent with Cox and Snell (1981), page 23, who advised that it is sensible to be most critical about primary aspects of a problem but simplifying assumptions can be made with regard to secondary aspects. Here, primary interest is in the marginal effects of the explanatory variables, and some reasonable attempt to capture any correlation structure should be adequate.
We examine a series of models that should produce roughly comparable estimates.  (4). Table 4 shows the results of fitting model (a). The regression coefficients reflect the importance of the various explanatory variables on the development of clinical damage. Fig. 1 then shows 95% confidence intervals and estimates for the 18 regression coefficients, the elements of β, and the dispersion parameter for models (a)-(e). Each box refers to a different regression coefficient and there are five error bars for each of the five models that were fitted. There are no obvious patterns in the regression coefficients, either in terms of the estimates or the standard deviations. The dispersion parameter does appear to become larger as the amount of data that is used becomes less and therefore more uncertainty is introduced. It is also larger for the model based on radiological damaged joint counts, model (d), than for the comparable model based on clinical damage, model (c). Comparison of the lengths of the confidence intervals from models (b) and (c) indicates that there is no consistent pattern across regression coefficients.
The logistic parameters, γ 0 and γ 1 , in model (e) were estimated as −1.06 and 0.34 with stan-  . 1. Comparison of estimated parameters and 95% confidence intervals: models (a)-(e) are from left to right dard errors 0.04 and 0.03. Thus there is highly significant evidence that observation depends on the increase in damaged joint count. From Fig. 1, however, it is clear that the coefficients that are estimated for models (d) and (e) are very similar. Thus, for these data, little bias appears to have been introduced by the adoption of a coarsening at random assumption in model (d).
From this example, no consistent evidence emerges that the more frequent updating of explanatory variable information that is enabled by the likelihood methodology of Section 2 is dramatically superior to simpler analyses. In addition, our estimates broadly agree with previous studies of these data (Gladman and Farewell, 1999;Gladman et al., 1994).

Discussion
In this paper we have illustrated how to implement likelihood-based inference for a regression model for longitudinal data when an outcome variable is only partially observed. The method is applicable to maximum likelihood estimation when the outcome variable is the increment in a cumulative, discrete quantity that can be missing at specific times of observation but when the time varying explanatory variables are available at all observation times. Generalization to account for informative observation is relatively straightforward. The choice of probability distribution is arbitrary as long as the sample space is discrete.
As pointed out by a referee, in the general context of regression models for longitudinal data, we cannot expect to improve a likelihood-based complete-case analysis if outcomes are missing but missingness at random holds. Here, the situation is special in that, in spite of the missing data, there is some information on the outcome because the sum over several outcomes is known.
In the example, our choice of the negative binomial can be regarded as a Poisson distribution with a gamma-distributed multiplicative random effect. We could generalize the assumption of conditional independence between increments, given current damage counts, by use of a Poisson model with a gamma-distributed random effect that is common within a patient. To use the multistate model framework to cope with the missing outcomes, we would have to factorize the joint likelihood of all measurements on a patient, which is available in a closed form, into a product of conditional likelihoods: L.y 1 |X 1 / L.y 2 |X 2 , y 1 /. . . L.y k |X k , y 1 , . . . , y k−1 /: After some algebra, it can be shown that L.y i |X i , y 1 , . . . , y i−1 / = Γ.θ i + y i / Γ.θ i / μ y i i λ θ i i .μ i + λ i / θ i +y i , which is very similar to equations (5) and (6), with equation (6) remaining the same, but with θ i and λ i obeying the recursive relationships Hence such a random-effects model could be implemented in principle and is a topic for future research.
In contrast, it is not apparent how to combine the multistate model approach to the missing outcomes, which is a full likelihood approach, with a generalized estimating equation approach (Liang and Zeger, 1986) that uses a working correlation matrix other than the independent correlation matrix. Even the use of a working independence assumption is computationally complex. Hence we have not pursued the generalized estimating equation approach.
Our method was developed in response to a specific application: the analysis of data from psoriatic arthritis patients. In this setting, the uncertainty in how best to measure joint damage required the analysis of radiological damage information which is not observed at each clinic visit. Our software was written by using the R programming language (Ihaka and Gentleman, 1996) and can be downloaded from http://www.mrc-bsu.cam.ac.uk. The download also contains a subset of the data consisting of the response and a few of the covariates. Computational efficiency might be improved with alternative programming packages.
Our simulation studies focused on the negative binomial distribution, which is suitable for count data with overdispersion. It shows that the naive method that ignores the missing observations is biased in estimating regression coefficients when the explanatory variables vary over the observations with missing outcomes. This bias does not improve with larger sample size. However, our likelihood method is consistent and has accurate confidence interval coverage error.
The practical implications of the incorporation of updated explanatory variable information may vary from application to application, depending primarily on the strength of relationships with explanatory variables and, most importantly, the amount of variation over time in the explanatory variables. With our psoriatic arthritis data, the effect was shown to be minimal, which is itself of considerable importance in strengthening the findings of Bond et al. (2007) that demonstrated that the use of clinical measures of damage provided similar information regarding progression of disease as would be provided by the use of radiological measures. Consideration of other realistic settings with comparable characteristics would be informative.