Epidemiologic Perspectives & Innovations BioMed Central Methodology

Longitudinal studies are helpful in understanding how subtle associations between factors of interest change over time. Our goal is to apply statistical methods which are appropriate for analyzing longitudinal data to a repeated measures epidemiological study as a tutorial in the appropriate use and interpretation of random effects models. To motivate their use, we study the association of alcohol consumption on markers of HIV disease progression in an observational cohort. To make valid inferences, the association among measurements correlated within a subject must be taken into account. We describe a linear mixed effects regression framework that accounts for the clustering of longitudinal data and that can be fit using standard statistical software. We apply the linear mixed effects model to a previously published dataset of HIV infected individuals with a history of alcohol problems who are receiving HAART (n = 197). The researchers were interested in determining the effect of alcohol use on HIV disease progression over time. Fitting a linear mixed effects multiple regression model with a random intercept and random slope for each subject accounts for the association of observations within subjects and yields parameters interpretable as in ordinary multiple regression. A significant interaction between alcohol use and adherence to HAART is found: subjects who use alcohol and are not fully adherent to their HIV medications had higher log RNA (ribonucleic acid) viral load levels than fully adherent non-drinkers, fully adherent alcohol users, and non-drinkers who were not fully adherent. Longitudinal studies are increasingly common in epidemiological research. Software routines that account for correlation between repeated measures using linear mixed effects methods are now generally available and straightforward to utilize. These models allow the relaxation of assumptions needed for approaches such as repeated measures ANOVA, and should be routinely incorporated into the analysis of cohort studies.


Introduction
Injury-related mortality rate estimates are often analyzed under the assumption that case counts follow a Poisson distribution. [1][2][3][4] Certain types of injury incidents occasionally involve multiple fatalities, however, resulting in dependencies between cases that are not reflected in the simple Poisson model and which can affect even elementary analyses of rates. This paper examines the application of the compound Poisson process model [5][6][7] to address this issue, emphasizing adjustments to some commonly used interval estimators for rates and rate ratios. Accompanying examples demonstrate the proposed adjustments and provide comparisons of results obtained under the Poisson and compound Poisson process models. This paper was motivated by the need for basic statistical methods applicable to data from the National Violent Death Reporting System (NVDRS). [8] The NVDRS data provide a census of violent deaths occurring in the states covered by the reporting system. Data are collected on incidents and persons (victims/suspects), and records for all persons associated with each incident are linked. Two types of incidents (not mutually exclusive) are of particular note in the present context: (i) those involving multiple homicides and (ii) those involving homicide followed by suicide. The NVDRS data for the year 2004 (covering 13 states) indicate that over 4% of homicide-related incidents involved multiple homicides. Correspondingly, at the individual (person) level approximately 9% of homicides were associated with multiple-homicide incidents. These data also show that nearly 12% of homicide-suicide incidents involved multiple (usually two) homicides, while at the individual level approximately 23% of homicides associated with a homicide-suicide incident were part of a multiple homicide-suicide incident.

An Analysis Framework Based on the Compound Poisson Process Model
Analyses of vital statistics data often rely on a conceptual framework in which case counts, even though based on a census, are considered inherently variable. [1,2,4,9,10] In selected National Center for Health Statistics reports, for example, mortality rate estimates are evaluated for statistical stability by assuming that census-level case counts (rate numerators) follow a Poisson distribution while atrisk population estimates (rate denominators) are assumed constant. [1,2] The simple Poisson model includes the assumption that cases occur independently. Cases (fatalities) associated with multiple-case incidents are not independent, however, and for this reason such a model does not adequately characterize the types of incidents described above. The compound Poisson process model [5][6][7] provides a closer conceptual parallel, by incorporating a two-level counting process. Applying this model to the NVDRS data, incident counts represent the first level and are assumed to follow a simple Poisson distribution. The counts of cases associated with each incident represent the second level. These incident-specific case counts are assumed to follow a common (discrete) probability distribution of unspecified form. Incident-specific case counts are also assumed to be (i) independent across incidents and (ii) independent of the count of incidents. [5][6][7] As an illustration of the basic aspects of the compound Poisson process model, suppose that occurrences of a specific type of incident are consistent with a Poisson process having person-year rate parameter λ. Letting person-years at risk be denoted by P, it follows that the incident count N has a Poisson distribution with (unknown) mean λP, denoted by N ~ Poisson(λP). At the next level, suppose that the incident-specific case counts C 1 , C 2 , C 3 , ..., C N have a common underlying distribution with mean μ and variance σ 2 (both generally unknown) and satisfy the independence assumptions specified above. The total case count then conforms to a compound Poisson process model, with underlying mean E [C] = λP × μ and underlying variance Var(C) = λP × (μ 2 + σ 2 ). [5][6][7] Effectively, the compound Poisson process model reduces to the simple Poisson model in analyses where multiplecase incidents do not occur. [7] In such situations C k = 1 for every incident, so that the total case count C is equal to the incident count N, with the latter variable assumed to follow a simple Poisson distribution. In this way, the framework based on the compound Poisson process model encompasses the more customary analysis framework based on the simple Poisson model.

Rate Estimators and Variances
Using terms defined above, the typical estimate of the population-based case rate per 100,000 person-years is provided by R ≡ C/P × 100,000. Under the compound Poisson process model E [R] = E [C]/P × 100,000 = λ × μ × 100,000. Since this latter quantity also corresponds to the underlying case rate per 100,000 person-years, it follows that R is an unbiased estimator.

Confidence Intervals for Rates
Confidence intervals for rates and rate ratios frequently involve an initial logarithmic transformation to the point estimate. Applying this transformation to the rate estimator R, the "delta" method [11][12][13][14] suggests the following general form (not model-dependent) for an approximate 95% confidence interval for the underlying rate [15]: When R is based on a total case count C that is assumed to follow a simple Poisson distribution, interval (1) simplifies to the more recognizable form [10,16,17]: Alternatively, when the total case count C is assumed to conform to a compound Poisson process model, substitution of the earlier expression for Vâr(R) into (1) provides the following adjustment to interval (1a) to account for multiple-case incidents: where C 1 , C 2 , C 3 , ..., C N are the incident-specific case counts defined previously. By inspection of the square root terms in (1a) and (1b), it can be seen that the estimated variance of ln(R) is increased by the factor , exp ln( ) .
under the compound Poisson process model. When no multiple-case incidents appear in the data it follows that (because C k = 1 for each incident covered), whereupon interval (1b) reduces to interval (1a) and the distinction between models becomes academic.

Example 1
This example briefly demonstrates the calculations required for intervals (1a)  The resulting interval estimate (0.104, 0.238) is approximately 19% wider than the unadjusted interval estimate.

Example 2
The coverage properties of the unadjusted and adjusted confidence intervals (1a) and (1b) were compared using stochastic simulation. Assuming that the Poisson distribution is an appropriate model for incident counts, the relevant simulation inputs are the underlying mean incident count λP and the underlying distribution of cases within incidents. The ranges of the simulation input values were selected in part to cover the observed values from Example 1.
The underlying mean incident count λP can be varied either through λ (the incident occurrence rate per personyear) or through P (person-years at risk) and for purposes of simulation the choice of which to vary is arbitrary. Therefore, P was held constant at 20 million person-years while λ was varied across the values 0.050 × 10 -5 , 0.125 × 10 -5 , 0.250 × 10 -5 , and 0.500 × 10 -5 . These values correspond to underlying mean incident counts of λP = 10, 25, 50, 100 for the hypothetical period of observation.
Each value of the incident occurrence rate λ was considered in combination with five different within-incident case count distributions. The within-incident case count distributions are denoted by quadruples (p 1 , p 2 , p 3 , p 4 ) indicating the probabilities that an incident involves 1, 2, 3, or 4 cases of interest, respectively. The first within-incident distribution (0.76, 0.24, 0.00, 0.00) matches the observed distribution from Example 1 in which each incident involved one or two cases. The second within-incident distribution (0.95, 0.05, 0.00, 0.00) reflects a comparatively small fraction of incidents involving multiple cases. The remaining three distributions reflect a successively increasing fraction of incidents involving multiple cases as well as larger numbers of cases. Consistent with earlier notation, μ and σ 2 denote the mean and variance for each within-incident case count distribution.
The simulation was programmed using the Statistical Analysis System (SAS). [19] One-hundred thousand simulation replicates were generated for each combination of simulation inputs. Each replicate involved simulation of an incident count according to a Poisson distribution having the indicated mean λP, followed by simulation of incident-specific case counts according to the indicated discrete distribution. The 95% confidence intervals (1a) and (1b) were calculated for each simulation replicate, and for each interval estimate it was noted whether the underlying case rate per 100,000 person-years (λ × μ × 100,000) fell between the interval endpoints. The relative frequencies with which intervals (1a) and (1b) covered C C k k , exp ln( . ) . 0 157 1 96 the true case rate for the various combinations of simulation inputs are displayed in Tables 2 and 3, respectively.
The results in Table 2 indicate that as incidents involving multiple cases and larger numbers of cases become more frequent, coverage for the unadjusted interval (1a) drops well below the nominal 95% level. In the first line of the table (paralleling the NVDRS data from Example 1) coverage falls to about 90%. In the second line, where multiplecase incidents are less common, the difference between effective and nominal coverage is minimal. In the fifth line of the table, representing the most extreme departure from one case per incident, coverage is reduced to just over 85%. In contrast, the results in Table 3 show that coverage for adjusted interval (1b) conforms closely to the nominal 95% level under all of the simulation parameter combinations considered.
Further simulation results (not shown) indicate that the performance of interval (1b) is sensitive to extra-Poisson variability in the incident counts. Although more general models compensating for such extra-Poisson variability might be considered, the primary interest here is the effect of multiple-case incidents. Moreover, multiple-case incidents are unmistakable when represented in the data, whereas extra-Poisson variability in the incident counts may be more difficult to detect. The remaining presentation therefore continues to rely on the compound Poisson process model, which adequately addresses the issue of multiple-case incidents and results in tractable computational expressions.

Confidence Intervals for Rate Ratios
The analysis is somewhat more complicated when considering rate ratios as opposed to individual rates. Letting R S1 and R S2 denote rate estimators for two demographic subgroups (for example, persons under 21 years of age and persons 21 years of age or older) the estimated rate ratio is defined in the usual way as RR ≡ R S1 /R S2 . Applying a logarithmic transformation to this ratio, the delta method [11][12][13][14] suggests the following general form for an approximate 95% confidence interval for the underlying rate ratio: Let C S1 and C S2 denote the case counts used to calculate R S1 and R S2 , respectively. These counts (and hence R S1 and R S2 ) are often considered independent, and the covariance term in (2) thus omitted. Assuming that these counts follow a simple Poisson distribution, interval (2) reduces to the more customary interval [10,16,20] for the underlying rate ratio: The adjustment to interval (2a) to account for multiplecase incidents must address dependencies not only within the two subgroups, but also dependencies extending across the subgroups. The latter type of dependency occurs when some multiple-case incidents include cases in both subgroups, and in such instances the covariance term in interval (2) cannot simply be omitted. Let C S1·1 , C S1·2 , C S1·3 , ..., C S1·N denote the incident-specific case counts (some possibly zero) for subgroup 1 (so ). Similarly, let C S2·1 , C S2·2 , C S2·3 , ..., C S2·N denote the incident-specific case counts for subgroup 2 (so ). With P S1 and P S2 denoting person-years at risk for the respective subgroups, an unbiased estimate of Cov(R S1 , R S2 ) under the compound Poisson process model is given by / (P S1 × P S2 ) × 100,000 2 (see Appendix B). Substituting the appropriate variance and covariance estimates into (2) and simplifying provides the adjustment to interval (2a) to reflect both within-group and cross-group dependencies:  As with the adjusted interval for a rate, when no multiplecase incidents are represented in the data it follows that , , all cross-products C S1·k × C S2·k are zero, and the adjusted interval (2b) reduces to the unadjusted interval (2a).

Example 3
The calculations required for intervals (2a) and (2b) are illustrated using NVDRS summary data for the year 2004. These data indicate a total of 144 incidents (13 states) involving homicide followed by suicide. Of these, 127 incidents involved a single homicide, 15 involved a double homicide, one involved a triple homicide, and one involved a quadruple homicide, for a total of 164 homicides. [21] Expanding on Example 1, Table 4 provides a summary of the incident-specific homicide counts associated with all 144 homicide-suicide incidents, with homi-cides classified into the two demographic subgroups described earlier (<21 years of age, 21+ years of age).
The left half of Table 4 contains summary data for the homicide-suicide incidents. The right half of the table shows the calculation of the sums appearing in intervals (2a) and (2b). For example, the third summary line represents four separate incidents, each with two homicides in the first age group and none in the second age group. Since C S1·k = 2 and C S2·k = 0 for each of the four incidents represented by this line, it adds 4 × 2 = 8 to the sum and 4 × 2 2 = 16 to the sum .
The totals at the bottom of the right half of Table 4 are used to calculate the unadjusted and adjusted confidence interval estimates for the rate ratio. In the states covered by NVDRS for 2004, the size of the population for the first age group was approximately 19.8 million, and approximately 48.9 million for the second age group. [18] Since the reporting period covered one year, these population figures approximate person-years at risk in the respective age groups, so the estimated rate ratio is:   The adjusted interval estimate is approximately 10% wider than the unadjusted estimate. In this example, the influence of the increased variance estimates for the rate ratio components (the numerator and denominator rate estimates) is partially offset by the covariance term.
When cases associated with multiple-case incidents are concentrated mostly within subgroups the covariance term in adjusted interval (2b) will be relatively small and this interval will generally be wider than unadjusted interval (2a). For example, if the subgroups represent separate geographic regions, multiple-case incidents will rarely involve both subgroups and the covariance term will be negligible. Conversely, in situations where multiple-case incidents frequently involve both subgroups, the covariance term in (2b) offsets the influence of cases concentrated within subgroups (as in the previous computational example). In some instances the covariance term can dominate to the extent that the adjusted interval is narrower than the unadjusted interval.
The performance of intervals (2a) and (2b) under the types of conditions just described was evaluated using stochastic simulation, assuming various combinations of the incident occurrence rate, the underlying rate ratio, and the within-incident case count distribution (p 1 , p 2 , p 3 , p 4 ). The initial set of simulations randomly assigned all cases associated with any given multiple-case incident to one of the two subgroups (according to probabilities consistent with the assumed rate ratio) thereby reducing the covariance term in interval (2b) to zero. Provided that the underlying mean incident count for each subgroup was not less than 10, the estimated coverage for adjusted interval (2b) was within 1% of the nominal level (95%) for all simulation inputs considered. By contrast, the estimated coverage for unadjusted interval (2a) dropped to about 85% when assuming the most extreme within-incident case count distribution (0.70, 0.20, 0.07, 0.03) from Example 2.
When modified to include multiple-case incidents simultaneously involving both subgroups, the simulations again indicated an effective coverage for adjusted interval (2b) close to the nominal level for the inputs considered. However, the cross-group dependencies introduced by this simple change also resulted in an adjusted interval with average width about the same as that of the unad-   justed interval. Consequently, the estimated coverage of unadjusted interval (2a) was also close to the nominal level.

Confidence Intervals for Age-Standardized Rates and Rate Ratios
The methods considered thus far pertain to crude rates (and ratios of crude rates) estimated using a single case count (numerator) and a single value for person-years at risk (denominator). The treatment of age-standardized rates and ratios of age-standardized rates follows from straightforward extensions of results already presented.
To illustrate the proposed extension for age-standardized rates, assume that there are M age groups into which the data are partitioned and denote the corresponding agegroup rate estimators by R G1 , R G2 , R G3 , ..., R GM . Let ω G1 , ω G2 , ω G3 , ..., ω GM denote corresponding age-group population fractions (assumed fixed) in the referent (standard) population, such that . Applying the direct method of standardization [14] the age-standardized rate estimator is given by: The usual formula for the variance of a weighted sum provides the following expression for the estimated variance of R a : Here, dependencies between cases within any given age group will affect the variance of the age-group rate estimator, while dependencies between cases in different age groups will result in nonzero covariances between the agegroup rate estimators. The analog to interval (1) for agestandardized rates is given by: Appendix equations (A.1) and (B.1) provide variance and covariance estimation formulas applicable to the agegroup rate estimators (with rate scale per 100,000 personyears) assuming a compound Poisson process model. These can be substituted into (3) to obtain a computational formula for Vâr(R a ), which when used in (4) provides an interval adjusting for multiple-case incidents.
When considering a ratio of age-standardized rate estimates for two subgroups, there are three potential types of dependency associated with multiple-case incidents: (i) between cases within the same subgroup and age group, (ii) between cases in different age groups within the same subgroup, and (iii) between cases in different subgroups. All three effects can be simultaneously illustrated by considering the ratio of age-standardized rates for males and females. A multiple-case incident may variously involve several males (or females) in the same age group; males (or females) in different age groups (dependency between case counts contributing to the same age-standardized rate estimate); or both males and females (dependency between case counts contributing to both numerator and denominator rate estimates).
Letting R a·S1 and R a·S2 denote the respective age-standardized rate estimators for two subgroups, the age-standardized rate ratio is estimated by RR a ϵ R a·S1 /R a·S2 . The analog to interval (2) for age-standardized rate ratios is: Computational formulas for Vâr(R a·S1 ) and Vâr(R a·S2 ) under the compound Poisson process model can be obtained as described above for interval (4). Appendix equation (C.1) provides a computational formula for Côv(R a·S1 , R a·S2 ) (with rate scale per 100,000 personyears) assuming the compound Poisson process model. These formulas can be substituted into (5) to obtain an interval adjusting for the effects of multiple-case incidents.
When there are no multiple-case incidents represented in the data, the covariance estimates in (3) vanish as does the term Côv(R a·S1 , R a·S2 ) appearing in (5). Under such circumstances (4) and (5) reduce to intervals appropriate when case counts are assumed to follow a simple Poisson distribution and subgroups are assumed independent. [22] Stochastic simulation was used to evaluate the coverage properties of adjusted intervals (4) and (5) when case counts conform to a compound Poisson process model. When the age distributions of the study groups of interest do not depart too greatly from that of the referent population, the simulation results suggest that coverage levels are comparable to those reported earlier for the adjusted confidence intervals for crude rates and rate ratios. In particular, estimated coverage was close to the nominal level provided that underlying mean subgroup incident counts were not less than 10.

Assessment of Bias
It was noted at the outset that crude rate estimators are unbiased; by extension age-standardized rate estimators are also unbiased. Consequently, the coverage properties of adjusted intervals for crude and age-standardized rates depend on the appropriateness of the compound Poisson process model as well as the accuracy of the normal approximation implied when applying the delta method.
While the crude and age-standardized rate estimators are unbiased, the rate ratio estimators are only asymptotically unbiased. A supplementary assessment of finite-sample bias in the simulations described following Example 3 suggests that it is relatively small compared to the standard error of the rate ratio estimator, for the simulation inputs considered. In none of the simulations was the bias strong enough to cause the effective coverage of the adjusted (compound Poisson) interval to differ substantially from the nominal level.

Conclusion
By referring to the compound Poisson process model in place of the simple Poisson model for case counts, confidence intervals for injury-related mortality rates and rate ratios can be adjusted to account for statistical dependencies associated with multiple-case incidents. The adjustments rely on closed-form computations and offer meaningful improvements in the accuracy of statistical statements. The adjusted interval estimators described in this paper have been programmed as general routines using SAS. [19] When the data show any pattern of multiple-case incidents, the adjusted intervals for rates will be wider than their unadjusted counterparts. This does not hold for the adjusted intervals for rate ratios, however; different patterns in the data can variously widen or narrow these intervals relative to their unadjusted counterparts.
It is evident that in situations where multiple-case incidents are very infrequent and involve small numbers of cases when they do occur, there will be little difference between the statistical results obtained using the methods based on the compound Poisson process model and those based on the simple Poisson model. In the context of the NVDRS data, for example, when suicides are considered separately there is almost no distinction between suiciderelated incident counts and suicide case counts (because multiple-suicide incidents are extremely infrequent).
Conversely, there may be situations covered by other reporting systems where multiple-case incidents are more prominent and/or involve larger numbers of cases than in the examples considered in this paper. Simulations show that in such instances, the gaps between nominal and effective coverage probabilities for the unadjusted interval estimators become quite substantial.

A. The Estimated Variance of a Total Case Count
Let N ~ Poisson(λP) denote the count of incidents and let C 1 , C 2 , C 3 , ..., C N denote the incident-specific case counts.
That is an unbiased estimator for the variance of the total case count under a compound Poisson process model can be demonstrated using a basic conditioning argument. Employing the assumptions specified in the text (particularly the independence assumptions) it follows that: Because the last term in the sequence of equalities corresponds to the underlying variance of the total case count C under the compound Poisson model, it follows that is an unbiased estimator for Var(C).
Since the estimated case rate (per 100,000 person-years) is given by R ≡ C/P × 100,000, an unbiased estimate of Var(R) is: