Comparison of different methods in analyzing short-term air pollution effects in a cohort study of susceptible individuals

Background Short-term fluctuations of ambient air pollution have been associated with exacerbation of cardiovascular disease. A multi-city study was designed to assess the probability of recurrent hospitalization in a cohort of incident myocardial infarction survivors in five European cities. The objective of this paper is to discuss the methods for analyzing short-term health effects in a cohort study based on a case-series. Methods Three methods were considered for the analyses of the cohort data: Poisson regression approach, case-crossover analyses and extended Cox regression analyses. The major challenge of these analyses is to appropriately consider changes within the cohort over time due to changes in the underlying risk following a myocardial infarction, slow time trends in risk factors within the population, dynamic cohort size and seasonal variation. Results Poisson regression analyses, case-crossover analyses and Extended Cox regression analyses gave similar results. Application of smoothing methods showed the capability to adequately model the complex time trends. Conclusion From a practical point of view, Poisson regression analyses are less time-consuming, and therefore might be used for confounder selection and most of the analyses. However, replication of the results with Cox models is desirable to assure that the results are independent of the analytical approach used. In addition, extended Cox regression analyses would allow a joint estimation of long-term and short-term health effects of time-varying exposures.


Background
Ambient air pollution has been associated with increases in acute morbidity and mortality [1]. Patients with underlying chronic diseases such as for example diabetes, ischemic heart disease or heart failure may be at particular risk for the effects of ambient air pollution. Series of incident cases may be followed over time to assess the impact of short-term fluctuation on recurrent disease exacerbation. These studies deviate from the classical cohort as well as time-series designs. The major statistical challenge in these studies is the control for changes within the cohort over time due to changes in the underlying risk following disease inception, slow time trends in risk factors within population, dynamic cohort size and seasonal variation. The impact of time is particular important when effects of air pollution are considered, because concentrations of air pollutants vary by season and yearly averages of pollutants change due to weather, air pollution control measures and changes in sources.
In order to evaluate the impact of air pollution on patients with pre-existing diseases, previous hospitalization might be used to define a potentially susceptible subgroup [2][3][4]. Poisson regression analyses and case-crossover analyses have been used to estimate the impact of variations in daily average air pollution concentrations on the risk of death in previously hospitalized subgroups of the population [5][6][7][8].
A multi-city study was designed to follow cohorts of myocardial infarction (MI) survivors in five European cities: Augsburg, Barcelona, Helsinki, Rome and Stockholm (HEAPSS-Study). For entry into the cohort, incident first myocardial infarction cases were considered. The outcomes considered were re-infarctions and other related rehospitalization or death. Ambient air pollution was characterized based on existing air monitoring networks. In addition, condensation particle counters were set up in each location to measure the ambient particle number concentrations (PNC) and to retrospectively estimate PNC for the entire study period in each location. This paper describes and compares different statistical approaches for analyzing short-term air pollution health effects in a cohort with ongoing recruitment during the follow-up. Three different methods to analyze the cohort data were considered: (a) Poisson regression analyses on the calendar-time axis, (b) case-crossover analyses and (c) extended Cox regression analyses. As an example the analyses of the association between NO 2 and any cardiac readmission of MI survivors are presented.

Data collection
Incident myocardial infarction survivors were recruited in five European cities (Augsburg, Barcelona, Helsinki, Rome, and Stockholm) during 1992 to 2000 as has been described elsewhere [9]. Briefly, data sources were AMI Registries in Augsburg and Barcelona, administrative databases of hospital admissions in Helsinki, Rome and Stockholm. Enrollment was restricted to residents of the above cities, aged 35 years or more (35-74 in Augsburg, 35-79 in Barcelona) who had their first AMI (index AMI) during the recruitment period. Subsequent first cardiac rehospitalizations of cohort members within the study area were recorded from day 29 after the index AMI until the individual end of follow-up period defined by death, migration out of the study area, or center specific end of the study. Readmissions of interest were those with primary diagnoses of re-infarctions (ICD-9: 410; ICD-10: i21, i22), acute angina pectoris (ICD-9: 411, 413; ICD-10: i20-i22, i24, i25), dysrhythmia (ICD-9: 427; i46.0, I46.9, i47-i49, r00.1, r00.8) and heart failure (ICD-9:428; ICD-10: i50). Vital status and place of residence at the end of the follow-up period were ascertained for all cohort members.

Statistical analyses
In the study, MI survivors entered the cohort during an extended period. Once the cohort was complete at least one year of follow-up without recruitment was added. As a consequence the following issues had to be considered in the statistical analyses: (a) the number of subjects at risk increases over the recruitment period, (b) the risk of the cohort for a recurrent event on the calendar-time axis varies based on the proportion of recent MI survivors in the entire cohort. In the following we refer to calendar time when we number the dates during follow-up and to cohort-time when we number the person-time followed.

Poisson regression analyses
Epidemiological studies in air pollution research have developed techniques using Poisson regression analyses on the calendar-time axis [10]. In a time series analysis the daily counts of events are regressed on the daily predictor variables such as trend, season, weather, and air pollution. This design assumes that the events are independent and that the event rate is changes only slowly over time. The event rate λ t at the time point t is modeled as follows ln(λ t ) = α + ∑β i x it With λ t = y t /N t being the number of cases y observed at time t divided by the number of subjects N at risk at time t. The model can be rewritten as ln(y t ) = ln(N t ) + α + ∑β i x it In time-series analyses it is sometimes assumed that the underlying population at risk does not change, and therefore the population at risk is not modeled explicitly in the regression analyses. However, when applied to a cohort the number of subjects at risk can be explicitly modeled as given in the equation above. In the cohort setting, when analyzing the data on the calendar time axis, the trend captures three different underlying reasons for changes in the rate of MI over time: (a) long-term underlying trends in the study base due to life-style changes, changes in health care or aging of the population, (b) seasonal variation and (c) changing composition of the cohort. Therefore, in the cohort setting proper trend control is very important because it is likely that the effect estimates will be biased if the trend is not correctly specified. Recent discussions on modeling in time-series analyses [11,12] have lead to a broader use of different smoothing techniques in these analyses and different approaches have been used to assess the trend over time [13]. We selected three approaches to model the trend: (a) natural splines, (b) penalized splines and (c) locally weighted least square (loess) smoothers.
In a hierarchical approach potential confounders were selected, including long term trend, season, days of the week, holidays and meteorology, before adding air pollution concentration as independent variable. Generalized additive models were used to allow for non-parametric functions of the confounders in R (The R Foundation for Statistical Computing Version 1.8.1) using the package "mgcv" (version 0.9-6) [14][15][16]. All models included the natural logarithm of the number of persons at risk each day as offset and the daily number of events as outcome variable. To allow for possible over-or under-dispersion the quasi-likelihood family was used to estimate the parameters without specifying the underlying distribution function. Penalized regression splines were tested for the continuous confounder variables. The choice of degrees of freedom was left to the algorithm ("magic") in the "mgcv" -package that minimized the Generalized Cross Validation (GCV) criterion. The default of 10 knots as starting value was adjusted to higher values if necessary due to the structure of the data. If the smooth function was not significant or the estimated degrees of freedom were less than two, a linear term was tested instead. Decisions for keeping a covariate in a model were based on judgment using the p value (<.1), GCV score (the smaller the better), and the autocorrelation function (ACF) (the nearer to zero the better).
Trend was included in the model as an obligatory variable with a penalized spline starting with 6 knots per year to control for long term trends, seasonality and changes in the baseline risk. Then current day temperature and the deviation of the current day temperature from the mean temperature of lag day 1-3 were tested as penalized splines. At least one temperature term had to remain in the model. Thereafter penalized splines of air pressure and relative humidity, and then dummies for days of the week and the city specific indicators (holidays, population decrease) were tested one after the other. Finally the model was "fine tuned" changing the parameters where the decision had not been clear and comparing ACF plots and GCV-score to choose the final covariates to include for the further analyses.
Sensitivity analyses were performed to compare the results of the final models with the estimates obtained when modifying the smoothing-functions used or the confounders entered in the Poisson model. Thus the analyses were repeated using natural cubic splines or loess instead of the penalized splines, with comparable effective degrees of freedom to those in the final model with penalized splines. Temperature was replaced by apparent temperature [17] and dew point temperature. Instead of temperature difference the mean of lag day 1 to 3 was used. The sensitivity of the results to the choice made when selecting the confounders was tested by excluding or including borderline significant confounders that had or had not entered the model. At the same time, those that had been included with a linear term were considered with a smooth function in the sensitivity analyses.

Case-crossover analyses
Case-crossover analyses have been developed to study transient effects of acute exposures using a case-only design. This design samples information on exposure status from case and referent periods selected from the person-time of the cases. The period of exposure of the cases is selected as a plausible hazard period immediately preceding the event. Referent periods are chosen to represent the exposure distribution in the non-case time periods at risk.
The referent period selection poses the main challenge to the case-crossover analyses. Different sampling designs for referent periods have been proposed to estimate the effects of air pollution. Exposures that do not change in association with the case-status, such as air pollution can be sampled also from person-time after the case occurred [18]. The stratified approach by Lumley and Levy [19] controls time trends by design.
Case-crossover analyses were performed as an alternative to the Poisson regression using conditional logistic regression models in the S-Plus statistical package version 6.0. We used the "coxph" function with a strata statement. The date of each case contributed a hazard period that was matched with referent periods selected with the stratified approach (stratifying criteria were year, month and week-day). It controls for weekday by design. The same confounder variables were included as in the Poisson regression in order to obtain maximum comparability of the models. P-Splines were used as smoothing method for meteorology variables. The effective degrees of freedom of the smoothers in the Poisson models were translated by adjusting the smoothing parameter of the P-splines.

Proportional hazard models
An alternative possibility is to model the data as cohort data using Cox proportional hazard models. Time t now denotes the time since MI. Extended Cox regression allows for time-varying covariates in survival analyses [20]. The hazard h at time t is given as where X(t) = (X 1 , X 2 ,....., X p1 , X 1 (t), X 2 (t),....., X p2 (t)) and X 1 , X 2 ,....., X p1 are time-invariant variables such as for example age or gender, and X 1 (t), X 2 (t),....., X p2 (t) are time-varying variables such as weather or air pollution.
The model makes no assumption about the baseline hazard, and therefore is suitable for the analyses proposed here because the risk for cardiac readmission changes during the follow-up of the MI survivors. While X j (t) is varying over time, the hazard model provides only one regression coefficient δ j for each time-varying variable in the model. Thus at time t, there is only one value of the variables X j (t) that has an effect on the hazard: the value being measured at time t. The model therefore assumes a uniform relative risk for all time points and consequently does not per se address the possibility of effect modification of the air pollution effects by length of follow-up.
The association with daily pollution levels was analyzed in SAS statistical software (SAS Institute Inc., Cary, NC, USA, Release 8.02) PROC PHREG using the counting process style of input. For each subject one record for each day at risk was created. All models included the same cov-ariates as the Poisson models, but instead of smooth functions quadratic functions were used. Trend entered as a linear term. As constant covariates age at entry (in years as quadratic function), diabetes, hypertension and sex (as dummy variables) were considered, since they were identified as predictors of survival in a classical Cox regression.

Results
Data from Rome is used to illustrate the properties of the data. Rome was selected because it had one of the larger data sets and had clear evidence for seasonal variation. In Rome between 1998 and 2000, 7384 subjects survived at least 28 days after their first MI (Table 1). Between 1998 and 2001, 1916 readmissions for any cardiac disease defined as readmission for angina pectoris, myocardial infarction, congestive heart failure or arrhythmia were observed. Figure 1 describes various aspects of the cohort data for any cardiac readmission of MI survivors in Rome displayed on the calendar time-axis. During the followup, the size of the cohort changed daily (figure 1a). In addition, the composition with respect to the distribution of length of follow also changes constantly if considered on the calendar-time axis. Consequently, the number of cases observed at each time-point of the calendar-time axis of the follow-up reflects the size of the cohort and its composition (figure 1b). For each MI survivor, the risk of hospitalization for any cardiac disease is elevated during the first half year of follow-up and thereafter stabilizes (figure 1c). Figure 2 shows the number of subjects followed considering the time of follow-up as time-axis. The number of subjects followed over time steadily decreases due to the occurrence of an event, loss to follow-up or end of the observational period (figure 2a). The number of events (figure 2b) and the incidence rate (figure 2c) observed is greatest at the beginning of the follow-up due to the vulnerability of the patients in the time immediately after the index event. The incidence rate observed after two or three years of follow-up is low (figure 2c).  locally weighted least square smoothers (loess). Here the time-axis is the calendar-time axis and the functions shown in figure 3 correspond to the descriptive data in incidence rates in figure 1c. All functions suggest a decreased probability to observe hospitalizations due to any cardiac disease over time overlaid by seasonal variation. Thereby, they are able to capture the trend components discussed above remarkably well. The different smoothing approaches render quite comparable functions. The different smoothing approaches show comparable air pollution effect estimates in the Poisson regression analyses. The pooled estimates for NO 2 after confounder control by natural splines or loess functions are somewhat reduced, but one would still draw the same conclusions. Removing confounders overall increases the estimates whereas adding more confounders slightly decreases the regression coefficients. In the case-crossover analyses, the estimates are sometimes slightly higher (Augsburg) and

The Effect of NO 2
Number of MI survivors followed (a), the number of readmissions to the hospitals due to any cardiac disease dur-ing the followup (b) and incidence rate during the follow-up in Rome (c) as part of the HEAPSS study on the cohort time-axis Figure 2 Number of MI survivors followed (a), the number of readmissions to the hospitals due to any cardiac disease during the followup (b) and incidence rate during the follow-up in Rome (c) as part of the HEAPSS study on the cohort timeaxis.
Number of MI survivors followed (a), the number of readmissions to the hospitals due to any cardiac disease dur-ing the followup (b) and incidence rate during the follow-up in Rome (c) as part of the HEAPSS study on the calendar-time axis Figure 1 Number of MI survivors followed (a), the number of readmissions to the hospitals due to any cardiac disease during the followup (b) and incidence rate during the follow-up in Rome (c) as part of the HEAPSS study on the calendartime axis.
sometimes slightly lower (Barcelona, Helsinki, and Rome) than the Poisson model estimates. The pooled estimate would suggest smaller regression coefficients and larger standard errors than Poisson regression analyses. The results of the extended Cox regression analyses were consistent with the results of the Poisson regression analyses in Augsburg, Stockholm and Helsinki. Slightly smaller effect estimates were obtained for Barcelona and Rome. Overall, the pooled estimates were slightly smaller than those obtained with Poisson regression analyses, but would also suggest an association between NO 2 and hospital readmissions in MI survivors (figure 4). For extended Cox regression analyses individual characteristics were also considered as confounders in the analyses and the results were nearly identical to those obtained without consideration of individual characteristics. No strong evidence for heterogeneity of the city-specific effect estimates was observed and pooled random effect estimates were identical to the pooled fixed effect estimates (data not shown).

Discussion
The major challenge of analyzing short-term health effects of air pollution in a cohort of diseased subjects is to consider simultaneously other time-varying confounders and the changes in the probability of a recurrent event due to the individual characteristics. In the case of myocardial infarction survivors, a major determinant of individual vulnerability is the time since index event due to the underlying healing of the heart tissue.
Three different approaches were considered in the analyses. The first one, Poisson regression analysis summarizes the events on a calendar time-axis. The data in this study demonstrate an example where the underlying probability of observing an event changes with the composition of the cohort. Therefore, time is not only a measure for slow trends and seasonal variation, but also represents the changing fractions of persons at high risk over time.
Smoothing methods were used to attempt to model these three different components in the time trend. The recent discussion on smoothing techniques [11,12] was considered by the selection of three different approaches, namely natural splines, penalized splines and locally weight least square smoothers. All three methods gave quite comparable results and their function is consistent with a decreased risk of hospital readmission as the cohort ages and a seasonal variation in hospital readmissions. In the sensitivity analyses, results were robust to changes in the confounders selected in the final model.
The case-crossover method was chosen because it was designed to control for temporal changes by design. More specifically, the stratified referent period selection approach considered here controls for trends by design [19]. However, it has been noted that case-crossover analyses have reduced power compared to Poisson regression analyses [19]. In this study we observed slightly smaller effect estimates with larger standard errors. The case of a cohort with ongoing recruitment and dynamically changing composition has not been methodologically considered before. One may assume that the underlying changes in rates are constant within each stratum [21]. However, as observed in figure 1, panel C this assumption might be violated at the beginning of the study. Therefore, the changing number of subjects at risk might be responsible for the small differences observed between Poisson regression analyses, which explicitly consider the varying number of subjects at risk on a given day, and the casecrossover analyses. Nevertheless, case-crossover analyses were considered for these analyses because they quite elegantly allow analyses of subgroups.
Extended Cox regression analyses, on the other hand, were formulated to consider the present study design in the correct way. Here, the underlying risk is modeled by a hazard function h 0 (t) which is variable over time and considers both the changing hazard due to the recovery from the incident MI as well as changes in the composition of the cohort with respect to season and calendar year.
instead of subject's age as time axis [22]. While these two approaches should give similar effect estimates we favor the model on the follow-up time axis as it more appropriately considers the changing risk levels within the cohort.
A recent review by the American Heart Association has highlighted the emerging evidence for the biological plausibility between ambient air pollution concentrations in urban areas and cardiovascular disease exacerbation [23]. However, effect estimates obtained from the general population might underestimate the risk of susceptible subpopulations, which have also higher baseline risks [24]. Cohort studies assessing the risk of susceptible populations are highly recommended to provide better estimates for risk assessment. For example, the age at first MI, socioeconomic status as well as the time since the first MI might modify the risk of short-term air pollution exposures for an individual. All three methods might be used to assess the susceptibility of subgroups. The extended Cox regression analysis is the only method that would allow the estimation of the main effect of the considered effect modifier, but computation of the models is time-consuming.
Recent research indicated that spatial and temporal variability of long-term exposures to ambient particles may be important factors to consider [25,26]. Furthermore, future research might consider short-term fluctuations as well as individualized estimates of long-term exposures to ambient particles in assessing the health impact of environmental exposures. For these studies, Extended Cox regression analyses would be the method of choice. In its most simplistic version, one may estimate jointly the effect of homes exposed to high traffic together with air pollution concentrations from a central monitoring side. However, also more sophisticated approaches of exposure assessment building on spatio-temporal models such as for example described by Gulliver and Briggs [27] or Sahu and colleagues [28] can be foreseen.

Conclusion
Of the three methods considered for the analyses of the HEAPSS Study, the Poisson regression approach and the extended Cox regression analyses gave similar results. Case-crossover analyses might underestimate the strength of the association in this specific setting, but the differences were small. Further methodological investigation may be warranted. From a practical point of view, Poisson regression analyses are less time-consuming, and therefore might be used for confounder selection and most of the analyses. However, replication of the results with Cox models is desirable to assure that the results are independent of the analytical approach used. For the identification of susceptible subgroups within a cohort of susceptible populations, case-crossover analyses might be the least time consuming approach, however, Extended Cox regres-Effect estimates for the association between NO 2 (8 μg/m 3 ) hospital readmissions in MI survivors obtained in Poisson regres-sion analyses and Extended Cox regression analyses of all five cities within the HEAPSS study and the pooled estimates based on a fixed effect model Figure 4 Effect estimates for the association between NO 2 (8 μg/m 3 ) hospital readmissions in MI survivors obtained in Poisson regression analyses and Extended Cox regression analyses of all five cities within the HEAPSS study and the pooled estimates based on a fixed effect model.