Prediction of Cardiovascular Disease Risk Accounting for Future Initiation of Statin Treatment

Abstract Cardiovascular disease (CVD) risk-prediction models are used to identify high-risk individuals and guide statin initiation. However, these models are usually derived from individuals who might initiate statins during follow-up. We present a simple approach to address statin initiation to predict “statin-naive” CVD risk. We analyzed primary care data (2004–2017) from the UK Clinical Practice Research Datalink for 1,678,727 individuals (aged 40–85 years) without CVD or statin treatment history at study entry. We derived age- and sex-specific prediction models including conventional risk factors and a time-dependent effect of statin initiation constrained to 25% risk reduction (from trial results). We compared predictive performance and measures of public-health impact (e.g., number needed to screen to prevent 1 event) against models ignoring statin initiation. During a median follow-up of 8.9 years, 103,163 individuals developed CVD. In models accounting for (versus ignoring) statin initiation, 10-year CVD risk predictions were slightly higher; predictive performance was moderately improved. However, few individuals were reclassified to a high-risk threshold, resulting in negligible improvements in number needed to screen to prevent 1 event. In conclusion, incorporating statin effects from trial results into risk-prediction models enables statin-naive CVD risk estimation and provides moderate gains in predictive ability but had a limited impact on treatment decision-making under current guidelines in this population.


Editor's note: An invited commentary on this article appears on page 2015.
Cardiovascular disease (CVD) remains the leading cause of morbidity and mortality worldwide (1). Identifying individuals who are at high CVD risk is important for effectively implementing prevention strategies with limited health-care resources (2). For this purpose, many prediction models have been developed and subsequently recommended by primary prevention guidelines to help identify individuals at high risk of CVD who should benefit the most from preventive interventions, such as lifestyle advice and statin treatment (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17). As such, CVD risk-prediction models are typically intended for treatment-naive populations (i.e., for the assessment of CVD risk in the absence of future treatment initiation) (10); however, they are rarely developed and validated in populations that remain treatment-naive throughout followup (18)(19)(20). Indeed, most contemporary models have been developed using data that excluded statins users at baseline without taking into account statin initiation during followup (so-called "treatment drop-ins") (19,21), leading to a possible underestimation of risk and hence undertreatment of high-risk individuals (22). The problem of treatment dropins in risk-prediction modeling is underappreciated (23).
Given the absence of an ideal treatment-naive population in which to develop risk-prediction models, it is important to explore statistical methods that address treatment drop-in effects (23). Previous studies have investigated the use of inverse probability weighting (18) or marginal structural models (20) to enable the estimation of treatment-naive risks. However, these models require estimating an unbiased treatment effect within the study population, relying on randomized study designs or cohorts with no unmeasured confounders.
Here, we propose incorporating causal evidence from clinical trials to provide a novel and simple approach to address time-dependent treatment drop-in for the estimation of treatment-naive risks (interpretable as risk estimates in the absence of future treatment initiation) (23). We illustrated our simple and practical approach through the derivation and validation of a CVD risk model to estimate 10-year statinnaive CVD risk predictions, using longitudinal electronic health records from a large and representative UK population.

Study population
Data source. We used medical records from English National Health Service general practices that contributed anonymized primary-care electronic health records to the Clinical Practice Research Datalink (CPRD), covering approximately 6.9% of the UK population (24). Patients in CPRD are broadly representative of the UK general population with respect to age, sex, and ethnicity (24). CPRD was linked to secondary care admissions from Hospital Episode Statistics and national mortality records from the Office for National Statistics.
The data used in this study was obtained under license from the UK Medicines and Healthcare Products Regulatory Agency (protocol 162RMn2).
Study outcomes. CVD was defined as a combination of new diagnoses of nonfatal or fatal events of coronary heart disease (including myocardial infarction and angina), stroke, and transient ischemic attack, matching the definition used by QRISK algorithm (8,15), which is recommended by UK CVD risk assessment guidelines for 40-to 84-year-olds (25). Read codes (used to identify outcomes in CPRD) and International Classification of Diseases, Tenth Revision, codes (used to identify outcomes in primary or secondary diagnosis fields from Hospital Episode Statistics and in underlying or subordinate cause of death fields from the Office for National Statistics) are provided in the Web Appendix 1, Web Tables 1 and 2 (available at https://doi.org/10.1093/aje/ kwab031). We defined incident CVD as the first occurrence of CVD in any of the 3 databases (CPRD, Hospital Episode Statistics, and Office for National Statistics).
Risk factors. Conventional CVD risk factors (10,26) were selected, and included systolic blood pressure (SBP), total cholesterol, high-density lipoprotein (HDL) cholesterol (for which details of measurements have been previously described (24)), hypertension treatment (yes/no ascertained from CPRD prescription information), smoking status (current smoker or not ascertained from CPRD Read codes), and previous diagnoses of diabetes (yes/no ascertained from CPRD Read codes (27)). Individuals were assumed to have hypertension treatment or diabetes for the rest of follow-up after their first prescription or diagnosis. In addition, we defined statin initiation as the date of first CPRD prescription (code list for CPRD prescription provided in Web Appendix 2, Web Table 3). The following biologically implausible riskfactor values were set to missing: SBP >250 mm Hg or <60 mm Hg; total cholesterol >20 mmol/L or <1.75 mmol/L; HDL cholesterol >3.1 mmol/L or <0.3 mmol/L (28,29). Values of SBP, total cholesterol, and HDL cholesterol were standardized using sex-specific means and standard deviations.
Study entry and exit. Individuals entered our study on the latest of 4 dates: the date of 6 months after registration at the general practice; the date the individual turned 40 years of age (note, prior information from age 30 years onward were extracted for these individuals); the date that the data for the practice were up to standard (30); or April 1, 2004, the date of introduction of the Quality and Outcomes Framework (31). Individuals were censored at the earliest date of the following: the individual's death or the first incident CVD event; the date that the individual turned 85 years of age (note, follow-up data up to age 95 years were extracted for these individuals); the date of deregistration at the practice; the last contact date for the practice with CPRD; or November 30, 2017, the end of data availability.
Study eligibility criteria. Of the 2,589,074 individuals with linked data, those with CVD or statin treatment identified before study entry were excluded. We also excluded individuals who had no measurements of any of SBP, total cholesterol, HDL cholesterol, or smoking status between study entry and exit dates. A total of 1,678,727 individuals (762,606 men and 916,121 women) were included in the study (flowchart in Web Figure 1).
We randomly allocated 2/3 of practices (263 practices with 1,141,098 individuals) to the derivation data set and 1/3 of practices (135 practices with 537,629 individuals) to the validation data set.

Statistical modeling
To utilize all available electronic health records data, we used a 2-stage landmark approach for the construction of 10year CVD risk-prediction models (32). We briefly describe the methods here and provide more detail in Web Appendix 3, Web Figures 2-6. In the derivation data set, we developed 92 age-and sex-specific predictions models (i.e., for men and women and at ages 40, 41, 42, . . . , 85, denoted as "landmark ages"). Participants meeting the study eligibility constraints contributed to a model if they had no CVD diagnoses and no statin prescription before the landmark age. Ten-year crude CVD incidence rates and statin-initiation rates were calculated for each landmark age and sex.
In the first stage, to better utilize repeat risk factors and allow for incomplete data, error-free risk-factor values for SBP, total cholesterol, HDL cholesterol, and smoking status were estimated as best linear unbiased predictors (BLUPS) from landmark age-and sex-specific multivariate mixed-effects linear regression models (Web Appendix 3).
In the second stage, 10-year statin-naive CVD risk was modeled using landmark age-and sex-specific Weibull models, with time since landmark age as the time scale and with the following risk factors: the most recently observed diabetes status and hypertension treatment status; estimated error-free risk-factor values for SBP, total cholesterol, HDL cholesterol, and smoking status; and a time-dependent effect of statin initiation constrained to a 25% risk reduction as reported from published meta-analyses of trials (33,34). For example, in Stata (StataCorp LLC, College Station, Texas) this can be implemented by splitting the follow-up data at the time of statin initiation and using the offset option in the survival model (see example code in Web Appendix 3). Incorporating the effect of statins in this way ignores the potential error in the effect and assumes homogeneity in treatment effect (i.e., a 25% risk reduction for everyone) regardless of the time on statins and other characteristics. The Weibull distribution and proportional hazards assumptions were checked and verified (see Web Appendix 3, Web Figures 5 and 6). We also derived a standard model ignoring the effect of statin initiation.
In the validation data set, we predicted 10-year statinnaive and standard CVD risks, using risk-factor values estimated from the multivariate mixed-effects models.

Assessment of model predictive performance
Performance measures for the standard CVD models ignoring statin initiation were calculated from comparisons between the predicted standard CVD risks and observed survival times and risks in the validation data set. To appropriately assess model performance, we compared Abbreviations: CVD, cardiovascular disease; HDL, high-density lipoprotein; SD, standard deviation. a Included 1,678,727 individuals aged 40-85 years, without prevalent CVD or statin initiation at study entry, and with at least 1 measurement value of systolic blood pressure, total cholesterol, HDL cholesterol, or smoking status between their study entry and study exit dates. b Calculated using the first measurement values taken after study entry. c Recorded as "yes" if any of the measurement values showed "yes" throughout the follow-up time.
statin-naive CVD risks against observed risks using counterfactual statin-naive survival times. Under the Weibull model, counterfactual survival times were estimated as: where t is the observed follow-up time; t s is the time of statin initiation (which equals t if not observed); exp(0.75) represents the effect of statins from trial results of 25% risk reduction; and ν is the shape parameter of the Weibull model estimated in the derivation data set. Further details are provided in Web Appendix 4. Several measures were used to assess the model and compare the performance in the validation data set (full definitions and the use of counterfactual statin-naive survival times in performance assessment are provided in Web Table 4). Calibration was assessed visually (35,36) and with the calibration slope (35)(36)(37)(38); predictive accuracy and explained variation were assessed using the Brier score and R 2 respectively (36,39,40), and discrimination was assessed by the D statistic (39) and Harrell's C index (35) with bootstrap standard errors. Reclassification measures, including the net reclassification improvement (NRI), with both continuous NRI (41) and categorical NRI (42) using the predicted 10-year risk cutoff at <10% and ≥10% (i.e., the threshold of recommended statin treatment in the current UK guidelines (25)), together with the integrated discrimination index (42), were used to compare the statin-naive and the standard CVD risks at ages 40, 50, 60, and 70 years. Potential public health impact, including the number needed to screen and number needed to treat to prevent 1 CVD event (38,43,44), were estimated under the assumption that statin treatment is allocated to individuals with 10-year CVD risk greater than 10% and reduces CVD risk by 25%. In addition, to quantify the impact of models accounting for statin initiation on treatment decision-making, we compared the proportion of individuals with 10-year predicted risk exceeding a range of treatment thresholds from 5% to 30% by using the statin-naive versus the standard CVD risk for each landmark age. Weighted proportions across all ages were calculated using the most recent available data for an age-and sex-standard English population (2015) (45) aged 40-85 years. To directly demonstrate the predictive ability of statin-naive CVD risk, measures of model performance were also assessed on the subset of individuals with no statin initiation during follow-up.
All statistical analyses were conducted using Stata, version 15.1 (StataCorp LLC), and R, version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria). The 2-sided P value threshold was <0.05, and we calculated 95% confidence intervals.

Characteristics of participants
At study entry, the mean age was 50.9 (standard deviation, 13.1) years, and 45% of the participants were men. Characteristics of participants in the derivation and validation data sets were similar ( Table 1). The median follow-up was 8.9 years (interquartile range, 5.3-11.4), during which 237,806 individuals initiated statins and there were 103,163 incident CVD events (Web Figure 7).

Statin initiation and CVD incidence rates
Ten-year statin-initiation rates were higher in men, increased with age until approximately 70 years, and then declined ( Figure 1). The overall 10-year CVD incidence rate was 7.39 (95% confidence interval (CI): 7.34, 7.43) per 1,000 person-years. The 10-year CVD incidence rates increased rapidly after age 65, and were higher in men and those who initiated statins during follow-up (Figure 1 and Web Table 5). Rates were broadly similar in the derivation and validation data sets (Web Tables 6 and 7).

Risk factors associations with incident CVD
Hazard ratios for CVD attenuated at older landmark ages for all risk factors (Figure 2). Hazard ratios for total cholesterol and diabetes were somewhat higher in models accounting for statin initiation during follow-up (particularly for ages 60-70) compared with models ignoring statin initiation, but were similar for other CVD risk factors (Figure 2, Web Tables 8 and 9).

Model calibration, performance, and discrimination
The models appeared generally well calibrated, especially at younger ages (Web Figures 9-11). Compared against the models ignoring statin initiation, models accounting for statin initiation generally exhibited better model performance and discrimination, quantified by lower values for overall Brier score (Table 2), higher explained variation (Web Figure 12), and higher overall C index ( Table 2) and D measure (Web Figure 13). The age-specific C indexes were higher in women, decreased with age, and were slightly higher in models accounting for statin initiation, especially for ages 60-70 ( Figure 4).

Public health modeling
Reclassification. There were moderate improvements in risk classification using 10-year statin-naive versus standard  CVD predictions. Generally, individuals with future CVD events within 10 years were more likely to be reclassified from <10% to ≥10% risk categories (quantified by the categorical NRI) and have higher predicted risks (quantified by the category-free integrated discrimination index and continuous NRI) than were individuals who remained CVD event-free for 10 years (Tables 3-6). Above the ages of 69 for men and 76 for women, all predicted 10-year statin-naive and standard CVD risks were greater than 10%.
Potential public health impact. Fewer younger people needed to be screened to prevent 1 CVD event using statinnaive compared with standard 10-year CVD risk predictions ( Figure 5, Web Table 12). Above age 60, the number needed to be screened to prevent 1 event was generally similar between the 2 risk predictions, as well as for the number needed to treat to prevent 1 event ( Figure 5). The weighted proportions across all ages of individuals with 10-year predicted risk exceeding treatment threshold were slightly higher after accounting for statin initiation ( Figure 6). For example, at the threshold of ≥10%, the proportions were 55.6% in men and 33.5% in women using models ignoring statin initiation, and they were 57.1% in men and 34.8% in women using models accounting for statin initiation correspondingly.
Results were similar when analyses were performed using a validation subset including 463,017 individuals who did not initiate statins during the follow-up (Web Tables 13-16; Web Figures 14-21).

DISCUSSION
In this study, we described a novel and simple approach to account for statin initiation for the prediction of 10-year statin-naive CVD risk, illustrated using primary-care data collected in a general UK population, and it is applicable to other study designs with similar information. Our analyses showed that, after adding a time-dependent effect of statin initiation constrained to a 25% CVD risk reduction, 10year CVD predicted risks were higher, especially among 60-to 70-year-olds. These differences reflect the somewhat stronger associations between total cholesterol and CVD outcome after accounting for statin initiation and are in line with what is expected in a statin-naive population. Models that accounted for statin initiation also showed moderate improvements in calibration and discrimination but translated into limited public health and clinical relevance in our study population.
Currently recommended CVD risk-prediction models do not consider the effect of statin treatment drop-in during follow-up (19) and produce standard 10-year CVD risk estimates that are often interpreted in clinical practice, by practitioners and patients, as statin-naive CVD risk predictions (18)(19)(20)(21). In this study, we found stronger hazard ratios for total cholesterol in models accounting for statin initiation, a phenomenon previously described as an "intervention effect" in clinical prediction models (46). Despite our study showing that statin-naive CVD risk predictions are generally higher than standard CVD risk predictions, we found little benefit in their use for clinical decision-making in this population of 40-to 85-year-olds. Accounting for statin initiation made the largest difference to risk estimates for individuals aged 60-70 (i.e., those more likely to start statins); however, a large proportion of these individuals were already categorized as being in a high-risk group (≥10%) on the basis of their age. Greater public health impact might be found 1) in other populations with higher statin-initiation rates or with higher CVD risk (e.g., diabetic patients), 2) with models using more conservative CVD endpoint definitions in risk model derivation (10), and/or 3) with use of age-specific risk thresholds (although these are not currently recommended by clinical guidelines).
Previous studies have attempted to account for statin dropin by modeling the probability of statin initiation during follow-up (based on baseline risk factors), either through inverse probability weighting (18) or in marginal structural models (20). If the propensity model is incorrectly specified, then it might not fully account for the treatment drop-in. By contrast, our simpler approach incorporated causal evidence of 25% risk reduction with statin initiation from trial results. A similar approach using a time-fixed constrained treatment Abbreviation: CVD, cardiovascular disease. a The results are presented in 10-year increments in landmark age at 40, 50, 60, and 70. Above landmark age 69 for men, the predicted 10-year CVD risk for all individuals in the risk set was greater than 10% for both standard risk predictions and statin-naive risk predictions; therefore, there was no movement between the 2 categories for those older landmark age groups. b Events within 10 years and event-free at 10 year for the reclassification table were defined using the counterfactual follow-up time, assuming statin had not been initiated. Abbreviation: CVD, cardiovascular disease. a The results are presented in 10-year increments in landmark age at 40, 50, 60, and 70. Above landmark age 76 for women, the predicted 10-year CVD risk for all individuals in the risk set was greater than 10% for both standard risk predictions and statin-naive risk predictions; therefore, there was no movement between the 2 categories for those older landmark age groups. b Events within 10 years and event-free at 10 years for the reclassification table were defined using the counterfactual follow-up time, assuming statin had not been initiated. Abbreviations: CI, confidence interval; CVD, cardiovascular disease; IDI, integrated discrimination improvement; NRI, net reclassification improvement.

CVD Risk Prediction Addressing Future Statin Initiation 2009
a The results are presented in 10-year increments in landmark age at 40, 50, 60, 70. Above landmark age 69 for men, the predicted 10-year CVD risk for all individuals in the risk set was greater than 10% for both standard risk predictions and statin-naive risk predictions; therefore, there was no movement between the 2 categories and the categorical NRIs were 0 for those older landmark age groups. b Categorical NRI and IDI were calculated using information from individuals who were not censored at 10 years (either with CVD events within 10 years or event-free at 10-years). Events within 10 years and event-free at 10 years, for the calculation of categorical NRI and IDI, were defined using the counterfactual follow-up time assuming statin had not been initiated. c Categorical NRI was calculated based on the 4 categories of predicted risk of <10% and ≥10%. d Continuous NRI (the prospective form NRI) was calculated based on continuous predicted risk and used information from all individuals, including the censored subjects. e Events and nonevents for continuous NRI (the prospective form of NRI) were the expected results estimated using the Kaplan-Meier approach with counterfactual follow-up time assuming statin had not been initiated, so this prospective form of NRI uses the whole sample and does not require the restriction to the noncensored subjects. effect has been applied to estimate medication efficacy in long-term clinical trials (47) and in breast cancer prognostic models (48), as well as to adjust population-level incidence rates for CVD (33). However, to our knowledge, incorporating time-dependent statin treatment effects (which results in adjustment of risk-factor coefficients) for the prediction of the statin-naive 10-year CVD risk has not been fully explored and is aligned with the "hypothetical strategy" described previously (23). Our study assumed the same riskreduction effect for all individuals regardless of treatment duration and discontinuations. It is possible to extend our model to allow for individuals' risk reduction in response to statin initiation to vary by dose, treatment duration, and other demographic and socioeconomic factors (49,50) which in combination might result in individuals having larger or smaller changes in risk, although on average likely to be smaller than we have modeled.
In our study, we found a reduction in the prediction ability of CVD risk-prediction models at older ages, due to attenuating hazard ratios of conventional risk factors (irrespective of whether statin treatment drop-in was accounted for). Previous assessments of the Framingham Risk Score also noted poorer performance in older individuals (51,52). This was mainly attributed to older individuals still in the risk set being a homogeneous group in whom conventional CVD risk factors have little impact (36). This highlights the need to assess new CVD biomarkers across different age groups.
Our study has several strengths. This study proposed a simple approach to account for statin treatment dropin and assessed it using CVD risk factors and events recorded in a large and representative UK population data set combining primary and secondary care health records. The landmark framework allowed us to optimally use repeated a The results are presented in 10-year increments in landmark age at 40, 50, 60, and 70. Above landmark age 76 for women, the predicted 10-year CVD risk for all individuals in the risk set was greater than 10% for both standard risk predictions and statin-naive risk predictions; therefore, there was no movement between the 2 categories and the categorical NRIs were 0 for those older landmark age groups. b Categorical NRI and IDI were calculated using information from individuals who were not censored at 10 years (either with CVD events within 10 years or event-free at 10 years). Events within 10 years and event-free at 10 years, for the calculation of categorical NRI and IDI, were defined using the counterfactual follow-up time assuming statin had not been initiated. c Categorical NRI was calculated based on the 4 categories of predicted risk of <10% and ≥10%. d Continuous NRI (the prospective form NRI) was calculated based on continuous predicted risk and used information from all individuals, including the censored subjects. e Events and nonevents for continuous NRI (the prospective form of NRI) were the expected results estimated using the Kaplan-Meier approach with counterfactual follow-up time assuming statin had not been initiated, so this prospective form of NRI uses the whole sample and does not require the restriction to the noncensored subjects.
measurements of risk factors recorded in electronic healthrecords data and to assess the changes in hazard ratios and discrimination with age when accounting for statin initiation. Multivariate mixed-effects models allowed estimation of error-free risk-factor values at each landmark age, even when some risk factors were not observed, avoiding nonrepresentative "complete-case" analyses. In addition, a parametric Weibull model allowed a closed-form estimation of counterfactual statin-naive survival times, further allowing for model performance assessment in the "statin-naive" setting, with consistent results using the subset of individuals who remained statin-naive during follow-up. Our landmark models are easy to derive in standard software and to use in practice. We focused on the effect of statin drop-in, but the approach is generalizable to other causal relationships occurring in follow-up, such as short-term medications (e.g., corticosteroids), long-term medications (e.g., hypertension treatment), and lifestyle modification changes (e.g., smoking and smoking cessation), as well as other diseases.
Our study also has limitations that should be noted. Our data contains records only of statin prescriptions, with no information about treatment adherence, and so statin users might be incorrectly classified or indeed be treatment "dropouts" as the proportion of people with poor adherence for statins might not be negligible (53). We also ignored any impact of informative observations, whereby more riskfactor measurements are made in sicker individuals who visit their general practitioners more frequently, or in the "worried-well" (54,55); however, our previous work found that adjusting for the rate of general practitioner visits had negligible impact (32). We further ignored uncertainty in the constrained effect of statins, which might lead to slight overprecision in other estimated parameters. Additionally, the use of the Weibull model relies on strong parametric assumptions and is less flexible than the commonly used Cox model. It is possible to estimate counterfactual survival times in a Cox model with additional efforts (outlined in the Web Appendix 4). However, these limitations are unlikely to affect the between-model comparisons in prediction performance.
In conclusion, information from trials of the statins effect on CVD risk reduction can be simply incorporated into the derivation of risk models using electronic health records and yields statin-naive risk estimates interpretable as risk in the absence of future statin initiation. In our study population, accounting for statin initiation moderately improved measures of calibration and discrimination but had limited benefits for clinical decision-making under current UK guidelines of recommended statin-initiation threshold.