A method to predict breast cancer stage using Medicare claims

Background In epidemiologic studies, cancer stage is an important predictor of outcomes. However, cancer stage is typically unavailable in medical insurance claims datasets, thus limiting the usefulness of such data for epidemiologic studies. Therefore, we sought to develop an algorithm to predict cancer stage based on covariates available from claims-based data. Methods We identified a cohort of 77,306 women age ≥ 66 years with stage I-IV breast cancer, using the Surveillence Epidemiology and End Results (SEER)-Medicare database. We formulated an algorithm to predict cancer stage using covariates (demographic, tumor, and treatment characteristics) obtained from claims. Logistic regression models derived prediction equations in a training set, and equations' test characteristics (sensitivity, specificity, positive predictive value (PPV), and negative predictive value [NPV]) were calculated in a validation set. Results Of the entire sample of women diagnosed with invasive breast cancer, 51% had stage I; 26% stage II; 11% stage III; and 4% stage IV disease. The equation predicting stage IV disease achieved sensitivity of 81%, specificity 89%, positive predictive value (PPV) 24%, and negative predictive value (NPV) 99%, while the equation distinguishing stage I/II from stage III disease achieved sensitivity 83%, specificity 78%, PPV 98%, and NPV 31%. Combined, the equations most accurately identified early stage disease and ascertained a sample in which 98% of patients were stage I or II. Conclusions A claims-based algorithm was utilized to predict breast cancer stage, and was particularly successful when used to identify early stage disease. These prediction equations may be applied in future studies of breast cancer patients, substantially improving the utility of claims-based studies in this group. This method may similarly be employed to develop algorithms permitting claims-based epidemiologic studies of patients with other cancers.


Background
Administrative medical insurance claims are an important source of population-based data used in epidemiologic studies of various diseases. Specifically, in older patients, national Medicare data have been useful for the study of many conditions, including myocardial infarction, heart failure, chronic kidney disease, Parkinson's disease, and venous thromboembolism [1][2][3][4][5]. However, for studying cancer, the use of national Medicare data has, to date, been limited. Medicare claims data are clearly recognized as potentially a rich source for cancer epidemiology and outcomes research, and in fact demonstrate acceptable validity for identifying cancer diagnoses and treatment patterns [6][7][8][9][10][11][12][13]. Unfortunately, the lack of cancer stage data in Medicare claims remains a major limiting factor in maximizing the utility of these datasets for retrospective, outcomes-based research in cancer patients [14,15]. In particular, cancer stage is a crucial predictor of disease outcome and a key factor in determining the appropriateness of treatment. For example, in breast cancer patients, stage is associated with overall and disease-free survival and, furthermore, stage influences treatment decisions such as selection and timing of surgery, radiotherapy, and chemotherapy [16]. Epidemiologic studies of cancer patients typically employ stage variables as covariates or as inclusion and exclusion criteria, and thus it is essential to develop accurate algorithms to account for cancer stage in studies using claims data.
Surprisingly, the need to derive such algorithms has largely been ignored in the literature. Only one prior study has developed a claims-based algorithm to predict stage in breast cancer patients. Cooper et al. used the Surveillance Epidemiology and End Results (SEER)-Medicare database, and authors reported that their claimsbased, single-predictor models were insufficient for identifying patients with local, regional, and distant stage disease. Sensitivity of these models for distinguishing local from regional and distant disease was low-for example, in breast cancer patient samples, only approximately 60% [17]. In the decade since the prior algorithm was derived, no other algorithm has been presented in the literature attempting to improve cancer stage classification using claims data. Thus studies have continued to apply the algorithm by Cooper and colleagues to derive cancer stage variables, despite the recognized limitations of this algorithm [18,19] and the introduction of measurement errors into such analyses.
Accordingly, we sought to derive an expanded predictive algorithm based on multivariate modeling and to improve the sensitivity and specificity for identifying cancer stage, using our study sample of breast cancer patients as an illustrative case. Using available Medicare claims for breast cancer patients found in the SEER-Medicare database, we developed a prediction algorithm to identify patients with distant (stage IV) disease at diagnosis and, among patients without distant disease, a prediction algorithm to classify the extent of locoregional (stages I-III) disease.

Study sample
The SEER-Medicare database is comprised of a population-based cohort of Medicare beneficiaries with incident cancer identified through SEER registries, which account for up to 26% of the United States' population [20,21]. Our initial study population consisted of 150,764 women (age ≥ 65 years) diagnosed with breast cancer between 1992 and 2002 identified through SEER-Medicare. From this population, we excluded 5,217 patients with unknown SEER historic stage (as this variable indicated the presence or absence of metastases), and 19,816 with in situ disease (as we intended to focus only on invasive disease). We further excluded 47,114 patients who did not have continuous Medicare Fee-for-Service coverage or had any HMO coverage from 12 months prior to 9 months after their diagnosis date (as claims information might be incomplete during these periods), and the 308 patients age <66, since these patients potentially would not have had comprehensive claims information to define the independent predictor covariates. We finally excluded 1,003 patients who died or were lost to follow-up within 9 months of their diagnosis date. This yielded a final sample size of 77,306 patients in our study.

Dependent variable: Cancer stage
The "gold standard" for identifying cancer stage at diagnosis was determined using a combination of tumor variables available through SEER. Distant disease was determined using the American Joint Committee on Cancer (AJCC) [22] historic stage as reported to SEER, which indicates tumor present in any distant site at cancer diagnosis (compared with tumor limited only to local or regional sites at diagnosis). For our analysis, patients with any distant disease were considered stage IV.
Patients without distant (stage IV) disease had local or regional AJCC historic stage. T and N classification in these patients were assigned based on SEER variables for tumor size and extent of disease. Tumor size and extent were categorized as ≤ 2 cm (T1); >2 to 5 cm (T2); >5 cm (T3); or invading into the chest wall, ribs, intercostals or serratus anterior muscles, extensive invasion into the skin, inflammatory carcinoma, or further contiguous extension into the skin (T4). Nodal disease was categorized as 0 positive lymph nodes (N0); 1-3 positive lymph nodes (N1); 4-9 positive lymph nodes (N2); or 10 or more positive lymph nodes (N3) [21]. Due to the extent of missing data in the SEER database, location of positive lymph nodes was not included in N classification. Stage I included T1N0 disease, stage II included T0N1, T1N1, T2N0, T2N1, and T3N0 disease, and stage III included T0-2N2, T3N1-2, T4N0-2, and T0-4N3 disease. These classifications are based on AJCC 2003 staging criteria [22].

Independent predictors
Candidate independent predictors were selected a priori based on statistical significance in bivariate analyses (P < 0.25) and clinical significance in prior studies of cancer patients [20,[23][24][25][26][27][28]. Variables were defined by searching through inpatient, outpatient, and carrier Medicare claims or the denominator file for SEER-Medicare linked data for demographic variables. A comprehensive list of International Classification of Diseases, Ninth Revision (ICD-9), Common Procedural Terminology (CPT), and Revenue Center codes for each predictor are listed in Table S1, Additional file 1.

Statistical analysis
All statistical analyses were conducted using SAS version 9.1.3 (SAS Institute Inc, Cary, NC), and all statistical tests assumed a 2-tailed α of 0.05. The University of Texas M. D. Anderson Cancer Center institutional review board deemed this study exempt from review, since the data were without identifiers.
We derived two separate logistic models and implemented the models sequentially. The first model tested the associations between predictor covariates and the dichotomous outcome of stage IV versus non-stage IV disease. Among the subset of patients who were not categorized as having stage IV disease, the second model tested the associations between predictor covariates (excluding the covariate for metastatic disease at diagnosis) and the dichotomous outcome of stage I/II (early) versus stage III disease. Outcomes were dichotomized based on clinical rationale, given that treatment of metastatic disease is palliative; and that treatment of early stage disease is distinct in that breast conserving therapy is a treatment option.
We used a split sample approach to develop and validate our logistic models. Each model was derived from the "training set," selected using simple random sampling without replacement (38,653 of 77,306 patients). Parsimonious models were then selected based on statistical significance (P < 0.25), clinical significance of covariates in prior studies, [20,[23][24][25][26][27][28] and goodness-of-fit. Prior studies were used as an initial guide for the selection of covariates to consider. The significance cutoff (P < 0.25) was used to rule in covariates to keep. Examining the goodness-of-fit of the overall model was used to rule out covariates to exclude. In combination, these three criteria were used to select the final model.

Testing
Patients not included in the training set constituted the "validation set". In the validation set, the parameter associated with each covariate estimated from the derivation set was applied to each patient in the validation set to calculate each patient's predictive probability (calculated probability = exp( ) of having stage IV disease in the first model and stage I/II or stage III disease in the second model. Test characteristics were calculated for probability cutpoints between 0.05 and 0.90, using two-by-two tables. The "gold standard" for stage was considered the SEER stage; the test stage was based on the calculated probability (for example, for a probability cutpoint of 0.05, patients were predicted to have stage I/II disease if their calculated probability was ≥ 0.05, and not to have stage I/II disease if their calculated probability was <0.05).

Combining equations
The prediction equations were then applied to isolate a sample of patients with early stage disease. Specifically, the two prediction equations were applied sequentially to the validation sample in order to identify a subset of patients with stage I/II disease. The first step used a probability cutpoint of 0.05 to exclude patients with predicted stage IV disease. The second step was applied to the subset identified in the first step and used a probability cutpoint of 0.90 to include patients with predicted stage I/II disease. These cutpoints were chosen based on their test characteristics (high sensitivity, specificity, and positive predictive value [PPV] or negative predictive value [NPV]). Finally, we also compared the test characteristics derived from our multivariate prediction equations to test characteristics derived from single-predictor equations for distant and regional disease used in a prior study, [17] to determine whether multivariate equations improved test characteristics compared with the single-predictor equations.

Implementation
Example in Practice: Medicare test sample Finally, we present an example that applies the prediction equation. We used a test sample based on a claimsonly dataset, the national Medicare dataset. The national Medicare dataset includes claims data for all Medicare beneficiaries in the United States. Files contain data collected by Medicare for reimbursement of health care services for each beneficiary and include institutional (inpatient and outpatient) and non-institutional (physicians or other providers) final action claims [29]. We initially included 127,607 women (age ≥ 65) with a diagnosis claim indicating invasive breast cancer in 2003 (International Classification of Diseases, Ninth Revision (ICD-9) diagnosis code of 174) who underwent a breastcancer related procedure. We then excluded 23,715 patients who did not have at least 2 claims (on different dates) specifying a diagnosis of invasive breast cancer between January 1, 2003 and December 31, 2004 (at least 1 claim must have occurred during 2003); 16,471 patients who had a breast cancer-related diagnosis or procedure claim between January 1, 2002, and December 31, 2002; 5,719 patients who were receiving Medicare coverage because of end-stage renal disease or disability; and 6,612 patients who lacked Part A or B coverage or who had intermittent health maintenance organization coverage in the 9 months after or in the 1 year before their breast cancer diagnosis date (of these patients, 3,561 had incomplete information in the year prior to diagnosis because they were <66 years of age); for a total sample size of 56,725 patients. This method for sample selection of incident breast cancer has been validated in a prior study [30].
In this test sample, we applied our derived algorithm and calculated the frequency of patients classified as predicted stage IV and predicted stage I/II disease. Again, for this sample, the first step used a probability cutpoint of 0.05 and the second step a probability cutpoint of 0.90. As a test of our algorithm for consistency, we compared the predicted frequencies to the actual stage distribution in two populations: 1) the SEER-Medicare population (age >65 years) and 2) the National Cancer Data Base population (age ≥ 70 years) [31].

Prediction algorithm equations and test characteristics for probability cutpoints
Candidate covariates tested in prediction equations are listed in Table S1, Additional file 1. Parameter estimates for the covariates included in each final prediction equation are listed in Table S2, Additional file 1.

Stage IV
Fourteen percent of all patients and 73% of patients with stage IV disease had a claims code indicating possible metastatic disease. Accordingly, the single-predictor model including only this covariate had sensitivity of 73%; specificity 89%; PPV 22%; and NPV 99% for identifying stage IV disease. After including covariates (Table  S2, Additional file 1) in the multivariate model, the sensitivity was 81% (95% Confidence Interval 80% -84%) for identifying stage IV disease at a probability cutpoint of 0.05. At this cutpoint, specificity was 89% (86% -89%); PPV 24%; (22% -25%) and NPV 99% (99% -99%), yielding a c-statistic of 0.93. (Table S3, Additional file 1) The distribution of calculated predicted probabilities in the validation set for patients with stage IV versus stage I-III disease is presented in Figure 1a.

Comparison with other single predictors
For comparison's sake, for identifying stage IV disease, the second most important predictor was axillary lymph node dissection. This predictor alone would yield the following test characteristics: sensitivity 67%; specificity 74%; PPV 10%; and NPV 98%. For identifying stage I/II disease, the second most important predictor was breast conserving surgery vs. mastectomy, yielding the following test characteristics: sensitivity 49%; specificity 82%; PPV 95%; and NPV 18%.

Combining equations
The prediction equations were most accurate for isolating patients with early stage disease. Specifically, after applying the two prediction equations sequentially to the validation sample to identify patients with predicted stage I/II disease, a subset of 23,285 patients were selected (of 38,653 patients, 36,417 were predicted to No. visits to medical oncologist, mean (SD) 4 (9) No. visits to radiation oncologist, mean (SD) 3 (5) Imaging (CT, MRI, PET, or bone scan) 25    were excluded (classified as other than stage I/II) as a result of the algorithm (4,604 from the first model and 2,236 from the second model). (Figure 2, Figure 3).

Discussion
In this cohort of older breast cancer patients, Medicare claims data assisted the prediction of cancer stage.
Predictor equations using claims data alone were able to achieve approximately 80% sensitivity and specificity for identifying stage IV disease as well as distinguishing stage I/II from stage III disease. Prediction models maximized NPV when distinguishing stage IV from stage I-III disease but maximized PPV when distinguishing stage I/II from III disease. With a resulting tradeoff of lower PPV in the first model and lower NPV in the second model, the algorithm was therefore found to be best suited to most accurately identify early stage disease. Specifically, an algorithm combining the two equations seeking to identify patients with early stage disease was able to achieve a sample in which 98% of patients had stage I or II disease. Our prediction models represent an improvement over the single other previously published model. In this prior study, Cooper et al. used single-predictor equations to identify cancer stage. To identify patients with distant disease, authors tested a single variable based on claims codes for metastatic disease. This single-predictor model demonstrated 60% sensitivity and 58% PPV. To distinguish patients with local versus regional disease, authors tested a single variable based on the claim code for axillary lymph node involvement. This single-predictor model demonstrated 62% sensitivity and 85% PPV [17]. The relatively poor test characteristics from this prior study demonstrated that these single-predictor models would be insufficient for predicting stage in patients with breast cancer and suggested that claims data alone would be inadequate for epidemiologic studies of cancer patients. In contrast, our prediction models have improved upon these test characteristics. Our single predictor model demonstrated improvement, likely in part due to a more extensive list of claims codes, with multiple covariates providing added value. Moreover, our algorithm demonstrated consistency when results were compared with population-based data from SEER-Medicare and the NCDB. There are two important future research applications of our prediction models. First, our multivariate logistic modeling method for developing a stage predictor algorithm may similarly be applied to test models and potentially develop stage prediction equations for patients diagnosed with cancers of other sites. Second, our prediction equations may also be applied directly to claims-based databases of breast cancer patients who have unknown stage. Using a combination of multiple predictors along with claims codes for metastatic disease and axillary lymph node dissection, parameter estimates and calculated probabilities can be applied to the prediction of patient breast cancer stage.
Our algorithm can therefore serve as a tool to assist in the investigation of a variety of epidemiologic research questions in breast cancer patients by allowing a sample selection of those patients with early stage disease. In addition, predicted early stage disease can be applied as a covariate. Accordingly, since disease stage may be better accounted for, claims databases of breast cancer patients may also be better applied to address such questions as the assessment of treatment utilization, geographic variation, or outcomes in patients diagnosed with early stage breast cancer. Specifically, for the stage IV prediction model, a probability cutpoint between ≥ 0.05 and ≥ 0.10 would be highly specific and sensitive for identifying patients with stage IV disease. For the stage I/II prediction model, a cutpoint between ≥ 0.80 and ≥ 0.90 would be highly specific and sensitive for distinguishing patients with stage I/II disease from patients with stage III disease.
For the identification of patients with stage IV disease, a selected probability cutpoint criterion could be translated into a dichotomous variable, and used either to select a sample of patients with stage IV disease or used in a "rule out" context, as an exclusion criterion. The high NPV in our proposed model suggests that when using these cutpoints to identify a sample limited to patients with stage I-III disease, the likelihood of misclassification bias (bias due to the inappropriate inclusion of patients with stage IV disease in the sample) would be low in a "rule-out" setting.
For distinguishing patients with stage I/II versus III disease, the probability cutpoint criterion, translated into a dichotomous variable, could be useful in various contexts, such as excluding patients with stage III disease in order to refine a study population of patients with early stage breast cancer, or creating a dichotomous covariate to adjust for potential confounding associated with stage I-III disease. The test characteristics in our analysis suggest that the combination of these prediction equations may be particularly useful in the context of identifying breast cancer patients with early stage (stage I and II) disease.
Our study has limitations to consider. First, our cohort was limited only to older patients with breast cancer. Although the variables associated with stage are likely to be similar in younger patients, exact parameter estimates may differ, and the application of these models in younger patients requires further validation. Additionally, Smith et al. Epidemiologic Perspectives Innovations 2010, 7:1 http://www.epi-perspectives.com/content/7/1/1 as our predictor variables were derived from Medicare claims, these models will also require validation in other claims based data. If not all the proposed variables in our models are available, however, at a minimum, adjustment in multivariate analysis for as many possible available candidate predictors proposed in our study could be useful to improve modeling of breast cancer outcomes in future studies. Although we excluded from our parsimonious model covariates that required long-term follow-up (specifically, overall survival and mastectomy 9 or more months after diagnosis), our models still required both retrospective and prospective data for up to 1 year prior to and 1 year after the date of diagnosis. Thus, studies applying our models would be limited to patients with continuous coverage and complete claims information over this time period. The gold standard for our outcome, cancer stage, was based on pathologic variables in SEER-Medicare, though given a lack of central pathology review by the SEER program, unmeasured error may have affected the gold standard, yielding potentially less than 100% accuracy. Finally, if a sample was selected based on the algorithm, sample characteristics derived from algorithm predictor variables (for example, chemotherapy, radiotherapy, and surgery utilization) may be under-or overestimated compared with the complete patient population, depending on the direction and significance of their association with disease stage in the prediction equations.

Conclusions
Medicare claims data can be utilized to derive a useful algorithm to predict stage in breast cancer patients. In particular, the predicted probability of early stage disease can be easily generated when applying the prediction algorithm to this patient population, thus substantially improving the utility of Medicare claims data for studying breast cancer.
Additional file 1: Table S1. Candidate Covariates and Claims Codes.