External Validation of European System for Cardiac Operative Risk Evaluation II (EuroSCORE II) for Risk Prioritization in an Iranian Population

Introduction The European System for Cardiac Operative Risk Evaluation II (EuroSCORE II) is a prediction model which maps 18 predictors to a 30-day post-operative risk of death concentrating on accurate stratification of candidate patients for cardiac surgery. Objective The objective of this study was to determine the performance of the EuroSCORE II risk-analysis predictions among patients who underwent heart surgeries in one area of Iran. Methods A retrospective cohort study was conducted to collect the required variables for all consecutive patients who underwent heart surgeries at Emam Reza hospital, Northeast Iran between 2014 and 2015. Univariate and multivariate analysis were performed to identify covariates which significantly contribute to higher EuroSCORE II in our population. External validation was performed by comparing the real and expected mortality using area under the receiver operating characteristic curve (AUC) for discrimination assessment. Also, Brier Score and Hosmer-Lemeshow goodness-of-fit test were used to show the overall performance and calibration level, respectively. Results Two thousand five hundred eight one (59.6% males) were included. The observed mortality rate was 3.3%, but EuroSCORE II had a prediction of 4.7%. Although the overall performance was acceptable (Brier score=0.047), the model showed poor discriminatory power by AUC=0.667 (sensitivity=61.90, and specificity=66.24) and calibration (Hosmer-Lemeshow test, P<0.01). Conclusion Our study showed that the EuroSCORE II discrimination power is less than optimal for outcome prediction and less accurate for resource allocation programs. It highlights the need for recalibration of this risk stratification tool aiming to improve post cardiac surgery outcome predictions in Iran.


INTRODUCTION
A growing literature shows the pervasiveness and importance of the need for reliable information on the cost-effectiveness of adult cardiac surgeries. Moreover, potential post-operative adverse events highlight the significance of perioperative clinical decision making. Various prediction models have been developed aiming to estimate risk-adjusted mortality, morbidity and length of intensive care unit stay following cardiac surgeries [1] . European System for Cardiac Operative Risk Evaluation (EuroSCORE) is a risk stratification tool which incorporates 18 variables describing patient, heart and proposed surgery to predict 30-day post-characteristics and were used for statistical analysis. The relation of each variable was addressed and the number of patients due to different values were compared to the original EuroSCORE II population. Then, the overall model performance was reported using Brier Score (A score function which measures the closeness of predictions to actual outcomes and result in a value from 0 for a perfect model to 0.25 for a non-informative model) [12] . The area under the receiver operating characteristic curve (AUC) statistic was used to indicate the discriminative ability of model (while 1 refers to perfect discrimination, a value of 0.5 shows random classification). The Hosmer-Lemeshow goodness-of-fit test was employed to test the fitness of model to data by comparing observed to predicted mortality by decile of predicted probability) [13] . Analysis were performed using Medcalc-13.3.3.0 and R-3.3.1 (Resource Selection package).

Patients' Baseline Characteristics
The mean age among the total of 2581 patients was 56.3±13.88 years (minimum=17 and maximum=93). The mortality rate was 3.3% (N=84). The mean height and weight of patients were 1.64±0.1 meters and 68.4±13.4 kilograms, respectively. About 7.8% (N=201) of patients aged 75 years and older and 22.2% (N=572) were diabetic. While 6.1% (N=158) were involved with a type of chronic kidney disease, 15.1% (N=24) underwent dialysis regularly; 10.6% (N=274) were current or past smokers and 2.2% (N=56) of patients were diagnosed with COPD. Table 1 summarizes some comparable information of our patients with the original EuroSCORE II population.
As all procedures were elective operations, there were no urgent surgeries. Also, 23 patients undergoing valve surgery were suffering from active endocarditis, extra cardiac arteriopathy. Poor mobility was observed in 48 patients. No patient with Canadian Cardiovascular Society (CCS) class 4 or with critical preoperative state was observed. Also, none of surgeries were on thoracic aorta. Some other details are presented in Tables 1 and 2. As these patients had no mortality, these factors were excluded for regression analysis. The univariate and multivariate analysis are presented in Table 3.

Patients' Heart Status
Using New York Heart Association (NYHA), 37.1% (N=957) were classified as stage III cardiac failure patients, 6.2% (N=161) patients had a previous congestive heart failure during three months before surgery, 1% (N=26) of patients had atrial fibrillation. While 61.4% of surgeries were on-pump, the rest of procedures were performed off-pump. Table 2 shows more information about biological and clinical characteristics of patients.

Performance Measures
As mentioned before, the overall mortality was 3.3%. When applied to the current data set, the EuroSCORE II predicted a mortality of 4.7%. This means that the current risk-adjusted mortality ratio (RAMR=observed/predicted) for the previous additive model is about 0.67 and not adequately enough for outcome prediction or resource allocation programs.
operative risk of death [2] . Predictive power of EuroSCORE II has been evaluated on different samples of target population in European countries. Vast majority of these studies have reported acceptable calibration (How many patients with a risk prediction of x% have experienced the outcome?) and discrimination (Who are the patients who have experienced the outcome associated with higher risk predictions and who are those that do not?) measures in comparison to Society of Thoracic Surgeons (STS) [esp. for patients undergoing coronary artery bypass grafting (CABG) procedure [3] ].
An international evaluation study was performed by Roques et al. [4] , in 2000, to assess the predictive ability of EuroSCORE II on 18676 patients from six European countries (Germany, Spain, England, France, Italy, and Finland). Despite clinical and epidemiological differences, EuroSCORE II provided acceptable predictions for all datasets (esp. for Spanish patients). Geissler et al. [5] compared six prediction models using a single-center 2-year dataset, which resulted in the best performance measures for EuroSCORE II. While previous studies published admissible application of EuroSCORE II for patients undergoing CABG [6,7] , conflict reports exist for Australian samples [8] .
Similar studies in Iran reflect poor applicability of EuroSCORE II within patients undergoing different types of cardiac surgeries [9,10] . Diverse surgical techniques and potential risk factors already have been stabilized in different communities may mislead prediction models and result in erroneous interpretations. Thus, mathematical localization studies are required in different geographical borders to assure its proper predictive function before routine clinical use [11] . This study is conducted to investigate the accuracy of quantitative prioritization scores estimated by EuroSCORE II in an Iranian population.

Participants and Setting
A retrospective single-center cohort study was conducted to include all consecutive patients undergoing cardiac surgeries at Emam Reza hospital, Northeast Iran from January 1, 2014 to December 31, 2015. Once the patient was hospitalized a cardiologist or a general physician evaluated pre-peri-and postoperative state to fill out the pre-designed structural paper form.
A total of 2907 patients were included and 30-day outcome was discovered using hospital information system or direct contact with patients' family. About 11.2% (N=326) of records were excluded due to major variables' missing values and all data items were rechecked to verify their consistency, reliability and integrity. In some cases (less than 3% of records) by the physicians' recommendation, the missing data were imputed with normal values.

Statistical Analysis
First, univariate and multivariate analysis of relevant EuroSCORE II prognostic factors were performed aiming to identify significant covariates which contributed to higher risk. EuroSCORE II was calculated and inserted in dataset using online calculator (Available at: http://riskcalc.sts.org/stswebriskcalc/#). The data were aggregated in a unique electronic dataset, summarized considering the demographic and clinical prediction after adult cardiac surgery. The analysis of ROC curve showed that the EuroSCORE II discrimination power is less than optimal (AUC=0.667) for outcome prediction and less accurate for resource allocation programs, because, references consider an AUC value more than 0.7 as an acceptable value for least useful prediction models [5] . Although, the Brier score less than 0.05 indicates good overall performance for the model [12] , the Hosmer-Lemeshow test showed unacceptable matching of predicted probabilities to observed events. In general, EuroSCORE II did not predict the outcome for our population as well as it did for the European populations. Thus, recalibration process seems to be essential for Iranian population prior to daily clinical use.
It is well known that risk assessment is central in the evaluation of the perioperative risk. The application of risk stratification tools gives an objective appraisal of risk for both physicians and patients and presents a good estimation for The Brier Score lower than 0.05 indicates acceptable overall performance. However, poor discrimination may be revealed by AUC=0.667 (cut off=3.0, sensitivity=61.90, and specificity=66.24). Also, the Hosmer-Lemeshow test showed unacceptable matching of predicted probabilities to observed events (P-value<0.01) ( Table 4). Performance measures of EuroSCORE II are presented in Figure 1 and Table 4.

Main Finding
Our single-center study, based on consecutive patients who underwent cardiac surgery revealed that EuroSCORE II demonstrated a moderate statistical overall performance with poor discrimination and calibration measures remain as concerning issues regarding 30-day post-operative mortality Table 1. Comparison of demographic and comorbidity characteristics between the original EuroSCORE II population and an Iranian sample [2] . The expected and observed mortality can also be compared by any variable [2] .   [2] .  [9,10] . Further studies concentrating on recalibrated version of the model published unacceptable results [2,15,21,23] . The observed 30-day mortality rate in our sample (3.3%) was similar to those published by Roques et al. [7] (3.4%), Nashef et al. [6] (3.9%), Geissler et al. [5] (4%), and Pitkänen et al. [1] (2%). While Mir Mohammad Sadeghi et al. [9] reported similar mortality rate in Isfahan (central Iran), four years later Jamaati et al. [10] evaluated EuroSCORE II on a sample containing 12.2 mortality rate. An AUC of 66.7% in our study is lower than all similar studies including 78% by Geissler et al. [5] , 77% by Pitkänen et al. [1] , and 75.4% by Antunes et al. [24] . This is while similar studies in Iran confirmed the poor discriminative ability of EuroSCORE II [9,10] .

Characteristic
Currently, a great interest for prediction models as powerful tools for outcome prediction, cost-effectiveness strategies, reasonable resource allocation, and consequently quality control process have been growing [1,9,25] .
Due to the results of our study, despite the little differences between two populations (Tables 1 to 3) the EuroSCORE II may not be completely reliable for risk periodization or resource allocation programs in Iran. Poor performance measures for EuroSCORE II highlights the need for reformulating this risk stratification tool aiming to improve post cardiac surgery outcome predictions in Iran. It may be done by calibrating mortality risk scoring model (e.g. EuroSCORE model) for the region or creating new models with accurate localized parameter sets [11,20] .

Limitation
Although sampling was done in one of the largest hospitals performing various cardiac procedures and the study has adequate sample size, including just one center may affect the generalizability of results to the entire country.

Future Studies
Regarding the key prognostic role of prediction models, further investigation of clinical risk factors and recalibration process seems to be essential on large samples of target population from different centers around country aiming to improve outcome predictions.

CONCLUSION
Our study showed that the EuroSCORE II discrimination power is less than optimal for outcome prediction and less accurate for resource allocation programs. It highlights the need for recalibration this risk stratification tool aiming to improve post cardiac surgery outcome predictions in Iran.

ACKNOWLEDGEMENT
The authors would like to express their appreciation to the staff of cardiac surgery ICU of Emam Reza hospital, Mashhad, for their tremendous support in data collection. We thank Proff. Ameen Abu-Hanna for remarks on an earlier version of this paper. This study was a part of the first author's PhD thesis which was supported by a grant from Mashhad University of Medical Sciences Research Councils. allocation of resources. However, some features may not be fully covered by the models including center-to-center variability in outcomes, type of surgery required and inherent complexity of some diseases [14] . On the other hand, the patient populations defer significantly between institutions and countries. Thus, the comparison of absolute numbers such as the mortality rates is not feasible. A large variety of risk scores have been developed due to differences in patient populations and comparisons of observed mortality versus expected mortality have been reported [15][16][17] .

Comparison to Similar Studies
EuroSCORE was first developed in 1999 to estimate postsurgical mortality in a European population who underwent cardiac surgery (70% of procedures were CABG) [6] . A meta-analysis by Parolari et al. [18] revealed poor performance of EuroSCORE II in valve surgery, in 2010. Moreover, as the quality of medical techniques continues to improve over time, model expiration was considered as an inevitable topic in this area [2,[18][19][20] . To address the surgical type bias in the modified version of model, reasonable number of patients who experienced CABG and valve surgeries were included in the development dataset [2] . The new version of EuroSCORE reflected acceptable discrimination power among both European and non-European samples [17,[21][22][23] . The new model also provided more acceptable predictions for surgeries other than CABG [21][22][23] . However, evaluation studies in Iran reported poor performance measures for both EuroSCORE I