Performance of the high-dimensional propensity score in adjusting for unmeasured confounders

High-dimensional propensity scores (hdPS) can adjust for measured confounders, but it remains unclear how well it can adjust for unmeasured confounders. Our goal was to identify if the hdPS method could adjust for confounders which were hidden to the hdPS algorithm. The hdPS algorithm was used to estimate two hdPS; the first version (hdPS-1) was estimated using data provided by 6 data dimensions and the second version (hdPS-2) was estimated using data provided from only two of the 6 data dimensions. Two matched sub-cohorts were created by matching one patient initiated on a high-dose statin to one patient initiated on a low-dose statin based on either hdPS-1 (Matched hdPS Full Info Sub-Cohort) or hdPS-2 (Matched hdPS Hidden Info Sub-Cohort). Performances of both hdPS were compared by means of the absolute standardized differences (ASDD) regarding 18 characteristics (data on seven of the 18 characteristics were hidden to the hdPS algorithm when estimating the hdPS-2). Eight out of the 18 characteristics were shown to be unbalanced within the unmatched cohort. Matching on either hdPS achieved adequate balance (i.e., ASDD <0.1) on all 18 characteristics. Our results indicate that the hdPS method was able to adjust for hidden confounders supporting the claim that the hdPS method can adjust for at least some unmeasured confounders.


Introduction
The high-dimensional propensity score (hdPS) has been used in different contexts and within multiple databases for the control of confounding by indication and it has been shown to be at least equivalent and potentially superior to the propensity score in this regard [1][2][3][4][5][6][7]. Superiority of the hdPS is generally attributed to the greater number of covariates drawn from the database to include in the final hdPS model [5]. However, the performance of the hdPS has not been assessed when information regarding some of these potential confounders within the examined database is limited.
Our aim was to assess the impact of limited information regarding potential confounders on the performance of the hdPS. To achieve this goal, we compared the performance of the hdPS in a scenario where the algorithm had full access to all of the data contained within a database to its performance in a scenario where only partial data were available to the algorithm.
The administrative database situation in Quebec, Canada provides an interesting setting in which to examine this issue.
There are two distinct sets of medico-administrative databases available in Quebec; the Régie de l'assurance maladie du Québec (RAMQ) databases (physician and pharmacists billing data) and the Maintenance et Exploitation des Données pour l'Étude de la Clientèle Hospitalière (MED-ECHO) databases (hospital discharge data). RAMQ and MED-ECHO data may overlap. However, they differ on their source of information (e.g., only the RAMQ databases provide outpatient data) and may be more detailed in specific areas (e.g., the MED-ECHO databases provide more detailed and more precise information regarding patients' entry date/discharge date and on in-hospital diagnoses and therapeutic and diagnostic procedures) [8][9][10].
To test the performance of the hdPS under conditions of limited information regarding potential confounders, we examined the association between the risk of diabetes and exposure to high versus low statin doses [7,[11][12][13][14][15][16]. Assessing this association in a Quebec incident statin user population may be hindered by the presence of confounding by indication since patients started on a higher statin dose have been shown to be sicker and at higher risk for diabetes than those started on a lower dose [7].
We compared the performance of the hdPS within two scenarios: (1) the algorithm used in the hdPS estimation had full access to all the data provided by both the MED-ECHO and RAMQ databases, and (2) the algorithm had only access to the data provided by the MED-ECHO databases.
One of the uses of hdPS is to select a matched sub-cohort from the main cohort (all patients initiated on statins) where the characteristics of patients who received treatment A (high dose statins) are similar to the characteristics of patients who received treatment B (low-dose statins) [5]. That is, we assessed the performance of the hdPS on its ability to select a balanced sub-cohort when it is used as a matching criterion [7,[17][18][19][20][21]. The performance of the restricted information hdPS was assessed by comparing the balance achieved with this method to the balance achieved when all information was available to the algorithm.

Data sources
The different data sources used within this study have been described elsewhere [7]. Briefly, we obtained data on a cohort of 800,551 new statin users from RAMQ. For this study, we used data from both the RAMQ databases (i.e., demographic database, medical services, and claims database and pharmaceutical database) and from the MED-ECHO databases (i.e., hospitalization-description database, hospitalization-diagnoses database, and hospitalization-intervention database). Patient records were linked across all databases by use of a unique identification number which was encrypted to protect patient confidentiality. Access to data was granted by the Commission d'accès à l'information and the protocol was approved by the Centre hospitalier de l'Université de Montréal's ethics' committee.

Full cohort
The Full Cohort used within this study has been described elsewhere [7]. Briefly, it was comprised of 404,129 patients newly initiated on a statin (either simvastatin, lovastatin, pravastatin, fluvastatin, atorvastatin, or rosuvastatin) between January 1st, 1998 and December 31st, 2010. Patients were defined as having been newly initiated on a statin if they did not receive any statin dispensation in the year prior to the date of first statin dispensation (hereby defined as the cohort entry date).

Identification of exposure group
All patients were categorized into two groups based on the strength of the daily statin dose of their first statin dispensation [12]. Patients initiated on a daily dose of ≥10 mg of rosuvastatin, ≥20 mg of atorvastatin or ≥40 mg of simvastatin formed the high dose group and the remaining patients formed the low dose group.

Identification of the study outcome
Onset of diabetes within 2 years follow-up was used as our study outcome. Patients were defined as cases if they received either a dispensation of a drug used in the treatment of diabetes (WHO ATC A10) or a diagnosis of diabetes (ICD-9 code: 250.x; ICD-10 codes: E10.x-E14.x) within the 2 years following the cohort entry date; all other patients were considered to be diabetes-free.

High-dimensional propensity score method
Two distinct hdPS models were created and resulting hdPS were calculated for all patients included in the Full Cohort. Detailed description of the hdPS method can be found elsewhere [5]. Both models were created using the default setting of the SAS hdPS macro v.1 [22].
Six potential data dimensions were defined using the data collected from the year prior to the cohort entry date: (1) drugs dispensed in an outpatient setting, (2) physician claims for procedures codes, (3) physician claims for diagnostic codes, (4) specialty of the physician providing care, (5) hospitalization discharge data for inpatient procedure codes, and (6) hospitalization discharge data for inpatient diagnostic code.

Full information model
The first hdPS model (hereby defined as hdPS full info model) was created by selecting the top 500 covariates, as assessed by the hdPS algorithm, contained within all 6 data dimensions. In addition to these 500 covariates, the following known confounders were forced within the hdPS full info model: [12] patients' sex, age, poverty level status (yes versus no) at the cohort entry date, year of entry within the cohort (as a categorical variable), and ≥1 hospitalization, ≥5 outpatient visits, ≥5 distinct drugs dispensed to the patient, all within the year prior to the cohort entry date. The resulting hdPS full info model was used to estimate each patient's hdPS-1.

Hidden information model
The second hdPS model (hereby defined as the hdPS hidden info model) was created by selecting the top 500 covariates, as assessed by the hdPS algorithm, contained within the 2 data dimensions provided from the MED-ECHO databases since it was believed a priori that it would contain less potential covariates, therefore increasing the risk of unmeasured confounding (the 4 data dimensions provided by RAMQ were hidden to the algorithm). In addition to these 500 variables, the following covariates were forced within the hdPS hidden info model: patients' sex, age, and poverty level status (yes versus no) at the cohort entry date, the year of entry within the cohort (as a categorical variable) and ≥1 hospitalization in the year prior to the cohort entry date. Within this model, hospitalization status (≥1 hospitalization yes vs no) was assessed solely from data available within the MED-ECHO databases. Outpatient medical resource utilization and outpatient drug dispensation covariates, forced within the previous model, were excluded from this list since they were based on information solely available within the RAMQ databases. The resulting hdPS hidden info model was used to estimate each patient's hdPS-2.

Creation of the matched sub-cohorts
Trimming was performed and patients located within nonoverlapping regions of the hdPS-1 distribution were excluded [23][24][25], all other patients were eligible for inclusion within the Matched hdPS Full Info Sub-Cohort. Low dose controls were found for patients initiated on a high dose using a greedy, nearest neighbor 1:1 matching algorithm. Matching occurred if the difference in the logit of hdPS-1 between the nearest neighbors was within a caliper width equal to 0.2 times the SD of the logit of the hdPS-1 [26]. Patients selected by the matching algorithm were included within the Matched hdPS Full Info Sub-Cohort. These two steps were reproduced using hdPS-2 in order to create the Matched hdPS Hidden Info Sub-Cohort.

Statistical analyses
Patients' baseline characteristics within both sub-cohorts were assessed using the information provided from the full database. Absolute standardized differences (ASDD) were used to compare patients' baseline characteristics between patients included in the high dose group versus those included in the low dose group within both matched sub-cohorts [19,21]. ASDD <0.1 are generally assumed to indicate good balance between groups [21,27].
Discrete data are presented in absolute values and percentages and continuous data are presented as mean (± SD). All statistics were performed using SAS version 9.3 (Cary, North Carolina).

Results
Description of the full cohort The Full Cohort is comprised of 404,129 patients, 264,947 patients (65.6 %) of which were in the low dose group while the remaining 139,182 patients (34.4 %) were in the high dose group; as mentioned previously, patients in the high dose group were different and overall sicker than those in the low dose group. Specifically, eight of the 18 examined patient characteristics (i.e., sex, ≥5 outpatient medical visits, ≥1 hospitalization, history of myocardial infarction, history of percutaneous coronary intervention, dispensation of beta-blockers, dispensation of angiotensin receptor blockers [ARB], dispensation of angiotensin converting enzyme inhibitors [ACEI]) were unbalanced (ASDD >0.1) within the Full Cohort [7].
Of the 404,129 patients included within the Full Cohort, none were classified as diabetic at the cohort entry date. Diabetes was identified in 12,978 patients (3.2 %) within the 2 years follow-up. Table 1 shows the number of potential covariates, with and without the assessment of recurrence, within each of the 6 data dimensions considered within this study. The 4 data dimensions provided from the RAMQ databases (n without assessment of recurrence =2758 [71.

Characteristics of the patients included within the Matched hdPS Full Info Sub-Cohort
Using data contained within all 6 available high-dimensions, we created the hdPS full info model which was used to estimate patients' hdPS-1. Three hundred and one patients (0.0 %) had hdPS-1 located within non-overlapping regions and were excluded from the analysis. Among the remaining 403,828 patients, we matched 116,014 patients (28.7 %) from the high dose group to 116,014 patients (28.7 %) from the low dose group based on their individual hdPS-1; selected patients formed the Matched hdPS Full Info Sub-Cohort (Fig. 1). The hdPS full info model was created from the information present within all 6 data dimensions while the hdPS hidden info model was limited to the information present within the 2 data dimension provided by MED-ECHO MED-ECHO Maintenance et Exploitation des Données pour l'Étude de la Clientèle Hospitalière ; RAMQ Régie de l'assurance maladie du Québec a Any covariate not present within at least 100 patients is excluded by the hdPS algorithm and was therefore not included within this table Patients included within the Matched hdPS Full Info Sub-Cohort were on average 64.6 years old (SD 11.2) and 116,688 of them were males (50.3 %) ( Table 2). Balance (ASDD <0.1) was obtained in all 18 examined patient characteristics (ASDD ranged from 0.001 to 0.023 with an average of 0.008).

Characteristics of the patients included within the Matched hdPS Hidden Info Sub-Cohort
Using data from the 2 data dimensions selected from the MED-ECHO databases, we created the hdPS hidden info model to estimate each patient's individual hdPS-2. Sixty-six patients (0.0 %) had hdPS-2 located within non-overlapping regions and were excluded from the analysis. Among the remaining 404,063 patients, we matched 119,376 patients (29.5 %) from the high dose group to 119,376 patients (29.5 %) from the low dose group based on their individual hdPS-2; selected patients formed the Matched hdPS Hidden Info Sub-Cohort (Fig. 1). About half of the patients included within this sub-cohort were male (n = 120,238 [50.4 %]) and the average age was 64.5 years old (SD 11.2) ( Table 3).
Balance within this sub-cohort was obtained for all 18 examined patient characteristics (ASDD ranged from 0.004 to 0.075 with an average of 0.027), including those which were hidden to the hdPS algorithm (ASDD for the hidden covariates ranged from 0.011 to 0.075 with an average of 0.036).
Relative performance of the two matched sub-cohorts ASDD obtained within both matched sub-cohorts are shown within Fig. 2. The Matched hdPS Full Info Sub-Cohort was shown to achieve better balance on 16 of the 18 examined patient characteristics, the two remaining characteristics were equally balanced within both matched sub-cohorts.

Discussion
Our results show that matching on the hdPS hidden info model achieved balance on all 18 examined patient characteristics. This result shows that the hdPS algorithm was able to adjust for imbalance regarding patient characteristics, some of which were Comorbidity status, drug dispensations, and medical utilization rates were assessed in the year prior to the cohort entry date. Absolute standardized differences are defined as the between group difference as a proportion of the pooled standard deviation of the two groups a At the cohort entry date b Identifies baseline characteristics which had 0.10< ASDD ≤0.20 within the unmatched populations [7] c Identifies baseline characteristics which had ASDD >0.20 within the unmatched populations [7] unavailable to the hdPS algorithm. Among these, some were very important variables regarding outpatient medical visits and drug dispensations which can be highly associated with both the choice of treatment and the risk of diabetes. As expected the hdPS algorithm had access to a greater number of potential covariates to build the hdPS full info model (n = 3854 potential covariates) than when building the hdPS hidden info model (n = 1096 [28.4 %] potential covariates). This difference implied that 431 (86.2 %) covariates selected within the hdPS full info model were no longer available for selection and had to be replaced by the algorithm when it built the hdPS hidden info model.
The main strength of our study is that it provides support to the claim that the hdPS is able to adjust for at least some unmeasured confounders. In their original paper, Schneeweiss and colleagues [5] hinted that some of the covariates selected by the hdPS algorithm may not be direct confounders but may actually be proxies of unmeasured confounders. Although adjusting for a perfect proxy of an unmeasured confounder is equivalent to directly adjusting for this confounder [28], it remained unclear if the hdPS could truly adjust for a confounder not present within the examined database. Four important known confounders (i.e., ≥5 medical outpatient visits, dispensation of beta-blockers, dispensation of ARB, and dispensation of ACEI; all shown to be unbalanced within the full cohort) [7] were not available to the hdPS hidden info model. Our results show that this model was able to achieve balance within all examined patient characteristics, including the four previously mentioned (Table 3). Such a result is of significant value since the PS technique may not adjust for variables not included within the PS model [29]. However, we were unable to identify which covariates selected by the hdPS algorithm were used as proxies for these four confounders.
Our study has limitations. First, since our study shows that hdPS was able to control for measured confounders which were unavailable to the hdPS algorithm in the restricted data setting, it is reasonable to think that the algorithm is also able to control for some unmeasured confounders. Of note, the ability of hdPS to control for unmeasured confounders may be specific to these covariates/databases and to this specific population and may not be true in other settings. Comorbidity status, drug dispensations, and medical utilization rates were assessed in the year prior to the cohort entry date. Absolute standardized differences are defined as the between group difference as a proportion of the pooled standard deviation of the two groups a At the cohort entry date b Identifies baseline characteristics which had 0.10< ASDD ≤0.20 within the unmatched populations [7] c Identifies baseline characteristics which had ASDD >0.20 within the unmatched populations [7] d Identifies covariates which were hidden to the hdPS algorithm within the hdPS hidden info model Second, we only examined a limited number of patient characteristics. Although balance was achieved within both sub-cohorts regarding all 18 examined patient characteristics, we cannot guarantee that this balance would be achieved in other patient characteristics or in other unmeasured confounders.
Finally, we did not examine the relative performance of the two matched sub-cohorts in regards to the measure of association which would have been obtained in an eventual etiological study. To do so would require the existence of a Bgold standard^, providing the nature and magnitude of the Btrueâ ssociation, to which we could compare our results [2][3][4][5]. Despite this fact, we consider that the quality of the match is a good marker of the performance of the hdPS method within this study since only this approach could illustrate that the hdPS method truly adjusted for the seven hidden confounders [7,[17][18][19][20][21].
In conclusion, our results show that, within the confines of our study, the hdPS was able to adequately adjust for confounders which were hidden to the algorithm. Such results support the claim that the hdPS can adjust for at least some unmeasured confounders and further support its use in future observational studies. Acknowledgment This study was made possible through data sharing agreements between Canadian Network for Observational Drug Effect Studies (CNODES) and the provincial government of Quebec. The opinions, results, and conclusions reported in this paper are those of the authors. No endorsement by the province is intended or should be inferred. We would like to also thank the CNODES investigators and collaborators for their contribution in developing the study protocol evaluated in this paper.
The work presented within this manuscript has not been published previously except within abstract form and partly within Jason R Guertin's doctoral thesis work.

Compliance with ethical standards
Contributions of authors' statement Jason R Guertin helped in the design of the study, conducted all of the analyses and wrote the first draft of the manuscript. Elham Rahme and Jacques LeLorier both conceived the study and revised the draft of the manuscript. All authors approved the final draft of the manuscript.   2 Head-to-head comparison of the absolute standardized differences obtained within the two matched sub-cohorts. ACEI angiotensin converting enzyme inhibitors; ARB angiotensin receptor blockers; BB beta-blockers; CABG coronary artery bypass graft; Calc blockers calcium blockers; CHF congestive heart failure; hdPS high-dimensional propensity score; PCI percutaneous coronary intervention; PVD peripheral vascular disease; Absolute standardized differences <0.1 are assumed to indicate balance; all 18 patient characteristics were considered to be balanced within the two sub-cohorts