Matching on Race and Ethnicity in Case-Control Studies as a Means of Control for Population Stratification

In genetic epidemiology studies, potential confounding by genetic ancestry, known as population stratification, may bias results [1-3]. Although there is no agreement on the magnitude of bias that might be attributable to population stratification, many genetic epidemiology studies have taken measures to guard against it. Large genome-wide association studies often restrict analyses to subjects who conform to a certain degree of ethnic/genetic homogeneity [4]. In studies where such restriction is impossible or even undesirable, estimates of genetic ancestry based on structured association methods are often calculated [4] and adjusted for by regression analyses. However, some investigators argue that controlling for self-reported race or ethnicity, either in statistical analysis or in study design, is sufficient to mitigate unwanted influence from population stratification [2]. Such approaches are desirable under certain circumstances, such as when a study population is being used to type a limited number of genetic variants as part of a replication study, and costs of high-density genotyping are prohibitive. In this report, we examine the effectiveness of a study design involving matching on self-reported ethnicity and race in minimizing bias due to population stratification within an ethnically admixed population.


Introduction
In genetic epidemiology studies, potential confounding by genetic ancestry, known as population stratification, may bias results [1][2][3]. Although there is no agreement on the magnitude of bias that might be attributable to population stratification, many genetic epidemiology studies have taken measures to guard against it. Large genome-wide association studies often restrict analyses to subjects who conform to a certain degree of ethnic/genetic homogeneity [4]. In studies where such restriction is impossible or even undesirable, estimates of genetic ancestry based on structured association methods are often calculated [4] and adjusted for by regression analyses. However, some investigators argue that controlling for self-reported race or ethnicity, either in statistical analysis or in study design, is sufficient to mitigate unwanted influence from population stratification [2]. Such approaches are desirable under certain circumstances, such as when a study population is being used to type a limited number of genetic variants as part of a replication study, and costs of high-density genotyping are prohibitive. In this report, we examine the effectiveness of a study design involving matching on self-reported ethnicity and race in minimizing bias due to population stratification within an ethnically admixed population.

Materials and Methods
Since 1995, we have been conducting a population-based case control study of childhood leukemia in 35 counties of Northern and Central California. The study population includes 41% Hispanics, a recently admixed population, and 44% non-Hispanic whites, with the remaining 15% comprising smaller numbers of blacks, Asians and other races/ethnicities. The study has been described previously [5]. Briefly, incident cases of childhood leukemia (age under 15 years) were ascertained through a rapid reporting system established with participating hospitals. Controls were selected from the California birth registry and individually matched to cases on date of birth, gender, maternal race and child's Hispanic ethnicity, that is, mother's report of either parent being Hispanic. This study was reviewed and approved by institutional review committees at the authors' academic institutions, the California Department of Public Health (CDPH), and the participating hospitals. Written informed consent was obtained from all parent respondents; participating case and control children over age 7 also provided written assent.
We used a panel of ancestry informative markers (AIMs) developed specifically to estimate individual genetic ancestry of the major ancestral populations comprising Hispanics: Africans, Amerindians, and Europeans [6]. This panel of SNPs was selected based on high allele frequency differences between the ancestral populations and low linkage disequilibrium (r 2 <0.6) between the SNPs within each population. Using a custom 1536-single nucleotide polymorphism (SNP) Illumina GoldenGate genotyping panel, we attempted to genotype 95 AIMs in our study subjects. The remaining SNPs included on the Illumina panel were variants in 183 candidate genes. Most of these were haplotype-tagging SNPs, selected to capture genetic variation at r 2 >0.80, based on data from the 30 Caucasian trios in the Hap Map project (Release 19, Build 34, www.hapmap.org) and the 23 Hispanics in the SNP500Cancer project (www.snp500cancer. nci.nih.gov). After applying an Illumina GenCall threshold of 0.25, and SNP-wise and subject-wise call rate thresholds of ≥90% and ≥95%, respectively, we successfully genotyped 80 AIMs and 1260 candidate Abstract Some investigators argue that controlling for self-reported race or ethnicity, either in statistical analysis or in study design, is sufficient to mitigate unwanted influence from population stratification. In this report, we evaluated the effectiveness of a study design involving matching on self-reported ethnicity and race in minimizing bias due to population stratification within an ethnically admixed population in California. We estimated individual genetic ancestry using structured association methods and a panel of ancestry informative markers, and observed no statistically significant difference in distribution of genetic ancestry between cases and controls (P=0.46). Stratification by Hispanic ethnicity showed similar results. We evaluated potential confounding by genetic ancestry after adjustment for race and ethnicity for 1260 candidate gene SNPs, and found no major impact (>10%) on risk estimates. In conclusion, we found no evidence of confounding of genetic risk estimates by population substructure using this matched design. Our study provides strong evidence supporting the race-and ethnicity-matched case-control study design as an effective approach to minimizing systematic bias due to differences in genetic ancestry between cases and controls Epidemiology Open Access gene SNPs in whole-genome amplified DNA extracted from buccal cytobrush specimens (73.4%) or archived newborn dried blood spot specimens (26.6%) collected from 376 subjects with childhood acute lymphocytic leukemia (ALL) and 447 controls.
Individual estimates of genetic admixture, i.e. percent contribution of each of the three ancestral populations, were obtained from maximum likelihood estimation models as described previously [7,8]. We used Hotelling's T test to assess the statistical significance of the association between computed genetic ancestry distribution and casecontrol status.
To assess the degree of potential confounding due to case-control differences in genetic ancestry, we calculated the confounding risk ratio (CRR) for each of remaining successfully genotyped SNPs on the Illumina panel. The CRR is defined as the ratio of the unadjusted odds ratio (OR) to the OR adjusted for a potential confounder (in this case, genetic ancestry) [9]. Because genotyping data were not available for every individually matched case-control set, we report the results of an unmatched analysis, using unconditional logistic regression to calculate the CRR and compare risk estimates adjusted for the matching factors (age, gender, self-reported race and Hispanic ethnicity) to those further adjusted for estimated genetic ancestry. The results reported here were similar to those from a matched analysis using conditional logistic regression of the 278 complete matched case-control sets (201 pairs and 77 triplets). Accordingly, to maximize sample size and statistical power, we present only the results of the unmatched analysis.

Results
The mean percentages of African, European, and Amerindian genetic ancestry by case-control status are presented in (Table  1). European ancestry dominated our total population, followed by Amerindian ancestry, then African ancestry, present at just 7%. Overall, we observed no statistically significant difference in ancestry distribution between cases and controls (P=0.46). Similarly, stratification by Hispanic ethnicity showed no significant case-control differences in ancestry distribution (P=0.66 and 0.60 for Hispanics and non-Hispanics, respectively).
The CRRs for the 1260 candidate gene SNPs assessing potential confounding by genetic ancestry are shown in (Figure 1). Overall, we found no major differences (>10%) in risk estimates due to adjustment for genetic ancestry over and above adjustment for race and ethnicity, *CRR calculated as the ratio of the log-additive odds ratios (ORs) for an individual SNP, comparing the genetic ancestry-unadjusted risk estimate (adjusted for race, ethnicity, age, and gender) to the genetic ancestry-adjusted risk estimate (adjusted for race, ethnicity, age, gender, and genetic ancestry), and plotted across the X-axis by chromosomal position. though the observation of more CRRs above 1.05 than below 0.95 suggests that risk estimates unadjusted for genetic ancestry tend to be somewhat inflated.

Discussion
In this study, we found no significant differences in estimated genetic ancestry between race-and ethnicity-matched cases and controls overall. Furthermore, we found no evidence of confounding of genetic risk estimates by population substructure using this matched design. The results of this study demonstrate that careful study design can overcome potential differences in genetic ancestry between cases and controls that can lead to population stratification.
In order for confounding of a gene-disease association by any factor to exist, evidence of an association between that factor and the disease must be observed. In our matched case-control study, among all races/ethnicities together as well as after stratification by Hispanic ethnicity, we found no evidence of association between disease and genetic ancestry. However, it should be noted that the AIMs we used were optimized for discerning major continental ancestry origins [6]. As such, they are more informative in discerning ancestry among Hispanics than among non-Hispanics. Indeed, the overwhelmingly dominant ancestry among our non-Hispanic subjects was European (80-83%). Thus, potential confounding of results by subtle, intracontinental differences such as Northern vs. Southern European ancestry would have to be investigated by more sophisticated matching (such as matching on parental or grandparental country of origin) and/ or a more extensive set of AIMs than was done in our study.
It should also be noted that the individual matching of our original study design requires adjustment for matching factors when performing an unmatched analysis. These factors included race, ethnicity, gender, and age. Accordingly, assessment of confounding by ancestry via the CRR necessitated inclusion of these matching variables in regression analyses. This was done even though the distributions of the matching factors were balanced between cases and controls by the very design of the study.
Our intention with this investigation was not to gauge the general magnitude of population stratification on results, but to comment on the absence of material changes in risk estimates in race-and ethnicitymatched case control studies after adjustment for estimated genetic ancestry. This is particularly relevant given the large proportion of subjects reporting Hispanic ethnicity in our study population. Hispanics are a recently admixed group [10] and are reported to have the highest incidence of childhood leukemia in California [11]. It is therefore imperative to include this group in the search for genetic susceptibility loci for childhood ALL. Accordingly, the absence of a significant association between genetic ancestry and disease within this ethnic group is especially reassuring.
Similar to the results of our investigation, a recent study in New York found little improvement in model fit for specific genotypephenotype associations due to adjustment for genetic ancestry over self-reported race/ethnicity among ethnically heterogeneous populations and specifically among Hispanics [12]. Another study in a multi-ethnic population compared multiple methods for adjusting for genetic structure, and found that the potential impact of population stratification was effectively mitigated by adjustment for self-reported race/ethnicity or AIMs-based ancestry estimates [13]. The authors attribute the modest extent of bias due to population stratification observed to both the frequency matching of cases and controls on race/ ethnicity as well as the sampling of subjects of same ethnicity from the same geographic area. These findings provide support our conclusion that adjustment for race/ethnicity in admixed populations -by design, analysis, or both -paired with careful attention to recruitment of subjects from comparable geographic regions, can effectively mitigate effects of population stratification.
Matching reduces heterogeneity between cases and controls that may be due to the matching factors, or more specifically, broad unmeasured risk factors that are associated with the matching factors. In the case of matching on self-reported race and ethnicity, these include both genetic and non-genetic components. The matched study design's reduction in heterogeneity improves efficiency to study effects that do not involve these unmeasured components. However, individual matching such as was performed for this study is logistically challenging and costly, and matching on race/ethnicity precludes assessment of these factors as potential risk factors. Furthermore, there exists the risk of overmatching with respect to environmental and lifestyle factors particular to certain racial/ethnic groups. Nevertheless, for a rare disease such as childhood ALL (31.9 cases per million/ year in the US [14]) the number of cases available to participate in epidemiologic studies tends to be small, and thus the improvement in statistical efficiency afforded by individual matching of controls is useful.
In summary, our findings support the race-and ethnicity-matched case-control study design as an effective approach to mitigating systematic bias due to differences in genetic ancestry between cases and controls.