High-density genotyping of immune-related loci identifies new SLE risk variants in individuals with Asian ancestry

Systemic lupus erythematosus (SLE) has a strong but incompletely understood genetic architecture. We conducted an association study with replication in 4,492 SLE cases and 12,675 controls from six East-Asian cohorts, to identify novel and better localize known SLE susceptibility loci. We identified 10 novel loci as well as 20 known loci with genome-wide significance. Among the novel loci, the most significant was GTF2IRD1-GTF2I at 7q11.23 (rs73366469, Pmeta=3.75×10−117, OR=2.38), followed by DEF6, IL12B, TCF7, TERT, CD226, PCNXL3, RASGRP1, SYNGR1 and SIGLEC6. We localized the most likely functional variants for each locus by analyzing epigenetic marks and gene regulation data. Ten putative variants are known to alter cis- or trans-gene expression. Enrichment analysis highlights the importance of these loci in B- and T-cell biology. Together with previously known loci, the explained heritability of SLE increases to 24%. Novel loci share functional and ontological characteristics with previously reported loci, and are possible drug targets for SLE therapeutics.


Study participants
This study included Koreans (KR) recruited from a hospital network run by Hanyang University Hospital for Rheumatic Diseases and 7 other hospitals in South Korea (1843 SLE patients and 3262 controls); Han Chinese (HC) recruited from Peking University in Beijing, China (500 patients and 500 controls); and Malaysian Chinese (MC) recruited from the Health Center of the University of Malaya (302 patients and 296 controls, Malaysians of Chinese ancestry)(Supplementary Table 1). All patients satisfied American College of Rheumatology classification for SLE 90,91 based on review of medical records. Controls were recruited from the same geographic regions as SLE cases. Participants provided written consent at study enrollment, and the Internal Review Boards (IRBs) or ethical committees of participating institutions approved this study. Potential identifying information was removed for all participants.
To increase statistical power we incorporated an independent KR out-of-study control GWAS dataset (genotyped using Illumina HumanOmni1-Quad BeadChip GWAS array at the Korea National Institute of Health, which was used in a previous rheumatoid arthritis case-control study 21 ). We only used GWAS data (Illumina OMNI-1) from Koreans, since Koreans are incredibly genetically homogenous and constitute our largest cohort. The addition of out-of study controls boosted the statistical power by 12-16% for SNPs with MAF>0.15 and P<5x10 -8 (Supplementary Fig. 2).
The in silico replication samples are from a Japanese GWAS dataset (891 cases and 3384 controls) 59 ; the other two replication samples are from two independent Han Chinese cohorts (BHC and SHC) recruited at the Peking University First Hospital (601 cases and 1034 controls), and through the Joint Molecular Rheumatology Laboratory of the Institute of Health Sciences and Shanghai Renji Hospital, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai Jiaotong University School of Medicine (501 cases and 622 controls), respectively (Supplementary Table 1).

Genotyping and quality control
A total of 2645 cases and 4058 controls from three Asian populations were genotyped using the Illumina ImmunoChip array 9 at the Oklahoma Medical Research Foundation (OMRF). Subjects were excluded from analysis if they had >10% missing genotyping. SNPs with overlapping clusters were filtered out before further QC was performed. SNPs were also excluded if they had >5% missing genotyping, were out of Hardy-Weinberg equilibrium (P<0.0001 in controls) or had <0.5% minor allele frequency (MAF). All QC'd SNPs (Supplementary Table 2) for ImmunoChip datasets were aligned with NCBI Genome Build 37 and oriented to the forward/+ strand. Excluding known SLE-risk loci, there were 11,234 common, independent (i.e., LD r 2 <0.3) SNPs across all ImmunoChip and out-of-study control datasets. We used this SNP set for all QC on the samples. We calculated the pair-wise relationship between individuals for each cohort using GCTA 92 (estimating the genetic relationship matrix, grm). Individuals with grm>25% were excluded in subsequent analysis due to possible cryptic relatedness. We identified ancestral outliers as those exceeding six standard deviations beyond the mean for the first three principal components (Supplementary Fig. 15) obtained by EIGENSTRAT 93 PCA. After QC, we removed 160 cases (133 KR, 10 HC, 17 MC) and 111 controls (95 KR, 7 HC, 9 MC) due to low call rate, relatedness and outliers, and integrated 3669 KR into this study. In order to examine genomic inflation due to population stratification, we used the same set of SNPs to estimate genomic inflation ( ) and create Q-Q plots (Supplementary Fig. 16). We compared genomic inflation across three cohorts by estimating the sample size-corrected 1000 94 . There was little evidence of stratification, measured by the genomic control inflation factor ( GC : KR =1.12, HC =1.07, MC =1.02) and sample sizecorrected inflation factor 94 ( 1000KR =1.04; 1000HC =1.09; 1000MC =1.07).
Several signals were only genotyped on the ImmunoChip array, for various reasons. GTF2IRD1-GTF2I was analyzed using only ImmunoChip because of low SNP density and imputation quality in the out-of-study controls' GWAS array. ImmunoChip data alone was also used for SYNGR1, due to low quality imputation.
To confirm the strongest novel signal within the GTF2IRD1-GTF2I region, we used TaqMan assays to genotype 2-6 promising SNPs (including rs73366469) on a subset of ImmunoChip-genotyped samples. We genotyped 6 SNPs in 1363 KR (624 cases and 739 controls), 4 SNPs in 322 MC (141 cases and 181 controls) at OMRF, and 2 SNPs in 941 HC (477 cases and 464 controls). We also used the Sequenom array to genotype 4 SNPs in both 1635 BHC (601 cases and 1034 controls), and 1123 SHC (501 cases and 622 controls) at Peking University First Hospital, China (Supplementary Table 6).

Replication of novel loci
In order to confirm our novel loci, we used data from three replication cohorts. The in silico replication study of the Japanese population was conducted by RIKEN Center for Integrative Medical Sciences, Japan. The GWAS data of 891 SLE cases and 3,384 controls were used. SLE cases were collected under the support of the autoimmune disease study group of Research in Intractable Diseases, Japanese Ministry of Health, Labor and Welfare. The control subjects were collected by BioBank Japan Project. Detailed characteristics of the subjects and the study design are described elsewhere 59 . Japanese SLE cases and controls were genotyped using Illumina HumanHap610-Quad and Illumina HumanHap550v3 Genotyping BeadChips (Illumina, CA, USA), respectively. Details of the QC filters (sample call rates 0.98, SNP call rates 0.99, exclusions of closely related subjects and outliers in terms of PCA analysis, exclusion of SNPs with HWE-P < 1.0×10 -6 , SNPs with MAF < 0.01) are described elsewhere 59 . We applied whole-genome genotype imputation of the GWAS data using MiniMac2 software 95,96 . We adopted the East-Asian subjects of 1000 Genomes Project Phase I (a) data as an imputation reference. We excluded the imputed SNPs with MAF < 0.01 or Rsq values < 0.5 (Supplementary Table 27).
For our two other replication sets, we genotyped 22 SNPs (Supplementary Table 28) on a custom Sequenom array for two independent Han Chinese cohorts from Beijing (BHC) and Shanghai (SHC).

Analysis of the HLA locus
In order to dissect HLA associations, we imputed SNPs, 2-and 4-digit classical alleles, as well as amino-acid residues for eight HLA genes (A, B, C, DRB1, DQA1, DQB1, DPA1, DPB1) in KR, HC and MC using SNP2HLA 97 . We used the expanded Asian reference panel described in Kim et al. 22 , which included genotyping data for SNPs and HLA variants for 854 individuals of Korean, Southern Han Chinese, Singapore Chinese, Southern Malaya and Tamil Indian descent 22 . SNPs, amino-acid residues and 2-and 4-digit alleles with low imputation quality (info <0.70) were excluded from the analysis. SLE association analysis for the imputed markers was performed as previously described 98 . In brief, logistic regression and log-likelihood ratio test ("omnibus" test comparing log-likelihood of the null and alternative logistic regression models) were used to test associations of bi-allelic markers and multi-residue amino-acid positions, respectively. The fitted logistic regression models included the first three principal components and a dummy variable to account for the differences between cohorts. To identify independently associated variants within the extended MHC region in the merged set, we performed step-wise conditional logistic regression. In this approach, we started with the most significant variant and added new variants as covariates, until the significance level of the remaining variants reached >5x10 -8 . For example, when we tested the secondary association of amino-acid positions in HLA-DRB1, we conditioned on all residues for each position (excluding the most frequent residue, to avoid collinearity). Similarly, when we investigated the secondary effects outside of particular genes (e.g., HLA-DRB1), we conditioned on all 4-digit classical alleles of DRB1 and examined the remaining variant P-values (new variants would have been included if their P<5x10 -8 ). Supplementary Fig 1. Venn diagram for 578 imputation regions with P<5x10 -3 from three populations' ImmunoChip genotyping. KR=Korean; HC= Han Chinese; MC=Malaya Chinese. Numbers represent the number of regions of the genome with association P <5x10 -3 prior to imputation.

Supplementary Fig 2.
Statistical power for combined cases and controls, before and after addition of out-of-study KR controls. It is of note that our initial statistical power comparison between ImmunoChip-only and ImmunoChip + Korean GWAS data sets showed that for MAF > 0.15, adding controls boosted statistical power by 12-16%. For example, using a cutoff at P 5x10 -8 and OR = 1.30, the power increased by 15% (from 0.67 to 0.82) at MAF = 0.30. Supplementary Fig 3. Regional association plots for 10 novel SLE loci. Plots of -log10(Discovery Pmeta) versus position in megabase pairs. The lead SNP is highlighted in purple, while the linkage disequilibrium of the rest of the variants (r2) to the lead SNP is shown in the color legend. a. TERT; b.

Supplementary Fig 4.
Epigenetic annotation of novel SLE-associated regions. ENCODE tracks for DNase hypersensitivity, HMM promoter/enhancer/insulator/transcription classification, chromatin marks (CTCF, EZH2, H2Az, H3K27Ac, H3K27me3, H3K6me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9Ac, H3K9me3, H4K20me1), transcription factor binding sites and RNA-binding protein sites are shown. Data is taken from the ENCODE tracks for GM12878 cells. For chromatin marks and transcription factor/RNA-binding protein sites, shading of the box indicates intensity of the signal, with black meaning maximum signal. The SNPs tested in this study are shown at the top, with the height of the bar (y-axis on left) indicating -log 10 (P-value). Chromosome position is indicated on the x-axis. For each locus, a blow-up of the region containing the lead SNP and other potentially functional SNPs is shown in a second panel, with key SNPs labeled.

Supplementary Fig 5.
Long-range chromatin interactions assessed through Hi-C and ChIA-PET from nine cell lines. Long range interaction strength is shown in the thickness of the connecting lines.

Supplementary Fig 6.
Regional association plots for known SLE loci (35 known non-HLA genes). Plots of -log10(Discovery P-meta) versus position in megabase pairs. The lead SNP is highlighted in purple, while the linkage of the rest of the variants (r2) Fig 7. Regional association plot for the extended MHC/HLA region. Plots of -log10(Discovery P-meta) versus position in megabase pairs. The lead SNP is highlighted in purple, while the linkage of the rest of the variants (r2) to the lead SNP is shown in the color legend. a) SNP based association analysis; b) Imputed HLA variants (classical allele and residue) association; c) After conditioning for the effect of HLA-DRB1, no association remained. We positioned a red dotted line at the genome-wide significant level (P<5x10-8), as well as a suggestive association line (5x10 -8 <P<5x10 -5 ). Fig 8. Enrichment analysis of the novel and SLE loci in cell-specific gene ontology categories. We estimated enrichment of our gene-set in the Gene Ontology database. Overrepresented cell types have a high correlation (Pearson's correlation coefficient) of SLE-loci expression (dark red). P-values (blue bars) that passed the multiple testing threshold (black line) show significant enrichment in SLE loci. Supplementary Fig 9. Induced network module analysis. In order to investigate how our novel loci fit together, we created an induced network connecting our set of novel loci to interacting genes (blue), proteins (yellow) and drug targets (red). Network illustrating published relationships between known loci (blue) and novel loci (green). Relationships are defined as the co-occurrence of gene names (or their synonyms) within the same abstract. Line thickness is proportional to the number of times the genes were mentioned together. Node size is proportional to the number of connections within this network (i.e., bigger = more connections). Genes that were not co-mentioned with other genes on the list are shown at the left, divided according to how much published information exists for them (poorly < 15 papers mentioning the gene in any context, moderately = between 16 and 50 papers, well-annotated >100 papers). Fig 12. Protein-protein interaction sub-groups identified through DAPPLE analysis of reported biochemical interactions. Supplementary Fig 13. Enrichment analysis of the novel and known SLE loci in cell type-specific gene expression. We estimated enrichment of our gene-set in human (a: GeneAtlas) and mouse (b: ImmGen) cell lines. Overrepresented cell types have a high correlation (Pearson's correlation coefficient) of SLEloci expression (dark red). P-values (blue bars) that passed the multiple testing threshold (black line) show significant enrichment in SLE loci. Supplementary Fig 14. Weighted genomic risk scores (wGRS). a) ROC curves for all Asians constructed by weighted genetic risk scores before (blue), replicated (green) and after (red) adding newly identified SLE loci. b) Density plot of weighted genetic risk scores in Asians. Distribution of wGRS is shown for cases (red) and controls (blue).   Table 4. Six novel regions with suggestive association (Pmeta<5x10-5). We used mach2dat for the single SNP post-imputation-based association tests with the first three principal components as covariates to correct for population stratification in KR, HC and MC, respectively. Imputed SNPs are marked as "*". A1/A2= Minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes for each cohort are presented as cases/controls. KR=Korean; HC=Han Chinese; MC=Malaya Chinese; JPT=Japanese; SHC=Shanghai Han Chinese; BHC=Beijing Han Chinese.

Supplementary Table 5.
All associated SNPs all novel regions, note that GTF2IRD1-GTF2I and SYNGR1 used three ImmunoChip data sets only. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Rsq= Imputation quality measure.

Supplementary Table 6.
A. Experimental genotyping association results for confirming GTF2IRD1-GTF2I. BP= position. A1/A2= minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes are presented under each cohort as case/control B.Conditional analysis of four GTF2IRD1-GTF2I SNPs based on partial genotyping data. C: The linkage disequilibrium (LD) r2 value based on each cohort's control data set. Table 7. Functional annotation of associated lead SNPs, based on Haploreg, ChromMM, and GWAS3D. GERP RS scores of conservation are presented. ChromMM prediction for chromatin state predictions are presented below. LeadSNP=novel SNP used for reference; r2= linkage disequilibrium frequency correlation between the reference SNP and SNP within the feature; SNP =SNP in the feature; TFBS affinity= transcription factor binding site; CTCF = presence of CTCF mark; GERP =conservation status as extracted from GERP; SELF= same SNP as the lead SNP. Table 8. Areas of chromatin interaction. Both Hi-C and CHiA-PET chromatin interaction experimental data from nine different cell lines were used. LeadSNP=novel SNP used for reference; r2= linkage disequilibrium frequency correlation between the reference SNP and SNP within the feature; SNP =SNP in the feature; Locus 1/2= Genomic coordinates of Sides 1 and 2 of the chromatin interaction; Gene1/2= Genes within locus1/2; Score=Read count of interactions; Validation =experiment type. Table 9. Allelic expression changes through eQTLs. eQTLs for both the lead SNP and highly correlated (r2>0.7) SNPs. Probe information presented as chromosome and the center position. Type of eQTL (cis/trans). The gene affected is presented as well as the Source of the data: Bloodeqtl (Westra et al. 2013) and SNPexp(Holm et al. 2010). Table 10. Replicated 36 known loci with P<0.005. KR= Korean; HC =Han Chinese; MC=Malaya Chinese; A1/A2= minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes are presented under each cohort as case/control. Each SNP was either genotyped in the Immunochip or was imputed. Supplementary Table 11. Secondary associations for Non-HLA loci. A1/A2= Minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes are presented under each cohort as case/control. Alleles: I= Insertion D=Deletion. Table 12. HLA-association with classical allele imputation. HLA association was based on dosage data, average case and control dosages are presented. OR= odds-ratio, 95%CI= 95% confidence interval. INFO= imputation quality. For the omnibus test, D is the difference between the base model likelihood and the model with the residue effect. Table 13. All associated SNPs within 6 novel suggestive regions (KRIC+GWAS). A1/A2= Minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes are presented under each cohort as case/control. Rsq imputation quality.

Supplementary Table 14.
Reported SLE SNPs in our cohort. Source of the original report is presented as well as its P-value for each of our cohorts, compared to the European reported p-value and OR. We labeled column "same direction" as NO if the effect of the SNP is opposite in our cohorts, and NA if there was no information on the direction of the effect in the original publication. Source publications are detailed at the right of the table. A1/A2= Minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes are presented under each cohort as case/control.

Supplementary Table 15.
Four loci for which we found highly significant novel uncorrelated SNPs compared to the previously reported GWAS peak SNPs, thereby shifting the association peak. A1/A2= Minor/major allele. FA/FU= affected/control minor allele frequency. OR= odds ratio. CI= confidence interval. Sample sizes are presented under each cohort as case/control. Red color indicated our novel SNPs from this study, black indicated published SNPs. FA=allele 1 frequency of SLE affected; FU=allele1 frequency of unaffected Supplementary Table 16. Annotation of Bayesian Credible Sets. We used the Bayesian credible set method to better localize the possible set of functional SNPs within each novel region. BF=Bayes Factor; Posterior probability of the SNP assessed. We created a 95% and 99% probability credible set, labeled ad 95 and 99. Each Lead SNP (Table 1 SNP) was labeled as 1. Functional characteristics for each member of the credible set such as protein binding sites, Motif binding, DNAse, Promoter and Enhancer activity, as well as conservation score as determined by GERP/SiPhy were extracted from Haploreg. Table 17. Annotation of the PICS identified SNPs. In order to identify functional SNPs related to our lead SNPs, we used PICS method. The PICS methods assigned a probability of the SNP to be functional within their Linked SNP counterparts (PICS_probability), Functional characteristics for each member of the PICS set such as protein binding sites, Motif binding, DNAse, Promoter and Enhancer activity, as well as conservation score as determined by GERP/SiPhy were extracted from Haploreg. Table 18. Enrichment of drug targets through gene set enrichment analysis. Enriched terms contain= the term, species surveyed, GEO platform accession number, GEO dataset accession number. Overlap: Overlap of our gene set versus the whole set in each category. A combined score for the significance of each category is the combination of the Fisher's P-value with the Z-score estimated from the deviation from the expected mean. For ease of search, we included a column that indicates if any of the novel loci is included in the enriched category. Table 19. Enrichment analysis of Mouse Phenotypes (MGI), as well as GO-Processes. Overlap: Overlap of our gene set versus the whole set in each category. A combined score for the significance of each category is the combination of the Fisher's P-value with the Z-score estimated from the deviation from the expected mean. For ease of search, we included a column that indicates if any of the novel loci is included in the enriched category. We searched for reported in which any of our novel or suggestive loci might be associated with any of 12 autoimmune diseases. All citations as well as the reported SNPs are presented below. Supplementary Table 29. Over-representation analysis comparison of 100 gene sets taken at random from the Immunochip. We counted each time a pathway/Ontology that was found to be over-represented in our study gene-set appeared in one of the random sets (COUNTinSIM). Size= Size of the gene-set in each category; effective size=corrected size gene set. Statistical power for combined cases and controls, before and after addition of out-of-study KR controls. It is of note that our initial statistical power comparison between ImmunoChip-only and ImmunoChip + Korean GWAS data sets showed that for MAF > 0.15, adding controls boosted statistical power by 12-16%. For example, using a cutoff at P ≤ 5x10 -8 and OR = 1.30, the power increased by 15% (from 0.67 to 0.82) at MAF = 0.30.