Omics versus Questionnaires: Using methylation data to assess the environmental impact of smoking on obesity-related traits

Variation in complex traits related to obesity, such as body weight and body mass index, has a genetic basis with heritabilities between 40 and 70%. Nonetheless, the so-called global obesity pandemic is usually associated with environmental changes related to diet, lifestyle, and sociocultural and socioeconomic changes. However, most genetic studies do not include all relevant environmental covariates so their contribution, alongside genetics, to variation in obesity-related traits can not be assessed. Similarly, some studies have described interactions between a few individual genes linked to obesity and different environmental variables but the total contribution to differences between individuals is unknown. In this study we explored the effect of smoking and gene-by-smoking interactions on obesity related traits from a genome-wide perspective to estimate the amount of variance they explain by modelling them using self-reported data and a proxy created using methylation data. Our results indicate that exploiting omic measures as proxies for environmental variation can improve our models for complex traits such as obesity and can be used as a substitute of environmental measures when they are not available or jointly to improve their accuracy.


Introduction
Variation in obesity-related traits such as body mass index (BMI) has a complex basis with heritabilities ranging from 40 to 70% and variants detected so far explaining up to 5% of BMI variation 1 . Although the genetic component is large, studies suggest that the increase in obesity prevalence in the past decades is linked with environmental causes like changes in the diet and a more sedentary lifestyle 2,3,4,5 , but not all relevant environmental effects are accounted for in genetic studies. On top of these independent environmental effects, several studies suggest that gene-by-environment interactions also play an important role in obesity and other complex traits 2,6,7,8,9,10 and many researchers are focusing on finding interactions between specific genes and certain environments. Genotype-by-age interactions and genotype-by-sex interactions have also been detected for several health-related traits 10,11,12 . Recently, several studies have relied on stratifying the samples based on smoking status or explicitly modelling interactions when performing GWAS on traits like BMI, lipids or blood pressures 13,14,15 . Inclusion of smoking as a covariate in GWAS has led to identification of some new genetic variants associated with traits of interest. The newly identified variants tend to explain a very small proportion of the variation in the traits, in a similar way as GWAS hits do not account for as much genetic variance as expected based on classic heritability estimates 16 .
Some studies have attempted to quantify the overall contribution of genomic interactions with smoking.
Robinson, et al. 12 estimated them to explain around 4% of BMI variation in a subset of UK Biobank unrelated samples. In contrast, also in UK Biobank, using a new approach that only requires summary statistics, Shin & Lee 17 estimated the contributions of the interactions to be much smaller: 0.6% of BMI variation.
DNA methylation is an epigenetic mark that can be affected by genetics and environmental exposures 18,19,20,21,22,23 . Variation in methylation is correlated with gene expression and plays a crucial role in development, in maintaining genomic stability 24,25,26 , and has been associated with disease 27,28,29,30,31 and aging 32,33 . Epigenome-wide association analyses (EWAS) have identified multiple associations between DNA methylation levels at specific genomic locations and smoking 19,34,35,36 . These so-called signatures of smoking in the epigenome can help discriminate the smoking status of the individuals in a cohort 20 , and, if accurate enough, could be an improvement on self-reported measures, by adding information not captured accurately in the self-reported measure, as passive smoking or real quantity of tobacco smoked.
In this study, we aim to estimate the contribution to obesity variation of smoking and its interaction with genetic variation in two different cohorts, using self-reported measures of smoking and a methylomic proxy. Thus, we measure the contribution of smoking-associated methylation signatures and genome-by-methylation interactions to trait variation. We performed analyses in both sexes jointly and independently and also including genome-by-smoking-by-sex interactions, and we showed that omics data can be exploited as proxies for environmental exposures to improve our understanding of complex trait architecture. We observed that using an appropriate set of CpG sites, methylation can be used to model trait variation associated with smoking, and genome-by-smoking interactions suggesting potential applications for better prediction and prognosis of complex disease, and expanding these modelling approach to other environments and traits.

Results
The aim of this work was to explore the influence of smoking and gene-by-smoking interactions on trait variation, modelling them from self-reported information and using DNA methylation in both sexes jointly and separately. We used a variance component approach to fit a linear mixed model including a set of different covariance matrices representing: two genetic effects (G: common SNP-associated genetic effects and K: pedigree-associated genetic effects not captured by the genotyped markers at a population level; the inclusion of matrix K in the analyses allows to use the related individuals in the sample), environmental effects reflecting impact of smoking (modelled as fixed or random effects), and genome-by-smoking effects (GxSmk) representing sharing of both genetics (G) and environment (SMK), and we estimated the proportion of variation that each component explained for seven obesity-related measures and height. We defined the environment using either self-reported questionnaire data or its associated methylation signature as a proxy. A summary of the experimental design used in this study is shown in Figure 1. For more detailed information, see Methods.   The estimated contributions of smoking status (and the other covariates) to trait variation ranged between 0.35% (for height, assessed as a negative control, as we do not expect to find the same type of effects as with obesity-related measures) and 1.2% (for HDL cholesterol) and are shown in Supplementary

UK Biobank
We sought to replicate the results observed in Generation Scotland with data from the UK Biobank cohort (UKB). Analyses were run in four sub-cohorts (G1, G2, G3 and G4, grouping individuals in close recruitment centres; for more information see Methods and Supplementary Table 3), with the two sexes considered jointly and separately in three different analyses (the sample size of these groups permitted estimates to be obtained with the two sexes separately). Individual sub-cohort analyses were metaanalysed.   Meta-analyses of the sub-cohorts showed significant genome-by-smoking interactions in all traits except for height when analysing both sexes together and males separately, whereas in females, only fat percentage showed a significant effect of the interaction. Similarly, the genome-by-smoking-by-sex interactions were significant for all traits but height. Genome-by-smoking-by-sex interaction effects explained between 2 and 6% of the observed variation.

Smoking-associated methylation
To explore the value of DNA methylation data as a proxy for environmental variation, we modelled similarity between individuals based on their DNA methylation levels at a subset of 62 CpG sites previously associated with smoking 19,34 and which had heritabilities lower than 40%, aiming to target methylation variation that is predominantly capturing environmental variation (for details see Methods). Figure   interaction components (genome-by-smoking and genome-by-methylation) the estimates were not significant (or just nominally significant in the case of BMI).

Discussion
Most complex diseases have moderate heritabilities, with different environmental sources of variation, for example, lifestyle and socioeconomic differences between individuals, also contributing to disease risk 5 . These diseases, particularly obesity, pose major challenges for public health and are associated with heavy economic burdens 3,4,37 . To prevent the problems resulting from complex diseases, effective personalised approaches that help individuals to reach and maintain a healthy lifestyle are required. To achieve that aim, knowledge on environmental effects and gene-by environment interactions (GxE, i.e., differential effects of an environmental exposure on a trait in individuals with different genotypes 38 ) is required. This is a challenge, particularly for certain environments that are not easy to measure, or that are measured with a lot of error. It has previously been assumed that GxE effects contribute to variation in obesity-related traits 6,8 , but the total contribution to trait variation was not known. Previous analyses exploring GxE in obesity, as well as other traits, took advantage of particular individual variants with known effects, or constructing polygenic scores combining several of them reflecting genetic risks for the individuals 39,40 . Here we analysed contributions of interactions between the genome (as a whole) with smoking, both using self-reported measures and methylation data as a proxy for smoking.
Our estimates of the effects of genome-by-smoking interactions in obesity-related traits are larger than those estimated in Shin and Lee 17 but in line with Robinson, et al. 12 for BMI. However, our analyses indicate that the magnitude is substantially different in the two sexes, with interactions playing a bigger role in males for most traits studied (weight, waist, hips, fat%). Analysing jointly males and females provides less accurate estimates, suggesting that for similar interaction studies splitting the sexes or modelling the interactions with sex is a more sensible way of analysing the data. The estimates obtained from the genome-by-methylation analyses provided large estimates with large standard errors that would not be significant after multiple correction testing, but are large and should e investigated further.
We estimated that the impact of genome-by-smoking ranges between 5 and 10% of variation in obesityrelated traits. Results were similar across the different traits, with the exception of height that we used as a negative control. Our results suggest a larger interaction component in traits associated with weight (BMI, weight, waist, hips) than in those more involved in adiposity (waist-to-hip ratio, fat percentage).
Biological interpretation of these interactions implies that some genes contributing to obesity differences between individuals have different effects depending on their smoking status. This could be mediated in several ways, for example, via genetic variants that affect both obesity and smoking. Some metabolic factors associated with food intake, such as leptin, are suspected to play a role in smoking behaviours, and rewarding effects of food and nicotine are partly mediated by common neurobiological pathways 41 . For example, if these common genetic architectures counteract the two behaviours (i.e., more tobacco consumption leading to eating less 41 ) the genetic effects of obesity-related traits will be different depending on the smoking status. The interactions could also be driven by gene-by-gene interactions (GxG), i.e., genetic variants affecting obesity modulated by smoking associated genetic variants. Under this scenario smoking status would be capturing smoking associated variants, and the GxSmoking interaction would represent GxG instead of GxE. However, given the relatively small heritability of tobacco smoking (SNP heritability ~18% 42 ), it is unlikely that all the variation we detected is driven by GxG, although part of it could be.
When we estimated the effect of smoking using the methylomic proxy (62 CpG sites associated with smoking from two independent studies 19,34 ), the smoking associated variance increased substantially for all traits (from 2% to 6% for BMI). The methylation component captured the same variance as the selfreported component and some extra variation (Supplementary table 8b). This increase in variation captured could be due to a better ability to separate differences between different levels of smoking (e.g., the self-reported status does not include amount of tobacco smoked, while the methylation might be able to capture those differences better). These smoking associated CpG sites could also be picking up variation from other environmental sources that are not exclusively driven by smoking, but correlated with it, such as alcohol intake.
The fact that variation in obesity can be explained by CpG sites associated with smoking does not imply a causal effect of smoking or methylation on obesity. Methylation is affected by both genetic and environmental effects. Here we selected a subset of CpG sites with moderate to small heritability (lower than 40%, Supplementary Table 9) and we modelled them jointly with a genomic similarity matrix, making it unlikely that the variance picked up by the methylation matrix is genetic in nature. While most changes in methylation at these CpG sites are thought to be causally driven by smoking 19 , associations between methylation and other complex traits, such as BMI, are more unclear and mostly likely to be reversely caused 43 (i.e., BMI affecting methylation), however, since our aim was to use methylation as a proxy for the environment, the direction of the association with the traits does not impact the conclusion of the study. It is, however, important to notice the variable nature of the methylation data, To conclude, we showed that methylation data can be used as a proxy to assess smoking contributions to complex trait variation. We used DNA methylation levels at CpG sites associated with smoking as a proxy for smoking status to assess the contribution of smoking to variation in obesity-related traits. This principle could be extended to take advantage of the wealth of uncovered associations between methylation (and other omics) and other environmental exposures of interest, particularly for those that are difficult to measure, to expand our knowledge on their contribution to complex diseases, and potentially, help understand the underlying biology and to improve prediction and prognosis.

Data.
Generation Scotland. We used data from Generation Scotland: Scottish Family Health Study (GS) 46 Genotypes were available for 534,427 common markers spread over the 22 autosomes.
For computational reasons, UKB individuals were split in four sub-cohorts to be analysed separately.
The grouping was based in latitudinal differences between the assessment centres the individuals attended. Number of individuals and assessment centres are shown in Supplementary Table 3.
Phenotypes with values greater or smaller than the mean ± 4 standard deviations (after transformation and adjusting for sex, age and age 2 ) were set to missing. The traits were pre-adjusted for the effects of sex, age, age 2 , clinic where the measures were taken, and a rank-based inverse normal transformation was performed on the residuals. These values were used in all the analyses.
UK Biobank. We used measured phenotypes for anthropometric traits: height, weight, body mass index (BMI, computed as weight/height 2 ),waist circumference (waist), hip circumference (hips), waist-to-hip ratio (WHR, computed as waist/hips), body fat percentage (fat%), and bio-impedance analysis whole body fat mass (fatmass). Phenotypes with values greater or smaller than the mean ± 4 standard deviations (after transformation and adjusting for sex, age and age 2 ) were set to missing. The traits were pre-adjusted for the effects of sex, age, age 2 , clinic where the measures were taken, and a rank-based inverse normal transformation was performed on the residuals. These values were used in all the analyses.

Smoking status.
We used self-reported smoking status on both cohorts. Individuals were classified with respect of smoking as "never smoked", "ex-smoker" and "current smoker" for Generation Scotland, and as "never smoked", "ex-smoker", "current smoker", and "occasional smoker" for UK Biobank. The number of individuals in each category are shown in Supplementary Table 3.

DNA Methylation data.
DNA methylation data is available for a subset of 9,537 participants from the GS:SFHS cohort, as part of the Stratifying Resilience and Depression Longitudinally (STRADL) project 53 . From those, we used N = 8,821 individuals that had complete information for all the same set of covariates as used in the smoking status analysis. We refer to this subset of individuals as GS9K. DNA methylation was measured at 866,836 CpGs from whole blood genomic DNA, using the Illumina Infinium MethylationEPIC array.
Quality control was performed using R (version 3.6.0) 54 , and packages shinyMethyl 55 and meffil 56 . We removed outliers based on overall array signal intensity and control probe performance and samples showing a mismatch between recorded and predicted sex. We removed samples with more than 0.5% of sites with a detection p-value of > 0.01; and probes with more than 5% samples with a bead count smaller than 3. Normalization was performed using the R package minfi 57  Smoking associated CpG sites. We selected a subset of CpG sites identified in two epigenome-wide association studies of tobacco consumption 19,34 . We selected CpG sites with a p-value lower than 10 -7 in both Ambatipudi et al. 19 (associations between CpG sites and differences between groups: smokers v non-smokers, smokers v ex-smokers, ex-smokers v non-smokers) and in Joehanes et al. 34 (associations between CpG sites and dosage of tobacco smoked) to obtain a subset of CpG sites confidently associated with smoking (i.e., from two sources). We identified those CpG sites with heritabilities lower than 40% in Generation Scotland (as measured in the last step of the quality control of the data, see below) that are available in Generation Scotland. The list of 62 CpG sites is available in Supplementary Table 9.
To model the different sources of variance we used a set of covariance matrices representing similarity between individuals based on genetic components, environmental components, or both.
Genetic matrices: G is a genomic relationship matrix (GRM) reflecting the genetic similarity between individuals 16,60 . K is a matrix representing pedigree relationships as in Zaitlen et al. 61 . It is a modification of G obtained by setting those entries in G lower than 0.025 to 0.
Smoking matrices: ESMK is a matrix representing common environmental effects shared between individuals with same smoking status i.e., ESMK contains a value of 1 between individuals in the same smoking category and a 0 between individuals in different categories.

Gene-Environment interaction matrices:
GxSmk is a matrix representing genome-by-smoking interactions. It was computed as the cell-by-cell product (Hadamard or Schur product) of the corresponding G and ESMK matrices. For an element of the GxSmk matrix, if the corresponding G or the E elements are close to zero, the GxSmk term will be zero or close to zero as well. Therefore, similarity between individuals due to the interactions represented in the GxSmk matrices requires similarity at both genetic and environmental level. This method resembles a reaction norm modelling approach 62 .

Methylation-derived matrices: MSMK is a matrix representing similarity between individuals based on
DNA methylation levels at 62 smoking associated CpG sites (see Smoking associated CpG sites above).
A similarity matrix was created using OSCA v 0.45 63 using algorithm 3 (i.e., iteratively standardizing probes and individuals). GxMSMK is a genome-by-smoking interaction matrix computed as a Hadamard product of G and MSMK.

Analyses
We performed several variance component analyses using GCTA 50 , based in the following linear mixed models: (1) y = Xβ + gg + gkin + ε (2) y = Xβ + gg + gkin + wL + ε (3) y = Xβ + gg + gkin + wL + gw + ε (4) y = Xβ + gg + gkin + gw + ε where y is an n × 1 vector of observed phenotypes with n being the number of individuals, β is a vector of fixed effects and X is its design matrix, gg is an n × 1 vector of the total additive genetic effects of the individuals captured by genotyped SNPs with gg ~ N(0, Gσ 2 g); gkin is an n × 1 vector of the extra genetic effects associated with the pedigree for relatives with gkin ~ N(0, Kσ 2 k). w is a n × 1 vector representing the common environmental effects of smoking, with w ~ N(0, Eσ 2 w). gw is a n × 1 vector representing interactions between markers and environments with gw ~ N(0, GxSmkσ 2 gw). ε is an n × 1 vector for the residuals. The four basic models shown above were expanded to include all combinations of random and fixed effects showed in Figure 1.
The estimates for variance explained by the genome-by-smoking components in the four sub-cohorts of UK Biobank were meta-analysed using the R 54 package metafor 64 .      Supplementary Table 9.Smoking associated CpG sites information. Name, chromosome, location, and heritability of the 62 CpG sites associated with smoking.