The impact of genome-wide association studies on biomedical research publications

The past decade has seen major investment in genome-wide association studies (GWAS). Among the many goals of GWAS, a major one is to identify and motivate research on novel genes involved in complex human disease. To assess whether this goal is being met, we quantified the effect of GWAS on the overall distribution of biomedical research publications and on the subsequent publication history of genes newly associated with complex disease. We found that the historical skew of publications toward genes involved in Mendelian disease has not changed since the advent of GWAS. Genes newly implicated by GWAS in complex disease do experience additional publications compared to control genes, and they are more likely to become exceptionally studied. But the magnitude of both effects has declined over the past decade. Our results suggest that reforms to encourage follow-up studies may be needed for GWAS to most successfully guide biomedical research toward the molecular mechanisms underlying complex human disease. Electronic supplementary material The online version of this article (10.1186/s40246-018-0172-4) contains supplementary material, which is available to authorized users.


Background
Since the first successful genome-wide association studies (GWAS) were published over a decade ago [1][2][3][4], thousands have been performed [5]. These studies have identified tens of thousands of statistical associations between genetic variants and human diseases [5]. The large investment in GWAS has been criticized [6], perhaps because initial hopes for quick clinical impact were overenthusiastic [7]. The average time from basic science discovery to clinical practice is 17 years [8], so it is unsurprising that few GWAS results directly affect patients yet. But direct clinical impact is not the only goal of GWAS.
One major goal of GWAS has been to broadly characterize the genetic basis of human traits and complex disease. GWAS have shown that most traits are highly polygenic and that most common variants exhibit small effect size on phenotype [9,10]. They have also shown that genetic variants associated with disease are strongly enriched in regulatory regions [11] and that pleiotropy is pervasive [12,13]. They have also enabled polygenic prediction of traits by aggregating the weak effects of many these classic examples, going from association to therapy demanded substantial follow-up research.
Beyond anecdotal examples, how much follow-up research typically occurs when a gene is newly associated with complex disease via GWAS? To answer this question, we assessed the impact of GWAS on subsequent biomedical research publications. Our motivation was that if there is little follow-up research on associated genes, then important medical innovations are possibly being missed, and reforms may be necessary to encourage follow-up research.
Published GWAS are themselves often highly cited, for example [4,26,29]. A systematic comparison also found that GWAS are more highly cited than comparable candidate gene studies [30]. But a paper that cites a GWAS does not necessarily follow-up on the associations reported by that GWAS. To quantify how much follow-up research is motivated by GWAS, we focused on the subsequent publication record of newly associated genes.
The distribution of biomedical research publications is highly unequal among human genes ( Fig. 1a; [31]). Much of this inequality stems from historical momentum, driven by the availability of prior functional information [32] or research tools [33]. Consequently, many potentially medically important genes may be understudied [34]. Because GWAS are largely unbiased by previous knowledge about genes [35], they provide an opportunity for understudied genes to be brought to the scientific forefront.
We evaluated the effect of GWAS on the biomedical research literature in three ways. At a broad scale, we tested whether the distribution of publications among human genes has changed since the advent of GWAS. At a narrower scale, we quantified the effect of being newly associated with complex disease on the subsequent publication histories of human genes. Lastly, we identified outlier genes with exceptional publication activity and tested whether GWAS might play a role in motivating such activity. Overall, we find that genes newly associated with complex disease do experience increases in publication activity, but this effect has declined over the past decade.

Results
We measured research output on genes using scientific publications, as collected in the NCBI Gene database [36]. We prefer this manually curated database to automatic c a b Fig. 1 Biomedical scientific publications are highly unequally distributed and strongly skewed toward genes involved in Mendelian disease, even after the advent of GWAS. a The distribution of publications among all human genes is highly uneven. Plotted is the number of publications per gene, with genes sorted by number of publications. (The gene with the fewest publications is plotted as rank 1, and the gene with the most publications as rank 20,422.) A few genes are the subject of thousands of publications each, whereas thousands of genes are the subject of fewer than ten publications each. b The distribution of publications among all human genes is more uneven in the post-GWAS era (2005 and later) than in the pre-GWAS era (before 2005). Shown in this Gini plot are the cumulative proportions of publications in each category versus gene rank. The further the curve is from the diagonal, the more uneven the distribution. For comparison, the distribution of publications among yeast genes is shown, with the yeast x-axis stretched to match the number of human genes. c Highly studied genes tend to be involved in Mendelian disease. Plotted are the distributions of genes among publication rank for genes of each possible type of disease association and for both the pre-and post-GWAS eras. (Distributions are not normalized across types of disease association.) In both eras, genes involved in Mendelian diseases are strongly enriched toward high publication ranks. By contrast, many genes involved only in complex disease rank low in terms of publications text mining, because text mining may introduce false positives when a gene is mentioned in passing. In total, we considered 553,184 biomedical research publications that appeared in the annotations for one or more human genes, most of which were published after 1995 (Additional file 1: Figure S1).

Broad patterns of publications on human genes
We used the Online Mendelian Inheritance in Man (OMIM) database [37] and the EBI-NCBI GWAS catalog [5] to classify genes into those associated with Mendelian disease (N=1126), complex disease (N=3648), both (N=595), or no disease (N=15,043). As expected [31], we found that the distribution of publications among human genes was highly uneven. A small number of genes were the subject of many thousands of publications, while a large number of genes were the subject of only a few (Fig. 1a).
To quantify the unevenness of publications among genes, we used the Gini coefficient, which ranges from 0 (perfectly even distribution) to 1 (perfectly uneven). The Gini coefficient is calculated from the cumulative distribution of publications versus the gene rank (Fig. 1b).
To quantify the effect of GWAS on the distribution of publications among human genes, we compared that distribution before and after 2005. We chose 2005 as the cutoff between pre-and post-GWAS eras, because that is the year of the first entry in the GWAS catalog [5]. Other appropriate cutoff years might be 2007, when the first large GWAS were published, or 2009, to give time for publication patterns to change. Using either of these cutoff years does not qualitatively change our results (Additional file 1: Figure S2). The inequality of publications among human genes is larger in the post-GWAS era than in the pre-GWAS era (Gini coefficient 0.73 vs 0.65; Fig. 1b). It is not inevitable that the distribution of publications should be so unequal; the Gini coefficient of publications among yeast genes is much lower at 0.43 (Fig. 1b).
The ultimate goal of most biomedical research is to improve human health, so the distribution of publications is expected to be skewed toward genes involved in human disease. In the pre-GWAS era, genes associated with Mendelian disease were, almost without exception, among the most highly studied human genes ( Fig. 1c and Additional file 1: Figure S2). By contrast, many genes that would later be associated with complex disease were among the least studied human genes (Fig. 1c). The advent of GWAS led to the discovery of many genes associated with complex human disease. The focus of biomedical publications on Mendelian disease genes, however, remains strong in the post-GWAS era (Fig. 1c). In particular, many genes associated with complex disease remain among the least studied genes in the human genome (Fig. 1c). The distribution of publication ranks for genes associated only with complex disease has shifted slightly toward higher ranks in the post-GWAS era compared to the pre-GWAS era (Mann-Whitney U test, p ∼ 10 −9 , N=3648), but the distribution has not changed qualitatively. Examining the distributions of publication ranks at higher temporal resolution also does not reveal any qualitative changes (Additional file 1: Figure S3).

Subsequent publications on individual genes
To quantify the immediate effect of GWAS on research into individual newly associated genes, we considered all genes that were first associated with complex disease via GWAS before 2015 (N=2442), and we focused on the calendar year of the first association and the following 2 years. For each new GWAS gene, we compared the publications over this period with a control non-GWAS gene chosen to have as similar a prior publication history as possible (see the "Materials and methods" Section). The variance in an associated gene's publications is strongly correlated with the number of publications on that gene in the prior 3 years (Fig. 2a). Normalizing the excess in publications relative to the control gene by the square root of the number of recent publications normalizes the variance (Fig. 2b), consistent with a Poisson model for publication output [38]. The normalized excess in publications for a GWAS gene is slightly but significantly shifted ( Fig. 2c; one-sample t test, p ∼ 5 × 10 −34 , N=2442). The mean normalized excess is 1.24 units, corresponding to a mean excess of 2.95 publications over the 3 years following association.
We next sought to identify the factors that determine how large an effect a GWAS will have on an associated gene's subsequent publications. For example, the more heavily studied a gene was previously, the smaller the effect of GWAS association (Fig. 2b, Spearman rank correlation, p ∼ 6 × 10 −8 , N = 2442).
The strength of a GWAS association is quantified by its statistical p value and its estimated biological effect size, which is most commonly an odds ratio. The normalized publication excess for a newly associated gene is weakly positively correlated with the p value of its association ( Fig. 2d; p ∼ 1 × 10 −4 , N=2442). By contrast, the normalized publication excess is not significantly correlated with the estimated effect size of the reported association ( Fig. 2e; p ∼ 0.14, N=1327).
The strongest predictor of the effect of a GWAS on future publications for associated genes is the year in which the GWAS was published. The typical normalized publication excess has declined dramatically since the early years of GWAS ( Fig. 2f; p ∼ 9 × 10 −23 , N = 2442).
The predictors for the effect of GWAS on subsequent publications that we have studied may themselves be correlated; to disentangle their effects, we built a linear regression model. In that model, the effects of the number  (Table 1). By contrast, the quantitative properties of the association itself, the p value and the estimated effect size, have weak effects that are not statistically significant. The GWAS catalog uses a relatively liberal p value threshold of 10 −5 for inclusion of associations into the catalog, and large p value associations may be statistical noise that subsequent researchers properly ignore. To account for this effect, we repeated our analyses using only genes for which the first reported association had p < 10 −8 , the suggested threshold for testing lowfrequency variants [39]. When we restricted our analysis to these high-confidence associations (Additional file 1: Figure S4), we found that normalized publication excess was no longer significantly correlated with p value (ρ = 0.044, p ∼ 0.23; N = 724), but it was positively correlated with estimated effect size (ρ = 0.094; p ∼ 0.025; N = 570). The negative correlation between normalized publication excess and GWAS publication date was stronger than in the full data (ρ = −0.33; p ∼ 7 × 10 −20 ). The linear regression model (Additional file 1: Table S1) was similar to the full data, with the effects that were statistically not significant for p value and estimated effect size and significant for number of recent publications and GWAS publication date. Further restricting our analysis to associations for which the lower bound of the 95% confidence interval on the estimated odds ratio was larger than 1.1 (Additional file 1: Figure S5) yielded qualitatively similar results (Additional file 1: Figure S6 and Table S2). Association with particular diseases might lead to particularly intense study. To test this possibility, we considered the class of disease that each gene was associated with as an additional predictor in the linear regression model. Of the 20 disease classes tested, only metabolic disease had a significant effect on the normalized publication excess (Additional file 1: Table S3). Further stratifying among metabolic diseases, we found that this trend is driven by studies on type II diabetes and obesity (Additional file 1: Table S4).

Genes with exceptional publication records
The typical new GWAS gene experiences a modest increase in subsequent publications, but some exceptional genes may experience large increases, so-called hot genes. To identify such genes, we used the model of Pfeiffer and Hoffmann [38] to predict the number of publications for each gene in each year, based on that gene's prior publication history. We trained the model on all genes never implicated in complex disease through GWAS. By comparing the model predictions and publication data, we then identified particular years in which particular genes had unexpectedly large numbers of publications (Additional file 2). For example, complement factor H had a significant excess of publications in all 3 years following its association with macular degeneration (Fig. 3a).
The total number of hot genes per year has recently fluctuated (Fig. 3b). Between 2009 and 2016, on average, 0.3% of genes were hot in any given year. Of the genes that were newly associated with complex disease via GWAS within the past 3 years, the probability of being hot was 1.3%. So, being newly associated with complex disease does increase the probability that a gene will become hot. The total number of hot genes that were recently associated with complex disease via GWAS peaked, however, in 2009 (Fig. 3b), even as the number of new GWAS genes each year has grown (Fig. 3c). Thus, the proportion of hot genes that were recent GWAS hits has declined (Fig. 3d).
To further quantify the role of GWAS in creating hot genes, we used a logistic regression model ( Table 2). Consistent with the overall probabilities (Fig. 3), this model showed that being a recent new GWAS hit was an important factor in determining whether a gene would be hot. The effect of being a GWAS hit, however, had a negative interaction with the year. In other words, the effect of GWAS on creating hot genes with exceptional publication records decreased with time.

Discussion
We analyzed the biomedical research publications to quantify the effect of genome-wide association studies on published scientific research. We found that even after the advent of GWAS, publications remain highly skewed toward Mendelian disease genes, with many complex disease genes receiving little attention (Fig. 1c). New complex disease genes identified by GWAS do receive additional study and subsequent publications (Fig. 2c), but that effect b a d c Fig. 3 The effect of GWAS in generating exceptionally studied genes. a A significantly elevated number of studies were published on complement factor H following its association with macular degeneration via GWAS in 2005 [2]. Solid line is the predicted publication history from the model of Pfeiffer and Hoffmann [38], points indicate actual publication counts, and starred points indicate years with a statistically significant excess (one-sided Bonferroni-corrected p < 0.05). b The total number of genes exhibiting an unusual excess in publications peaked in 2009, as did the number of those genes that were recently newly associated with complex disease via GWAS. c The number of genes newly associated with complex disease through GWAS has grown since the inception of GWAS. d The proportion of genes exhibiting an unusual excess in publications that were recently identified in GWAS peaked at roughly 20% in 2009 and has since declined has declined (Fig. 2f, Table 1). Being newly associated with complex disease does increase a gene's chance of becoming a "hot" gene, but this effect has also declined ( Fig. 3d, Table 2). Together, our results suggest that GWAS have been successful in bringing research attention to novel genes involved in complex human disease, but this influence is waning.
Considering the overall distribution of biomedical publications, we found that GWAS have not reduced the inequality among human genes. The distribution of publications among human genes is characterized by a Gini coefficient of 0.73 in the post-GWAS era (Fig. 1a). By comparison, the Gini coefficient of money income among American households was 0.48 in 2016 [40] and among global households was 0.625 in 2013 [41]. The inequality of publications among genes is thus substantially greater than the inequality of income among households.
Focusing on individual genes, we found that association with complex disease via GWAS is correlated with an increase in subsequent publications (Fig. 2). Interestingly, the p value and estimated effect size of the association play a statistically insignificant role in determining the magnitude of that increase (Table 1 and Additional file 1:  Table S1). We found a stronger effect on the subsequent publications for genes newly associated with metabolic disease (Additional file 1: Tables S3 and S4), perhaps reflecting its recent emphasis in public health [42]. We also found that association with complex disease via GWAS does raise the chances of a gene becoming an exceptionally studied "hot" gene (Fig. 3). But most dramatically, we found that the effects of new association via GWAS have declined over the past decade (Figs. 2f and 3d).
The direct results of a GWAS are associations of a disease with genetic variants, not with genes. For simplicity, we associated each variant with the closest gene, as long as that gene was within 500 kb. But many variants are regulatory, and gene regulation is complex, so some variants may actually most strongly affect other more distant genes [23]. Thus, some of the gene associations we study may be spurious. But this issue has existed since the advent of GWAS and has not changed markedly since. So, it cannot explain why the effect of GWAS on subsequent publications has declined over time. When studying the effects of genetic evidence on drug development, Nelson et al. [43] used a more complex approach for assigning variants to genes. They incorporated linkage disequilibrium and attempted to infer regulatory relationships using expression quantitative trait loci (eQTLs) and DNAse hypersensitivity sites. When we analyzed their collection of association data, we found similar results to our original analysis, although the effects were somewhat weaker (Additional file 1: Table S5 and Figure S7). In particular, we still found a negative relationship between the publication date of an association and its effect on the subsequent publications.
Our measures of scientific publications do not necessarily capture the full effects of GWAS on biomedical research. We considered studies of specific associated genes, but the broad insights GWAS has given into the genetic basis of human disease have substantially affected the biomedical research [10][11][12]16]. Motivated by the example of complement factor H (Fig. 3a), we focused on the publications in a 3-year window following the GWAS. Some follow-up studies may take longer, but using a 5-year window does not change our qualitative conclusions (Additional file 1: Figure S8 and Tables S6 and S7). GWAS may also promote biomedical research in ways that do not involve new publications. For example, drugs with associated genetic evidence are more likely to progress along the development pipeline [43], suggesting that GWAS promote efficient drug development. More broadly, we focused on the associations with complex disease, the most common biomedical application of GWAS. But GWAS for drug response have already provided important guidance for personalized treatment [44]. Lastly, human GWAS have applications beyond health. For an evolutionary example, GWAS data have been used to detect adaptation in the human genome [45].
What explains the declining effect of GWAS on subsequent publications regarding newly associated genes? Perhaps early GWAS captured most genetic variants of large effect, so more recent studies find less compelling associations. But estimated effect size is not a strong predictor of subsequent publications (Table 1). Moreover, the typical estimated effect size of new associations has declined only modestly, and the absolute number of largeeffect associations has grown (Additional file 1: Figure S9). Or perhaps journal publication criteria have changed over time, making GWAS less visible or follow-up studies more challenging to publish. The typical impact factor of journals GWAS are published in has declined slightly since the advent of GWAS (Additional file 1: Figure S10A). But the impact factor of the GWAS publication has only a weak effect on the publication excess of newly associated genes (Additional file 1: Figure S10B). When we included GWAS publication impact factor in our linear regression model, its effect was statistically significant but insufficient to explain the effect of publication date (Additional file 1: Table S8). Or perhaps researchers are spreading their effort among newly associated genes, so effects on individual genes have declined. But the summed publication excess over all genes newly associated with complex disease in a given time period has also declined over the past decade (Fig. 4). Or perhaps the availability of funding for follow-up studies has declined, as overall biomedical research funding has declined in both North America and Europe [46]. Or perhaps the capacity and interest to perform follow-up analyses has not kept pace with the "fire hose" of GWAS results [47]. Our data do not point toward a definitive explanation, and further investigation is needed to understand why recent GWAS promote less follow-up study on associated genes than early GWAS.
Over the past decade, GWAS have undeniably contributed greatly to biomedical knowledge [16]. The development of large-scale accessible databases of phenotypic and genotypic data, such as the UK Biobank [48], will fuel further contributions. But few GWAS results are directly medically actionable, so follow-up research is essential to translate novel associations into medical innovations. Our results suggest that the ability of GWAS to motivate published follow-up research on associated genes is declining. To maximize the positive impact of GWAS on human health, this trend must be understood and reversed.

Publication data
We obtained Entrez GeneIDs for all 20,422 human protein-coding genes from NCBI Gene [36] on December 12, 2017. For all those genes, we collected PubMed identifiers of associated publications from NCBI Gene's gene2pubmed file, downloaded December 12, 2017. This Fig. 4 Total publication excess of new GWAS genes. For 6-month periods, plotted is the total publication excess (compared to control genes) of genes newly associated with complex disease via GWAS during each period file contains both associations created manually during the curation of Gene References Into Function (GeneRIFs) and associations collected from organismspecific databases, Gene Ontology, and other curated data sources. We then obtained date information for each publication from PubMed, taking the earliest year between the reported year or EYear, using BioPython [49]. We followed a similar procedure for yeast genes. We obtained impact factor data from the 2016 InCites Journal Citation Reports [50].

Disease data
To identify genes associated with Mendelian disease, we downloaded the Online Mendelian Inheritance in Man (OMIM) Gene Map of connections from genes to traits [37] on January 17, 2018. We filtered to keep only entries with a confidence code of "confirmed" and to ignore entries indicating a potentially spurious mapping or association with a non-disease trait. We further considered only entries with Entrez GeneIDs, to avoid ambiguity among gene names and aliases. This procedure yielded 1878 genes associated with disease traits. Of these, 1543 genes were associated with Mendelian but not complex multifactorial disease, 157 were associated with complex multifactorial but not Mendelian disease, and 178 were associated with both Mendelian and complex multifactorial disease.
To further identify genes associated with complex disease and to gather GWAS data, we used the January 1, 2017, release of NHGRI-EBI's GWAS Catalog [5]. We filtered the catalog to remove non-disease traits, by keeping only entries that were children of the term "disease" (EFO-0000408) in the Experimental Factor Ontology [51]. To connect associated variants with genes, we began with the Mapped Genes column in the catalog. We then connected each variant with its closest mapped gene, if that gene was within 500 kb. If a variant was within two overlapping genes, we connected with both genes. This procedure yielded 4069 genes associated with complex disease. To analyze the classes of disease, we used the children of the term "disease" in the Experimental Factor Ontology.
Our analysis of OMIM and the GWAS catalog yielded 5369 total disease-associated genes. Considering genes associated with only Mendelian disease in OMIM and not associated with disease through GWAS yielded 1126 Mendelian disease genes. Considering genes associated with only complex multifactorial disease in OMIM or associated with disease through GWAS yielded 3648 complex disease genes. The remaining 595 genes were associated with both Mendelian and complex disease.
Of the disease genes in the GWAS catalog, 2442 were first associated prior to 2015, so we could analyze three full years of publication data. For those genes, we identified odds ratios as reported effect sizes without units for variants that had a reported frequency of the risk allele. For our odds ratio analysis, we analyzed the 1327 genes for which an odds ratio was reported in the first year of GWAS association.
We also analyzed the association data of Nelson et al. [43]. They connected variants to genes using linkage disequilibrium, expression QTLs, and DNAse hypersensitivity. We filtered their Supplementary Data Set 1 to remove associations from OMIM, which may be Mendelian diseases. We also manually classified traits as disease or nondisease (Additional file 3), filtering out the non-disease traits.

Control genes
For each of our 2442 GWAS genes, we identified its control gene as the non-GWAS gene with the closest number of total publications prior to the year the gene was first associated with complex disease. If multiple genes were tied for closest, we compared the previous year as well, continuing either until there was no ambiguity or until we reached 1950. For the 233 GWAS genes with ambiguous control genes, we compared subsequent publications between the GWAS gene and the average of the control genes.

Publication rate model
We used the model of Pfeiffer and Hoffmann [38] to predict expected per-gene publication rates: Here, P i,t+1 is the predicted number of publications for gene i in year t + 1, and P i,t and P * t are the cumulative number of publications in previous years for the gene and the average cumulative number of publications for all genes in the organism, respectively. The term in the denominator models saturation of publication rates. The three rate parameters, k 1 , k 2 , and k 3 , and the saturation parameters, P S and α, were assumed to be identical for all genes. To fit the parameters to our data, we constructed a likelihood function by assuming that the number of publications each year for each gene was independently Poisson distributed with mean P i,t+1 given by Eq. 1. We then maximized that likelihood with respect to the five model parameters, using publication data from 1950 to 2015 for all non-GWAS genes. The maximum-likelihood parameter values were k 1 = 0.0214, k 2 = 0.225, k 3 = 0.00288, P S = 24.1, and α = 1.67. Five genes each had one publication prior to 1950 that was not included in the data fit.
To identify the years in which genes had significantly elevated publication rates, our null model was that publications were Poisson distributed with mean given by Eq. 1.
Significant gene years were defined as those in which the probability of generating at least the observed number of publications was less than the Bonferroni-corrected significance cutoff 0.05/ (N g N y ). Here, N g = 20, 442 was the total number of genes considered, and N y = 67 was the total number of years.