Keeping Pace with the Red Queen: Identifying the Genetic Basis of Susceptibility to Infectious Disease

The results of genome-wide association studies are known to be affected by epistasis and gene-by-environment interactions. Using a statistical model....

I NFECTIOUS diseases are pervasive. So pervasive, in fact, that without effective mechanisms of resistance, host populations can be quickly reduced in size or even driven to extinction. For instance, chestnut blight effectively wiped out the American chestnut, which had little if any resistance to this novel pathogen, after its introduction to North America in the early 1900s (Anagnostakis 2000;Anderson et al. 2004). Similarly, when Myxoma virus was introduced to Australia in the 1950s, local rabbit populations were almost entirely susceptible, resulting in millions of deaths and the decimation of local populations (Ratcliffe et al. 1952). Human populations, too, have been heavily affected by infectious disease in the past, perhaps most notably during the 1918 influenza pandemic that killed .50 million people before fading away in 1920 (Johnson and Mueller 2002;Taubenberger and Morens 2006). Although these examples are striking and demonstrate the impact of unchecked infectious disease, they are far from the norm. More com-monly, host populations have effective mechanisms of resistance against pathogens they encounter regularly (Revers et al. 2014), with significant variability between populations depending on their history of exposure (Bartholomew 1998;Weatherall and Clegg 2002).
The existence of substantial variation in resistance to infectious disease within host populations has generated hope that it may be possible to identify the genes conferring resistance. Identifying such resistance genes would pave the way for genetic engineering of resistant crops and livestock, focus drug development efforts on likely targets, and open the door to gene therapeutic approaches within human populations. As the genomic revolution has progressed, it has become increasingly common to search for these "resistance genes" using genome-wide association studies (GWAS) (Newport and Finan 2011;Rowell et al. 2012). Loosely speaking, these studies compare the marker genotypes of individuals infected with disease and those uninfected and ask which loci predict an individual's infection status. The GWAS approach has now been used to successfully identify a range of candidate genes thought to be important in resistance to infectious disease in plants and animals (Chapman and Hill 2012;Khor and significant limitations. For instance, GWAS are most powerful when resistance depends on common genetic variants with relatively large phenotypic effects (Manolio et al. 2009). In addition, which candidate genes are identified by this method may depend on the environment in which the study is conducted (Thomas 2010). These limitations apply to GWAS in general, not just those studies focused on infectious disease, and are widely recognized. When GWAS are used to understand the genetic basis of resistance to infectious disease, however, a potentially more important problem arises. Specifically, the resistance genes identified within the host population may depend on the genetic composition of the infectious disease itself (Newport and Finan 2011). This sensitivity of the GWAS approach to the genetic composition of the infectious disease becomes acute any time genotype-by-genotype (G 3 G) interactions exist; in other words, when particular combinations of host and pathogen genes yield resistance whereas other combinations lead to susceptibility. These G 3 G interactions may have drastic effects on the results of genetic association studies and our understanding of disease resistance (Lambrechts 2010), similar to the effects of gene-by-environment interactions. One particularly disconcerting possibility is that rapid pathogen evolution or host-pathogen coevolution will cause the host resistance genes that can be identified by GWAS to fluctuate rapidly over time.
Here we quantitatively explore the performance of GWAS when resistance to infectious disease involves G 3 G interactions between host and disease. We begin by presenting a general mathematical model of an association study to investigate disease resistance and evaluate the role of G 3 G interactions for several forms of host-parasite interactions. We then simulate host-pathogen coevolution to illustrate the extent to which G 3 G interactions may vary across time and/or space. We conclude by reanalyzing published genome-wide association data  of Daphnia magna resistance to its Pasteuria ramosa pathogen, distinguishing regions of the genome associated with overall health from those involved in resistance specific to a particular P. ramosa strain.

Model
We consider a scenario, common in practice, where host resistance is measured as a continuous quantitative trait. This would be the case, for instance, if host resistance is assessed by measuring viral load, duration of infection, or damage to host tissues. Our model assumes that host resistance depends on the value of a quantitative trait in the host, z H ; relative to the value of a quantitative trait in the pathogen, z P : Specifically, we assume host susceptibility, S, is given by the following function: The function f is sufficiently general to accommodate many commonly observed resistance mechanisms. For instance, in the interaction between the snail Biomphalaria glabrata and its trematode parasites, resistance depends on the relative quantities of reactive oxygen molecules in the snail (z H ) and reactive oxygen scavenging molecules produced by the parasite (z P ) (Bayne 2009;Mon et al. 2011). In cases like these, the function f may take a sigmoid form which we call the phenotypic-difference model ( Figure 1A) (Nuismer et al. 2007;Ashby and Boots 2017): In contrast, in the interaction between the schistosome parasite, Schistosoma mansoni, and its snail host, B. glabrata, resistance depends on the degree to which the conformation of defensive FREP molecules produced by the snail (z H ) match the conformation of parasite mucin molecules (z P ) and successfully bind to them (Mitta et al. 2012). In such cases, the function f may take a Gaussian form which we call a phenotypic-matching model ( Figure 1B) (Kopp and Gavrilets 2006): To study the effects of genetic interactions on susceptibility to infection, S, we must integrate genetics into our phenotypic model. For a haploid host and pathogen where z H and z P depend on n H and n P biallelic loci, respectively, we can write general expressions for these phenotypes as functions of alleles present in each species: In these expressions, X Mi is an indicator variable describing the allelic state (0 or 1) of an individual of species M at locus i, b M0 is the phenotype of an individual of species M with all "0" alleles, and b Mi is the additive effect of carrying a "1" allele at locus i in species M. Our goal now is to use this genetic model to predict the sensitivity of GWAS to the genetic composition of the pathogen population. We will explore both traditional, singlespecies GWAS approaches and a novel approach that takes genetic information from both host and pathogen into account (co-GWAS). Our investigation will rely on a pair of complementary approaches. First, we will develop and analyze analytical approximations that quantify the sensitivity of GWAS and co-GWAS approaches to changes in pathogen genotype frequencies. These analytical approximations will rely on simplified genotype-phenotype maps and will not explicitly integrate evolution and coevolution. Second, we will develop and analyze simulations that allow us to explore the consequences of rapid pathogen evolution and coevolution between the species on the performance of both GWAS and co-GWAS approaches.

Analytical Approximation
To simplify the genetic model of resistance developed in the previous section sufficiently for mathematical analysis, we begin by considering the case where n H ¼ n P ¼ 2: In addition, we assume that the phenotypes of host and pathogen are not too far from one another, such that the quantity z H 2 z P is small relative to the extent of phenotypic specificity (a in Equations 2 and 3). Under this assumption, Equation 1 can be approximated by its second order Taylor series expansion. This allows the genetic model of susceptibility to be simplified to the following approximate expression: where primes indicate derivatives with respect to the distance between host and pathogen phenotypes. With (5) in hand, we have a model that predicts host resistance as a function of host and pathogen genotypes. In the following two sections, we will use (5) to investigate how the genetic composition of the pathogen population influences the results of GWAS and co-GWAS. Extending these models to complete G 3 G association studies requires a large number of pathogen loci (n P 2) and thus may be computationally prohibitive. For many pathogens, however, strain type or subtype may be known and capture much of the relevant genetic variation in the pathogen population. In these cases, tracking pathogen types can greatly reduce the effective number of loci, even to n P ¼ 2 as in Equation 5. Such simplifications should allow us to expand beyond two host loci to a whole host genome (n H 2), while avoiding the computational complexity of tracking all possible genetic interactions between the full host genome and the full parasite genome.

Single-species GWAS
We envision a standard GWAS where susceptibility to infection has been measured for some number of host individuals, each of which has also been genotyped at a large number of marker loci. To focus our model on the effects of species interactions, we will assume this data accurately provides us with the genotype of individuals at the two host resistance loci. Using this data, the goal of the genetic association study is to partition host susceptibility between these genes relative to their effects. This can be done by fitting susceptibility with a linear combination of the genetic indicator variables: where the b coefficients can be found using least squares regression. The biological interpretation of this linear model is straightforward. The intercept coefficient, b H0 ; is the expected host resistance when both 0 host alleles are present. The coefficients b H1 and b H2 are the inferred additive effects of the 1 alleles at the first and second loci, respectively, and b H1;H2 captures the epistatic interaction between the two host 1 alleles. Solving for the coefficients in (6) we have (see Supplemental Material, Mathematica notebook in File S1): for i ¼ {1,2}, where f ð0Þ; f 9ð0Þ; and f $ð0Þ are the resistance function and its first and second derivative evaluated at 0 as in Equation 5, and whereb Importantly, these expressions for the coefficients depend on the allele frequency at the pathogen loci, q P1 and q P2 ; as well as the linkage disequilibrium between them, D P : Note that the relevant allele frequencies and linkage disequilibrium are among pathogens to which the host is exposed, which may not be equivalent to the pathogen population as a whole. As a result of the dependence of the coefficients in (7) on the pathogen allele frequencies and linkage disequilibrium, the allelic effects (b's) inferred by a host-only GWAS can be quite sensitive to the genetic composition of the pathogen population ( Figure 2). Changes in pathogen allele frequency can alter the magnitude and sign of the inferred effects. From a practical standpoint, if susceptibility is assayed in two host populations that are exposed to pathogen populations that differ greatly in their allele frequencies, one may find a host allele has a protective effect in one population but increases risk in the other. Similar to hidden host population structure, uncontrolled differences in the pathogen population can greatly alter the inferences of single-species GWAS.
A second result that can be drawn from Equation 7 is that when the resistance function is approximately linear, f $ð0Þ ¼ 0; the inferred additive and epistatic effects, b H1 ; b H2 ; and b H1;H2 are independent of the pathogen allele frequencies. For example, in contrast to the nonlinear phenotypic-matching model where the inferred effects vary with pathogen allele frequency, the inferred effects remain constant in the approximately linear phenotypic-difference model (Figure 2). A third conclusion from Equation 7 is that, at least under the assumption that z H 2 z P is small, the epistatic interaction between the host loci, b H1;H2 ; is independent of pathogen genetics. We will explore the consequences of this dependence on the pathogen allele frequencies for the stability of GWAS-inferred effects across evolutionary time (See the Host-Parasite Coevolution section below).
In addition to identifying the allelic effects on host resistance, an important metric of GWAS performance is the proportion of phenotypic variation explained by the identified causative loci. Given the dependence of the estimated allelic effects on pathogen allele frequencies, we calculated the total phenotypic variation explained by the host loci across the range of pathogen allele frequencies (Figure 2, C and D). When the pathogen population is monomorphic (q P1 ¼ q P2 ¼ 0 or 1), the host loci can explain 100% of the genetic variation in the phenotype. If the pathogen population is polymorphic, however, the host-only approach may explain as little as 10% of the variation. Partitioning the total variation explained into the additive and epistatic contributions demonstrates that, due to changes in the additive effect size b Hi ; the relative contribution of additive and epistatic effects also varies with pathogen allele frequency and depends on the form of the host-parasite interaction.

Two-species co-GWAS
The results derived in the previous section demonstrate that traditional single-species GWAS may be sensitive to the genetic composition of the pathogen population at loci involved in host-pathogen specificity. In this section, we attempt to overcome this problem by developing an alternative GWAS design in which both host and pathogen genetics are incorporated. In contrast to the traditional method where only host genotypes are recorded, this design requires that both host and pathogen genotypes are known. As with Equation 6, we now attempt to fit host resistance as a linear function of the allelic indicator variables, but we include pathogen indicators as well as interaction terms between host and pathogen loci: As with Equation 6, the coefficients of this equation have straightforward biological interpretations. The intercept, b 0 ; describes the expected host resistance when all host and pathogen loci have 0 alleles. Terms 2, 3, 5, and 6 describe the additive effects of each individual host and pathogen 1 allele; and terms 4 and 7 describe the epistatic interactions between loci within the host and pathogen, respectively. The remaining four terms describe the G 3 G interactions between pairs of host and pathogen loci.
Despite the complexity of Equation 8, and hence the logistical and computational challenges of applying it, the expressions for each of these coefficients in terms of the host and pathogen phenotypic effects are simple (see Mathematica notebook in File S1): Comparing the equations in (9) with the coefficients in (7) reveals an important conclusion: the effect sizes no longer depend on the pathogen allele frequencies nor the linkage disequilibrium ( Figure 3, A and B). This result suggests that the two-species, co-GWAS approach is more robust to changes in the genetic composition of the pathogen population and thus may be much less sensitive to rapid evolution and spatial genetic structuring within the pathogen population. In addition to stabilizing the estimated allelic effects across pathogen allele frequencies, the total phenotypic variation explained by the co-GWAS greatly exceeds that of the hostonly GWAS. For the two-locus case explored here, the co-GWAS approach can explain 100% of the variation regardless of pathogen allele frequency (Figure 3, C and D). The contributions of additive, epistatic, and G 3 G interactions do, however, vary with pathogen allele frequency. As with the host-only approach, when the pathogen population is monomorphic the host effects explain all of the observed phenotypic variation. In summary, unlike the host-only model, the effect size coefficients (Equation 9) and the total variation explained, no longer vary with pathogen allele frequency. This contrast between the host-only and co-GWAS ap-proaches is particularly relevant any time the composition of the pathogen population is likely to differ between the sample used for the association study and the population in which the resulting inferences are applied. In the following section we explore how temporal changes in the host and pathogen populations driven by coevolution affects the reproducibility of GWAS over time and, by extension, space.

Host-Parasite Coevolution
To simulate host-parasite coevolution, we envision a system where each host comes into contact with a single parasite each generation. The probability that this contact results in infection is determined by host susceptibility, S, which is a function of the host and parasite genotype. Infected hosts experience a fitness cost j H ; whereas their infecting parasites receive a fitness benefit j P : In the absence of infection, both hosts and parasites have a fitness of 1. Together, these assumptions lead to the following fitness of a host with genotype fX H1 ; X H2 g that comes into contact with a pathogen with genotype fX P1 ; X P2 g : W H ¼ 1 2 j H SðX H1 ; X H2 ; X P1 ; X P2 Þ; whereas the pathogen has a fitness of W P ¼ 1 þ j P SðX H1 ; X H2 ; X P1 ; X P2 Þ: Given these fitnesses, we simulate allele frequencies and linkage disequilibrium over time assuming random mating, a per-locus mutation rate of m, and a recombination rate r (see Mathematica notebook in File S1). We then use Equations 7 and 9 to calculate the inferred allelic effect sizes by using a host-only GWAS or co-GWAS for each generation over the course of coevolution for both the phenotypic-difference and phenotypic-matching models ( Figure 4). As expected, using the host-only GWAS approach, the inferred allelic effects can vary over time but only under the quadratic-shaped, phenotypic-matching model. As noted above, the estimated effects can even change sign, having large positive values when sampled in one generation and large negative values when sampled only a few generations later. In contrast, the inferred effects remain constant in the co-GWAS approach regardless of the coevolutionary model. In terms of the phenotypic variation explained, the host-only approach explains only a portion of genetically determined phenotypic variation, whereas the co-GWAS approach can explain up to 100%. The contribution of different genetic components to the total variation explained remains approximately constant under the phenotypic-difference model but varies rapidly as allele frequency changes in the phenotypicmatching model.

Data availability
The analysis, numerical simulations, and scripts to generate the original figures were coded in Wolfram Mathematica 11 (File S1) and are available for download from the Dryad Digital Repository (DOI: https://doi.org/10.5061/dryad.tb25q).

Daphnia-Pasteuria GWAS
Taken together, our analytical model and simulations illustrate that incorporating pathogen genetic information into the search for disease genes can greatly increase the explanatory power and repeatability of genome scans. Testing these theoretical predictions with biological data is a critical step in evaluating the power of the co-GWAS approach relative to a traditional single-species GWAS. Analysis of biological data will include several complications that we ignored above, including finite sample sizes, arbitrary forms of coevolutionary interactions, and complex genomic architectures. Unfortunately, we know of no studies that include full host and parasite genomic data as well as the outcome of infection experiments. Further, the computational tools to perform a co-GWAS in the form of Equation 8 do not yet exist. We can, however, use recently published data by Bourgeois et al. (2017) on the susceptibility of D. magna to two P. ramosa strains, C1 and C19, as a preliminary test of our analytical predictions. In particular, we compare the results of genome scans for C1 and C19 susceptibility analyzed separately to a single genome scan for susceptibility using all the data but ignoring pathogen strain type. Our analytical model predicts that, despite having half the sample size, the separate genome scans for C1 and C19 resistance should reveal loci that determine host-parasite specificity, whereas the full data scan will have lower power to do so. Note that strain type captures almost all of the relevant genetic information in this case, given that the parasite is clonal.
The original data set, provided on Dryad by the authors , sampled 97 D. magna clones from three distinct geographic regions-1 site in Germany, 1 in Switzerland, and 11 sites in Finland-and provided the sequence at 6403 SNPs. Host susceptibility (S: susceptible; R: resistant) infection by each P. ramosa strain, C1 and C19, was determined by assessing whether fluorescently labeled spores attached to the host's esophagus (Duneau et al. 2011). All four possible combinations of susceptibility and resistance to the two strains (SS, SR, RS, and RR) were present. By performing two separate association studies, one for each strain, Bourgeois et al. (2017) used this experimental design to identify genomic regions associated with susceptibility to a specific parasite strain. Following the methods in the original work, we compare their results to a third genome scan including all the data, a total of 194 samples, ignoring the Pasteuria strain type tested. All genome scans were performed using the R package GenAbel, adjusting for population structure and repeated measures of the same host genotype using the Eigenstat method (Aulchenko et al. 2007).
To accurately assess which genomic regions are associated with susceptibility to C1, C19, and/or "overall" susceptibility from the complete data set, we used the Daphnia genetic map constructed by Dukić et al. (2016) to array the scaffolds into 10 linkage groups. To limit the detection of false positives, we followed an approach analogous to that used in Bourgeois et al. (2017) where SNPs were only considered significantly associated with a given susceptibility phenotype if there existed four SNPs in a 10-cM region with a log-likelihood score .2 ( Figure 5). Multiple genomic regions are significantly associated with susceptibility to C1, C19, and to disease susceptibility in the complete data set without strain information. Four linkage groups (4, 5, 7, and 9), with a total of 28 significant SNPs, are associated with C1 susceptibility. Three linkage groups (1, 4, and 7) with 38 SNPs are associated with C19 susceptibility, and two linkage groups (4 and 5) with 35 SNPs are associated with susceptibility in the complete data set. Thus, while the complete data set has twice as many measures of disease susceptibility, it has less power to detect genetic regions underlying disease susceptibility because of the lack of parasite information.
The contrast between the associations for C1 and C19 susceptibility to overall susceptibility in the complete data set provides additional information about the nature of the genetic basis to resistance. Genomic regions associated with the overall resistance regardless of parasite type, particularly when these regions are also associated with C1 and C19 resistance, provide increased resistance regardless of the parasite strain tested and are consistent with general host health and nonspecific immune response. By contrast, sites that are not associated with overall resistance-despite the data set having twice the size-but are associated with either C1 or C19, are good candidates for loci that act in a parasitespecific manner. Examining Figure 5, we therefore conclude that linkage group 4 and possibly 5 are involved in general health and resistance. In contrast, the regions on the far left and right of linkage group 7 as well as the regions on linkage group 1 and 9, which are associated only with C1 or C19 resistance, are indicative of parasite-specific resistance loci.
These conclusions are in agreement with the hypothesized model and previous molecular work on Daphnia resistance to Pasteuria. In particular, resistance to Pasteuria is hypothesized to be controlled by a three-locus, matching-alleles system. One of these loci (the C locus) determines overall Figure 4 Allelic effects over coevolutionary time. Top row: Phenotypes z H (red) and z P (blue) simulated over coevolutionary time in the phenotypicdifference (left) and phenotypic-matching models (right). Middle row: Coefficients estimated under the host-only model (7) (black is b 0 ; solid red is b Hi ; dashed red is b H1;H2 ). Bottom row: Coefficients estimated under the host-pathogen model (9) (black is b 0 ; solid red is b Hi ; dashed red is b H1;H2 ; blue is b Pi ; dashed blue is b P1;P2 ; purple dashed is b Hi;Pj ). Because epistatic and G 3 G interactions are absent in the phenotypic-difference model, their allelic effects all overlap at 0 and hence are not all visible.
host susceptibility regardless of pathogen strain and is thought to reside on linkage group 4 (Bento et al. 2017). In the absence of protection from the C locus, a second "A locus" is thought to confer resistance to C1 when the dominant allele is present. The regions detected on linkage groups 7 and 9 in the hosts exposed to C1 may only be candidates for such C1-specific resistance. Finally, if the C locus and A locus are both homozygous recessive, a third "B locus" determines susceptibility to the C19 strain. Such a locus would likely be hard to detect in a GWAS due to epistasis between the A, B, and C loci; nevertheless, the regions associated with only C19 resistance (on linkage groups 1 and 7) would be candidates for such a B locus. Overall we conclude that significant SNPs obtained without accounting for parasite type may signal general health status. Against this background, a co-GWAS can help identify genes whose regions are likely critical to host-parasite specificity and variation in host susceptibility.

Discussion
Identifying genes that determine a host's susceptibility to infection is a promising frontier with a wide range of applications, including agriculture and human health. Yet, as our mathematical models demonstrate, association studies focusing on identifying genes in a single species without accounting for the genetics of the interacting species can drastically affect our ability to detect disease genes involved in hostpathogen specificity and limit our ability to account for the genetic variation in disease susceptibility. When the genetic composition of the pathogen population varies over time and/or space, this can further lead to inconsistencies in the results of genetic association studies. Finally, using previ-ously published data on D. magna resistance to its Pasteuria parasite, we illustrate that performing association studies with and without information about pathogen type can be used to distinguish genomic regions affecting general vs. specific resistance to pathogens. Consistent with current models for Daphania-Pasteuria interactions, we identify one region associated with general health as well as candidate regions more directly involved in mediating host-pathogen specificity.
The mathematical analysis presented above focuses on host-pathogen interactions of a specific form, given by Equation 1. Although we have relied on an approximation that assumes weak phenotype differences, i.e., z H 2 z P is small, we postulate that the power to detect strain-specific resistance genes will be increased whenever parasite information is incorporated, even when genes have major effects and phenotypic differences become large. Similarly, the methods used above can be extended to include alternative interaction types such as a "matching-alleles" interaction (see Mathematica notebook in File S1). The expressions for the b coefficients under this interaction model are unruly and difficult to interpret. Using a numerical approach, we observe that once again G 3 G interactions can explain a significant proportion of the variation in susceptibility ( Figure S1 available on Dryad), particularly in highly variable pathogen populations. Unlike the phenotypic-difference and phenotypic-matching models, however, the co-GWAS approach (Equation 8) no longer explains all of the variation in susceptibility and the coefficients vary with pathogen allele frequency. This is a result of higher order interactions not included in our model. Hence, although the co-GWAS approach performs significantly better than a single-species approach, it will not always capture the full genetic basis of infection because of the second order approximation used in Equation 8.
Regardless of the form of the interaction, our analytical models and simulations illustrate that incorporating pathogen genetics into the search for disease genes can greatly increase the explanatory power and repeatability of genome scans. Unfortunately, several logistical and computational challenges preclude applying a full two-species GWAS. Most notably, such a design requires additional genetic data that is not currently available. More specifically, this design requires genotyping all hosts and the pathogens to which they are exposed, not just the host-parasite combinations observed in infected individuals. Future exploration is warranted to determine whether uninfected individuals can simply be treated as unknown with respect to pathogen exposure, and what the consequences of doing so would be for the statistical power of our approach.
The complexity of the two-species design (Equation 8) relative to that of a single-species design (Equation 6) also introduces computational challenges. In addition to requiring larger sample sizes, estimating the effects of the large number of potential G 3 G interactions in a full host-genome by parasite-genome study is computationally unrealistic. In Figure 5 GWAS of D. magna susceptibility. Genetic associations of each SNP with C1 (red • ), C19 (blue n), and overall susceptibility in the complete data set without parasite-type information (green :). Hence, each SNP is represented three times, once for each genome-wide scan. Note that closely linked SNPs often overlap with one another and are not all individually visible. Significant SNPs are shown in color while those below the log-likelihood of two threshold or that are not clustered within a 10-cM region of three other significant SNPs are shown in gray. The 10 linkage groups are delineated by vertical dashed lines.
addition to the large number of pairwise interactions between hosts and pathogens, depending on the form of the interaction, higher order genetic interactions may be necessary to fully explain the variation in susceptibility. These higher order interactions can be particularly important as the number of loci underlying susceptibility, n H and n P ; increases. Although incorporating complete pathogen genetic data may be unfeasible, there often exists some form of pathogen typing, which is largely indicative of the pathogen's genotype and may be sufficient for the purposes of a host genome-wide scan. For example, despite its vast diversity, Hepatitis C virus has been subdivided into seven genotypes (Irvine et al. 1993;Murphy et al. 2015), which may capture much of the relevant variation in host susceptibility.
The Daphnia-Pasteuria data set we analyzed provides a valuable test case for a two-species co-GWAS. In this study, we know exactly to which pathogen type individuals have been exposed, which is generally not known in natural populations. This information may have increased the power of the study to detect loci underlying C1 and C19 susceptibility. Despite this increased power, we chose to use the arguably lenient significance threshold of a log-likelihood score .2 plus clustering of four or more SNPs, as in the original article. Requiring more stringent threshold corrections for multiple sampling, such as a Bonferroni correction, does not yield any significant SNPs. Given the correspondence between the GWAS results and those of functional studies (Bento et al. 2017), however, many of the observed SNPs are arguably not false positives. Using the log-likelihood of two and clustering threshold, we observe fewer genomic regions associated with overall susceptibility when parasite information is not incorporated than when conducting GWAS with exposure to either C1 or C19, despite the complete data set containing twice the number of data points. As an alternative to analyzing the complete data set, we could hold the sample size constant in a combined analysis by randomly choosing whether the host was exposed to C1 or to C19 for each host genotype ( Figure S2 available on Dryad). Interestingly, this "mixed" GWAS not only identifies the same regions on linkage group 4 and 5 but also identifies regions on linkage groups 1, 9, and 10, as were found in the single pathogentype GWAS. The fact that this mixed analysis picks up some of the potentially parasite-specific loci is likely due to randomly sampling an excess of C1-or C19-tested clones. Consistent with this interpretation, exactly which parasite-specific regions are identified varies with the random sample chosen. Nevertheless, as with the complete data set, a comparison between C1, C19, and mixed susceptibility provides additional information about which genes are involved in general health vs. parasite-specific susceptibility.
The results presented here highlight several important avenues for future research. First and foremost, designing genome-wide association methods that allow for G 3 G interactions is critically important, as is the collection of genotypic data from hosts and pathogens. This could be approached, for example, by adapting GWAS designs and analyses used to detect gene-by-environment interactions (Winham and Biernacka 2013). Recognizing the importance of host-pathogen genetic interactions is important for understanding the applicability and limitations of single-species association scans. Developing metrics that capture relevant variability in host and pathogen populations may facilitate the application of these results. Finally, incorporating G 3 G interactions into our association studies will also enable us to understand what mathematical models of host-parasite interactions best predict the genetic interactions observed in natural systems, allowing for further refinements of the models.