Using multiple infection history to infer contact network structure

The structure of host interactions within a population shapes the spread of infectious diseases but contact patterns between hosts are difficult to access. We hypothesised that key properties of these patterns can be inferred by using multiple infections data from individual longitudinal follow-ups. To show this, we simulated multiple infections on a contact network in an unbiased way by implementing a non-Markovian extension of the Gillespie algorithm for a community of parasites spreading on this network. We then analysed the resulting individual infection time series in an original way by introducing the concept of ‘infection barcodes’ to represent the infection history in each host. We find that, depending on infection multiplicity and immunity assumptions, knowledge about the barcode topology makes it possible to recover key properties of the network topology and even of individual nodes. The combination of individual-based simulations and barcode analysis of infection histories opens promising perspectives for the study of infectious disease transmission networks. Significance Statement The way hosts interact with each other is known to shape epidemics spread. However, these interactions are difficult to infer, especially in human populations. Using recent developments in stochastic epidemiological modeling and barcode theory, we show that the diversity of infections each host has undergone over time contains key information about contact network between hosts. This means that longitudinal follow-ups of some individuals in a population can tell us how hosts are in contact with each other. It can also inform us on how connected a particular individual is. This opens new possibilities regarding the use of genetic diversity of infectious diseases in epidemiology.

The structure of host interactions within a population shapes the spread of infectious diseases but contact patterns between hosts are difficult to access. We hypothesised that key properties of these patterns can be inferred by using multiple infections data from individual longitudinal follow-ups. To show this, we simulated multiple infections on a contact network in an unbiased way by implementing a non-Markovian extension of the Gillespie algorithm for a community of parasites spreading on this network. We then analysed the resulting individual infection time series in an original way by introducing the concept of 'infection barcodes' to represent the infection history in each host. We find that, depending on infection multiplicity and immunity assumptions, knowledge about the barcode topology makes it possible to recover key properties of the network topology and even of individual nodes. The combination of individualbased simulations and barcode analysis of infection histories opens promising perspectives for the study of infectious disease transmission networks.
network | epidemiology | coinfections | infection history | barcodes Host populations are often assumed to be 'well-mixed' even though individuals tend to only interact with a small subset of the whole population and these contact patterns between individuals can dramatically affect the way epidemics spread (1)(2)(3). For instance, it was shown during the early phase of the HIV pandemics that the average number of sexual partners but also the variance in the number of partners both increase the basic reproduction number (R0) of sexually transmitted infections (4). Since then, studies have identified how key parameters of the host contact network affect the risk of outbreak (5,6) and the spread of an epidemic (7)(8)(9)(10)(11)(12).
Measuring host contact networks is a lively research topic (13), which ranges from definition issues (defining a 'contact'), to assessing the appropriateness of various types of data. In some cases such as networks connecting farms together (14), interactions between individual units can be well approximated through shipping logs. Other settings, such as wild populations or human populations, are more challenging to analyze. For humans, the field has traditionally relied on self-reported data, but new insights have been provided by airline transportation registries (15) or cell phone data (16). More recently, parasite sequence data was used to infer network properties by analysing phylogenies of infections (17). Note that we use the term parasite in its evolutionary ecology meaning, which encompasses both micro-and macro-parasites. Importantly, this genetic data, which by definition pre-dates new outbreaks, can be used to make relevant predictions (18).
We hypothesize that key properties from host contact networks can be inferred from individual longitudinal data about infection status. Such longitudinal data is classically used in epidemiological studies to measure the odds that a specific event may occur (19), however they are rarely coupled to mathematical models of disease spread.
In our case, we also track multiple parasite strains or species in these individual histories. This is both feasible and highly relevant, assuming that parasites spread on the same network. Indeed, the increasing affordability and power of sequencing strengthens the assertion that most infections are genetically diverse (20). In fact, in the case of genital infections by human papillomaviruses (HPVs), not only do we know that coinfections, i.e. the simultaneous infection by multiple genotypes, are the rule rather than the exception, we also know that they strongly correlate with the number of lifetime sexual partners (21). Another classical example is the distribution of the number of macroparasites per host, which has been used to infer population structure (22). This makes multiple infections an ideal candidate to measure epidemic properties (23).
To show how individual infection histories for multiple parasite strains can inform us about the underlying transmission contact network, we conduct a simulation study, which requires alleviating two obstacles. First, we need to simulate multiple infections on a network. This in itself is not an easy task and few studies have attempted it (12,(24)(25)(26)(27)(28)(29). Second, we need to extract information from longitudinal follow-up data. For this, we take advantage of recent development in stochastic epidemiological modelling and implement a non-Markovian version of the well-known Gillespie algorithm (30). This allows us to make sure a host's infection history is not lost every time it acquires a new strain. It also addresses statistical evidence that for many parasites infection duration does not follow exponential but heavy-tailed distributions (31)(32)(33)(34)(35)(36). Finally, we need to compare infection histories, which we do by using barcode theory from computational topology (37).
We first illustrate the importance of infection histories

Significance Statement
The way hosts interact with each other is known to shape epidemics spread. However, these interactions are difficult to infer, especially in human populations. Using recent developments in stochastic epidemiological modeling and barcode theory, we show that the diversity of infections each host has undergone over time contains key information about contact network between hosts. This means that longitudinal follow-ups of some individuals in a population can tell us how hosts are in contact with each other. It can also inform us on how connected a particular individual is. This opens new possibilities regarding the use of genetic diversity of infectious diseases in epidemiology.
CS and SA conceived the study, CS generated and analyzed the data, CS and SA wrote the manuscript.
The authors declare no conflict of interest.. 1 To whom correspondence should be addressed. E-mail: christian.selinger@ird.fr

D R A F T
when modelling multiple infections. We find that accounting for the documented fact that recovery from infection becomes more likely with increasing age of infection (38)(39)(40)(41) leads to higher prevalence peaks and earlier extinction times than in memory-less models. Then, we simulate individual multi-strain infection histories for epidemiological models with genotypes-specific immunity on random clustered networks to demonstrate the impact of network topology on infection diversity. We refer to 'genotypes' to describe the parasite diversity, but our results apply to different genotypes of the same species or of different species of parasites provided that they are spreading on the same network. Finally, we show that infection barcodes can inform us on the connectivity in the network in several ways. First, similarity matrices between individual infection histories highly correlate with the network adjacency matrices. Second, an individual host's network connectivity can be inferred from its connectivity in the infection barcode matrix.

Memory in multiple infections.
Under the assumption of neutrality and absence of interactions between genotypes, coinfections should intuitively follow the principle of 'first in, first out', i.e. the parasite genotype which is acquired first should also be cleared first from the host. To show that memoryless coinfection processes violate this principle, we simulate epidemics where we sequentially introduce four genotypes on random clustered networks and assume that full genotypespecific immunity is conferred once a host recovered from infection. The model is parameterized such that recovery from infection episodes follows exponential (memory-less) or Weibull (with memory) distribution at unit mean ( Figure 1A). Corresponding hazard rates used as model inputs are constant or increase linearly with the age of infection respectively. We find that the assumption of recovery with memory impacts epidemic outcomes with increased peaks, shorter duration and higher infection multiplicity when compared to the case of memory-less recovery ( Figure 1B). Moreover, sequential introduction of different genotypes leads to further imbalance of infection and recovery events. Figure 1C shows the prevalence time series for 4 genotypes introduced one time unit apart from each other, genotype 1 being introduced first. Because we assume neutrality, their order of introduction should not matter. However, without memory (in red), the earlier a genotype is introduced during the epidemic, the higher prevalence because hosts are more likely to recover but also to create new infections. Those introduced later will initially be less likely to produce recovery or infection events, but lead to sustained epidemics later on, once most of the genotypes introduced first have been cleared and hosts become immune to them. Taking into account the age of infection for recovery (in blue) can alleviate this imbalance such that average epidemic curves and final epidemic sizes are similar between genotypes therefore resembling a scenario where all genotypes are introduced simultaneously (data not shown).

Network topology and multiple infections.
To better understand how the topology of a contact network impacts multiple infection dynamics, we simulate SIR epidemics with simultaneous introduction of four infections by different genotypes on random clustered graphs with distinct summary statistics. First, we consider random clustered graphs with low average but highly over-dispersed degree, and low clustering coefficient. These emulate contact networks with relatively low but variable number of contacts spread out evenly in the network (type 1 in Table 1). With a multiplicity of infection (MOI) of 1.5 and a diversity index of 3 at the peak of the epidemic (Figure 2), simulation results indicate that an average host is singly infected and that three out of the four genotypes circulate at equivalent prevalence. Next, we increase the clustering coefficient while keeping average node degree low, which results in less dispersed degree distributions emulating networks with highly clustered pockets of contacts but generally small number of contacts (type 2 in Table 1). Multiple infections on such networks have very similar MOI patterns but slightly increased and more sustained diversity of genotype prevalence ( Figure 2). Finally, we consider networks with both high degree and high clustering coefficient resulting in higher degree assortativity such that the highest average MOI is approximately 2 during the epidemic peak (type 3 in Table 1). During the initial expansion phase of such epidemics, a typical host is coinfected, and genotype prevalence is more unevenly distributed due to high degree assortativity.   Table 1.

Infection barcodes and network structure.
To infer properties of the host contact network from individual infection histories, we describe each of these through an infection barcode, i.e. a set containing all intervals of infection episodes with different parasite genotypes. In order to measure similarity (or distance) between infection barcodes of any two hosts, we use the length and frequency of overlaps between infection episodes. As detailed in the Methods, we can consider overlaps for matching genotypes only or for any genotypes, the latter being less informative. This allows us to obtain barcode matrices that indicate the similarity (or distance) between any two nodes. These can then be compared to more classical matrices that capture global (shortest path between nodes) or local properties (binary adjacency) of contact networks restricted to infected individuals. Simulation results with four different parasite genotypes show that, if the entire network of infected individuals is known, the barcode similarity matrix highly correlates with local properties of the contact network given by its adjacency matrix ( Figure 3A). In more realistic situations, infected nodes would only be partially observed. Our results show that spatial correlations become less significant as the sampling rate of network nodes decreases but remains significant for sampling proportions of 25% or more. Given the importance of multiple infection histories in this approach, decreasing the number of circulating genotypes from 4 (violet) to 2 (yellow) also decreases inference power. We also investigate spatial correlations between the barcode similarity matrix and global properties such as the shortest path metric. We find that the correlations are highly significant even at a network sampling rate of 10% ( Figure 3B), but suffer similar limitations if the number of genotypes circulating in the epidemic is reduced to 2.
Overall, we found that the barcode matrix built using the genotype-specific similarity index was more informative than that built on the distance function. Correlation coefficients obtained from the Mantel test are as high as 0.3 for comparisons of barcode similary matrices with shortest path matrices regardless of network sampling rates and number of genotypes ( Figure S2). Alternatively, considering barcode distance matrices gives highly significant correlations with the graph adjacency metric, but no significant correlations with shortest path metric, even at high network sampling rates ( Figure S1). B B B B B B B B B B B B B B   Based on the results from multinomial regression models, the host's node degree is predicted from its normalized barcode similarity degree value. The bars on the right show the average node degree frequency from the simulated data on which the statistical model was trained. The higher the node degree in the barcode similarity matrix, the higher its degree in the adjacency matrix.

A A A A A A A A A A A A A A A A
The spatial correlations between barcode similarity indices and both local and global graph properties also translate to the individual level. From the barcode similarity matrix, we calculate each host's connectivity (i.e. degree) in terms of infection history by summing the host's barcode similarity with respect to all other infected hosts and normalizing appropriately (see Methods). We hypothesise that the infection barcode connectivity within the network of infected hosts can inform the degree of connectivity in the contact network. Multinomial regression tests show that the odds of having higher degree in the contact network (i.e. more neighbors) increases as infection barcode connectivity goes up.
Model predictions show that, with increasing barcode similarity degree, hosts are more likely to have higher network degree (Figure 4), indicating that a host's network degree can be inferred from the infection barcode similarity with all infected hosts in the network.

Discussion
Understanding the properties of host contact networks is key to predicting epidemic spread but raises many practical challenges (13). Some phylodynamics studies have shown that some of these properties can be inferred from virus sequence data (17,42). Here, we use individual infection histories to gain insights into contact network structure.
A first obstacle was to simulate epidemics on contact networks, while allowing for coinfections, i.e. the simultaneous infection of hosts by multiple parasites (43). Indeed, a direct implementation of the Gillespie algorithm, which is classically used to perform stochastic simulations based on systems of differential equations, is problematic. Indeed, this method being memory-less, it fails to keep track of the infection history. As we show, this generates results that are at odds with the biology. To address this problem, we implemented a non-Markovian version of the Gillespie algorithm following recent developments in statistics (30). As expected, network topology directly impacted multiple infection dynamics, with increased level of clustering leading to increased parasite strain diversity.
Being able to simulate epidemics of multiple infections in an unbiased way, we encountered a second obstacle, which was comparing complex individual infection histories. To address this issue, we captured these histories using barcodes and compared two infection histories using tools from computational topology (37). We could then show that global properties of the network are correlated with these barcodes, i.e. similarity matrices inferred from the barcodes correlate with matrices inferred from the network adjacency matrix. Furthermore, even individual-centered properties such as a host's degree can also be inferred from infection barcodes.
Overall, we provide a proof of principle that multiple infections and infection history can be used to gain insight into host contact network properties. This is biologically sound, given that infections by multiple parasite genotypes are extremely frequent, and realistic. Indeed, for many human infections, there are longitudinal follow-ups with a detailed record of parasite diversity, one of the clearest examples being human papillomaviruses (44).
There are several directions in which our work could be extended. First, we only reported results obtained on random clustered networks. We also obtained similar results on other types of topologies (e.g. Erdős-Rényi, data not shown) but it would be valuable to know whether infection history is more or less informative depending on the type of network considered and if network comparison can be performed. We also assumed that the contact network was static but in reality, it can be highly dynamical. Recent evidence suggests that these dynamics could be detected in sequence data (45) and it would be interesting to explore this with infection histories.
We assumed the life-cycle of the parasite to be that of an SIR with perfect genoytpe-specific immunity. Dynamics with partial cross-immunity or no immunity can be readily implemented. Following many existing studies, e.g. in phylodynamics, we used a neutrality assumption. However, genotypes are known to interact (46) and multiple infections can be a means to detect these interactions (23,47). In general, exploring richer life-cycles is a promising path for future studies.
In terms of articulating this framework with clinical data, we considered that infection history was based on parasite genotype presence in the host (e.g. PCR detection test) but much more data could be obtained using serologies, i.e. evidence of past presence in the host via antibody detection. An advantage of this approach is that it does not require a detailed longitudinal follow-up. However, the downside is that we then ignore the origin of the infection. Simulation studies could be instrumental in assessing our ability to infer network properties from serological data.
A separate line of future research has to do with the inference per se. One possibility could be to use Approximate Bayesian Computation to get a more precise idea of accuracy of the prediction we can make on key network topology parameters. This could also allow us to compare between different classes of networks, e.g. using random forest algorithms (48).
Finally, with the increased power and decreasing cost of sequencing, it may be possible in the future to have both the information about infection history and the sequence data for multiple pathogens. It would then be worth determining whether we can get the best out of the two types of data, infection history being less precise but also less noisy than phylodynamic inference.

Materials and Methods
Simulation algorithm. We developed an event-driven stochastic model of multiple infections on networks in Python. For the purpose of this simulation study, we considered static, random networks to model contact (i.e. edges with binary weights) between hosts (i.e. nodes). Contact networks were generated using the class of random clustered graphs (49)(50)(51) implemented in Python (52). Given a propensity list of edge degrees and triangle degrees as input, this algorithm generates a random graph with predefined average degree and average clustering coefficient. The clustering coefficient of a graph was defined as the local clustering coefficient (i.e. the degree of a node divided by the number of all possible edges in the node's neighbourhood) averaged over all nodes. Degree dispersion was defined as the ratio of degree variance and degree mean. Degree assortativity, i.e. the propensity for nodes in the network that have many connections to be connected to other nodes with many connections was defined following (53).
Upon network initialization, we randomly seeded outbreaks of one infection per genotype, allowing for up to four genotypes per host. Disease dynamics followed Gillespie's stochastic simulation algorithm (SSA) (54,55) adapted to a non-Markovian setting (56) to incorporate memory-dependent processes. Here in particular, we considered recovery from infection as a process depending on the age of infection. We assumed life-long genotype-specific immunity, i.e. after recovery from an infection with a particular genotype, the host would not be susceptible to this genotype again.Transmission and recovery rates were considered equal for all genotypes and fixed in advance, i.e. genotype frequencies could not vary over time at neutrality without mutation nor migration.
The Gillespie SSA allowed us to simulate disease dynamics by performing regular updates in the values of the rates. For this, at each time point we first created a rate vector {r k } k∈E indexed by the set of all possible events E (recovery from an ongoing infection or acquisition of a new infection by an infectious host in the graph neighborhood). If the node i had spent t i,g time in an infection with genotype g, then the rate of recovery from this infection was given by a Weibull hazard rate function h(t i,g , α, µ) = αµt α−1 i,g with shape parameter α and scale parameter µ. If the node i was currently D R A F T uninfected with genotype g but had a network neighbour infected with g, then the rate of infection was unity. To better handle the computational workload of multiple infections on networks, we stored the infection histories in an interval tree data structure comprised of the times of onset and end of each infection episode as well as the genotype and the host. We refer to this data structure as infection barcodes (see below). Given the rate vector at present, we first drew an exponentially distributed random variate with rate k r k to determine the time increment to the next event. Certain rates r k depended on the infection age, and it was shown in (56) that drawing exponentially distributed random variates with respect to these rates provides good Markovian approximations of non-Markovian processes, granted the number of events is sufficiently large. We then drew a random variate according to the total probability vector {r k / k r k } k∈E to determine the next event.
Depending on the event, we updated the list of possible events. In the case of recovery of host i from infection with genotype g, we wrote the end time of the infection episode to the interval tree, removed the recovered infection from the rate vector and added potential infections with genotypes other than g from the host's network neighborhood. In the case of a new infection of the host i with g, we wrote the time of onset of the infection episode to the interval tree, removed all possible edges from g-infected neighbors of i and added possible infection edges from i to all neighbours that had not been or were not currently infected with g. We then updated the rates vector again and proceeded until the epidemic became extinct.
Unless stated otherwise, we simulated disease transmission with four genotypes introduced simultaneously on the giant component of random clustered graphs with 250 nodes at average degree of 4 and average clustering coefficients of 0.36. For memory-less recovery we set αg = 1, and for simulations with memory we chose αg = 2. For a given parameter set, we seeded a random outbreak at the beginning of each simulation with one infection per genotype and ran 50 stochastic replicates until the disease-free state or a pre-defined time horizon was reached. Epidemic outcomes were reported as average and 95% confidence intervals for equidistant time bins.
The source code, configuration files used for simulations and container files will be publicly available on https://gitlab-bioinfo.ird.fr/ ETE/MultiNet.

Genotype diversity and multiplicity of infection.
We measured genotype diversity during the course of epidemics with N = 4 genotypes by Shannon's diversity index (57) defined as H(t) = exp i p i (t) log p i (t) , where p i (t) is the frequency of infections with genotype i relative to all infections present at time t. The index H ∈ [1,4] is maximised when all genotypes have equal frequency and minimized when only one genotype is present. We emphasize that this index pertains to diversity at the population-level and does not distinguish between a multiple infection within a single host and single infections in several hosts. To highlight the importance of multiple infections, we also considered the multiplicity of infection (MOI) given by the number of genotypes present within infected hosts at a given time. We reported MOI averaged over all infected nodes.
Infection barcodes. We summarized each individual multiple infection history by a barcode, i.e. the set of intervals describing the onset and clearance of infection episodes accumulated within a host during the course of an epidemic. We explicitly included the presence of multiple genotypes in a host's infection history. The notion of barcode first arose in computational topology (58) to succinctly summarize and compare topological properties of metric spaces. Here, we considered the metric space of host barcodes endowed with two complementary notions of distance.
In the first approach, we considered a similarity index (59) such that two hosts with largely overlapping infection episodes tended to be similar to each other, regardless of the respective genotypes. More precisely, a given pair of barcodes between any two hosts A = {A i } and B = {B j } was transformed into a weighted bipartite graph. The nodes of this graph consisted of infection episodes, and the nodes where partitioned according to the two hosts. The edge weight of the graph was given by the overlap length |A i ∩ B j | between the infection episodes constituting the edge (i, j). To obtain the similarity index s 1 (A, B), we calculated the maximum bipartite graph matching of this graph. The similarity index was transformed into a distance by d 1 = −2s 1 + i |A i | + j |B j |. For the second approach, we considered the overlap length of infection episodes between two hosts of a particular genotype g. By summing over all genotypes we obtained the similarity index s 2 = g |Ag ∩ Bg|. This index was transformed into a distance by setting d 2 = −2s 2 + i |A i | + j |B j |. To determine whether similarity between infection histories implies proximity in the transmission network, we compared the metric space of infection barcodes to the metric space of the network, restricted to nodes that had been infected during the course of the epidemic. For the network, we considered two different distance notions, i.e. the discrete metric based on the binary adjacency matrix and the shortest path distance of the graph.
We used two-sided p-values from the Mantel test (60) between the barcode similarity (resp. distance) matrix and the adjacency or shortest path length matrix respectively to assess for each model the correlation between infection barcode and network topology (61).
To assess whether spatial correlations between similarity matrices also translate to local properties such as a host's node degree (i.e. number of network neighbors), we performed a multinomial regression analysis.
where β 0 is the intercept, β 1 is the coefficient for continuous variable of the normalized barcode similarity degree, and β 2 is the coefficient for the categorical variable of Ng = 2 relative to the base level Ng = 4. We evaluated the model for genotype-agnosticŝ 1 and genotypespecificŝ 2 similarity degrees. Significance for each coefficient was assessed by p-values from two-sided Z-tests. We used the Akaike Information Criterion (AIC) to compare with ordered logistic models log P(degree = k) P(degree < k) = β 0 + β 1ŝi + β 2 Ng which yielded similar predicted values at consistently higher AIC (data not shown).