Quantifying transmission dynamics of acute hepatitis C virus infections in a heterogeneous population using sequence data

Opioid substitution and syringes exchange programs have drastically reduced hepatitis C virus (HCV) spread in France but HCV sexual transmission in men having sex with men (MSM) has recently arisen as a significant public health concern. The fact that the virus is transmitting in a heterogeneous population, with different transmission routes, makes prevalence and incidence rates poorly informative. However, additional insights can be gained by analyzing virus phylogenies inferred from dated genetic sequence data. By combining a phylodynamics approach based on Approximate Bayesian Computation (ABC) and an original transmission model, we estimate key epidemiological parameters of an ongoing HCV epidemic among MSMs in Lyon (France). We show that this new epidemic is largely independent of the previously observed non-MSM HCV epidemics and that its doubling time is ten times lower (0.44 years versus 4.37 years). These results have practical implications for HCV control and illustrate the additional information provided by virus genomics in public health.

It is estimated that 71 million people worldwide suffer from chronic hepatitis C virus (HCV) infections 2 [1,2]. The World Health Organisation (WHO) and several countries have issued recommendations 3 towards the 'elimination' of this virus, which they define as an 80% reduction in new chronic 4 infections and a 65% decline in liver mortality by 2030 [2]. HIV-HCV coinfected patients are targeted 5 with priority because of the shared transmission routes between the two viruses [3] and because of 6 the increased virulence of HCV in coinfections [4][5][6]. Successful harm reduction interventions, such 7 as needle-syringe exchange and opiate substitution programs, as well as a high level of enrolment 8 into care programs for HIV-infected patients, have led to a drastic drop in the prevalence of active 9 HCV infections in HIV-HCV coinfected patients in several European countries during the recent 10 years [7-10]. Unfortunately, this elimination goal is challenged by the emergence of HCV sexual 11 transmission, especially among men having sex with men (MSM). This trend is reported to be 12 driven by unprotected sex, drug use in the context of sex ('chemsex'), and potentially traumatic 13 practices such as fisting [11][12][13]. The epidemiology of HCV infection in the Dat'AIDS cohort has 14 been extensively described from 2000 to 2016 [14][15][16]. The incidence of acute HCV infection has been 15 estimated among HIV-infected MSM between 2012 and 2016, among HIV-negative MSM enrolled in 16 PrEP between in 2016-2017 [13] and among HIV-infected and HIV-negative MSMs from 2014 to 17 2017 [17]. In the area of Lyon (France), HCV incidence has been shown to increase concomitantly 18 with a shift in the profile of infected hosts [17]. Understanding and quantifying this recent increase 19 is the main goal of this study. 20 Several modelling studies have highlighted the difficulty to control the spread of HCV infections in 21 HIV-infected MSMs in the absence of harm reduction interventions [12,18]. Furthermore, we recently 22 described the spread of HCV from HIV-infected to HIV-negative MSMs, using HIV pre-exposure 23 prophylaxis (PrEP) or not, through shared high-risk practices [17]. More generally, an alarming 24 incidence of acute HCV infections in both HIV-infected and PrEP-using MSMs was reported in 25 France in 2016-2017 [13]. Additionally, while PrEP-using MSMs are regularly screened for HCV, 26 those who are HIV-negative and do not use PrEP may remain undiagnosed and untreated for years. 27 In general, we know little about the population size and practices of HIV-negative MSM who do not 28 use PrEP. All these epidemiological events could jeopardize the goal of HCV elimination by creating 29 a large pool of infected and undiagnosed patients, which could fuel new infections in intersecting 30 populations. Furthermore, the epidemiological dynamics of HCV infection have mostly been studied 31 in intravenous drug users (IDU) [19][20][21][22] and the general population [23,24]. Results from these 32 populations are not easily transferable to other populations, which calls for a better understanding 33 of the epidemiological characteristics of HCV sexual transmission in MSM. 34 Given the lack of knowledge about the focal population driving the increase in HCV incidence, 35 we analyse virus sequence data with phylodynamics methods. This research field has been blooming 36 over the last decade and hypothesizes that the way rapidly evolving viruses spread leaves 'footprints' 37 in their genomes [25][26][27] Assuming an epidemiological transmission model with two host types, 'classical' and 'new' (see 50 the Methods), we use dated virus sequences to estimate the date of onset of the HCV epidemics 51 in 'classical' and 'new' hosts, the level of mixing between hosts types, and, for each host type, the 52 duration of the infectious period and the effective reproduction ratio (i.e. the number of secondary 53 infections, [34]). To validate our results we performed a parametric bootstrap analysis, we tested 54 the sensitivity of the method to differences in sampling proportions between the two types of hosts. 55 We also tested the sensitivity of the method to phylogenetic reconstruction uncertainty, and we 56 performed a cross-validation analysis to explore the robustness of our inference framework. We find 57 that the doubling time of the epidemics is one order of magnitude lower in 'new' than in 'classical' 58 hosts, therefore emphasising the urgent need for public health action.

60
The phylogeny inferred from the dated virus sequences shows that 'new' hosts (in red) tend to be 61 grouped in clades ( Figure 1). This pattern suggests a high degree of assortativity in the epidemics 62 (i.e. hosts tends to infect hosts from the same type). The ABC phylodynamics approach allows us 63 to go beyond a visual description and to quantify several epidemiological parameters.

64
As for any Bayesian inference method, we need to assume a prior distribution for each parameter. 65 These priors, shown in grey in Figure 2, are voluntarily designed to be large and uniformly distributed 66 to be as little informative as possible. One exception is the date of onset of the epidemics, for which 67 we use the output of the phylogenetic analysis conducted in Beast (see the Methods) as a prior. We 68 also assume the date of the 'new' hosts epidemics to be after 1997 based on epidemiological data.

69
The inference method converges towards posterior distributions for each parameter, which 70 are shown in red in Figure 2. The estimate for the origin of the epidemic in 'classical' hosts is 71 1957.47 [1948.61; 1961.96] (numbers in brackets indicate the 95% Highest Posterior Density, or 72 HPD). For the 'new' host type, we were not able to estimate when the epidemic (t 2 ) has started. 73 We find the level of assortativity between host types to be high for 'classical' (a 1   hosts, before 1997 t  Figure S2. is also slightly negatively correlated to γ 1 , which 104 most likely comes from the fact that for a given R 0 , epidemics with a longer infection duration have 105 a lower doubling time and therefore a weaker epidemiological impact. Overall, these correlations do 106 not affect our main results, especially the pronounced difference in infection periods (γ 1 and γ 2 ).

107
To validate these results, we performed a goodness-of-fit test by simulating phylogenies using the 108 resulting posterior distributions to determine whether these are similar to the target dataset (see the 109 Methods). In Figure 3, we see that the target data in red, i.e. the projection of the observed summary 110 statistics from the phylogeny shown in Figure  test.

118
To further explore the robustness of our inference method, we use simulated data to perform a 119 'leave one out' cross-validation (see the Methods). As shown in Supplementary Figure S5, the mean 120 relative error made for each parameter inference is limited and comparable to what was found using 121 a simpler SIR model [33]. One exception is for the 'new' hosts' level of assortativity (a 2 ). This is 122 likely due to the poor signal given the small size of the observed phylogeny.

123
A potential issue is that the sampling rate of 'new' hosts may be higher than that of 'classical' 124 hosts. To explore the effect of such sampling biases on the accuracy of our results, we sub-sampled 125 the 'new' hosts population by pruning the target phylogeny, i.e. randomly removing 50% of the 'new' 126 hosts' tips. In Supplementary Figure ?? show that the estimates from our ABC method are qualitatively similar for all these trees. with that respect, next-generation sequence (NGS) data could be particularly informative [38][39][40]. 156 Some potential limitations of the study are related to the sampling scheme, the assessment of 157 the host type, and the transmission model. Regarding the sampling, the proportion of infected 158 'new' host that is sampled is unknown but could be high. For the 'classical' hosts, we selected 159 a representative subset of the patients detected in the area but this sampling is likely to be low. 160 However, the effect of underestimating sampling for the new epidemics would be to underestimate its 161 spread, which is already faster than the classical epidemics. When running the analyses on different 162 phylogenies with half of the 'new' hosts sequences, we find results similar to those obtained with the 163 whole phylogeny, suggesting that our ABC framework is partly robust to sampling biases. In general, 164 implementing a more realistic sampling scheme in the model would be possible but it would require 165 a more detailed model and more data to avoid identifiability issues. Regarding assigning hosts to 166 one of the two types, this was performed by clinicians independently of the sequence data. resistance mutations on the R 0 of HIV strains [42]. Both of these are implemented in Beast2. We ran 184 an analysis using the BEAST 2 package bdmm with our data. We were unable to conclude anything 185 from this analysis.However, this is probably due to difficulties in estimating both evolutionary 186 and epidemiological parameters, when in this ABC inference study we inferred epidemiological 187 parameters using a fixed phylogeny.

188
Overall, we show that our ABC approach, which we validated for simple SIR epidemiological 189  Interval (1,10) new hosts, we found a similar estimate (1960) and the same 95% HPD of [1942; 1975], which we 229 used as a prior distribution to estimate the origin of the classical hosts t 0 (Table 1).

230
Epidemiological model and simulations 231 We assume a Birth-Death model with two hosts types (Supplementary Figure S1) with 'classical' 232 hosts (numbered 1) and new hosts (numbered 2). This model is described by the following system 233 of ordinary differential equations (ODEs): In the model, transmission events are possible within each type of hosts and between the two types 235 of hosts at a transmission rate β. Parameter ν corresponds to the transmission rate differential 236 between classical and new hosts. Individuals can be 'removed' at a rate γ 1 from an infectious 237 compartment (I 1 or I 2 ) via infection clearance, host death or change in host behaviour (e.g. condom 238 use). The assortativity between host types, which can be seen as the percentage of transmissions 239 that occur with hosts from the same type, is captured by parameter a i .

240
The effective reproduction number (denoted R t ) is the number of secondary cases caused by an 241 infectious individual in a fully susceptible host population [34]. We seek to infer the R t from the 242 classical epidemic, denoted R (1) and defined by R (1) = β/γ 1 , as well as the R t of the new epidemic, 243 denoted R (2) and defined by R (2) = νβ/γ 2 = νR (1) γ 1 /γ 2 .

244
The doubling time of an epidemic (t D ) corresponds to the time required for the number of 245 infected hosts to double in size. It is usually estimated in the early stage of an epidemic when 246 epidemic growth can assumed to be exponential. To calculate it, we assume perfect assortativity 247 (a 1 = a 2 = 1) and approximate the initial exponential growth rate by β − γ 1 for 'classical' hosts and 248 νβ − γ 2 for 'new' hosts. Following [45], we obtain t The prior distributions used are summarized in Table 1 and shown in Figure 2. Given the phylogeny 256 structure suggesting a high degree of assortativity, we assume the assortativity parameters, a 1 and 257 a 2 , to be higher than 50%. For the prior distribution of parameter ν, we combined a uniform 258 distribution from 0 to 1 with a uniform distribution from 1 to 10. This was done to ensure that the 259 probability to have ν < 1 is equal to the probability to have ν > 1. no modification of the phylogeny (if one of the lineages is not sampled). 274 We implicitly assume that the sampling rate is low, which is consistent with the limited number 275 of sequences in the dataset. We also assume that the virus can still be transmitted after sampling. 276 We simulate 60, 000 phylogenies from known parameter sets drawn in the prior distributions 277 shown in Table 1 the RPANDA R package [48]. 286 We also compute new summary statistics to extract information regarding the heterogeneity of the population, the assortativity, and the difference between the two R. To do so, we annotate each internal node by associating it with a probability to be in a particular state (here the host type, 'classical' or 'new'). We assume that this probability is given by the ratio P (Y ) = number of descendent leaves labelled Y number of descendent leaves (2) where Y is a state (or host type). Each node is therefore annotated with n ratios, n being the 287 number of possible states. Since in our case n = 2, we only follow one of the labels and use the 288 mean and the variance of the distribution of the ratios (one for each node) as summary statistics. 289 In a phylogeny, cherries are pairs of leaves that are adjacent to a common ancestor. There are 290 n(n + 1)/2 categories of cherries. Here, we compute the proportion of homogeneous cherries for each 291 label and the proportion of heterogeneous cherries. We also consider pitchforks, which we define as 292 a cherry and a leaf adjacent to a common ancestor, and introduce three categories: homogeneous 293 pitchforks, pitchforks whose cherries are homogeneous for a label and whose leaf is labelled with 294 another trait, and pitchforks whose cherries are heterogeneous.

295
The Lineage-Through-Time (LTT) plot displays the number of lineages of a phylogeny over time. 296 In this plot, the number of lineages is incremented by one every time there is a new branch in the 297 phylogeny and is decreased by one every time there is a new leaf in the phylogeny. We use the ratios 298 defined for each internal node to build an LTT plot for each label type, which we refer to as 'LTT 299 label plot'. After each branching event in phylogeny, we increment the number of lineages by the 300 value of the ratio of the internal node for the given label. This number of lineages is decreased by 301 one every time there is a leaf in the phylogeny. In the end, we obtain n = 2 LTT label plots.

302
Finally, for each label, we compute some of our branch lengths summary statistics on homogeneous 303 clades and heterogeneous clades present in the phylogeny. Homogeneous clades are defined by 304 their root having a ratio of 1 for one type of label and their size being greater than N min . For 305 heterogeneous clades, we keep the size criterion and impose that the ratio is smaller than 1 but 306 greater than a threshold . After preliminary analyses, we set N min = 4 leaves and = 0.7. We then 307 obtain a set of homogeneous clades and a set of heterogeneous clades, the branch lengths of which 308 we pool into two sets to compute the summary statistics of heterogeneous and homogeneous clades. 309 Note that we always select the largest clade, for both homogeneous and heterogeneous cases, to 310 avoid redundancy.

311
Regression-ABC 312 We first measure multicollinearity between summary statistics using variance inflation factors (VIF). 313 Each summary statistic is kept if its VIF value is lower than 10. This stepwise VIF test leads to the 314 selection of 101 summary statistics out of 330. 315 We then use the abc function from the abc R package [49] to infer posterior distributions generated 316 using only the rejection step. Finally, we perform linear adjustment using an elastic net regression. 317 The abc function performs a classical one-step rejection algorithm [32] using a tolerance parameter 318 P δ , which represents a percentile of the simulations that are close to the target. To compute the 319 distance between a simulation and the target, we use the Euclidian distance between normalized 320 simulated vectors of summary statistics and the normalized target vector.

321
Before linear adjustment, the abc function performs smooth weighting using an Epanechnikov 322 kernel [32]. Then, using the glmnet package in R, we implement an elastic-net (EN) adjustment, 323 which balances the Ridge and the LASSO regression penalties [50]. Since the EN performs a linear 324 regression, it is not subject to the risk of over-fitting that may occur for non-linear regressions 325 (e.g. when using neural networks, support vector machines or random forests).

326
In the end, we obtain posterior distributions for t 0 , t 2 , a 1 , a 2 , ν, γ 1 , γ 2 , R (1),t1 and R (1),t2 using 327 our ABC-EN regression model with P δ = 0.05. informative, we expect the target data to be contained in the envelope. This analysis was performed 336 either on the posterior distribution, or on a uniform distribution based on the 95% HPD posterior 337 distribution of each parameter, the latter being less informative.

338
To assess the robustness of our ABC-EN method to infer epidemiological parameters of our BD model, we also perform a 'leave-one-out' cross-validation as in [33]. This consists in inferring posterior distributions of the parameters from one simulated phylogeny, assumed to be the target phylogeny, using the ABC-EN method with the remaining 59, 999 simulated phylogenies. We run the cross-validation 100 times with 100 different target phylogenies. We consider three parameter distributions θ: the prior distribution, the prior distribution reduced by the feasibility of the simulations and the ABC inferred posterior distribution. For each of these parameter distributions, we measure the median and compute, for each simulation scenario, the mean relative error (MRE) such as: where Θ is the true value.