Modelling tree shape and structure in viral phylodynamics

Epidemiological models have highlighted the importance of population structure in the transmission dynamics of infectious diseases. Using HIV-1 as an example of a model evolutionary system, we consider how population structure affects the shape and the structure of a viral phylogeny in the absence of strong selection at the population level. For structured populations, the number of lineages as a function of time is insufficient to describe the shape of the phylogeny. We develop deterministic approximations for the dynamics of tips of the phylogeny over evolutionary time, the number of ‘cherries’, tips that share a direct common ancestor, and Sackin's index, a commonly used measure of phylogenetic imbalance or asymmetry. We employ cherries both as a measure of asymmetry of the tree as well as a measure of the association between sequences from different groups. We consider heterogeneity in infectiousness associated with different stages of HIV infection, and in contact rates between groups of individuals. In the absence of selection, we find that population structure may have relatively little impact on the overall asymmetry of a tree, especially when only a small fraction of infected individuals is sampled, but may have marked effects on how sequences from different subpopulations cluster and co-cluster.


Introduction
Viruses, especially RNA viruses such as human immunodeficiency virus type 1 (HIV-1), hepatitis C virus and influenza A virus, may exhibit a great deal of genetic variation at the population level, allowing the reconstruction of viral phylogenies that reflect the past transmission of the virus. The shape of the phylogeny can tell us a great deal about how population processes, such as changes in population size and geographical population structure, and immunological processes, such as selection on the virus to escape immune responses, interact [1]. For example, 'star-like' phylogenies are typical of populations that are growing exponentially, while 'ladder-like' phylogenies are consistent with a model where one variant is replaced by another due to immune escape. This integration of ecological, epidemiological and evolutionary processes has been dubbed 'phylodynamics' [2]. Phylodynamic approaches have been used in hundreds of studies of viruses [3] and have generated important insights into the transmission dynamics of many viral pathogens, such as the spread of HIV in the UK [4,5], as well as the geographical spread of influenza A [6][7][8]. The information obtained by applying phylodynamic models to viral sequence data would be hard, if not impossible, to obtain through more classical epidemiological approaches.
The majority of phylodynamic studies have employed models derived from simple population dynamic models of single species. However, these models may be inappropriate when considering the spread of a virus in a population.
A key quantity in these models is the coalescence rate, the rate at which lineages coalesce in a phylogeny as we go backwards in time from the present. From the coalescence rate, these models generate estimates of effective population size or N e , which is commonly (mis)interpreted as being proportional to the number of infected individuals. Previously, we have demonstrated that the coalescence rate of an infectious disease is related not only to the prevalence, but also to the rate of transmission (i.e. the incidence) [9]. Consequently, the conclusions of previous studies, particularly those that integrate viral sequence data with information on prevalence, may have to be reinterpreted. The use of epidemiological models to underpin viral evolutionary models can lead to results that are more easily interpretable, and permit the inclusion of prior information, such as that on the duration of the infectious period, as well as facilitating the integration of phylogenetic data with other forms of epidemiological data [10].
Recently, there has been increased interest in considering the phylodynamics of structured populations, for example, in the context of studying the spatial spread of viruses from sequence data ('phylogeography') [11]. In addition, there are many other forms of heterogeneity that may be important, including differences by age, duration of infection, contact rate, infectiousness, susceptibility, treatment or vaccination status, etc., depending on the system being studied. When data are available on which subpopulation a viral sequence is associated with, there are a variety of tests that can be used to assess whether there is significant population structure (see Zárate et al. [12] for a comparison of several approaches to within-host HIV population structure). A particular challenge arises when data on the subpopulations are lacking. For example, while acute HIV infection is associated with higher infectiousness, information on the time since infection may not be available; similarly, while there may be differences in contact rates between different subpopulations, many molecular epidemiological studies of HIV do not collect behavioural data. Recently, Leventhal et al. [13] took an inventive approach to this problem; using a phylogeny from the Swiss HIV epidemic, they showed that the phylogeny was significantly more unbalanced than expected from a simple model of random mixing, which they argued could be due to contact structure in the at-risk population. They used a measure of tree balance, Sackin's index [14], and derived an approximation of the expectation of Sackin's index given a transmission network. Of note, they did not present a similar approximation for Sackin's index given the underlying contact network.
In this study, we consider how population structure may affect phylodynamic patterns, using HIV-1 as a model system. We first introduce the notion of tree imbalance or asymmetry. We then review our framework for modelling coalescence using ordinary differential equations, presenting another perspective on our past results, which we extend to consider the dynamics of external branches (or tips or leaves) of the phylogenetic tree. This device allows us to model cherries [15], pairs of tips that share a direct common ancestor, which we use to capture both tree asymmetry as well as population structure. We also derive an approximation to Sackin's index as a complementary measure of tree asymmetry. We apply this approach to determine how (i) higher infectiousness during acute infection and (ii) the presence of a high-risk group with a high contact rate may affect the shape and the structure of the viral phylogeny.

Asymmetry of phylogenetic trees
The most widely used model used in studies of viral phylodynamics is the time-varying coalescent model [16], which considers the genealogical process in a population that changes size in a deterministic fashion according to some relative size function, n(t), where t is time measured in generations, starting with the present and going backwards. Assuming a sample of n individuals taken at time t ¼ 0, and that the sample can be traced back to a single common ancestor with probability one, the dynamics of the number of distinct ancestors of the sample at time t is modelled as a stochastic process fA n ðtÞ; t ! 0g, which starts at A n (0) ¼ n, and moves down in steps of 1 until reaching 1, at which point the sample has been traced back to the common ancestor. In a small time-step h, the transition probabilities are determined by the following: This model assumes that the rate of coalescence between any two lineages is the same for all pairs of lineages, but varies over time. If the rate of coalescences varies between lineages at a given time, then this may have an impact on the shape of the tree [17]. Hence, deviations of the shape of an inferred tree from that expected under the coalescent model suggests that additional biological complexity may need to be considered. There are a number of different measures of tree shape that can be used for this purpose [18,19], but we focus on two specific measures: the number of 'cherries' [15] and Sackin's index [14]. Figure 1 illustrates how these statistics are calculated for two small trees, one symmetric and one asymmetric. Measures of tree shape tend to consider another stochastic process that generates trees, a linear birth or the Yule process [20]; however, as this model gives the same probability distribution on cladograms (i.e. the topology of the tree) [21], results on asymmetry for the Yule process also hold for the coalescent model.
Cherries are defined as the number of tips that share a direct ancestor, which are generated when two tips coalesce. The expected number of cherries in a tree with n taxa under a Yule or coalescent model is n/3 [15]. In an asymmetric tree, tips tend to coalesce with branches deeper in the tree, and there are fewer cherries than expected. We denote the number of cherries as C, which is not to be confused with another measure of asymmetry, Colless' index [22].
Sackin's index is a measure of the topological distance from the tips of the tree to the root and is defined as follows. If the distance d j of a leaf j is the number of internal nodes that need to be traversed when following the path from the root of the tree to a leaf j, then Sackin's index is the sum of all such paths, I S ¼ P j d j . The expectation of Sackin's index for n taxa, EðI S ðnÞÞ, under a Yule or coalescent process [23] is as follows: where c ð0Þ is the polygamma function of order 0, and g e the Euler-Mascheroni constant (% 0:577). For large n, EðI S ðnÞÞ % 2n log(n). As Sackin's index increases with sample size, it is often standardized by dividing by the number of sequences. Although this has a direct biological interpretation-the mean rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 root-to-tip distance (in terms of nodes)-we employ a different standardization used by Leventhal et al. [13], which is as follows: Under the null Yule or coalescent model, I S ðnÞ ¼ 0, which allows one to assess deviations from the null model more easily. We calculated these tests for two HIV phylogenies, one from an early clinical trial, ACTG 241 [9,24], and another of group M viral sequences sampled from HIV-infected individuals from the Democratic Republic of Congo [25][26][27]. Both trees show moderate, but statistically significant, evidence of asymmetry (see the electronic supplementary material, figure S1). For both of these datasets, the sequences were sampled at approximately the same time. Although many viral datasets are collected in serial samples, and this can result in more asymmetric trees than if sequences are sampled at a single timepoint (see the electronic supplementary material, figure S2), in order to keep the exposition simple, we will consider sampling at a single timepoint, although the approach taken here can also be extended to serial samples.
The number of cherries and Sackin's index complement each other well, as the number of cherries captures asymmetry in the recent evolutionary past, while Sackin's index captures asymmetry over the entire evolutionary history of the sample, and simulations demonstrate that these statistics are only weakly correlated under the coalescent model (see the electronic supplementary material, figure S3).

Tree shape in a simple model of HIV infection
To investigate how the shape of a viral phylogeny is linked to transmission, we first considered a simple model commonly used to study the spread of HIV among men who have sex with men (for a comparison of the deterministic and stochastic version of this model, see Jacquez & Simon [28]). If S denotes the number of susceptible individuals and I denotes the number of infected individuals, the rates of change of S and I are as follows: Here, b is the per-contact probability of infection, c the contact rate, m represents the natural mortality rate, g denotes the excess mortality caused by infection, and L is the rate of immigration/birth of new susceptibles. The dynamical behaviour of the model depends on the value of the basic reproductive number R 0 ¼ bc=ðm þ gÞ. If  Let A(s) denote the set of lineages that are ancestral to the sample at time s, so that AðsÞ ¼ jAðsÞj. U(s) will denote the set of lineages which are not ancestral to the sample and will have cardinality U(s). Lower-case symbols will denote elements of these sets: a [ A and u [ U. will denote the removal of a lineage. At each internal node of the tree, we denote the types of daughter lineages i and j and the state of the parent k using the notation ði; jÞ ! k. The possible types of transition as we go backwards in time are as follows: rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 The above transitions assume that the population sizes for I, A and U are large, such that IðI À 1Þ % I 2 , etc., and we consider sampling with replacement for the coalescence of lineages. The first type of transition, ða; aÞ ! a reflects the decrease in lineages when there is an infection involving two sampled lineages. Infections where neither the source nor the recipient individual are sampled do not affect the number of lineages in the sample, and so we do not need to consider transitions of the form ðu; uÞ ! u for the NLFT. The transition ða; uÞ ! a occurs either when a sampled individual infects another individual but we do not sample the latter individual, or when an unsampled individual infects a sampled individual. These transitions reflect what we have described as an 'invisible' transmission [32,33]. While such transitions do not affect the number of lineages, for structured populations they may result in a change in state of the lineage, and so are important for more complex models, which we will demonstrate later. The removal of lineages, for example by death or recovery of infected individuals, while changing the number of unsampled lineages ( ! u), does not directly affect the number of sampled lineages, as the transitions ! a is not possible; in addition, the transition ða; uÞ ! u is not possible. These transitions suggest the following set of differential equations for the dynamics of A and U: Transitions in this model are more complex than those considered in a simple Wright-Fisher model. Firstly, an infection gives rise to another through transmission, such that there are overlapping 'generations'. Secondly, sampling effects enter into the transition rates. Nevertheless, there is a one-to-one correspondence of the coalescence rate in this epidemiological model with that in standard population genetics models widely used in viral phylodynamic studies. In a haploid Wright-Fisher model, the dynamics of the effective population size N e is in units of generations. The resulting estimates from models fitted to phylogenies where branch lengths are in real time are usually interpreted as N e t, or the 'scaled effective population size', where N e is the 'effective number of infections' and t the generation time. For the model given by equations (3.1)-(3.2), the coalescence rate is the same as a haploid Wright-Fisher model if we define the number of infected individuals I/2 as the 'effective number of infections' and the generation time t ¼ f SI =I ¼ bcS=N (the incidence-to-prevalence ratio [34]), with the 'scaled effective population size' being It/2. The use of the term 'generation time' here, in a population genetics context, should not be confused with epidemiological interpretations of the generation time [35,36], which is defined at the individual level, as the time between the infection time of an infected person, and the infection time of his or her infector, rather than the average time between infections at a given time at the population level. Figure 2a illustrates the dynamics of n ¼ It/2 over time in relation to the number of infected individuals I for the model given by equations (3.1)-(3.2). This demonstrates that n is out of phase with I, and also exhibits differences in the magnitude of fluctuations. We also fitted a 'Bayesian skyride' model [37] to simulated coalescent intervals generated from a forwards-time, stochastic, discrete event version of the model, using a fast approximation that is fitted directly to a phylogenetic tree [38]. Such non-parametric models for n tend to smooth out fluctuations, as well as underestimating n at a given timepoint, as they average n over a time period with a simulation time of 40 years. Simulations of the differential equations were performed using the SIMECOL library [29] in R [30], fitting of the skyline plot used the INLA library [31], while the stochastic simulations were performed using SIMPY v. 1.9.1 in PYTHON (see [3] for more details). Simulations were conditioned on reaching a quasi-equilibrium state, and registered by aligning the peaks of the simulated number of infected individuals to the peak of infected individuals from the ordinary differential equations. Code to perform simulations is available from http://code.google.com/p/ simonfrost. rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 as the harmonic mean [39]. Nevertheless, the skyride performs well in identifying the overall trajectory of n.
These results argue that the term 'effective number of infections' is potentially misleading [3], as it implies that the ancestral size function n is directly proportional to the number of infected individuals. This is not generally the case, owing to a time-varying generation time t(t) over the course of an epidemic (figure 2b), which is short during the early stages of an epidemic, and becomes longer as the number of susceptible individuals becomes limiting. However, there are time periods, such as the case of exponential growth, and at endemic equilibrium, where the generation time is constant, and hence n is proportional to the number of infected individuals [3,40].

(b) The number of leaves and cherries
While the distribution of coalescent intervals is sufficient for inference of n under simple models, this is not the case for more realistic models that incorporate heterogeneity. As a prelude to discussing these models, we consider the number of cherries in the homogeneous model. As cherries are generated when two tips coalesce, we consider the dynamics of tips and internal branches separately. L(s) and B(s) will denote the set of tips and internal branches, with cardinality L(s) and B(s), respectively. As before, lower-case symbols will denote elements of these sets. We denote the cumulative number of cherries as C, and consider the following transitions backwards in time: The rationale for this scheme is as follows. When two leaves coalesce, they form a single branch, as well as a cherry. When a leaf and a branch coalesce, this either results in the loss of a leaf, or a loss of a branch and a change of state from a leaf to a branch; both of these occur at the same rate, and result in the same net changes in L and B (hence the factor of two). When two branches coalesce, this results in the loss of a branch. Consideration of the dynamics of tips also allows us to consider the proportion of lineages that cluster with at least one other sequence, a common approach when analysing HIV phylogenies [5], and is related to the concept of an operational taxonomic unit. The proportion of unclustered tips is PðsÞ ¼ LðsÞ=Að0Þ, and the distribution of tip lengths is ÀdPðsÞ=ds. The mean number of taxa per cluster [9], M, is included for completeness. If A(0) sequences are sampled at a single timepoint s ¼ 0, then the initial conditions are L(0) ¼ A(0), B(0) ¼ 0, C(0) ¼ 0 and M(0) ¼ 1. This leads to the following set of differential equations for L, B, C and M: The total number of cherries in a tree is simply the solution of C(s) at the time to the most recent common ancestor (TMRCA). The only subtlety that arises is the calculation of the TMRCA. In a standard coalescent framework, the TMRCA is the time at which the last two lineages coalesce; in an epidemiological model, this is the time at which the first transmission takes place involving two infected individuals ancestral to the sample, which may occur after the first transmission by the first infected individual in the population. We make the approximation that the TMRCA is the time at which A ¼ L þ B ¼ 1.
Theory based on extended Polya urn models [15] has shown that the expected number of cherries in a tree generated by a Yule or coalescent process is n/3, where n is the number of sequences. We considered the dynamics of cherries for the simple HIV model at endemic equilibrium. If we define a constant k ¼ f SI =ðI 2 Þ, then the solution of equations (3.4)-(3.9) for A(s), L(s) and C(s) is as follows: The time to the most recent common ancestor, s MRCA ¼ ðAð0Þ À 1Þ=ðkAð0ÞÞ, is the solution of A(s) ¼ 1 for S, and the total number of cherries, Cðs MRCA Þ ¼ ðAð0Þ=3Þ ð1 À 1=A 3 Þ; for large n, Cðs MRCA Þ % Að0Þ=3, i.e. the total number of cherries in the differential equation model is approximately the same as the mean from a Yule or coalescent process, with only a negligible difference for sample sizes typical of many viral studies, in the order of a hundred or more.
(c) Sackin 0 s index As Sackin 0 s index is the number of internal nodes (including the root) from each tip, summed over all tips, to obtain an approximation for Sackin 0 s index we need to consider the coalescence rate, f SI ðA=IÞ 2 , and the expected change in Sackin 0 s index given a coalescence, which is 2Að0Þ=AðsÞ ¼ 2M, where M is the mean cluster size; the factor of two arises due to coalescent events affecting the counts for two lineages. A differential equation for K(s), the cumulative value of Sackin 0 s index at time S in the past (where K(0) ¼ 0), is as follows: A : ð3:14Þ Kðs MRCA Þ provides an approximation for Sackin 0 s statistic. Considering the simple HIV model at equilibrium, substituting s MRCA ¼ ðAð0Þ À 1Þ=ðkAð0ÞÞ into equation (3.14) gives Kðs MRCA Þ ¼ 2n logðnÞ. Although as the number of sequences tends to infinity, Kðs MRCA Þ ! EðI S ðnÞÞ, the rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 difference between Kðs MRCA Þ and EðI S ðnÞÞ (on the order of 5% for sample sizes in the hundreds) is large enough that we standardize Kðs MRCA Þ by 2n logðnÞ rather than 2ðc ð0Þ ðn þ 1Þ þ g e À 1Þ, i.e.

Heterogeneity and tree shape
The model given by equations (3.1)-(3.2) considers only a single type of susceptible and a single type of infected individual. More generally, we can consider models that include heterogeneity between individuals. Examples of such heterogeneity include differences in infectivity at different times since infection, differences between hosts in contact rates, and geographical heterogeneity. Such heterogeneity can have a profound effect on the transmission dynamics. Incorporating heterogeneity in our phylodynamic models presents additional challenges, as we need to consider ancestral lineages for each type of infected individual, and coalescences between lineages of both the same and different types.

(a) The number of lineages as a function of time
We begin by considering the dynamics of the total number of lineages of each type for a two-type system, although these results can easily be extended to more than two types. Considering forwards time, we define a time-varying matrix F(t), comprising elements f ij (t), the rate at which a lineage of type i generates another of type j, and a matrix G(t), comprising elements g ij (t), the rate at which a lineage of type i changes to one of type j. These matrices are used to express the transition rates for changes in the number of ancestral lineages of different types [32], which for a two-type system are as follows: This leads to the following differential equation for the dynamics of A 1 (s), with an analogous equation rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 for dA 2 ðsÞ=ds: ð4:1Þ

(b) The number of tips and cherries
Derivation of the dynamics of tips in this two-type system requires us to consider four types of tips, l 11 , l 21 , l 12 and l 22 , based on the (unobserved) state at time S in the past (referred to by the first subscript) and the (observed) initial state at the time of sampling. As for simplicity, we consider sampling at a single timepoint, this is at s ¼ 0. For example, transitions for the dynamics of tips involving l 11 are as follows: transition DL 11 DL 21 DL 12 DL 22 rate Consideration of these transitions leads to the following differential equation for the dynamics of tips L 11 , with analogous expressions for the dynamics of the other tips: ð4:2Þ The first line of equation (4.2) represents coalescence of tips, the second 'invisible' transmissions, which result in a change in state and the third migration events. The fraction of unclustered lineages that are in state j at the tips of the tree, and are in state i at some time s in the past, P ij (s), can be obtained from the above in a similar fashion as in the single-population model.
In this two-type system, there are three types of cherries, which we denote by c ij . The rates of coalescence of different types of tip (l 11 , l 12 , l 21 and l 22 ), and the types of cherry generated are as follows: transition rate cherry It is important to note that we have to consider both lineages as potential 'sources' of infection when considering coalescence between tips of different types, hence the factor of two for ðl 11 ; l 12 Þ ! b 1 and ðl 21 ; l 22 Þ ! b 2 . Consideration of these transitions gives rise to the following differential equations for the dynamics of the number of cherries: : ð4:5Þ The total number of cherries, C ¼ C 11 þ C 12 þ C 22 gives a measure of tree asymmetry, which can be compared against the null of % Að0Þ=3. To facilitate comparison of trees with different numbers of tips, L(0) ¼ A(0), we define a normalized number of cherries, C norm ¼ C=Að0Þ.
(i) The composition of cherries as a measure of clustering Capturing how different types cluster together on a tree, i.e. co-clustering, is difficult, as-except at the tips of the treethe type of a lineage is not directly observable. Previously, we have derived equations for the correlations in numbers of sequences of different types in a cluster [33]. Here, we consider clustering in terms of the composition of different types of cherries, with relatively low values of C 12 being indicative of separation between types. We define the following measure of assortativity, based on that of Newman [41]. We denote a matrix E with elements e ij as follows: The assortativity coefficient, r, is defined as follows, where Under a null panmictic model, r ¼ 0, while for a model where types are completely separated, r ¼ 1. An estimate of r can also be obtained directly from a viral phylogeny.
(c) Sackin 0 s index Extending our approximation for Sackin 0 s index (equation (3.14)) to two subpopulations is relatively straightforward, except now we have to consider three different types of coalescence. When lines of type i and j coalesce, they produce a clade with a mean number of descendants X i ðsÞ=A i ðsÞþ X j ðsÞ=A j ðsÞ, where X i (s) denotes the number of taxa descended from all extant lineages of type i at time s in the past, with P i X i ðsÞ ¼ Að0Þ. Such clades are produced at the rate F ij ðsÞðA i ðsÞ=I i ðsÞÞðA j ðsÞ=I j ðsÞÞ. This leads to the following differential equation for the cumulative Sackin index until time s: In order to aid comparison with the simple model without heterogeneity, we derive a normalized version of k, K ¼ ðK À 2n logðnÞÞ=ð2n logðnÞÞ.
The dynamics of X 1 (s) can be described with the following equation, with an analogous equation for X 2 (s), with initial conditions X 1 ð0Þ ¼ A 1 ð0Þ; X 2 ð0Þ ¼ A 2 ð0Þ: This equation simply captures flow between the states, either by coalescent events (captured by the matrix F) or by 'migration' between states (captured by the matrix G). To verify the deterministic approximations, and to determine the variability in tree shape due to finite sample size, we also simulated trees using an approximation to the coalescent in structured populations developed by Volz [32], which takes the matrices F and G and the numbers of infected individuals, I 1 and I 2 , at different time points as input.

Applications
To determine how structure and sampling affects phylodynamic patterns, in terms of the number of lineages over time, the extent to which sequences cluster and co-cluster, and the extent of tree asymmetry, we now consider two specific models of HIV that incorporate heterogeneity, either in infectiousness over the course of infection or differences between groups in contact rates.

(a) Acute and chronic HIV infection
The infectiousness of HIV-1 is thought to be much higher during acute infection than during chronic infection [42]. Previously, we have analysed models of HIV transmission that include acute and chronic infection [9,32,33]. We recap some of the main results here, as well as extending them to consider more tree statistics. We denote the number of acutely infected individuals by I 1 , and the number of chronically infected individuals by I 2 . We allow acutely infected individuals to have a different per-act probability of infecting a susceptible person, b 1 , which we assume to be higher than the probability for a chronically infected person, i.e. b 1 . b 2 . We assume that acute infection progresses to chronic infection at rate a, and that acutely infected individuals do not suffer any excess mortality due to HIV infection. These generalizations to the simple HIV model result in the following set of differential equations: The matrices F and G for this model are as follows: To investigate the effects of heterogeneity in contact rates between individuals, we considered a model with two rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 groups of individuals with different contact rates, c i , with the fraction of contacts made by a person in group i with a person in group j denoted by p ij [43]. A number of assumptions can be made regarding the fraction of contacts of a person in group i with a person in group j, p ij . A common assumption is proportionate mixing, in which the fraction of the contacts of group i with group j is equal to the fraction of the total contacts made by the population that are due to group j, such that p ij ¼ c j N j = P k c k N k . A more general formulation, that allows a wider range of mixing matrices, is the preferred mixing structure described by Jacquez et al. [43], in which a fraction r i of the contacts of group i are reserved for within-group contacts. The elements p ii and p ij (i = j ) under this model are as follows (note that this corrects an error in the term for p ij reported in Jacquez et al. [43]): For both the acute/chronic model, and the differential risk model, the differential equations captured the mean number of cherries, the assortativity coefficient and Sackins index calculated from simulations of multiple trees, at a fraction of the computational burden (see the electronic supplementary material, figures S4 -S6).

(c) Tree shape and structure
We simulated the acute/chronic model and the differential risk model, assuming either proportionate or preferential mixing, for a range of sample fractions, f, from 0.1 to 0.9. Model outputs for the number of cherries, the assortativity coefficient and Sackin's index at a fixed time of sampling are shown in figure 4.
For the acute/chronic model, we considered a range of values for the relative infectiousness of acute and chronic HIV infection, while maintaining the same mean infectiousness over the infection period. Assortativity increased with higher infectiousness of acute infection, in line with our previous results examining the composition of clusters [33]. Although higher infectiousness resulted in more asymmetric trees, this depended on both the sample fraction and the choice of statistic in a nonlinear way. Sackin's index showed the greatest evidence of asymmetry for intermediate values for the relative infectiousness of acute infection and was generally insensitive to sampling fraction. In contrast, when the sample fraction was high, asymmetry, as measured by a low number of cherries, was the greatest for high infectiousness during acute infection.
For proportionate mixing, the differential risk models over a range of contact rates for the high-risk population relative to the low-risk population demonstrated asymmetry similar in magnitude to those of the acute/chronic model, in terms of the number of cherries, but showed less extreme values for Sackin's index. Assortativity was generally low, with small negative values of the assortativity coefficient for greater relative contact rates for the high-risk group. However, although variation in contact rates could result in asymmetric phylogenies under proportionate mixing, this effect was almost completely eliminated when mixing was preferential (r ¼ 0.9), as such population subdivision limits the impact that individuals with a high contact rate can have on the entire viral phylogeny. Also in contrast to proportionate mixing, assortativity was much more marked when mixing was preferential. The assortativity coefficient, r, was relatively insensitive to contact rate variation, being mainly driven by the mixing between the high-and low-risk groups, captured by the parameter r (results not shown), although the assortativity of different types of infected individuals, r, may be much less than the assortativity of different types of all individuals, r, especially for low sample fractions.

Discussion
We have extended our previous differential equation-based framework for modelling the NLFT to consider tree asymmetry and, in the case of structured population models, co-clustering of different states. Of note is that our models generate trajectories of measures of asymmetry and assortativity over evolutionary time, rather than just summary measures over the whole tree. We have also presented examples of how heterogeneity in the susceptible and/or infected individuals can result in different phylodynamic patterns. The two models presented here have a wide range of applications. For example, the model used for acute and chronic HIV infection can also be used to consider a simple form of treatment, where I 1 and I 2 represent the number of untreated and treated individuals, respectively, and a represents the rate of going on treatment, while the model of different risk groups can be used to examine heterosexual spread of HIV, by setting c ii , i ¼ 1,2 to zero, or a spatial model, where 'migration' of infections occurs via transmission between individuals in different geographical areas.
Given the nonlinearities in the system, it is hard to develop scenarios where the impact of a single parameter can be examined. For example, changing the relative rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 infectiousness of acute infection changes the whole trajectory of the epidemic, such that sampling at a fixed time since the introduction is not strictly comparable across parameter values. As our models comprise a relatively small number of differential equations, which can be simulated quickly, they are well suited for exploring how tree shape is affected by population structure. In contrast, simulating trees may be extremely time consuming, especially for large numbers of taxa, and large numbers of simulations for a given parameter set may be needed, owing to the high variability in many tree shape statistics.
Our results suggest that in many cases, the level of asymmetry of the tree may be rather insensitive to the underlying population structure. This is not particularly surprising for a number of reasons, including the relatively weak selection of HIV-1 at the population level [2], the averaging of asymmetry over the entire tree, and that the risk among those infected is likely to be higher and less variable among infected individuals than susceptible individuals. Given these considerations, it is somewhat surprising that Leventhal et al. [13] found asymmetry in their analysis of the Swiss HIV epidemic, especially as the overall phylogeny comprised distinct risk groups, which our results suggest generates less, not more asymmetry. This may be due to biases in when each risk group was sampled, and/or the unusually high sampling fraction in this epidemic (30 -40%). Indeed, the three largest transmission clusters, which were more homogeneous in terms of risk (one associated with heterosexual risk/injection drug use and two clusters associated with men who have sex with men) showed much lower asymmetry (I 2 S ,0.5). Our models also show that factors other than contact rate, such as high infectiousness during acute infection, may have a more dramatic impact on asymmetry; while high-risk groups may be at a minority in a population, all infected individuals go through a period of increased infectiousness during acute infection. Moreover, as sequences sampled at different times will generate more asymmetric trees for rapidly evolving pathogens such as HIV-1 (see the electronic , and the relative contact rate in the differential risk model, assuming either (b) proportionate mixing (r ¼ 0) or (c) preferential mixing (r ¼ 0.9). Parameter values for the acute/chronic model are as follows: c ¼ 1, a ¼ 1 8 , g ¼ 1 512 , m ¼ 1 3640 , L ¼ 10 000 3640 , S(0) ¼ 9999, I 1 (0) ¼ 1, I 2 (0) ¼ 0. The infectivity parameters b i were constrained such that b 2 ¼b ðd 1 þ d 2 Þ=ðkd 1 þ d 2 Þ and b 1 ¼ kb 2 , whereb is the mean infectiousness (withb ¼ 0:01), d i the mean duration of stage i and k the fold increase in infectiousness during acute infection. Parameter values for the differential risk model are b ¼ 0.01, c 2 ¼ 1, g ¼ 1 520 , m ¼ 1 3640 , L 1 ¼ 1000 3640 , L 2 ¼ 9000 3640 , S 1 (0) ¼ 999, I 1 (0) ¼ 1, S 2 (0) ¼ 9000, I 2 (0) ¼ 0. The simulation time is 30 years, with weekly timesteps, assuming 52 weeks per year. rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120208 supplementary material, figure S2), measures of asymmetry may be difficult to interpret for serial samples, which are commonplace in HIV-1 phylogenetic studies, and difficult to compare between studies that have different temporal sampling patterns.
The composition of cherries may be highly informative about patterns of mixing between populations, provided that the sample size is sufficient to include representatives from all groups. However, in order to calculate the composition of cherries, we need to specify subpopulations a priori, and this may be difficult to perform, especially for variables such as sexual contact rates. Although, ideally, other data such as behavioural data should be collected in order to identify risk groups, as clustering is also related to contact rates, it may be possible to identify individuals with higher contact rates based on patterns of clustering. However, patterns of clustering have to be interpreted carefully, as differences in clustering may also be driven by differences in the time since infection at which samples are taken ( [33]; electronic supplementary material, figure S7) and by the underlying frequencies of the groups (see the electronic supplementary material, figure S8).
Our simulations assumed a random sample of taxa across all groups. In practice, random sampling of infected individuals may not be feasible, or in some cases it may even be desirable to oversample particular groups. For example, while our model of acute and chronic HIV infection predicts increasing assortativity as the assumed relative infectiousness during acute infection increases, it may be difficult to test this empirically, as generally acutely infected individuals are relatively infrequent, and sampling variation in the assortativity coefficient may be high (see the electronic supplementary material, figures S4-S6). Our framework can accommodate over-or under-sampling of specific groups, although prior information on the size of each group is highly desirable in order to make accurate inferences.
We have focused on developing and simulating phylodynamic models, rather than inferring parameter values of these models from sequence data. As highlighted in our discussion of the simple HIV model, some simple epidemiological models are just special cases of the time-varying coalescent model, for which methods of inference are well established. While the theory presented for structured models can also be used as a basis for inference, full likelihood-based fitting may be computationally intensive, and approximations to the likelihood may be required [32]. The models presented here, which can generate a number of summary measures of phylogenetic structure, can be used as the basis for Approximate Bayesian Computation (ABC) approaches [44], in which parameter values are found that generate simulated data that resemble the observed data. The use of more biologically realistic phylodynamic models can be used not only to determine whether a population deviates from random mixing [13], but also to determine the type of population structure. By linking asymmetry, assortativity and the number of lineages through time, bespoke models of viral phylodynamics may be able to provide rich insights into the dynamics of viral transmission.