Comparing mutational pathways to lopinavir resistance in HIV-1 subtypes B versus C

Although combination antiretroviral therapies seem to be effective at controlling HIV-1 infections regardless of the viral subtype, there is increasing evidence for subtype-specific drug resistance mutations. The order and rates at which resistance mutations accumulate in different subtypes also remain poorly understood. Most of this knowledge is derived from studies of subtype B genotypes, despite not being the most abundant subtype worldwide. Here, we present a methodology for the comparison of mutational networks in different HIV-1 subtypes, based on Hidden Conjunctive Bayesian Networks (H-CBN), a probabilistic model for inferring mutational networks from cross-sectional genotype data. We introduce a Monte Carlo sampling scheme for learning H-CBN models for a larger number of resistance mutations and develop a statistical test to assess differences in the inferred mutational networks between two groups. We apply this method to infer the temporal progression of mutations conferring resistance to the protease inhibitor lopinavir in a large cross-sectional cohort of HIV-1 subtype C genotypes from South Africa, as well as to a data set of subtype B genotypes obtained from the Stanford HIV Drug Resistance Database and the Swiss HIV Cohort Study. We find strong support for different initial mutational events in the protease, namely at residue 46 in subtype B and at residue 82 in subtype C. The inferred mutational networks for subtype B versus C are significantly different sharing only five constraints on the order of accumulating mutations with mutation at residue 54 as the parental event. The results also suggest that mutations can accumulate along various alternative paths within subtypes, as opposed to a unique total temporal ordering. Beyond HIV drug resistance, the statistical methodology is applicable more generally for the comparison of inferred mutational networks between any two groups.

The emergence of drug resistant viral strains, a process driven by the evolutionary 2 escape dynamics of HIV-1, limits the success of antiretroviral therapies. Today, HIV-1 3 infections are clinically manageable by using two or more antiretroviral drugs together, 4 a treatment strategy known as combination antiretroviral therapy (cART) [1]. However, 5 when viral replication is inadequately suppressed, the virus may acquire mutations 6 which confer resistance to cART, and the regimen must be replaced [2]. A better 7 understanding of the underlying evolutionary process leading to resistance is believed to 8 be crucial for predicting therapy outcome [3][4][5] and designing effective therapy 9 sequences [6]. 10 Mutational pathways of HIV-1 under the selective pressure of several antiretroviral 11 drugs have been studied by sequencing the viral genome derived from patients over the 12 course of treatment [7][8][9][10][11][12]. However, such longitudinal data are not available for most 13 antiretroviral therapies. To leverage information from large cohorts and cross-sectional 14 studies, different statistical models have been proposed to investigate mutational 15 pathways leading to drug resistance. These approaches include several probabilistic 16 graphical models, such as Markov processes [13]; a Markov model incorporating 17 information from phylogenetic trees [4]; mutagenetic trees [3,14]; Bayesian 18 networks [15][16][17][18][19]; discrete and continuous-time Conjunctive Bayesian Networks 19 (CBN) [20,21]; and Suppes-Bayes Causal Networks (SBCN) [22]; as well as a Cox 20 proportional-hazards model which is used to identify pairs of resistance mutations, in 21 which one mutation alters the hazard of the other one [6]. 22 Most of the aforementioned methods have been applied to study the accumulation of 23 drug resistance mutations in HIV-1 subtype B infections. As an exception, Deforche et 24 al. [15,16] combined observations from various subtypes to investigate dependencies 25 among resistance mutation and polymorphisms using Bayesian networks. The inferred 26 network was used to explain the lower prevalence of protease mutation 30N in subtypes 27 G and A as compared to subtype B through an interaction with the polymorphic locus 28 89L/M. Indeed, there is increasing evidence of differences in mutation profiles and 29 evolutionary rates among subtypes [23][24][25][26][27][28][29], but implications on the order of 30 accumulating mutations have not been systematically addressed. Here, we investigate 31 the rate and order of accumulation of drug resistance mutations in different HIV-1 32 subtypes. Specifically, we compare mutational pathways to lopinavir resistance in HIV-1 33 subtypes B versus C. Although HIV-1 subtype B is the best studied and most prevalent 34 subtype in Europe and North America, subtype C alone accounts for nearly half of all First, Beerenwinkel et al. [21] used a mixture model with two components to 48 distinguish signal from noise. This model has been applied to learn mutational 49 pathways in HIV under different selective pressures. The data sets originally analyzed 50 included at most nine resistance mutations, but Montazari et al. [32] presented a Monte 51 Carlo expectation-maximization algorithm for parameter estimation of the mixture 52 model with hundreds of mutations. The mixture error model has, yet, several 53 limitations. Every genotype that violates the ordering constraints is assumed to occur 54 with equal probability regardless of, e.g., the number of violations. Moreover, as the 55 number of mutations increases the chance of obtaining an error-free genotype decreases 56 rapidly. For instance, with a 1% per locus error rate and 64 mutations, we expect only 57 around 53% of the genotypes to be correct. The mutation network is, however, inferred 58 exclusively from the portion of the data assigned to the signal component of the mixture 59 model, which can quickly result in a large portion of the data being discarded.  Here, we take advantage of the improved error model of the H-CBN, but address its 72 limitation regarding scalability in the number of mutations by employing an 73 approximation scheme for the estimation of model parameters. We assess the 74 performance of our method on simulated data and compare it to the original H-CBN 75 method. Furthermore, we incorporate an adaptive simulated annealing algorithm to 76 infer the maximum likelihood mutational network from the data, including different 77 moves to explore the discrete space of networks. The resulting model and inference 78 methods, called H-CBN2, are implemented as part of the MC-CBN R-package available 79 at https://github.com/cbg-ethz/MC-CBN. 80 We use the H-CBN2 method to infer evolutionary pathways to lopinavir resistance in 81 HIV-1 subtypes B and C confirming previous knowledge on frequently observed 82 patterns of resistance-conferring mutations in response to lopinavir treatment [33,34]. 83 We also devise a statistical test to assess the similarity between two CBN models, which 84 is available at https://github.com/cbg-ethz/H-CBN2-comparison-test. When applied to 85 subtypes B versus C, we find significant differences in their mutational pathways.

87
We first recapitulate the probabilistic graphical model underlying this work, the H-CBN. 88  CBNs are probabilistic graphical models, in which a directed acyclic graph (DAG) 94 represents the order in which genetic events may accumulate [20]. In the CT-CBN, the 95 time between genetic events is modeled by independent exponential distributions [21].

96
The H-CBN extends the CT-CBN model by introducing hidden variables to model the 97 error-prone observational process [31].

98
Formally, the CT-CBN is defined by a partially ordered set (poset) of genetic events, 99 or mutations, and a rate for each mutation to occur. A poset (P, ≺) consists of a set P 100 of size p = |P | and a binary relation ≺. The relation l ≺ k indicates that mutation l 101 must take place before k. Further, a relation l ≺ k is a cover relation if l ≺ z ≺ k 102 implies z = l or z = k. Drawing a directed edge from node l to node k for every cover 103 relation l ≺ k yields a DAG which is transitively reduced and uniquely represents the 104 poset ( Fig 1A). It is therefore sufficient to consider transitively reduced DAGs only.

105
A genotype is a subset of genetic events of P , represented by a binary vector 106 x = (x 1 , . . . , x p ), where x j = 1 indicates that mutation j has occurred. A genotype x is 107 called compatible with the poset P if (x l , x k ) = (0, 1) for all cover relations l ≺ k. The 108 collection of all genotypes compatible with P is the space of all feasible mutational 109 patterns, and it is denoted by J(P ) (Fig 1B). A partially ordered set of p = 4 mutations. A The hidden variable T j is the waiting time to mutation j, and edges between mutation times encode the temporal ordering constraints among mutations. The hidden variable X j represents the true binary mutation status, whereas Y j denotes the corresponding error-prone binary observation. B The genotype lattice J(P ) consists of eight genotypes compatible with the poset. The lattice encodes all possible pathways from the wild-type (0,0,0,0) to the fully mutated type (1,1,1,1).
The waiting time to each mutation j is represented by a random variable T j . Their 111 joint distribution is defined recursively as where pa(j) denotes the set of parents of j in the DAG, i.e., the set of mutations which 113 precede mutation j. The random variable Z j is exponentially distributed with rate λ j and accounts for the time elapsed for generating and fixating mutation j, after its 115 predecessors have occurred. The probability density of T j conditioned on the times to 116 mutation of its parents pa(j) is defined as where the indicator function I encodes the ordering constraints of the poset. That is, 118 the density function is zero if a mutation occurs before any of its predecessors.

119
The time at which every individual mutation emerges is generally unknown. Instead, 120 patients are monitored with certain regularity, and oftentimes when the viral load 121 increases, the virus population is sequenced. The sampling time is generally also 122 unknown and typically differs among patients. To account for this uncertainty, an 123 exponentially distributed random variable T s ∼ Exp(λ s ) is introduced. Hence, the 124 observed data is censored, and a mutation j occurred if and only if its waiting time t j 125 was smaller than the sampling time t s , i.e., x j = 1 if t j < t s and x j = 0 otherwise. The 126 model is not identifiable as long as the rate λ s is unknown. Therefore, unless known, 127 this scaling factor is set to λ s = 1 [21,31].

128
There is another hidden process, namely the generation of viral genotype data. To 129 account for false positives and false negatives, a variable Y is introduced in the H-CBN 130 model to denote the observed genotype, an error-bearing version of the true genotype 131 X [31] (Fig 1A). Assuming errors are independent and identically distributed across 132 mutations, the probability of observing genotype Y given the true underlying genotype 133 X is where is the per-locus error probability and d H is the Hamming distance.
The probability of the true genotype X is defined in terms of the waiting times as 139 discussed above,

140
Pr (X; λ, , P ) = where the indicator function I encodes mutation times that can give rise to genotype X, 141 i.e., z, z s X, z j = t j − max u ∈ pa(j) t u (j = 1, . . . , p), and 142 f Z (z j ; λ j ) = f T t j | (t u ) u ∈ pa(j) ; λ j . Henceforth, we also set z = (z 1 , . . . , z p , z s ). Owing to censoring of mutation times and unobserved true genotypes, the Expectation 146 Maximization algorithm (EM) has been previously used to obtain maximum likelihood 147 estimates of model parameters and λ j , j = 1, . . . , p [31]. To address the limitation on 148 the scalability in the number of mutations, we develop a Monte Carlo Expectation

151
In the expectation step (E step) of the MCEM algorithm, we estimate the expected 152 value of the complete-data log-likelihood X ,Z,Y, (λ, ) with respect to the current 153 conditional distribution of the hidden data (i.e., the unobserved true genotypes 154 X = (X (1) , . . . , X (N ) ) and mutation times Z = (Z (1) , . . . , Z (N ) )), given the observed 155 genotypes Y, as well as the current estimates of the parameters λ (k) and (k) where k denotes the current MCEM iteration. Assuming independent observations the 157 complete-data log-likelihood is Substituting Eq. (8) into Eq. (7) yields According to Bayes' theorem, We denote the numerator of Eq. (10) by A (k) (x, y, z) to obtain the expected complete-data log-likelihood For small H-CBN models, this integral has been computed by decomposing it into a sum of integrals over all possible maximal chains in the genotype lattice [21,31]. However, the number of maximal chains is p! in the worst case, where p is the number of mutations. Moreover, the summation over all possible genotypes in J(P ) is bounded by the total number of unobserved true (binary) genotypes: 2 p . For moderate to large numbers of mutations, the exact computation of the expected value thus becomes computationally infeasible. To overcome this limitation, we approximate the expected value (11) using importance sampling. The general idea is to generate L samples of the unobserved true genotypes x and the mutation times z from a proposal distribution Q(x, z). Then, Intuitively, we would like to draw samples from the important region, e.g., samples that 160 are likely to have given rise to the observed data. We use two types of importance 161 sampling schemes, which we refer to as the forward and backward sampling, and 162 implement and compare several variations of them (see next subsections).

163
In the maximization step (M step), we are concerned with maximizing Eq. (11) with 164 respect to the parameters and λ j , j = 1, . . . , p. The maximum likelihood (ML) 165 estimateˆ of the error rate is found to be the conditional expectation of the sufficient 166 statistic d H (X, Y ) obtained in the E-step, Similarly, the ML estimate for the rate parametersλ j are, The estimates can also be written more succinctly aŝ Forward sampling 169 Assume the rate parameters λ and the poset P are known. We generate a candidate error-free genotype x by sampling the mutation and sampling times z = (z 1 , . . . , z p , t s ) from the corresponding exponential distributions as follows To determine the waiting times t = (t 1 , . . . , t p ) we set t j = z j + max u ∈ pa(j) t u .

170
Whenever, the waiting time t j for mutation j is smaller than the sampling time t s , we 171 record that the mutation j has been observed. If we do this for every mutation j, we 172 obtain a sample of an error-free genotype x = (x 1 , . . . , x p ). We draw samples by 173 traversing the DAG in topological order to ensure that we compute t u for all u ∈ pa(j) 174 before visiting any dependent mutation j. Because we do not know the rate parameters 175 λ, nor the poset P , in each iteration of the MCEM algorithm, we use their current 176 estimates, λ (k) and P (k) .

177
For each observed genotype y (i) , i = 1, . . . , N , we draw L samples using the forward 178 sampling scheme described above. A sample is a tuple of waiting times and the 179 corresponding error-free genotype. Because of the graph traversal and the loop over 180 parents, the worst-case time-complexity of the forward sampling is O(N Lp 2 ). We note 181 that the candidate hidden genotypes are generated without accounting for the observed 182 data. Alternatively, we implement a second forward sampling scheme called 183 forward-pool. In this case, for each iteration of the MCEM algorithm, we draw an initial 184 pool of K waiting times vectors (t (l) j , j = 1, . . . , p), with K L, and for each observed 185 genotype, we choose a subset of L samples according to their similarity to the observed 186 genotype as explained below. For each of the waiting times samples, we first construct 187 the error-free genotype x (l) and then draw L genotypes, each with probability Backward sampling

189
For the backward sampling, we construct the sample of candidate error-free genotypes 190 x (l) , l = 1, . . . , L, based on the observed genotype y (i) and then sample the mutation 191 times as where TExp is a truncated exponential distribution. Montazeri  probability equal to the current estimate of the error rateˆ (k) . We draw L candidate 206 genotypes some of which may be incompatible with the current poset P (k) and, thus, 207 obtain a zero sampling weight; i.e., they do not contribute to the estimation of the 208 model parameters. This sampling scheme is referred to as Bernoulli sampling. The third 209 approach is a two-step scheme. First, we decide uniformly at random whether to (i) 210 leave the genotype y (i) unperturbed, (ii) add, or (iii) remove a mutation. For (ii) and

211
(iii), we draw a mutation from the set of mutations that can be added or removed, 212 respectively. If we remove an event j, it is chosen with probability proportional to 213 κ j = 1 λj + max l ∈ pa(j) κ l , which corresponds to a greedy approximation of the time to 214 mutation assuming that the process is dominated by the slowest predecessor in each 215 reverse breadth-first search generation. The rationale is to remove mutations from the 216 genotype y (i) that are likely to occur at later times with higher probability. On the 217 other hand, if we add an event, it is chosen with a probability which is inversely 218 proportional to the probability of being removed. In this case, we add mutations that 219 can arise faster with higher probability. In the second step of this scheme, we ensure the 220 genotype is compatible with the current poset P (k) by adding or removing all 221 incompatible mutations. This sampling scheme is referred to as the 222 backward-add/remove (backward-AR) sampling.

223
Evaluation of sampling schemes 224 We evaluate the accuracy of the different approximation schemes by computing the 225 probability of a genotype y (i) and comparing it to the exact solution (Eq. 5). Since

226
Pr(Y = y (i) ) are the factors of the likelihood, we are assessing the accuracy of the 227 likelihood computation. We approximate the probability of genotype y (i) by drawing L 228 samples from each of the proposal distributions, . (17) Structure learning 230 Gerstung et al. [31] implemented a simulated annealing (SA) algorithm with a geometric 231 annealing schedule to infer the network structure of the H-CBN model. However, as the 232 size of the model increases, the poset search space increases rapidly (sequence A001035 233 in The On-Line Encyclopedia of Integer Sequences, https://oeis.org/A001035), and the 234 standard SA algorithm is more prone to converge to local optima and to miss globally 235 optimal or near-optimal solutions. Here, we incorporate an adaptive simulated 236 annealing (ASA) algorithm [35] to improve the efficacy of the search. As in the 237 standard SA algorithm [36], in each iteration, we propose an update P of the current 238 poset P (k) and accept the new poset with probability Conventionally, the temperature Θ is gradually reduced over iterations, initially 240 allowing the system to explore a broad region of the search space, but ultimately 241 moving exclusively towards solutions that improve the likelihood. In the ASA algorithm, 242 the cooling schedule is adjusted according to the search progress, but following the same 243 principle as before, i.e., gradually changing the temperature such that the system is able 244 to converge [37,38]. We have adopted the cooling schedule from Srivatsa et al. [39] as 245 follows. For every interval consisting of m consecutive iterations, we set the temperature 246 Θ m = Θ m−1 exp 0.5 − a c m−1 a r , where a m−1 is the observed acceptance rate of the 247 previous interval, a r is a custom adaptation rate, and c = − log (2) log a ideal is a scaling factor 248 accounting for deviations from an optimal acceptance rate. Following the previous 249 work [39], the optimal acceptance rate is set to a ideal = 1/p, where p is the number of 250 mutations. Moreover, the adaptation rate a r is an additional free parameter enabling to 251 further control the abruptness of temperature changes.

252
The optimization includes proposing a neighboring poset, which ultimately defines  under the null hypothesis of both data sets having been generated by the same 276 underlying poset. The alternative is that the two data sets have been generated by two 277 different posets. 278 We compute the distribution of the test statistic D J under the null hypothesis as 279 follows. We combine all genotypes from the considered groups and randomly split the 280 data into two disjoint sets S 1 and S 2 with N 1 and N 2 genotypes, respectively, where N 1 281 and N 2 are the sizes of the two original data sets. That is, we permute the group labels 282 of the genotypes. Then we apply H-CBN2 to infer the poset for S 1 and S 2 separately 283 and compute their Jaccard distance. We repeat this procedure B times and construct 284 the distribution of the test statistic D J under the null by aggregating the computed 285 Jaccard distances (Fig 2). We assess how likely it is to observe a test statistic at least as 286 extreme as d J under the null hypothesis by means of computing the associated p-value 287 whereF(d J ) is the empirical cumulative distribution function, where  Fig 2. Schematic representation of the comparison of CBN models. A Data sets D 1 and D 2 consist of N 1 and N 2 genotypes, respectively, and, in this example, p = 4 mutations. We combined both data sets into a single one D 0 with N 1 + N 2 genotypes. B We randomly split data set D 0 into data sets S 1 and S 2 and we do so B times. C For each data set, we apply the H-CBN2 approach to learn the structure of the network and for each pair, S 1 and S 2 , we compute the Jaccard distance. D The empirical distribution of the test statistic is computed by aggregating the distances between pairs S 1 and S 2 . E We compare the inferred CBN posets from original data sets D 1 and D 2 by computing the Jaccard distance and assess its significance.
Jaccard distance
The Simulation study 319 We assess the quality of approximating the probability of a genotype y (Eq. 17) by the 320 H-CBN2 importance sampling schemes and compare it to the exact solution (Eq. 5).

321
For a poset with p = 16 mutations, we vary the number of samples, L, drawn from the 322 proposal distribution and find, as expected, that the accuracy of the approximation 323 improves with L ( Fig S1-Fig S5). In most cases, we are able to accurately approximate 324 the likelihood of the H-CBN model (Eq. 4) with L = 1000 (relative absolute error 325 ≤ 0.02, Table S1). For the forward-pool sampling, the relative absolute error of the Assessment of parameter estimation for various numbers of mutations, error rates, and posets. A Box plots of the difference between true ( ) and estimated (ˆ ) error rate (y-axis) for 100 simulated data sets for each of the evaluated model sizes (x-axis). B Box plots of the relative median error (RME; y-axis) of the estimated rate parametersλ. Different colors indicate different importance sampling schemes. The sample size is N = min(50 p, 1000) and the number of samples drawn from the proposal distribution is set to L = 1000 unless specified otherwise. We run 100 iterations of the Monte Carlo EM algorithm. . As an exception, the estimates obtained by the Bernoulli sampling deteriorate as the 359 number of mutations increases. This is because the fraction of incompatible genotypes 360 increases with the number of mutations and it becomes less likely to sample candidate 361 genotypes with non-zero weight (i.e., compatible with the poset). In fact, the Bernoulli 362 sampling scheme failed to provide any samples for the data sets with 5% error rate and 363 256 mutations, as well as for the data sets with 10% error rate and 128 and 256 364 mutations.

365
By contrast, the relative error in estimating the rate parameters λ increases with the 366 number of mutations. This is likely due to the bounded number of genotypes to 1000 for 367 data sets with more than 32 mutations. For this particular constellation of sample size 368 N and number L of samples drawn from the proposal distribution, for posets with more 369 than 32 mutations the sampling schemes fall short of accurately estimating the rate 370 parameters. We also evaluate the Hamming k-neighborhood sampling for different k 371 (Fig S6 E). We observe that as the number of mutations increases, we need to expand 372 the neighborhood based on the Hamming distance for accurate estimation of the model 373 parameters. However, the run time increases substantially and becomes a limiting factor 374 for larger posets (Fig S7). Once again, the performance of the Bernoulli sampling 375 scheme declines as the number of mutations increases, and for data sets with more than 376 64 mutations, the relative median error for the rate parameters is outside the range 377 displayed in Figure 3. 378 We also assess the run time performance of various sampling schemes implemented 379 in H-CBN2 and compare to H-CBN (Fig S7). Each run corresponds to 100 iterations of 380 the MCEM or EM algorithm. We observe that H-CBN is faster than any of the exponentially with the number of mutations and is outperformed by H-CBN2 for p 10. 385 We also observe that, for larger posets, the forward-pool sampling is slower than the 386 standard forward sampling, because the size of the pool increases with the number of 387 mutations; we set K = pL to assure accurate parameter estimates ( Fig S2). As the 388 number of mutations increases, the computation time of the Hamming distance becomes 389 the limiting factor (Eq. 15).

390
The forward sampling and the backward-AR sampling perform equally well in terms 391 of accuracy of the estimated model parameters for small-and medium-sized posets, even 392 when the number L of samples drawn from the proposal distribution is set to 100 for 393 the backward-AR sampling (Fig S6 G-H). The run times of these sampling schemes 394 with L = 1000 and L = 100, respectively, are also similar. The forward and the 395 backward-AR sampling schemes thus enable performant parameter estimation for posets 396 with more than 14 mutations and up to about 32 mutations. Since we do not observe 397 any advantage in using the backward-AR sampling over the forward sampling, we 398 choose the latter for all further analyzes presented in this work.

399
Assessment of the simulated annealing algorithm on simulated data 400 So far, we assumed that the poset P is known. In the following, we evaluate the 401 performance of the H-CBN2 structure learning algorithm, which, in addition to adding 402 or removing an edge, includes new moves to propose candidate posets, as well as an 403 ASA schedule. We employ a similar approach as before: (i) draw a transitive reduced 404 DAG and parameters at random, (ii) generate a data set from the joint probability 405 distribution of the model, and (iii) infer the network structure in addition to the model 406 parameters. 407 We first compare the accuracy of the estimated model parameters when the poset P 408 is also learned. We do not observe any manifest difference in the absolute error between 409 the true and the estimated error rate (Fig S8 A), but the relative absolute error of the 410 rate parameters is marginally larger when the poset is learned in addition to the model 411 parameters, as well as the absolute error of the log-likelihood (Fig S8 B-C). 412 Next, we compare different SA strategies for structure learning. We observe a 413 notable improvement in the log-likelihood of the reconstructed network after including 414 the additional new moves (SA+) compared to a simulated annealing algorithm (SA) 415 with only addition and removal of edges (Fig 4A). Incorporating, in addition, an 416 adaptive annealing schedule yields similar performance to SA+. Similarly, the error in 417 estimation of the model parameters also decreases mostly upon including the new moves 418 (Fig S9 A-B). We also compute the recall and precision of reconstructing the elements of 419 the cover relation and find a clear improvement of SA+ over SA (Fig 4B-C).

448
In addition, we analyze two genotype-treatment data sets from the HIVDB 449 corresponding to HIV-1 subtype B and C genotypes. For the latter, we exclude 450 genotypes from South Africa that constitute the data set described above. All patients 451 in these data sets were treated with lopinavir or lopinavir and ritonavir but not with 452 another protease inhibitor. The data sets include 298 and 775 sequences of subtype B 453 and C genotypes, respectively. Additionally, we consider 172 HIV-1 subtype B 454 sequences of the SHCS derived from patients treated with lopinavir as the first PI. We 455 jointly analyse all 470 subtype B genotypes to mitigate the small sample size. 456 We use H-CBN2 for analyzing and comparing the accumulation of resistance 457 mutations in HIV-1 subtype B and subtype C under the selective pressure of lopinavir. 458 We employ the forward sampling scheme to learn the partial order among mutations.

459
The robustness of the network estimation is investigated by using 100 bootstrap samples 460 and the consensus networks are shown in Figs 5A and 6 (p = 20 and p = 18, 461 respectively). In the South African cohort (subtype C sequences), we identify a 462 mutation at residue 82 in the protease as an early event. The initial substitution is 463 likely to be V82A, as it is predominantly observed in the data set ( Fig S10). After this 464 initial event, we find strong support for mutations at residues 10, 33, 46, 54 and 76 (Fig 465  5A). For subtype B, we find strong support for a mutation at residue 46 as an initial 466 event (Fig 6). The inferred posets can explain previously observed mutation patterns, 467 such as M46I+I54V alone or in combinations with L76V or V82A in subtype B [33], as 468 well as M46I+I54V+V82A and L10F+M46I+I54V+L76V+V82A in subtype C [34].

475
To assess whether the two H-CBN2 posets are significantly different beyond 476 reconstruction uncertainty, we have developed a customized statistical test based on the 477 Jaccard distance between the posets. The distance between the maximum likelihood 478 posets (Fig S11 and Fig S12) is 0.802. To assess the significance of this result, we 479 compare it to the empirical distribution of pairwise distances computed between 480 reconstructed networks after randomly permuting the group labels (Fig 7). At a 481 significance level of 5%, we reject the null hypothesis that the data sets stem from the 482 same underlying poset (p-value < 0.02, Fig 7B), for p = 18 mutations. Similarly, we 483 reject the null hypothesis while comparing subtype-specific CBN models for HIV-1 484 subtype B and C with p = 11 mutations (p-value = 0.04, Fig 7A). The smaller data sets 485 are obtained by discarding mutations with marginal counts less or equal 5 in either of 486 the two data sets.

487
As a negative control, we also compare the two H-CBN2 models for subtype C 488 inferred from the South African cohort versus the remaining subtype C genotypes from 489 the HIVDB (Fig 5). The consensus posets share 16 cover relations, namely, L10FR ≺ Consensus posets for lopinavir resistance for two different HIV-1 subtype C data sets. Shown are the consensus poset for A the South African cohort and B for the remaining HIV-1 subtype C sequences retrieved from the HIVDB. Nodes in the network correspond to amino acid changes in the HIV-1 protease, where mutations at the same locus are grouped together in one event. Only edges with a bootstrap support greater than 0.7 are shown and the edge thickness indicates the bootstrap support. Nodes with white background show residues with at least one major PI mutation.  annealing algorithm, including additional move types that allow exploring the space of 512 posets more efficiently. We validated the structure learning algorithm for 16 mutations 513 which aligns with the numbers of mutations relevant for our application to HIV-1.

514
Structure learning is, however, a hard problem and further improving the efficiency of 515 this step might be worthwhile addressing in future research.

516
Even though there are descriptive analyses of subtype-specific PI mutation 517 profiles [23,33,34,44], to our knowledge, this study is the first comparative analysis of 518 pathways of accumulating mutations over time in different HIV-1 subtypes. In addition 519 to a more systematic approach to investigating mutation patterns, the number of 520 observations in our study is greater than in any of the previous studies, which ranged 521 from 88 to 165 patients. We applied the H-CBN2 approach to learn the partial 522 temporal ordering of resistance mutations in HIV-1 subtypes B and C under the 523 selective pressure of lopinavir. Our results indicate that despite some similarities, for 524 the considered numbers of mutations, the resistance pathways differ significantly 525 between the two subtypes. Moreover, we compared H-CBN2 posets for subtype C 526 inferred from two independent data sets as a validation of the distance-based test and 527 the outcome aligns with the expectation that there exists a single underlying poset 528 explaining both data sets better than two distinct posets.

529
In our analysis, we included major PI mutations associated with lopinavir resistance 530 and non-polymorphic accessory mutations. Although some polymorphisms, in 531 combination with PI resistance mutations, are associated with an increase in viral 532 fitness [45], these are also highly prevalent in treatment-naïve patients, especially in 533 non-B subtypes [46][47][48]. Therefore, despite observing polymorphisms with relatively 534 high prevalence, we did not include these mutations in our study. We also found more 535 than one PI-associated mutation in only about 14% and 16% of the patients in the 536 South African cohort (subtype C) and the subtype B data set, respectively. The 537 absence of resistance mutations in the protease gene has been repeatedly observed at 538 virological failure, even in the absence of reverse transcriptase inhibitors [33,[49][50][51][52]. In 539 addition to poor adherence to treatment [53,54], there may be other reasons for 540 observing a low percentage of patients harboring PI resistance mutations, and some of 541 them are listed below. First, the genetic barrier to lopinavir resistance appears to be 542 high. Barber et al. [33] have suggested that PI resistance mutations are more likely to 543 accumulate under prolonged virological failure. Second, there is increasing evidence that 544 mutations in the gag gene play a role in decreasing susceptibility to protease inhibitors 545 by, e.g., inhibiting the proteolytic cleavages necessary for protein maturation [19,55,56]. 546 Virions with immature particles may not adequately complete cell entry or reverse 547 transcription [56]. Third, resistance mutations may exist in the intra-host virus 548 population at frequency below the detection threshold. Mutations are typically detected 549 by Sanger sequencing-based methods, while next-generation sequencing methods could 550 improve upon the sensitivity of detecting mutations [57].

551
A limitation of the methodology is the aggregation of different amino acid 552 substitutions at the same locus as single events. This is required because when 553 observing a specific substitution, we do not know which other mutations at that locus 554 led to the current state. One potential way to overcome this limitation is to incorporate 555 additional data sources, such as time-series data per patient, single-genome 556 amplification data, or even resort to next-generation sequencing data. However, these 557 data are not generally available in public databases, nor are they part of routine 558 diagnostic tests. If data were available, such hidden states could be accounted for in an 559 extended model. Nonetheless, with the number of genotypes available in the current 560 application, we may not be able to include more mutations without decreasing the 561 accuracy of the parameter estimates of the H-CBN model.

562
The comparison of the cross-sectional data sets is challenging due to the existence of 563 several confounders. First, the data are gathered from various sources, which entails 564 potential differences in HIV surveillance and clinical monitoring protocols. Moreover, 565 observations come from distinct geographical locations, which implies, e.g., differences 566 in socio-demographic aspects and health-care standards. Lastly, therapeutic strategies 567 tend to differ between developed and developing countries, and there is a limited 568 number of observations of various subtypes undergoing the same therapy. In the present 569 study, the number of observations in the subtype B data set is approximately half of the 570 observations available for subtype C. Such an imbalance poses additional challenges for 571 the CBN comparisons. The spread of the empirical distribution of Jaccard distances 572 might be wider for imbalanced data sets, which could result in an apparent increase in 573 false negatives. But rather than a shortcoming of the distance-based test, small sample 574 size generally lead to reduced accuracy of the parameter estimates, including the 575 network structure.

576
In summary, the inferred CBN models provide insights into the evolution of drug 577 resistance in HIV-1 subtype C infections and enable comparisons with other subtypes, 578 as demonstrated for subtype B. Moreover, the methods proposed in this work can be 579 generally applied to investigate subtype-associated differences pertaining to HIV-1 drug 580 resistance. neighborhood including the leading, the first-order, and the second-order terms (k = 2), 605 and G-I a neighborhood including the leading, the first-order, the second-order, and the 606 third-order terms (k = 3). In this case, the value of L indicates the number of waiting 607 time vectors sampled per genotype in the neighborhood.  corresponding to 20 different transitively reduced DAGs for simulated data sets with 16 656 mutations and an error rate of 5%. We use the forward sampling scheme with L = 1000 657 samples drawn from the proposal distribution. We fix the ideal acceptance rate to 658 1/p = 0.0625 and run 25000 iterations of the simulated annealing algorithm. The initial 659 temperature is set to Θ 0 = 50 for all runs and for the adaptive simulated annealing 660 three adaptation rates are evaluated (a r = 0.1, 0.3, 0.5). SA: simulated annealing, ASA: 661 adaptive simulated annealing, +: with additional new moves.   Table S1 Relative error in approximating the log-likelihood via 678 importance sampling. The relative error is computed by dividing the absolute error 679 by the absolute value of the true log-likelihood.

680
File S1 Patient identifiers of the South African cohort as retrieved from 681 the HIVDB.