Reassortment Networks and the evolution of pandemic H1N1 swine-origin influenza.

Prior research developed Reassortment Networks to reconstruct the evolution of segmented viruses under both reassortment and mutation. We report their application to the swine-origin pandemic H1N1 virus (S-OIV). A database of all influenza A viruses, for which complete genome sequences were available in Genbank by October 2009, was created and dynamic programming was used to compute distances between all corresponding segments. A reassortment network was created to obtain the minimum cost evolutionary paths from all viruses to the exemplar S-OIV A/California/04/2009. This analysis took 35 hours on the Cray Extreme Multithreading (XMT) supercomputer, which has special hardware to permit efficient parallelization. Six specific H1N1/H1N2 bottleneck viruses were identified that almost always lie on minimum cost paths to S-OIV. We conjecture that these viruses are crucial to S-OIV evolution and worthy of careful study from a molecular biology viewpoint. In phylogenetics, ancestors are typically medians that have no functional constraints. In our method, ancestors are not inferred, but rather chosen from previously observed viruses along a path of mutation and reassortment leading to the target virus. This specificity and functional constraint render our results actionable for further experiments in vitro and in vivo.


INTRODUCTION
A newly emergent strain of influenza subtype H1N1, termed the swine origin influenza virus (S-OIV), was first detected during an outbreak in Mexico and southwestern USA in spring of 2009 [1], [2]. The disease was next observed in other parts of the USA in late March 2009 and the virus was first isolated in mid April 2009 [3], [4]. Throughout the spring, the epidemic spread worldwide and the World Health Organization (WHO) declared an influenza pandemic, phase 6, on 11 June 2009 [5]. By 25 April 2010, 17,919 deaths attributable to H1N1 S-OIV were reported with confirmed cases in more than 214 countries [6].
Influenza is a negative sense single-stranded RNA virus in the family Orthomyxoviridae. It is maintained in migratory waterfowl and can infect many species including humans, birds, pigs, horses, and other animals. The virus genome contains eight segments that code for 10 or 11 proteins. Three segments encode the polymerase complex: basic polymerase 2 (PB2), basic polymerase 1 (PB1), and the acidic protein (PA). The nucleoprotein segment (NP) encodes a protein that binds to viral RNA. The matrix segment (MP) encodes two proteins: a structural component of the viral capsid and a membrane ion channel. The nonstructural segment (NS) encodes a protein essential for cellular RNA processing and transport. Two other segments, hemagglutinin (HA) and neuraminidase (NA), encode viral surface glycoproteins responsible for host cell entry and exit, respectively [7], [8]. The hemagglutinin and neuraminidase genes determine the viral subtype, designated HxNy, where x is one of 16 known hemagglutinin subtypes [9] and y is one of nine known neuraminidase subtypes [7].
When two different influenza viruses infect the same host cell, novel combinations of the eight genomic segments can create a novel virus through a process known as reassortment [7], [8], a form of horizontal gene transfer (HGT). Often, the host population is immunologically naïve to the new viruses created through reassortment, which can lead to worldwide pandemics as observed in 1957 with the H2N2 pandemic, in 1968 with the H3N2 pandemic, and in 2009 with the H1N1 pandemic [8], [10]. The role of reassortment in influenza evolution is also discussed in [11] and [12].
Reassortment events leading to novel influenza viruses are poorly understood, greatly underestimated, and thus are an area of continuing research. Reassortment events are often identified by visual comparisons of incongruence among phylogenetic trees [13]. Some algorithms have been developed to infer reassortment events through statistical techniques in the absence of phylogenetics [14], [15]. Prior research in using phylogenetic networks to analyze situations where HGT occurs has been done by Huson and Bryant [16], and Makarenkov and Legendre [17]. An extensive survey of combinatorial methods for phylogenetic networks appears in [18].
In this paper, we build on previous work by Bokhari and Janies [19] to explore the evolution of H1N1 S-OIV pandemic viruses using a reassortment network. This paper reports multidisciplinary research, involving several diverse fields. As such, we must necessarily cover a wide range of issues in order to provide a clear understanding of 214 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 1, JANUARY/FEBRUARY 2012 our methods and results. It is important to describe many of the low-level implementation details to allow other researchers to appreciate the complexities of this project and pursue its applications and extensions. In particular, we cover the details of data gathering and processing that were necessary to create the database that is at the center of this work. The issues of parallel computation, both on a distributed memory commodity cluster as well as on a shared-memory massively multithreaded supercomputer are also covered, as the reassortment algorithm could not be run in an acceptable amount of time without these. Finally, the presentation of output as "in-trees" that capture both mutation and reassortment is important for interpretation and evaluation of our results.
The main contributions of this paper are: 1. Implementation of a parallel version of the Bokhari and Janies [19]  almost every evolutionary path to the 2009 S-OIV set of viruses. These possibly played a significant role in the S-OIV pandemic and are worthy of further study in a molecular biology context. We start the paper with an overview of the workflow of our research in Section 2, which is followed by a discussion of the architecture of the Cray XMT supercomputer. In Section 3, we review existing models for S-OIV reassortment. The main results of our paper appear in Sections 4 and 5. We conclude with a discussion of our results and suggestions for future research in Section 6. 1

OVERVIEW OF WORKFLOW
There are three main components of our research (Fig. 1). To start with, virus names, dates, locations, subtypes, and nucleotide sequences were downloaded from the National Center for Biotechnology Information (www.ncbi.nlm.nih. gov). Next, pairwise distances between all pairs of corresponding sequences were computed using dynamic programming. Finally, the distances were used to create a Reassortment Network and to compute the minimum cost paths from all viruses to a given target virus. (A minimum cost path, or shortest path, from node s to node t has the shortest distance over all paths from s to t.) These shortest paths are used to create the in-tree that has, in general, all known viruses as its leaves and the given target virus as its root.

Creating the Virus Database
Although the RNA sequences of influenza viruses are available in the public databases www.ncbi.nlm.nih.gov/ genomes/FLU [20] and www.ncbi.nlm.nih.gov/sites/ batchentrez, extracting this information in a form that could be used for our study was a challenging problem that required significant manual curation.
We need, for each virus, a unique strain-specific identifier (a TaxID), strain name (e.g., A/Canada-NS/ RV1535/2009(H1N1)), subtype (H1N1 in this case), year (2009), and RNA sequences for each of the eight segments. Acquiring this information is complicated by the following issues: . Many segment sequences are repeated in the database due to redundant effort. . The year entry in the database is often not standardized (e.g., 1979-chicken instead of 1979). . Sequences include those isolated from laboratory adapted and vaccine strains that have to be excluded by inspection. . Segment types are part of name strings that also include place names. For example, "PA" could be a location (a state of the US) or the segment type. . Segment types are labeled in numerous ways, e.g., polymerase 2, polymerase basic protein 2, segment 1, Sequence 1, P2, PB2, pb2. These issues make it difficult to fully automate the process of extracting information from the database and require significant manual effort. Full details of the steps required to acquire this data are given in the Supplemental Material, Appendix E, which can be found on the

Computing Intersegment Distances
Weighted edit distances for every pair of corresponding segments were computed using standard dynamic programming [21,Section 11.5], as detailed in [19]. We use "end-space free" alignments [21, Section 11.6.4]. Such alignments ignore any missing bases at the ends of the input strings, which are due to sequencing artifact. When computing distances, we needed to account for symbols other than A, C, T, G, that represent ambiguous bases [22]. This issue is described in detail in the online Supplemental Material, Appendix F. The number of fully sequenced (all eight segments known) influenza A viruses in www.ncbi.nlm.nih.gov/ genomes/FLU as of 15 October 2009 was 5,016. Thus, we needed to carry out 8 Â ð5;016 Â 5;017Þ=2 % 101 Â 10 6 alignments. The memory requirements in this case are modest, as the length of an influenza segment is <2;500 bases, implying dynamic programming matrices of size <10 7 . Furthermore, the 101 Â 10 6 alignments are completely independent and thus trivial to parallelize.
We initially used the resources of Ohio Supercomputer Center (OSC, www.osc.edu), which has several thousand processors. The computation of all distances took 12-24 hours on this shared system, where we had to compete with hundreds of other users.
During the course of our research, our department (Biomedical Informatics) acquired its own cluster, called "Bucki," with 576 AMD Opteron 2378 (2.4 GHz) processors that we could use in dedicated mode. The time on this system was reduced to 90 minutes by finely dividing the workload, as detailed in the online Supplemental Material, Appendix G.

The Reassortment Network
We now turn to the actual creation and solution of reassortment networks. As the theory underlying these networks has been discussed in detail by Bokhari and Janies [19], we present only a brief sketch.
A reassortment network representing stages of evolution has 2 þ 1 layers 0 . . . 2. Even layers represent viruses and odd layers represent events, which can be stasis, mutation, or reassortment. Edges extend only from layer i to i þ 1, 0 i < 2 (resulting in a multipartite graph). A path from a virus v in the first layer to a virus u in the last layer represents a series of stasis, mutation, or reassortment events that transform v to u.
Edges in the reassortment network have weights on them. The weighting scheme is designed such that the cost of the path between virus v in layer 0 and virus u in layer 2 equals the sum of the costs of the mutation and reassortment events that transform v into u (stasis costs between identical segments are, of course, zero).
The cost of a mutation of virus i into virus k is the sum of the edit costs of the individual segments. When virus i reassorts to obtain a segment from virus j to become virus k, the reassortment cost is the sum of the segment edit distances between virus i (with segment replaced by the corresponding segment of virus j) and virus k. These concepts are clarified in Figs. 12 and 13 in the online Supplemental Material, Appendix D.

Motivation for Reassortment Algorithm
Our reassortment network is a layered graph in which alternating layers of nodes correspond to viruses and reassortment events between pairs of viruses. Edges in this graph correspond to transitions between viruses and reassortment events, with edge weights corresponding to the costs of transitions. These costs are the sums of segment distances between pairs of viruses.
Paths in a reassortment network correspond to evolutionary changes and the lengths of the paths correspond to the sums of the edge weights in the paths. Lower path lengths correspond to smaller sums of distances between the corresponding sequence of viruses. It follows that the shortest path indicates the minimum cost sequence of mutation and/or reassortment events required to transform one virus into another.

Finding Shortest Paths
Once a stage reassortment network for n viruses with s segments each has been set up, the shortest evolutionary path between any two viruses can be found in time Oðs 2 n 3 Þ. This time suffices to find shortest paths from all viruses in layer 0 to a given target virus t in layer 2. These paths constitute the in-tree for the target virus. The expression Oðs 2 n 3 Þ represents a very large number of computational steps, given that % 30; s ¼ 8, and n % 5;000, and massive parallelism is required to achieve reasonable run times, as discussed below.

The Cray XMT
The algorithm for finding shortest paths in the reassortment network is difficult to parallelize on conventional cluster machines because 1. the reassortment network is a graph that has little locality and thus requires long-range communications between graph nodes, 2. there is a strict precedence relationship on the order in which the nodes are labeled, that is, from layer 0 to layer 2n, 3. if layers are assigned to individual processors, then all processors whose layers are currently not being updated will be idle, and 4. if layers are partitioned over processors, then there will be need for frequent and heavy interprocessor communications.
The Cray Extreme Multithreading (XMT) supercomputer is the latest in a family of machines that originated in the Tera [23] and subsequently evolved into the MTA (Multithreaded Architecture) [24]. Reviews of the architecture of this machine and its applications are available in [25] and [26].
The key features of this machine include 1. hardware support for 128 threads per processor, 2. zero overhead switching between threads, 3. large uniformly accessible shared memory, 4. extremely fine grained synchronization using Full/ Empty bits with individual 64-bit words, 5. powerful interconnect, and 6. powerful compiler and performance analysis tools. These features of the XMT allow straightforward and efficient parallelization of many important problems, especially those in bioinformatics, as surveyed in [26]. In particular, the XMT is the only machine that can efficiently parallelize large-scale reassortment networks.
The largest XMT that is available to researchers as of 2010 is the 128 processor, 1 Terabyte machine at the Center for Adaptive Supercomputer Software (CASS) in the Pacific Northwest National Laboratory (PNL).

Parallelization of Algorithm
The XMT allows the parallelization of ordinary C code with some loop restructuring and the judicious addition of pragmas (compiler directives in code). The use of machine specific synchronization is occasionally required.
As an example, Fig. 2 describes the routine that labels event nodes from virus nodes when finding shortest paths in a reassortment network. The #pragmas assure the compiler that the loops that follow are safely parallelizable. Only the i and k loops need to be parallelized since the number of viruses ¼ 5;016 and this results in enough parallelism (5;016 2 computations) to keep the machine busy. Thus, the s loop is not parallelized. The readfe(.) (wait until full, then read and set empty) (writeef(.) (wait until empty, then write and set full)) functions lock (unlock) their arguments (which are individual words) so that multiple threads can safely and correctly update them. Although expressed as functions, these are really machine operations that are executed in one clock cycle, leading to very efficient, fine grained synchronization. This behavior is possible because the Cray XMT supercomputer has special synchronization bits with each memory location. An illustration of synchronization is provided in the online Supplemental Material, Appendix H. Notice how variables are declared only in the scopes where they are used. This ensures that if a loop is parallelized, the corresponding thread will have an independent copy of the variable, thus eliminating contention.

Implementation Details
When going from a serial to a parallel version of our code, we first examined the major loops to identify which could be parallelized. In these loops, we had to ensure that all accesses to shared variables were safely parallelizable. This is easy to do, since the XMT compiler will indicate unsafe parallelization (i.e., situations where there are accesses to shared locations without appropriate synchronization). Full/Empty operations, readfe(.) and writeef(.), as described above, were used in such situations. In a few cases, loops had to be restructured so as to allow efficient parallelization.
One of the major issues when using the XMT is the speed of input/output. When large volumes of data, such as the distance matrices (which are of size 101 million words) are to be input, these have first to be loaded into a special parallel file system and then transferred in parallel into the 1 Terabyte shared memory of the XMT. Ordinary C input/ output functions are executed serially and cannot be used for large data volumes.
Our code occupied about 120 GB of memory and ran at the rate of about 1.1 hours per stage, for a total of 35 hours for a 32-stage problem, using 128 processors.

S-OIV EVOLUTION MODELS
A number of research groups have presented models of the emergence of influenza in human hosts from influenza in swine hosts, including S-OIV. These include [4], [27], [28], [29], [30]. Figs. 3, 4, and 5 present the salient features of these models.
In work by Dawood et al. [4], Olsen [27], Trifonov et al. [28], and Kingsford et al. [29], "classes" of viruses such as "Classical H1N1 swine," "Eurasian H1N1 swine," etc. are mentioned. We have been unable to find enumerated lists of these classes of viruses in the literature. This lack of specificity complicates the comparison of the results of our algorithm with those of other research groups.
Consider class "Eurasian Swine Influenza" shown in of [29, Fig. 3], which is a phylogenetic tree for the NA segment and includes many avian viruses, e.g., A/duck/Nanchang/ 1904/1992, A/goose/Italy/296426/2003, A/chicken/ Hebei/718/2001, etc. The rationale behind this inclusion is not explained and appears to be based only on the similarity between the NA segments and not on any clinical evidence of infection of swine with avian viruses.

VALIDATION OF MODEL
We ran our reassortment network for different target viruses. To start with, we present the results of two runs with non-SOIV viruses as targets. The objective here is to test our algorithm against prior results by Karasin et al. [31]. The following validation covers virus evolution over the period 1977-1999, while the S-OIV investigation (Section 5, below) covers the years 1930-2009. The same master database of intersegment distances was used in both cases; the reassortment algorithm is insensitive to the year of the target virus.

Constraints
We used a threshold of 500: if two viruses differed in more than distance 500 in any one segment, the mutation or reassortment event was ignored. This is necessary to get useful information from our reassortment network. This is because the algorithm searches for shortest paths in terms of sums of edge weights. If a high threshold (or no threshold) is used, all evolutionary paths will be only a few edges long, thus obscuring fine grained information on mutations and reassortments. If a zero threshold is used, no path will be found (as the target will be disconnected from the rest of the graph). For our 5,016 virus, 35 stage problem, we found that 500 was a choice of threshold that gave useful information. This phenomenon is discussed in Section 5.4 and Fig. 8 of our previously published paper [19].
We allowed reassortments between viruses i and j to yield k only if the years of i and j were the year of k. Under these constraints, some viruses may not participate in paths to the target.

Interpreting In-Trees
As described in Fig. 6, a black box represents a virus with the number on the right of the box indicating the length of the shortest path from that virus to the target. Two black boxes connected by a black line indicate a mutation event. Each dashed red box indicates a reassortment, with the segment name and number indicated on the right hand side. In Fig. 6, A/PuertoRico/8/34 (which has total distance 2,348 from the target (not shown)) mutates into A/Alaska/ 1935 which is at distance 2,407 from the target. A/Albany/ 1618/1951 reassorts: it obtains segment 1 (PB2) from A/ HongKong/117/1977 to become A/Tientsin/78/1977.
Virus names that are too long to fit in available space are truncated (indicated with "Ã") and full names are available in the corresponding digital annotations in the pdf files in the online Supplemental Material. Fig. 7 shows part of the tree resulting from a run with A/ swine/Colorado/1/1977(H3N2) as target. The full tree is in Supplemental file ASwineColorado.pdf. The blue cut associates (on its right hand side) viruses that are distance 900 from the target virus. These viruses are exclusively of human origin. Nonhuman origin viruses, on the left side of the cut, have weight greater than 1,800. There is a clear differentiation between human (<900) and nonhuman Fig. 4. The model presented by Trifonov et al. [28] is a subset of the model in Fig. 3; it omits a stage of swine H3N2 viruses.  [30] differs significantly from the preceding two models. Here, "Eurasian H1N1 swine" do not stand in isolation but are derived from Avian H1N1.

A/Swine/Nebraska/209/98
The in-tree from a run with target A/swine/Nebraska/209/ 98(H3N2) is shown in Fig. 8. The full tree is in Supplemental file ASwineNebraska.pdf. According to Karasin et al. [31], the PA and PB2 segments should be derived from avian viruses and the remaining from human. In the tree of Fig. 8, the lowest cost paths (blue box) pass through reassortments with avian viruses to obtain PA and PB2 just before the target. This partially supports Karasin et al. [31], since these are two disjoint sets of paths and not one path with two reassortments, as we would have expected. We conjecture that this is due to missing data-a richer data set might have yielded the expected path.

Parameters of 32-Stage Run
We now present the results of a run with the exemplar S-OIV virus A/California/04/2009 as target. This 32-stage run took 35 hours on a 128 processor Cray XMT. The reassortment network generates an in-tree of shortest paths from 5,015 viruses to A/California/04/2009. In addition to the constraints stated in Section 4.1, we used a reassortment overhead of 10 units to suppress trivial, small distance reassortments that would otherwise clutter up our paths. Such reassortments are indistinguishable from point mutations and drive interesting large-distance reassortments out of the 35-stage range of our experiment. The overhead is incorporated by adding 10 units to the reassortment edges of our network.

Suppressing Intra-S-OIV Events
As a result of the intense interest in the pandemic strains, the number of S-OIV viruses in our database is a disproportionately large fraction of the total viruses. Consequently, the target virus is surrounded by a very large subtree that represents intra-S-OIV evolution and obscures other evolutionary events. To concentrate on our immediate objective of tracing the origin of the exemplar S-OIV virus A/California/04/2009, we suppressed paths through all other S-OIVs by temporarily setting their distances from everyone else to infinity.

Viewing the In-Trees
Despite the constraints mentioned in Section 4.1, the size of the database is such that very large trees are still generated and it is a challenge to visualize the results. After trying several approaches, we have chosen to separate the in-trees by the years of their source viruses. Thus, Fig. 14 in the online Supplemental Material, Appendix J, shows the in-tree for viruses from the 1930s, 40s, 50s while Fig. 15 shows those from the 1960s (trees for subsequent decades are available in the online Supplemental Material). For clarity, even with this approach, we needed to prune paths in which the leaf nodes were highly similar in terms of their distance to target. As the number of sequenced viruses has been increasing dramatically since the 1960s, it is impossible to show results from the 1970s onward on paper. Full information from the run is available as a spreadsheet as described below.
In Supplemental Fig. 14 These viruses are two edges away from the target, have path weight 1;010 and form part of an important "bottleneck" set that we will discuss below. The annotated tree is in Supplemental file tree304050.pdf.
The in-tree for the 1960s is shown in Supplemental Fig. 15. It is noteworthy that the bottleneck set has now expanded to include A/swine/HongKong/1110/2006(H1N2) and that most paths pass through this set. The annotated tree is in Supplemental file tree60.pdf. Supplemental file tree70.pdf holds the in-tree for the 1970s. The bottleneck viruses are the same as for the 1960s, and have path weight to target 1;010. A number of additional viruses now occur at distance two edges from the target, but these have path weight >2;000, which is significantly greater than the bottleneck viruses. For the 1980s, Supplemental file tree80.pdf has a new bottleneck virus: A/swine/ Shanghai/1/2007. This has slightly greater distance, i.e., 1,030, than the others, but is still significantly smaller than other viruses at distance two edges from target. Supplemental file tree90.pdf shows a new bottleneck virus A/ Iowa/CEID23/2005(H1N1), with distance 1,008 to target.
For viruses that were identified in 2000 and later, an annotated tree is available in Supplemental file tree00.pdf. Fig. 9 shows a selection of paths that pass through a set of six bottleneck viruses, five of which were encountered in previous decades. The new bottleneck is A/ swine/Shanghai/1/2007(H1N2).
The virus A/duck/NC/91347/01(H1N2) is located in the same position in the tree as the bottleneck viruses and has distance to target of 954, which is smaller than any of the bottlenecks. However, in our run, only one path is found through it, unlike the bottleneck viruses which have hundreds of paths, as described below. We, therefore, do not include this in our list of bottleneck viruses, but do consider it worthy of future analysis in a molecular biology context because of its low distance.

Bottleneck Viruses
Analysis of the results of the 32-stage run reveals that 3,600 out of 3,926 paths pass through a set of bottleneck viruses before reaching the S-OIV A/California/04/2009, as shown in Table 1. These viruses are: to obtain the NA segment before reaching A/California/04/ 2009. Viruses 1-6 are obtained by reassortments with a number of different viruses. Some of these are the bottlenecks themselves. For example, one of the possible evolutionary paths in Fig. 9 Fig. 9. A subset of the in-tree for A/California/04/2009 that shows only source viruses (i.e., leaf nodes) from 2000 to 2009. The first six of seven viruses in the rightmost column are "bottleneck" viruses, with hundreds of paths through them. Only a few representative paths are shown. Black boxes represent viruses; red boxes stand for reassortment events (see Fig. 6). A/duck/NC/91347/01 is not a bottleneck, as discussed in text. Fig. 10 illustrates the bottleneck viruses. This subtree was obtained from the full tree by only including paths that pass through bottleneck viruses 1-6, listed above, and excluding everything beyond the second reassortment.
A very large proportion of paths reach the S-OIV target through the bottleneck viruses, as shown in Table 1. This strongly suggests that these paths represent actual evolutionary events. Note, however, that only a subset of these paths is shown in the figures discussed above. The fact that these six bottleneck viruses occur repeatedly and at the lowest distance found from the target A/California/04/ 2009 suggests that they are important in the evolutionary history of S-OIV influenza A.

Spreadsheet
The Supplemental Material includes a spreadsheet SOIVspread.xls that lists all paths to A/California/04/2009, from the 32-stage run, sorted by year of source virus and then by path cost. Bottleneck viruses are marked with "ÃÃÃ" in this spreadsheet and in the tables that follow below. Explanation of the notation used in this spreadsheet appears in the Supplemental Appendix I.

DISCUSSION
Our analysis yields detailed information on the series of reassortments and mutations required to transform a set of viruses to S-OIV viruses. We have identified six bottleneck viruses that almost invariably occur on shortest paths found to the target S-OIV virus.
To investigate these bottleneck viruses further, Table 2 shows details of the reassortments that occur in the final stage of the network. The entries in this table can be interpreted using the notation given in Section 2.    7½6; tÞ shows the distances between the reassorted virus (i.e., the virus obtained by replacing segment 6 (NA) of virus 1 with the corresponding segment of virus 7) and the target virus t. Further elucidation of the reassortment notation is available in the Supplemental Appendix C.
In Table 2, all bottleneck viruses have small differences from the target virus in all segments except NA. A reassortment with A/swine/HongKong/NS29/2009, which has low distance (160) for the NA segment results in a virus that is very close to the target. Furthermore, the years of the three viruses in each possible reassortment are obviously consistent (as a result of constraints built into the reassortment network). Finally, most of these viruses are remarkably close to each other, as Table 3 demonstrates. This distance matrix gives the absolute distances (sums of the segment-wise distances) between all pairs of bottleneck viruses. The four H1N2 viruses are very close to each other as are the two H1N1 viruses. Tables 4 and 5 show the shortest paths from our run, organized by number of edges and permit us to identify unusual events. We would expect the path length to increase with the number of edges in a path across paths and within paths and this does, in general, happen. However it is noticeable that the weight of the shortest path for edges ¼ 2 in Table 4 is remarkably small. This path starts in A/duck/ NC/91347/01, a virus that we have already noted, in Section 5.3, as being oddly similar to the bottleneck viruses (though only having a single path through it). Olsen et al. [32] described this strain in 2003 and showed that it has great similarity with swine origin viruses. Similarly, for paths with three edges, A/swine/OH/511445/2007 (not a bottleneck) has an unusually low distance to the target, compared with other viruses with the same number of edges.
In Table 5, it is immediately noticeable that nonbottleneck A/Swine/Nebraska/209/98 has path length 1,441 (with edges ¼ 6) which is significantly different from other paths that are %2;550 for edges ¼ 5, 7, and 8.
Turning to Table 6, we note that while the oldest shortest path is 2,066 for A/swine/USA/1976-MA/1931, the shortest paths for later decades (1940; . . . ; 1960) are not significantly different. In the 1970s, some biological event results in a large multiplicity of shortest paths of length %2;000. In subsequent decades the shortest path lengths decrease significantly, dropping to 950-1500 in the 2000s. Once again we see A/ duck/NC/91347/01 standing out because it is dated 2001, unlike the remaining viruses from this decade that are from 2004 or later.

Significance of Bottleneck Viruses
As far as we are aware, the discovery of bottleneck viruses is new to the field and has not been reported elsewhere.    (Fig. 3). 2 Matches only Smith et al. (Fig. 5). 3 Does not match any of Figs. 3, 4 or 5.  Fig. 10 (see Section 6.2). A/swine/Shanghai/1/2007 is also included in Table 1 of Trifonov et al. [28]. Both Hong Kong viruses are listed in Table S3 in Smith et al. [30, ].
A/swine/Kansas/77778/2007 is a particularly virulent strain discussed by Ma et al. [36]. This strain also appears in the phylogenetic trees given by Kingsford et al. [29]. The bottleneck viruses are, thus, the key members of the large sets that have been identified by other researchers as being important to the SOIV evolution. Our algorithmic technique highlights these six viruses as being crucial to the SOIV pandemic. We were able to discover these viruses via search of a comprehensive database without reliance on preconceived notions of lineages to sample. It is now for biologists, virologists, and epidemiologists to apply molecular biology techniques to establish the functional reasons for the prominence of these viruses. Our research, which is of a purely algorithmic nature, will proceed in the directions given in Section 6.4.
One of the main accomplishments of our research is to provide means for identification of specific viral isolates that are likely similar to ancestors of epidemic viruses. By concentrating on actual viral isolates rather than inferred ancestors as in phylogenetics, we assure that our results are functionally plausible. In phylogenetics, the goal is to infer median states at ancestors to optimize an objective function over edit costs-there are typically no functional constraints attempted or implied (but see Wheeler [38] for a counterexample). The specificity and functional constraints inherent to our method are important because they permit other researchers, especially in molecular biological domains, to use our results to choose in vitro and in vivo models.

Nonbottleneck Viruses of Interest
In addition to the bottleneck viruses, A/duck/NC/91347/ 01, A/swine/OH/511445/2007, and A/Swine/Nebraska/ 209/98 are worthy of detailed study because of their unusually low distance to the target.

Validation of Reassortments
It is of great interest to compare the reassortments in the paths of Tables 4 and 5 against the models described in Section 3. We consider a reassortment indicated in our intree as matching a reassortment in Figs. 3, 4, or 5, if there is a path in the figures matching (in terms of subtype, host) the reassortment from the in-tree. For example, in Table 4 the three-edge path for 1977 gets PB1 from H3N2 swine. There is a path from H3N2 swine to SOIV in each of Figs. 3, 4, and 5.
All reassortments in Table 4 and most in Table 5 match the models of Olsen and Kingsford, Nagarjan and Salzberg (Fig. 3), Trifonov, Khiabanian, and Rabadan (Fig. 4), and Smith et al. (Fig. 5). Those that do not match are indicated by footnotes in Table 5.
As stated in Section 3, the unavailability of enumerated lists of classes of viruses (e.g., "Classical H1N1 swine," etc.) limits the granularity of our validation. Nevertheless, we see a broad agreement between our reassortments and the results of other researchers, which indicates that our research is consistent with, and a useful addition to, the existing knowledge and methods base. The in-trees generated by our algorithm provide an alternate and functionally actionable model for analyzing the evolution of S-OIV and other viruses.

Future Research
Some issues that present themselves for future work are: 1. Bokhari and Janies [19] have proposed means for incorporating temporal, geographic, and host constraints in reassortment networks. These would result in more precise analyses. 2. Given that fine grained data are available for the present pandemic, it is of great interest to monitor intra-SOIV evolution, and 3. Refine the algorithm to repeat this analysis at a finer granularity, using distances between specific proteins encoded in segments, rather than global distances between segments. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.