Portal protein diversity and phage ecology

Oceanic phages are critical components of the global ecosystem, where they play a role in microbial mortality and evolution. Our understanding of phage diversity is greatly limited by the lack of useful genetic diversity measures. Previous studies, focusing on myophages that infect the marine cyanobacterium Synechococcus, have used the coliphage T4 portal-protein-encoding homologue, gene 20 (g20), as a diversity marker. These studies revealed 10 sequence clusters, 9 oceanic and 1 freshwater, where only 3 contained cultured representatives. We sequenced g20 from 38 marine myophages isolated using a diversity of Synechococcus and Prochlorococcus hosts to see if any would fall into the clusters that lacked cultured representatives. On the contrary, all fell into the three clusters that already contained sequences from cultured phages. Further, there was no obvious relationship between host of isolation, or host range, and g20 sequence similarity. We next expanded our analyses to all available g20 sequences (769 sequences), which include PCR amplicons from wild uncultured phages, non-PCR amplified sequences identified in the Global Ocean Survey (GOS) metagenomic database, as well as sequences from cultured phages, to evaluate the relationship between g20 sequence clusters and habitat features from which the phage sequences were isolated. Even in this meta-data set, very few sequences fell into the sequence clusters without cultured representatives, suggesting that the latter are very rare, or sequencing artefacts. In contrast, sequences most similar to the culture-containing clusters, the freshwater cluster and two novel clusters, were more highly represented, with one particular culture-containing cluster representing the dominant g20 genotype in the unamplified GOS sequence data. Finally, while some g20 sequences were non-randomly distributed with respect to habitat, there were always numerous exceptions to general patterns, indicating that phage portal proteins are not good predictors of a phage's host or the habitat in which a particular phage may thrive.

between g20 sequence clusters and habitat features from which the phage sequences were isolated. Even in this meta-data set, very few sequences fell into the sequence clusters without cultured representatives, suggesting that the latter are very rare, or sequencing artefacts. In contrast, sequences most similar to the culture-containing clusters, the freshwater cluster and two novel clusters, were more highly represented, with one particular culture-containing cluster representing the dominant g20 genotype in the unamplified GOS sequence data. Finally, while some g20 sequences were non-randomly distributed with respect to habitat, there were always numerous exceptions to general patterns, indicating that phage portal proteins are not good predictors of a phage's host or the habitat in which a particular phage may thrive.
Studying the diversity of phages has proven difficult because no universal gene, analogous to the 16S rRNA gene used for microbes, exists throughout all phage families (Paul et al., 2002). Thus family-specific genes have been proposed for use as taxonomic tools in phage ecology (Rohwer and Edwards, 2002). One such marker, a homologue to the coliphage T4 portal protein gene 20 (g20), has been developed to study the diversity of dient gel electrophoresis and terminal-restriction fragment length polymorphism banding patterns) revealed variability in g20 diversity across gradients in space and time from a variety of different environments (Wilson et al., 1999;Frederickson et al., 2003;Dorigo et al., 2004;Wang and Chen, 2004;Mühling et al., 2005;Sandaa and Larsen, 2006). These studies concluded that g20 diversity was as great within a sample as between oceans (Wilson et al., 1999), that phage g20 diversity increased as Synechococcus abundance increased (Wilson et al., 1999;Frederickson et al., 2003;Wang and Chen, 2004;Sandaa and Larsen, 2006), that some g20 types were ubiquitous in the habitats examined (Wilson et al., 1999;Frederickson et al., 2003;Dorigo et al., 2004), as well as a temporal study by Muhling and colleagues (2005) that correlated 'cyanophage' diversity (inferred from g20 sequence types) with Synechococcus diversity (inferred from rpoC1 sequence types).
Subsequent cloning and sequencing of g20 PCR amplicons from both cultured isolates and wild populations have allowed phylogenetic analyses of cyanomyophage diversity. Although initial studies (Zhong et al., 2002) suggested some correlation between ocean habitat and g20 phylogeny (e.g. phylogenetic cluster II represents 'open ocean' g20 sequences), further sampling revealed that this was not the case, as seven g20 sequences from coastal Synechococcus myophages isolated from Rhode Island waters clustered with the putative 'open ocean' sequences (Marston and Sallee, 2003). As more g20 sequence data have accumulated from diverse environments (Zhong et al., 2002;Marston and Sallee, 2003;Dorigo et al., 2004;Short and Suttle, 2005;Sandaa and Larsen, 2006;Wilhelm et al., 2006), it has become clear that marine g20 sequences form nine phylogenetic clusters (first described by Zhong et al., 2002), and g20 sequences originating from freshwater environments form a separate, tenth cluster (Dorigo et al., 2004;Short and Suttle, 2005;Wilhelm et al., 2006). Three of the nine marine clusters (clusters I-III in Zhong et al., 2002) contain cultured representatives (hereafter called 'culturecontaining clusters'), whereas the remaining six marine clusters (clusters A-F) and the 'freshwater' cluster do not (hereafter called 'environmental-sequence-only clusters'). The cultured representatives were isolated using only Synechococcus hosts (7 strains = WH7803, WH7805, WH8007, WH8012, WH8018, WH8101, WH8113), which undoubtedly limits the diversity represented considering the larger diversity of Synechococcus strains (Rocap et al., 2002;Fuller et al., 2003;Ahlgren and Rocap, 2006) and that the sister genus Prochlorococcus is also abundant in open ocean waters. This raises the question: could these seven environmental-sequence-only clusters represent novel cyanomyophages that infect this broader diversity of Synechococcus host strains, Prochlorococcus or other cyanobacteria?
To address this question, we isolated phages on a broad diversity of Prochlorococcus and Synechococcus hosts (Table 1), sequenced their g20 homologues and analysed their diversity in the context of published PCRgenerated sequences from natural populations. We then combined the g20 sequences from these new cultured isolates with all environmental g20 sequences available [including all PCR-generated environmental sequences, as well as primer-independent sequences available in the Global Ocean Survey (GOS) metagenomic data set], to examine the broad diversity of g20 observed in the wild. This allowed us to ask: do any of the new environmental sequences cluster with the previously observed environmental-sequence-only clusters? Furthermore, are g20 sequence clustering patterns ecologically meaningful? Do they reflect the habitat -and by inference the microbial community -of the site from which they were isolated?

Results and discussion
Analysis of g20 diversity captured by several g20 primer sets As our understanding of marine myoviruses has grown over the years, multiple primer sets have been developed    a. M, P and S represent the virus families Myoviridae, Podoviridae and Siphoviridae respectively, as determined by morphology. 'M?' indicates that the assignment is based solely on amplification and sequencing of a g20 PCR product and has not been confirmed with electron microscopy. b. Reference where cultured isolate was originally described: 1, Sullivan and colleagues (2003); 2, this study; 3, Waterbury and Valois (1993); 4, Marston and Salee (2003); 5, Wilson and colleagues (1993); 6, Zhong and colleagues (2002); 7, Wichels and colleagues (1998).
'+' indicates positive PCR amplification; '-' indicates that there was no PCR product of the expected size. The new g20 sequences contributed in this study are shown in bold letters. CPS1.1/8.1 is the new primer set designed for this study, while CPS4GC/5 and CPS1/8 were published previously. and used to specifically amplify cyanomyophage g20 sequences from field samples (Fuller et al., 1998;Wilson et al., 1999;Zhong et al., 2002;Frederickson et al., 2003;Marston and Sallee, 2003;Dorigo et al., 2004;Wang and Chen, 2004;Sandaa and Larsen, 2006;Wilhelm et al., 2006). Each of these primer sets was designed based on a limited number of sequences from cultured isolates. Thus we wondered how well these primer sets would capture the diversity of g20 sequences in our relatively extensive Prochlorococcus and Synechococcus cyanophage collection (Table 1). We found that the CPS4GC/5 primer set (Wilson et al., 1999) amplified g20 sequences from 80% of the cyanomyophages screened (bold entries in Table 1). This primer set, however, amplifies only a small region of this gene (~165 bp), thus its utility for subsequent phylogenetic analyses is limited. In contrast, the CPS1/8 primer set (Zhong et al., 2002), which captures a larger segment of the gene (~594 bp), amplified the g20 sequence of only 56% of the cyanomyophages screened (Table 1). Using genome sequence data from two Prochlorococcus cyanomyophages (Sullivan et al., 2005) that became available after these primer sets were designed, we modified the CPS1/8 primer set with the hope of amplifying g20 from all of our isolates for use in subsequent phylogenetic analyses. Indeed, the redesigned set (CPS1.1/8.1) captured g20 homologues from all cyanomyophage isolates screened (Table 1). Despite their degeneracy, the redesigned primer set remained specific only for cyanomyophage isolates as inferred from repeatedly negative PCR results against the sipho-and podo-cyanophage, as well as the noncyanomyophages we examined (Table 1).

Phylogenetic relationships of g20 sequences
We next analysed how these new g20 sequences from cultured isolates compared with selected sequences (see Experimental procedures) from the databases (Fig. 1). Randomly paired g20 sequence identities from this data set ranged from 59% to 100% amino acid identity, notably with some identical g20 protein sequences observed multiple times (alphanumeric clusters #1-13 in Fig. 1). This is not unprecedented: even at the level of the gene, identical viral sequences have been previously reported from vastly different aquatic environments using two separate gene markers including g20 (Zhong et al., 2002;Marston and Sallee, 2003;Short and Suttle, 2005) and DNA polymerase (Breitbart et al., 2004b;Breitbart and Rohwer, 2005).
In phylogenetic analyses, 40 of 45 g20 sequences from cyanomyophages (38 new, 7 previously published) grouped within the clusters that contain cultured representatives (I, II and III), four fell into a new monophyletic cluster (indicated by 'PSSM9/11/12 new cluster' on Fig. 1), and one (P-ShM1) fell onto a long branch. None fell into the previously defined (by Zhong et al., 2002) environmental-sequence-only clusters A-F, which were thought to be from marine cyanomyophages because of the use of isolate-designed and -tested 'cyanophagespecific primers'. Thus either our phage culture collection is still not diverse enough to represent the g20 diversity of phages that infect marine cyanobacteria, or the sequences in the environmental-sequence-only clusters A-F represent myophages that infect other hosts. Observations made by Short and Suttle (2005) lend support to the latter. They found three g20 sequences in waters 3246 m deep in the Arctic Chukchi Sea, waters unlikely to contain cyanobacteria and their phages, which grouped with cluster A.
Given our extensive host range information for these cyanobacteria phage-host systems, we examined g20 clustering patterns for relationships with respect to the host strains upon which the phage were isolated or could cross-infect. None of the three culture-containing clusters (I, II, III) were comprised solely of g20 sequences from phages with similar hosts (Fig. 1), and no clear-cut patterns emerged when subclusters within these clusters were evaluated. This is consistent with the observations of Stoddard and colleagues (2007), who recently reported that g20 sequences could not predict the pattern of crossresistance observed when selecting for cyanophage Fig. 1. Evolutionary relationships determined using 183 amino acids of the portal protein gene (g20) amplified from cultured phage isolates (names begin with 'S-' or 'P-' and are coloured orange or green for Synechococcus or Prochlorococcus phages respectively) from this study (italicized), as well as previous studies (non-italicized), and environmental g20 sequences (names in black) (Zhong et al., 2002;Marston and Sallee, 2003). Clusters defined by Zhong and colleagues (2002) are as follows: clusters I-III contain g20 sequences from cultured phage isolates, while clusters A-F represent only environmental g20 sequences. Clusters containing identical g20 protein sequences are numbered with alphanumeric numbers (1-13). For cultured phages, the phage isolate names are followed by black lettering that indicates the original host strain used for isolation, while the phage host range is indicated as high light-adapted Prochlorococcus (green circle or dash), low light-adapted Prochlorococcus (blue circle or dash) or Synechococcus (orange circle or dash). The circles represent cross-infection was observed within this group of hosts tested, whereas a dash indicates that no cross-infection was observed. Isolates not available for host range testing have no indication of their host range. The tree shown was inferred by neighbour-joining as described in the Experimental procedures. Support values shown at the nodes are neighbour-joining bootstrap/maximum parsimony bootstrap/maximum likelihood quartet puzzling support (only values > 50 are shown). Well-supported nodes (as defined in Experimental procedures) are designated by italicized support values, including six nodes that represent subclusters within the culture-containing clusters I-III. The g20 sequence from the non-cyanomyophage isolate T4 was used as an outgroup to root this tree.

substitutions per position
Coliphage T4  (Stoddard et al., 2007). Thus for the Prochlorococcus/ Synechococcus/myophage system in Fig. 1, it appears that commonly used phage and host genetic markers lack the ability to predict either the range of hosts that a phage can infect, or the range of phages to which a host is susceptible. We next added more recently published g20 sequences to this analysis, including those from the non-PCR-based GOS metagenomics database (Rusch et al., 2007) and all published PCR-based environmental sequences (Fig. 2, Table 2). Only sequences of sufficient length for phylogenetic analysis were used. The majority (464 of 769) of these environmental sequences, including 401 GOS sequences, grouped in culture-containing clusters I, II and III. First we found that 13 of the 38 GOS sample sites included in our analysis lack Prochlorococcus and Synechococcus (as determined by dot-blots in Rusch et al., 2007), yet 75 g20 sequences from these sites fell into clusters I, II and III (Fig. 2), thought, from earlier studies, to represent myophages that infect marine picocyanobacteria. Thus it appears that clusters I, II and III likely represent phages that infect a diversity of hosts and are not limited to pico-cyanobacteria-dominated environments. Second, these analyses revealed that cluster II contains~10-fold more GOS sequences than clusters I and III (336 versus 32 and 33 respectively). If we ignore possible cloning bias, this suggests that cluster II sequences are by far the most abundant type in the environments sampled. Third, we note that a relatively tiny number of the GOS sequences fell into the environmental-sequence-only clusters -clusters A-F in Fig. 1 -that were defined by Zhong and colleagues (2002) (Fig. 2). The 12 that fell into cluster A originated from seven sites with different physicochemical characteristics (see colour rings, Fig. 2). Even fewer sequences fell into environmental-sequence-only clusters B-F, suggesting that these types of g20 sequences are either extremely rare in the environments sampled to date, or are sequencing artefacts.
This expanded data set lends support for three additional g20 lineages (Fig. 2). These include 93 sequences that group with the previously identified 'freshwater' cluster (Dorigo et al., 2004;Short and Suttle, 2005;Wilhelm et al., 2006; labelled as 'new cluster #1' in Fig. 2), 25 sequences that group with the new culture-containing P-SSM9/11/12 cluster (named after the original phage isolates forming this cluster in Fig. 1, labelled as 'new cluster #2' in Fig. 2) and 84 environmental sequences (74 GOS + 10 non-GOS environmental sequences, labelled as 'new cluster #3' in Fig. 2) of mixed biogeographic and habitat origin that form a new environmental-sequenceonly cluster.

Relationship between g20 clusters and habitat
Using Unifrac distance metric statistical tools (Lozupone et al., 2006), we examined the meta-g20 data set for correlates between sequence clustering and habitat descriptors, such as the microbial community type, temperature and salinity of the original sample. As a first approximation of the microbial community type, we used previously defined environmental categories originally inferred from ribotype dot-blots and metagenomic sequence data (figs 9 and 10 in Rusch et al., 2007) for the GOS g20 sequences, then assigned such categories where reasonable assumptions could be made for non-GOS sequences (details in Table 3 legend). We found that the g20 sequence clusters were non-randomly distributed with respect to sequences that originated from freshwater, tropical freshwater, arctic/polar, estuarine, Sargasso and hypersaline environments, while eight other environments lacked statistically significant clustering (Table 3). Beyond habitat-related properties, we also The 'PCR-based' column indicates whether the environmental sequence was obtained by PCR or metagenomic approaches (N/A indicates that this is not applicable for sequences from cultured phage isolates). Reference code: 1, Rusch and colleagues (2007); 2, Short and Suttle (2005); 3, Marston and Salee (2003); 4, Wilhelm and colleagues (2006); 5, Dorigo and colleagues (2004); 6, Zhong and colleagues (2002); 7, this study; 8, T4-like phage genomes website http://phage.bioc.tulane.edu/ observed non-random g20 sequence distributions relative to abiotic factors, such as salinity (four of five categories significant, Table 4) and temperature (three of five categories significant, Table 5). In both cases, the outermost categories (e.g. 'cold' and 'hot', but not 'medium' for temperature) were significantly structured, but median categories were not. Qualitatively, some of these clustering patterns are also evident in the colourcoded rings in Fig. 2. Notably, however, clustered sequences, when significantly correlated with a habitat characteristic, always contained exceptions. For example, the 'freshwater' category was one of the most significantly non-random sequence categories (Fig. 2, Tables 3-5). In spite of this, the 'freshwater' cluster also contained 6 sequences from brackish waters, while 68 additional freshwater sequences were distributed elsewhere in the tree (light blue in the outer circle in Fig. 2). Similarly, while sequences in the 'tropical freshwater' category were found to be non-randomly distributed (Table 3), this is likely driven by the 24 sequences that form a welldefined subcluster within cluster II (GOS site 20 subclus- Qualitative characterization of the relative abundance of dominant ribotypes using published data from the GOS (detailed data available in Rusch et al., 2007). These data represent only those microbes captured in 0.1-0.8 mm size fraction samples, except for the Fringing Reef sample which is the 0.8-3.0 mm size fraction. No data are available for freshwater and arctic/polar samples because these were not part of the GOS sampling expedition. Unifrac distance metric ( Lozupone and Knight, 2005) was used for the analysis. A P-value < 0.05 (italicized) indicates that sequences from that category are non-randomly distributed with respect to habitat in the phylogenetic analysis. In the Unifrac analysis presented here, we used the environmental categories given to the GOS sample g20 sequences by Rusch and colleagues (2007) (inferred using ribotype dot-blots and shared metagenomic content; figs 9 and 10 in Rusch et al., 2007), whereas we assumed which environmental category non-GOS sequences belonged to as follows: (i) Woods Hole, Plymouth, NE Providence Channel, Rhode Island waters were considered 'temperate ocean -north' (akin to GOS sample 8, Newport Harbor, RI), (ii) freshwater, the Sargasso Sea or estuaries were considered 'freshwater', 'Sargasso Sea' or 'estuary' respectively, (iii) arctic or polar water sequences were given their own category. We did not assume an environmental category for non-GOS samples originating from the Red Sea, Atlantic Ocean continental shelf and slope waters, Dauphin Island and Gulf Stream so they were not used in this analysis (temperature and salinity data were available for many of these samples, so they were used in subsequent analyses). A total of 698 categorized sequences were used in the Unifrac analysis. To provide an overall picture of the microbial community for each environmental category, we provide qualitative relative abundance microbial community data for each environmental category inferred from the ribotype data published for the GOS samples in Rusch and colleagues (2007) Fig. 2). However, another 18 sequences from this same sample are scattered throughout the rest of the tree (11 in cluster II, 4 in cluster I and 3 in other clusters).
In other words, while some patterns emerge, exceptions are so frequent that one must conclude that the g20 sequence is not a good predictor of the habitat from which the phage originated. This is perhaps not surprising given the sheer abundance of phages on the planet (10 31 phages) and the apparent promiscuity of viral-host interactions allow a lot of 'rule breakers' to persist. For example, not only can viral particles survive the physical challenges of extreme environmental shifts (Breitbart et al., 2004c), but viruses from one environment (e.g. freshwater Great Lakes) are also readily capable of infecting hosts from another environment (e.g. oceanic Synechococcus; (Wilhelm et al., 2006). Further, in coliphage T4, the g20 gene encodes a portal protein (Marusich and Mesyanzhinov, 1989) involved in functions quite removed from the direct interaction between Fig. 2. Evolutionary relationships determined using 554 base pairs of the portal protein gene (g20) from 769 available g20 sequences. Clusters defined by Zhong and colleagues (2002) are identified as culture-based clusters I-III and environmental-sequence-only clusters A-F. New clusters defined since Zhong and colleagues (2002) are indicated with the preface 'new cluster', a number and a brief description. The tree shown is the consensus (majority rules) tree from 11 GARLI iterations inferred using the maximum likelihood criterion (see Experimental procedures), with the Aeromonas phage Aeh1 g20 sequence used as an outgroup to root the tree. Three colour rings reflect the habitat type from which the g20 sequence originated. For most of these sequences (GOS sequences), there is ribotype dot-blot and metagenomic information about the microbial community structure at the site, while for non-GOS sequences such information was assumed where reasonable to do so (see Table 3 legend). The inner ring is the microbial community structure information listed as Rusch and colleagues (2007)-defined environmental categories, while the other two rings reflect the temperature and salinity of the original sampling site. The Unifrac distance metric (Lozupone et al., 2006) was used for the analysis. Salinity values, when not available from the published work, were obtained from the communicating author of the paper in which the g20 sequence was first reported. All freshwater samples were assumed to have a salinity of < 0.50 ppt. All but the sequences from brackish waters clustered non-randomly (P < 0.05) with respect to the habitat type as defined by salinity. The Unifrac distance metric (Lozupone et al., 2006) was used for the analysis. Temperature values, when not available from the published work, were obtained from the communicating author of the paper in which the g20 sequence was first reported. All but the sequences from moderate temperatures clustered non-randomly (P < 0.05) with respect to the habitat type as defined by temperature.
phage and host. In contrast, the distal tail fibre gene is known to be the direct determinant of host range in T-even coliphages (Henning and Hashemolhosseini, 1994;Tetart et al., 1998). Thus, g20 sequence patterns might no longer correlate to host range at the fine scales (e.g. cyanobacteria and their phages) where host range 'jumps' could more commonly occur (e.g. by simple tailfibre-switching sensu Tetart et al., 1998) that would de-couple host properties from vertically evolved g20 sequence lineages.

Concluding remarks
Taken together, these data reveal that oceanic phage g20 sequence clustering patterns are, at a fine level (e.g. cyanobacteria-cyanophages), largely uncorrelated to host factors. As one zooms out to more generally consider the relationship between g20 sequences from the wild and the habitat characteristics from which they were collected, we find that they are non-randomly distributed, reflecting in some cases a connection between habitat properties, microbial community structure and phage community composition as defined by the g20 gene. We posit that the latter patterns, when evident, reflect host range-limited vertical evolution of g20 sequences, while the former reflects highly specific 'tip-of-the-tree' phage-host interactions that are evolutionarily disconnected from that of the g20 protein product.

PCR amplification and sequencing
Previous g20 PCR primer sets [non-degenerate CPS4GC/ CPS5 (Wilson et al., 1999) and degenerate CPS1/CPS8 (Fuller et al., 1998;Zhong et al., 2002] were designed to amplify~200 bp and~592 bp fragments, respectively, of the T4 g20 homologue in myophages. The PCR reactions for CPS4GC/CPS5 and CPS1/CPS8 were conducted as described previously (Wilson et al., 1999;Zhong et al., 2002). Briefly, 2 ml of cyanophage lysate was added as DNA template to a PCR reaction mixture (total volume 50 ml) containing the following: 20 pmol each of a forward and reverse primers, 1¥ PCR buffer (50 mM Tris-HCl, 100 mM NaCl, 1.5 mM MgCl2), 250 mM of each dNTP and 0.75 U of Expand high-fidelity DNA polymerase (Roche, Indianapolis, IN). The PCR amplification was carried out with a PTC-100 DNA Engine Thermocycler (MJ Research, San Francisco, CA). Optimized thermal cycling conditions varied slightly from those reported as follows: CPS4GC/CPS5 required an initial denaturation step of 94°C for 3 min, followed by 35 cycles of denaturation at 94°C for 1 min, annealing at 50°C for 1 min, ramping at 0.3°C s -1 , and elongation at 73°C for 1 min with a final elongation step at 73°C for 4 min, whereas both primer sets CPS1/CPS8 and CPS1.1/CPS8.1 required an initial denaturation step of 94°C for 3 min, followed by 35 cycles of denaturation at 94°C for 15 s, annealing at 35°C for 1 min, ramping at 0.3°C s -1 , and elongation at 73°C for 1 min with a final elongation step at 73°C for 4 min. Systematic PCR screening using various primer sets was conducted using the same PCR reaction conditions and amplification protocol, but replacing the high-fidelity DNA polymerase with the lessexpensive Taq DNA polymerase (Invitrogen, Carlsbad, CA) and only using 20 ml reactions as replicate (range 3-8) PCR reactions were pooled before sequencing to decrease PCR bias (Polz and Cavanaugh, 1998). In all cases, a 5-10 ml aliquot of PCR product was analysed in a 1.5% TAE gel stained with EtBr. The gel image was captured and analysed with an Eagle Eye II gel documentation system (Stratagene, La Jolla, CA). For purification and sequencing, replicate PCR reactions were combined, run out on a 1.5% TAE gel and purified using the QIAGEN QIAquick gel extraction kit (Qiagen, Valencia, CA). The purified PCR products were sequenced directly on both strands using the degenerate PCR primers used to obtain the product (CPS1, CPS8, CPS1.1, CPS8.1) with best results at primer concentrations~10-fold those suggested by the sequencing facility (40 pmol per reaction). To have greater confidence in negative PCR results, templates that did not produce amplified product were tested against optimized primer sets multiple times (data not shown). To confirm that our correctly sized amplicons from 'positive' PCR reactions were in fact g20 sequences, we sequenced the products. In all cases, the amplicon sequences were from g20 homologues Where identical g20 sequences were observed in our study, we confirmed that the match was real and not the result of PCR contamination by re-amplifying and sequencing directly from fresh phage isolates (e.g. for P-SSM4, P-RSM3, S-SSM2 and 'Syn' phages Syn2, Syn9, Syn10, Syn26, Syn30, Syn33, Syn1, Syn19), many of which were obtained from stocks kept at a separate institution.

Phylogenetic analysis
For the new sequences presented in Fig. 1 of this study, paired sequence data were aligned using ClustalW (Thompson et al., 1997) and corrected manually using the sequence chromatograms. Consensus sequences for each cyanophage isolate were then translated in-frame into amino acids. Published g20 sequences from PCR-amplified environmental clone libraries and phage isolates were screened by building preliminary neighbour-joining trees to select representative sequences that spanned the known g20 diversity and added to this data set. Multiple sequence alignments of translated amino acid consensus sequences were done with ClustalW using the Gonnet protein weight matrix, a gap opening penalty of 15 and gap extension penalty of 0.30 (although changing these penalties did not significantly alter the alignments). Phylogenetic reconstruction was done using PAUP 4.0 (Swofford, 2002) for parsimony and distance trees and Tree-Puzzle 5.0 (Schmidt et al., 2002) for maximum likelihood trees. Evolutionary distances for neighbour-joining trees were calculated based on mean character distances, while evolutionary distances for maximum likelihood trees were calculated using the JTT model of substitution assuming a gamma-distributed model of rate heterogeneities with 16 gamma-rate categories empirically estimated from the data. A heuristic search with 10 random addition replicates using the tree-bisection-reconnection branch swapping algorithm was used for parsimony trees. Bootstrap analysis was used to estimate node reproducibility and tree topology for neighbour-joining (1000 replicates) and parsimony (100 replicates) trees, while quartet puzzling (10 000 replicates) indicates support for the maximum likelihood tree. The g20 sequence from coliphage T4 was used as the outgroup taxon for all analyses.
Phylogenetic analyses of 183 amino acids from viral g20 sequence from 79 taxa yielded robust, similar trees using both algorithmic (neighbour-joining) and tree-searching (parsimony and maximum likelihood) methods. The translated g20 sequences contained phylogenetically informative regions (e.g. for parsimony analyses, 41 positions were constant, 25 were parsimony uninformative and 117 were parsimony informative). Differences between the parsimony, distance and maximum likelihood trees were limited to the branching order of the terminal nodes in a given cluster. To evaluate whether g20 sequence diversity correlated to the host-related properties presented in Fig. 1, we empirically defined a 'well supported node' as one where the average support across all three phylogenetic methods was 80% or greater.

GOS g20 identification, filtering and phylogenetic analyses
Using the 549 bp g20 fragment from all available cultured isolates as queries (Table 1), we retrieved 553 sequence reads with similarity (bit score > 100) to this region of the g20 gene from the GOS databases (downloaded from http:// camera.calit2.net/), then combined these GOS sequences with available published g20 sequences. The combined sequences were aligned using Clustal X and filtered to remove short, phylogenetically uninformative sequences, as well as sequences with poor quality at the ends. This manual curation left 769 total sequences (512 GOS sequences, details in Table 2) with 554 aligned nucleotide positions. Eleven maximum likelihood trees were generated using GARLI (Zwickl, 2006), starting from a neighbour-joining topology calculated in PAUP v4b10 (Swofford, 2002). Tree searching was terminated after 100 000 generations with no significantly better scoring topology, and a score improvement threshold for termination of 0.05. Topology mutation proportions were 0.1-0.2 nearest neighbour interchange and 0.8-0.9 limited SPR (subtree pruning-regrafting), with the maximum SPR range of 8-10 branches. From the 11 resulting trees, a majority-rule consensus tree (threshold 50% agreement) was generated in PAUP and is presented in Fig. 2.
Statistical analyses to evaluate whether g20 clustering patterns uncovered in the phylogenetic reconstructions were related to the habitat features of the original sample (e.g. microbial community type, temperature and salinity) were carried out using the Unifrac distance metric statistical tools available at http://bmf2.colorado.edu/unifrac/index.psp (Lozupone and Knight, 2005). The database and the tree file used for the analysis are provided in Supplementary Information (Files S1 and 2). Briefly, all g20 sequences were assigned to environmental categories using meta data for each sequence, with some assumptions made as described in Table 3 legend. Missing meta data for published g20 sequences were obtained where possible from the authors of the original work, as indicated in Tables 4 and 5. The patterns of these meta data were evaluated for 'each environment separately' in the context of a single neighbour-joining tree that included branch lengths (File S2) using Unifrac; all statistical results were similar using the P-test (also available at the Unifrac site, data not shown).

Nucleotide sequence accession numbers
The nucleotide sequences determined in this study were submitted to GenBank and assigned accession numbers EU715778-15813.