Genomic Stability and Genetic Defense Systems in Dolosigranulum pigrum a Candidate Beneficial Bacterium from the Human Microbiome

Dolosigranulum pigrum is positively associated with indicators of health in multiple epidemiological studies of human nasal microbiota. Knowledge of the basic biology of D. pigrum is a prerequisite for evaluating its potential for future therapeutic use; however, such data are very limited. To gain insight into D. pigrum’s chromosomal structure, pangenome and genomic stability, we compared the genomes of 28 D. pigrum strains that were collected across 20 years. Phylogenomic analysis showed closely related strains circulating over this period and closure of 19 genomes revealed highly conserved chromosomal synteny. Gene clusters involved in the mobilome and in defense against mobile genetic elements (MGEs) were enriched in the accessory genome versus the core genome. A systematic analysis for MGEs identified the first candidate D. pigrum prophage and insertion sequence. A systematic analysis for genetic elements that limit the spread of MGEs, including restriction modification (RM), CRISPR-Cas, and deity-named defense systems, revealed strain-level diversity in host defense systems that localized to specific genomic sites including one RM system hotspot. Analysis of CRISPR spacers pointed to a wealth of MGEs against which D. pigrum defends itself. These results reveal a role for horizontal gene transfer and mobile genetic elements in strain diversification while highlighting that in D. pigrum this occurs within the context of a highly stable chromosomal organization protected by a variety of defense mechanisms. IMPORTANCE Dolosigranulum pigrum is a candidate beneficial bacterium with potential for future therapeutic use. This is based on its positive associations with characteristics of health in multiple studies of human nasal microbiota across the span of human life. For example, high levels of D. pigrum nasal colonization in adults predicts the absence of Staphylococcus aureus nasal colonization. Also, D. pigrum nasal colonization in young children is associated with healthy control groups in studies of middle ear infections. Our analysis of 28 genomes revealed a remarkable stability of D. pigrum strains colonizing people in the U.S. across a 20-year span. We subsequently identified factors that can influence this stability, including genomic stability, phage predators, the role of MGEs in strain-level variation and defenses against MGEs. Finally, these D. pigrum strains also lacked predicted virulence factors. Overall, these findings add additional support to the potential for D. pigrum as a therapeutic bacterium.

Genomic Stability and Genetic Defense Systems in Dolosigranulum pigrum, a Candidate Beneficial Bacterium from the Human Microbiome ABSTRACT Dolosigranulum pigrum is positively associated with indicators of health in multiple epidemiological studies of human nasal microbiota. Knowledge of the basic biology of D. pigrum is a prerequisite for evaluating its potential for future therapeutic use; however, such data are very limited. To gain insight into D. pigrum's chromosomal structure, pangenome, and genomic stability, we compared the genomes of 28 D. pigrum strains that were collected across 20 years. Phylogenomic analysis showed closely related strains circulating over this period and closure of 19 genomes revealed highly conserved chromosomal synteny. Gene clusters involved in the mobilome and in defense against mobile genetic elements (MGEs) were enriched in the accessory genome versus the core genome. A systematic analysis for MGEs identified the first candidate D. pigrum prophage and insertion sequence. A systematic analysis for genetic elements that limit the spread of MGEs, including restriction modification (RM), CRISPR-Cas, and deity-named defense systems, revealed strain-level diversity in host defense systems that localized to specific genomic sites, including one RM system hot spot. Analysis of CRISPR spacers pointed to a wealth of MGEs against which D. pigrum defends itself. These results reveal a role for horizontal gene transfer and mobile genetic elements in strain diversification while highlighting that in D. pigrum this occurs within the context of a highly stable chromosomal organization protected by a variety of defense mechanisms. IMPORTANCE Dolosigranulum pigrum is a candidate beneficial bacterium with potential for future therapeutic use. This is based on its positive associations with characteristics of health in multiple studies of human nasal microbiota across the span of human life. For example, high levels of D. pigrum nasal colonization in adults predicts the absence of Staphylococcus aureus nasal colonization. Also, D. pigrum nasal colonization in young children is associated with healthy control groups in studies of middle ear infections. Our analysis of 28 genomes revealed a remarkable stability of D. pigrum strains colonizing people in the United States across a 20-year span. We subsequently identified factors that can influence this stability, including genomic stability, phage predators, the role of MGEs in strain-level variation, and defenses against MGEs. Finally, these D. pigrum strains also lacked predicted virulence factors. Overall, these findings add additional support to the potential for D. pigrum as a therapeutic bacterium. elements, thus limiting the introduction of new genes and maintaining genomic stability.
Comparing genomic content and chromosomal organization of D. pigrum strains collected 20 years apart, and mostly in the United States, we identified the following characteristics: (i) highly similar strains circulating across 20 years; (ii) stable chromosomal synteny across the phylogeny; (iii) the first predicted D. pigrum prophage and insertion sequence; and (iv) a diverse collection of RM, deity-named defense and CRISPR-Cas systems incorporated at conserved chromosomal insertion sites across strains. Together, these reveal a stable synteny and a high-level of sequence conservation within the D. pigrum core genome, along with an open pangenome and active defense against HGT.

RESULTS
Detection of highly similar Dolosigranulum pigrum strains over a 20-year span. To identify genomic shifts in D. pigrum strains currently circulating in human nasal microbiota compared to strains from approximately 20 years ago, we collected 17 new nostril isolates of D. pigrum from volunteers in 2017 and 2018 and sequenced the genomes of these isolates using SMRTSeq (PacBio), fully circularizing 14 (Table 1). We compared these 17 new genomes to 11 described genomes (28), 9 of which are from strains collected in the late 1990s (48). This refined existing and uncovered new information about the basic genomic characteristics of D. pigrum (see Table S1 in the supplemental material).
To assess the similarity of these 28 D. pigrum strains, we generated a phylogenomic tree based on 1,102 single-copy core GCs (Fig. 1). Some of the terminal clades include strains collected during different decades. The average number of pairwise single nucleotide polymorphisms (SNPs) among isolates collected approximately 20 years apart  Table S2A). Thus, closely related strains of D. pigrum have circulated among people in the United States over a span of time that has an upper bound of 20 years and a lower bound of 8 to 13 years. (This lower bound allows for the possibility that the recent isolates were stably acquired in infancy since most of the 2018 strains were from children in the 7-to 12-year age range.) Alloiococcus otitis (49) is the closest genome-sequenced bacterium to D. pigrum in 16S rRNA gene phylogenies. A. otitis ATCC 51267 shared 789 core GCs with the D. pigrum strains (see Fig. S1A). Using these 789 core GCs, we constructed a phylogenomic tree with A. otitis as an outgroup (see Fig. S1B and C). In contrast to the D. pigrum-only phylogeny (Fig. 1), the phylogeny including A. otitis displayed poor support for many of the branches within the D. pigrum clade. This is likely due to the reduced number of SNPs among D. pigrum strains when using only the 789 GCs shared with A. otitis (see Table S2A). Therefore, we based subsequent inferences on the D. pigrum-only phylogeny.
The chromosome of D. pigrum exhibits conserved synteny across a phylogeny spanning 20 years. Based on the observed similarity of circulating strains over time, we hypothesized there would be a high-level of genomic stability across the D. pigrum This maximum-likelihood core-gene-based phylogeny shows recently collected strains (bold), mostly from 2018, and strains collected before 2000 intermingled in three of the four distinct clades (clades C1 to C4 are color coded, and the year of collection is in parentheses). Strains separated by 18 to 19 years grouped together in terminal clades: KPL1914 and CDC4294-98, KPL3246 and CDC4199-99, and KPL3250 and CDC4792-99. The genomes of strains in boldface plus strain CDC4709-98 (asterisk) are closed. Strains KPL3065 and KPL3086 were collected from two different individuals and have almost identical genomes, differing by just 4 core SNPs and 6 gene clusters (4 and 2 in KPL3086 and KPL3065, respectively). We created this unrooted phylogeny using the concatenated alignment of 1102 conservative single-copy core GCs (see Fig. S3A), a GTR1F1R3 substitution model of evolution, 553 maximum-likelihood searches, and 1,000 ultrafast bootstraps with IQ-Tree v.1.  (50,51) of these 19 genomes starting at the dnaA gene revealed a remarkable conservation of the overall chromosomal structure with no visible shifts in the position of large blocks of sequence ( Fig. 2A). Dispersed among these blocks are regions with higher numbers of insertions and deletions (indels) ( Fig. 2A; see also Fig. S2A). D. pigrum has a core genome that has leveled off, an open pangenome, and a high degree of conservation at the amino acid and nucleotide level. Analysis of all 28 D. pigrum genomes revealed a conservative core of 1,102 single-copy GCs, as defined by the intersection of results from three algorithms, including bidirectional best hits (BDBH) (see Fig. S3A). A core of 1,134 GCs was defined by the intersection of two algorithms when BDBH was excluded (see Table S1 and Fig. S3B). The D. pigrum core genome has leveled off in size (Fig. 3A). Meanwhile, the pangenome continued to increase, with each additional genome (Fig. 3B) reaching 3,700 GCs (see Fig. S3B); of these, 30.6% (1,134/3,700) are core. The average number of coding sequences (CDS) per genome was 1,765 and, on average, the core constituted ;64% (1,134/1,765) of the CDS in each individual genome (see Table S1). These results from GET_HOMOLOGUES (42) generally agreed with those from Anvi'o (52,53), allowing us to leverage Anvi'o for additional analyses. In the Anvi'o-derived single-copy core (38.2%; Fig. S3D), 89.4% (993/1,111) of the GCs had a functional homogeneity index score $0.98, indicating a high degree of  Fig. 1, shows a conserved order of chromosomal blocks across the phylogeny of strains collected 20 years apart. Vertical bars represent clades: clade 1, green; clade 2, purple; clade 3, orange; and clade 4, blue. CS1 and CS2 designate the CRISPR-Cas sites (Fig. 8), R1H represents the RM system insertional hot spot and R2 represents the site containing either a type II m5C RM system or a type IV restriction system (Fig. 7). (B) This PPanGGOLiN partitioned pangenome graph displays the overall genomic diversity of the 28 D. pigrum genomes. Each graph node corresponds to a GC; the node size is proportional to the total number of genes in a given cluster, and the node color represents the PPanGGOLiN assignment of GCs to the partitions: persistent (orange), shell (green), and cloud (blue). Edges connect nodes that are adjacent in the genomic context and their thickness is proportional to the number of genomes sharing that neighboring connection. The insets on the right depict subgraphs for sites R1H, CS1, and R2 showing several branches corresponding to multiple alternative shell and cloud paths. These sites with higher genomic diversity are surrounded by longer regions with conserved synteny, i.e., long stretches of consecutive persistent nodes (GCs). The static image depicted here was created with the Gephi software (https://gephi.org) (133) using the ForceAtlas2 algorithm (134) with the following parameters: scaling = 20,000, stronger gravity = true, gravity = 6.0, LinLog mode = true, and edge weight influence = 2.0.
Genomic Stability and Defense Systems in D. pigrum September/October 2021 Volume 6 Issue 5 e00425-21 msystems.asm.org 5 conservation at the amino acid level. This fits with an average nucleotide identity (ANI) over 97.58% for all 28 genomes (see Fig. S3C), matching earlier findings with 11 strain genomes (28). Moreover, two sets of three recently collected strains each shared over 99% ANI, as well as similar accessory Clusters of Orthologous Group (COG) annotations (see Fig. S3E). This revealed highly similar strains in the nasal microbiota of different individuals in Massachusetts. Of these, two strains collected from different people were nearly identical, differing by just 4 core SNPs and 6 GCs (4 and 2 in KPL3086 and KPL3065, respectively) with a MASH-distance of 3.10E-05 (P=0 ; Table S2B). (Henceforth, we refer to these two strains as KPL3065/KPL3086.) In contrast KPL3086 and KPL3043, which are in that same distal clade in Fig. 1, have a MASH distance of 0.0045 (P=0). The D. pigrum accessory genome is enriched for gene clusters involved in mobilome and host defense. Of the 49,412 individual genes identified across the 28 genomes, 63.8% (31,501/49,412) had informative calls to a single functional COG annotation (i.e., their assignment corresponds to a single COG category other than S or R) (54, 55) (Fig. 4A). Using Anvi'o, we observed that GCs involved in mobilome, in defense mechanisms, and in carbohydrate transport and metabolism were overrepresented in the accessory compared to the core genome (Fig. 4B). GCs classified to these three COG categories accounted for 3.9, 6.6, and 8.5% of the D. pigrum accessory genome, respectively. The proportion of accessory functions was similar among all strains, but the sizes of their accessory genomes varied (see Fig. S3E and F). Because genome stability is relevant to suitability of a candidate beneficial microbe for therapeutic use, we focused subsequent analysis on the predicted mobilome and defense mechanisms.
D. pigrum hosts distinct integrated phage elements, insertional elements, and a group II intron. Of the total GCs in the pangenome, 2.2% were predicted to be part of the mobilome. MGEs can negatively affect genome stability and can positively affect strain diversification. Therefore, we systematically searched for various types of MGEs, including phage elements, plasmids, and insertional elements that interact with D. pigrum. First, using the Phage Tool Enhanced Release (PHASTER) database (56,57), we identified four distinct, and mostly intact, integrated phage elements, i.e., prophages (Fig. 5). We gave these the provisional names Dolosigranulum phage L1 through L4. All four were in the size range common for Firmicutes phages and had a life cycle-specific organization of its CDS with lytic and lysogenic genes separated (Fig. 5) (58-60). Predicted prophage L1 from D. pigrum KPL3069 was the most intact with two . Prophages L2 and L3 from D. pigrum KPL3090 also had intact integrases, with similarity to other streptococcal phages, but lacked distinguishable attP sites. Beyond these similarities, other CDS from L1 to L4 displayed few and dissimilar matches to known phage elements (see Text S1), indicating that D. pigrum hosts a distinct set of lysogenic phage that are expected to have a limited host range. Second, using the Gram-positive plasmid database PlasmidFinder (62), we detected no autonomous plasmids. However, a nearly complete fragment of the S. aureus plasmid pUB110 is integrated in the chromosome of four strains and includes a gene encoding kanamycin resistance (see Fig. S2B). This prompted a systematic search for antibiotic resistance genes using the Comprehensive Antibiotic Resistance Database in the Resistance Gene Identifier (CARD-RGI) (63,64). Of the 28 genomes, 6 are predicted to encode antibiotic resistance genes for erythromycin and/or kanamycin, which are located within a CRISPR array or the integrated plasmid, respectively (see Text S1).
Third, we identified GCs predicted to be either transposases (eight) or integrases (five) using a multistep approach (see Table S3). Transposases are thought to function both as detrimental, selfish genetic elements that can disrupt important genes and as diversifying agents that can provide benefit to host cells through gene activation or rearrangements (65,66). Among the 26 genomes containing at least one transposase CDS, the mean was 4.42 (median, 3.5), with a maximum of 13 per genome. Transposases were more prevalent and abundant than integrases (see Table S3). One of the predicted transposases was the GC containing the third largest number of sequences. This is consistent with reports that genes encoding transposases are the most prevalent protein-encoding genes detected across the tree of life when accounting for both ubiquity and abundance (67). We detected 74 intact instances of this most common transposase, an ISL3 family transposase with similarity to ISSau8, across 22 of the D. pigrum genomes with a mean (median) of 3.36 (2) and a maximum of 11 copies per genome (GC_00000003; Table S3). As shown on the PPanGGOLin graph (Fig. 6Ai), FIG 4 The accessory genome of D. pigrum has functional enrichment for defense mechanisms, mobilome, and carbohydrate transport and metabolism genes. (A) Of the total 49,412 individual genes identified across the 28 analyzed genomes, up to 8,242 genes (16.7%) lacked a COG annotation, 5,221 (10.6%) had an ambiguous COG category annotation (more than one COG category), and 4,448 (9.0%) had an uninformative annotation (belonging to the S or R COG category). At the gene cluster (GC) level, only 37.2% of the 1,517 GCs present in the accessory genome had an informative COG assignment compared to 68.7% of the 1,388 GCs in the soft/core. (B) The number of GCs present in the accessory genome was severalfold higher than in the soft/core for the following informative COG assignments (colored categories): defense mechanisms (olive, 2.60-fold), mobilome: prophages, transposons (orange, 14.88-fold), and carbohydrate transport and metabolism (khaki, 1.66-fold). This was determined using the COG functional annotations defined in our Anvi'o analysis of the soft/core ("core" and "soft core" bins) versus accessory ("shell" and "cloud" bins). Since many GCs have individual genes with distinct COG annotations each individual gene was counted as 1/x, with x being the number of genes in each GC. this transposase is inserted at multiple different sites within and across the genomes ( Fig. 6Aiii; see also Table S3). The most common of these is likely the ancestral insertion site (Fig. 6Bii). The absence of a cotraveling CDS is consistent with this ISL3 family transposase being part of an insertional sequence (IS). According to standards for IS nomenclature, we propose the name ISDpi1 (66).
Fourth, the PPanGGOLin graph (68) revealed insertion of a predicted group II intron reverse transcriptase-maturase at multiple sites across multiple D. pigrum genomes ( Fig. 6Aii and Bi; see also Table S3). Group II introns are MGEs commonly found in bacterial genomes that consist of a catalytic RNA and an intron-encoded protein that assists in splicing and mobility (69). Like transposases, group II introns can play both detrimental and beneficial roles within their host. We detected this intron-encoding GC in all 28 genomes with a mean (median) of 4.7 (3.5) and range of 1 to 14 copies per genome. This GC contained the highest number of individual gene sequences of any GC with 132 (GC_00000001 ; Table S3). It is most closely related to the bacterial class C intron-encoded protein from La.re.I1 in Lactobacillus reuteri with 44% identity and 65% similarity over 419 amino acids (70). These data are consistent with an intact bacterial reverse transcriptase/maturase expected to facilitate splicing and mobility of the group II intron (69).
A systematic search identifies multiples types of defense systems to protect D. pigrum from MGEs. The enrichment for defense mechanisms in the accessory genome of D. pigrum is combined with the relative paucity of plasmids and prophages among D. pigrum genomes. Based on this, we performed a systematic search of the pangenome for known bacterial host defense systems, including RM, deity-named defense, and CRISPR-Cas systems.
D. pigrum harbors a diverse collection of RM systems. In bacteria, individual RM systems can differ with respect to target sequence, active site architecture, and reaction mechanisms, but all recognize the methylation status of target sequences on incoming DNA and degrade inappropriately methylated (non-self) DNA. Type I to III systems largely recognize and digest a target sequence when it lacks the appropriate methyl group. In contrast, type IV systems, which lack a methyltransferase, are composed of a methyl-dependent restriction endonuclease (REase) that cuts a target sequence when it contains a specific methyl modification. RM systems and their recognition sequences are often strain specific. Therefore, we characterized and compared the repertoire of RM systems present in each of the 19 D. pigrum strains sequenced via SMRTseq, defining the methylome of each strain using SMRTseq kinetics (Basemod analysis) and predicting the recognition sequences of each system via REBASE analysis (71) (Fig. 7A; see also Table S4 and Text S1). Most of the modifications detected were . Each graph node corresponds to a GC; node size is proportional to the total number of genes in a given cluster; and node color represents PPanGGOLiN assignment of GCs to the partitions: persistent (orange), shell (green), and cloud (blue). Edges connect nodes that are adjacent in the genomic context and their thickness is proportional to the number of genomes sharing that neighboring connection. In panels Aii and Aiii, only the adjacent neighboring nodes and edges for each of the depicted GCs are contrast colored against the background pangenome graph. (B) Most common genomic neighborhoods for the predicted group II intron reverse transcriptase-maturase (Bi) and the ISL3 family transposase (Bii). OCTAPUS (https://github.com/FredHutch/octapus) identified the chromosomal coordinates of each MGE integration event in individual strains, and groupings of colocated genes residing within the same neighborhood structure across strains were visualized using Clinker (https:// github.com/gamcil/clinker). ClustalOmega alignments of flanking regions across groupings revealed predicted terminal sequence boundaries (consistent 59-39 sequences across integration events) for each MGE. The three most common genomic loci for each MGE were rendered using BioRender. m6A with only one m4C being found. There were several genes coding for m5C enzymes, but their products are not usually detected by the PacBio software. Only one positive m5C enzyme was identified. Among the RM systems, most were type II, although half the strains had a type IV enzyme of unknown specificity. The type I to III systems were associated with 19 individual target recognition motifs identified by methylome analysis (Fig. 7A; see also Table S4 and Text S1). The D. pigrum type IV RM system is inversely related to a specificm5C-associated type II system. We noted an inverse relationship between the presence of the D. pigrum type IV REase and a specific m5C-associated type II RM system that modified the second cytosine residue within the motif GCNGC (Fig. 7A). This inverse relationship was found to be interdependent between strains based upon a Fisher exact test (P=0.0055). The type II m5C system was present in nine D. pigrum genomes that lacked the type IV REase. Conversely, the type II m5C system was absent in eight strains that contained the type IV REase. Type IV REase that target m5C-modified motifs have the potential to limit the spread of RM systems that utilize m5C modifications. The D. pigrum type IV REase appears related (99% coverage/43% identity) to S. aureus SauUSI, a modified cytosine restriction system targeting S 5m CNGS (either m5 Co r 5hm C), where S is C or G. Based on the inverse relationship of the type IV and type II m5C systems, this strongly indicates that the D. pigrum type IV system targets m5C containing sequences, including GCNGC, GGNCC, and potentially the recognition sequence of the other m5C enzyme, M.Dpi3264ORF6935P.
The type IV and specific m5C-associated type II RM systems are present at the same integration site. To decipher the basis for the inverse relationship between these two RM systems, we sought to determine where each was incorporated in the D. pigrum genomes. In 18 of the 19 strains, the type IV REase or the m5C-associated type II system are inserted into the same genetic locus, dubbed R2 ( Fig. 2B; see also Fig. S4A). In the one strain that carried both the type IV REase and the m5C-associated type II system, CDC4709-98, the type IV is present at R2, whereas the m5C system is integrated at an unrelated locus downstream from a tRNA-Leu site. MGEs that carry similar integrases tend to integrate at the same sites in the chromosome, but in most strains we did not observe any integrase or additional genes cooccurring with the RM systems at this site. Many D. pigrum RM systems compete for an integration hot spot. Extending our analysis, we identified a genomic locus with an unexpectedly high frequency of variable genes across all 28 genomes. We dubbed this site RM system integration site 1 hot spot (R1H), because it harbors a diverse collection with 12 different RM systems spanning types I, II, and III across strains (Fig. 2B and Fig. 7B). Cooccurring with these RM systems in R1H, we also identified three of the antiphage deity-named defense systems: Hachiman, Gabija, and Kiwa present across seven strains (Fig. 7B). A third RM system integration site (R3) contained two different type II systems, along with an IS66 transposase family of genes (see Fig. S4B), consistent with the known association of defense systems and MGEs (72).
D. pigrum encodes subtype II-A and I-E CRISPR-Cas systems. CRISPR-Cas systems provide adaptive/acquired defense (immunity) against MGEs (46). All of the complete D. pigrum genomes encoded at least one subtype II-A or I-E CRISPR-Cas system ( Fig. 8A; see also Table S5A), based on the CRISPRDetect database (73). Of the 32 CRISPR-Cas systems detected, 22 are subtype II-A, which is mostly found in Firmicutes (74) and is the predominant CRISPR-Cas system among Lactobacillus spp. (75). Subtypes II-A (circles, Fig. 8A) and I-E (stars, Fig. 8A) CRISPR-Cas systems were generally intermixed within the four major clades, although several distal clades harbored only one type. A single genomic locus (CS1) contained either a subtype II-A or a subtype I-E CRISPR-Cas system in all 19 closed genomes ( Fig. 2B and Fig. 8A and B). A second CRISPR-Cas system (triangles, Fig. 8A) was found at a second location (CS2) in 4 of these 19 genomes, from three of the four clades (Fig. 8B).
D. pigrum CRISPR-Cas spacers point to undiscovered D. pigrum MGEs. Each of the 19 closed genomes included at least one complete CRISPR array. (As expected, most of the arrays were incomplete in the unclosed genomes.) Examining the CRISPR arrays in the 19 closed genomes revealed two key findings. First, the spacer sequences predict the existence of a diversity of undiscovered D. pigrum phages and plasmids with mean numbers of spacers per array of 13 (median, 12.5) for subtype II-A and 11.1 (median, 12) for subtype I-E (see Table S5A). Second, spacer sequences show a sparsely shared history of exposure to many MGEs ( Fig. 8C; see also Table S5). Only 60 of the 161 unique identified spacers were shared by more than one strain (Fig. 8C). The exceptions to this limited shared history were two distal clades with shorter branch lengths within Clade 4, which shared 15 and 12 spacers, respectively. Of these 27 spacers, 9 had similarity to known MGEs (see Text S1). A few other shared spacers were scattered among D. pigrum strains outside these two distal clades. For example, D. pigrum KPL3033 (clade 3) and KPL1914 (clade 4) shared five spacers (Fig. 8C), one of which matched to the Clostridium phiCDHM19 phage (LK985322;s p a c e r1 2 9 )( 7 6 ). These shared spacers suggest strains within the host-range of specific MGEs. Spacer similarity to known MGEs indicated prior D. pigrum exposure to phage and plasmid elements that might be related to those found in other genera of Firmicutes, e.g., Clostridium, Lactococcus, Streptococcus, Staphylococcus,a n dEnterococcus. However, only 46/161 spacers had significant matches (match score .15) to previously identified MGEs and none had matches to any of the predicted prophage from Fig. 5, indicating that that D. pigrum CRISPR-Cas systems likely target a variety of yet-to-be-identified hostspecific D. pigrum plasmids and phages.

DISCUSSION
Multiple recent studies of the composition of human nasal microbiota identify D. pigrum as a candidate beneficial bacterium . Our systematic analysis of 28 D. pigrum strain genomes, including 19 complete and closed genomes, reveals a phylogeny in which strains collected 20 years apart intermingled in clades and showed remarkable stability in genome structure ( Fig. 1 and 2). Many of the older D. pigrum strains were collected in the context of human disease (48), making it unclear whether these strains were contributors to disease, bystanders, or contaminants. In a previous  Table S5A). Two distal clades had only a subtype II-A system (KPL3043, KPL3065/KPL3086, KPL3090, KPL3052, and KPL3069) or a subtype I-E system (KPL3070, KPL3084, and KPL391). Three genomes (KPL3077, KPL3246, and CDC2949-98) have both types of system, with each at a different locus. (B) The most common location, CRISPR-Cas system insertion site (CS1), is between the ABC transporter permease protein (yxdM) and the glyxyolytate/hydroxypyruvate reductase A (ghrA) genes. However, subtype II-A systems are also found in between the guanine/hydoxanthine permease (pbuO; NCS2 family permease) and dipeptidylpeptidase 5 (dpp5; S9 family peptidase) genes at CRISPR-Cas insertion site 2 (CS2). Five of the strains with a subtype II-A system in CS1 had a predicted rRNA adenine N-6-methyltransferase (ermC9) gene integrated in their CRISPR arrays (open circles) (C) Representation of the spacers (see Table S5B and Text S1 in the supplemental material) found among the different CRIPSR systems in the 19 closed genomes. We found 161 unique spacers, less than one-third of which were homologous to phages and plasmids found among other Firmicutes. analysis, we detected no virulence factors in the genomes of nine of these older strains from Laclaire and Facklam (48), consistent with D. pigrum having a commensal or mutualistic relationship with humans (28). Adding further support for this, we detected no virulence factors in any of our newer strains here (see Text S1), which were all isolated from healthy volunteers. Plus, many of the older strains are closely related in the phylogeny with these recent healthy-donor-derived strains (Fig. 1). These findings are consistent with there being only a few isolated reports of D. pigrum growth in samples from different types of infections (77)(78)(79)(80)(81)(82). Of these, the repeated detection of D. pigrum alone in keratitis/keratoconjunctivitis raises the possibility that some strains might be rare causes of eye surface infection (83)(84)(85)(86). We recommend future genome sequencing of ophthalmic infection isolates to ascertain whether and how these vary from currently sequenced avirulent strains.
Our results show that strain-level variation in D. pigrum is driven by gene gain/loss in variable regions located between large blocks of syntenic DNA (Fig. 2). This pattern is consistent with the findings of Oliveira et al. for the chromosomal structures of 80 different bacterial species (87). Furthermore, D. pigrum core GCs exhibit very high conservation of nucleotide sequences ($97.5%), and the 19 closed genomes show the order of syntenic blocks of core genes is conserved ( Fig. 2A). D. pigrum has an average genome size of 1.93 Mb (median, 1.91 Mb) (see Table S1) with an open pangenome (Fig. 3B). About 64% of each D. pigrum strain genome consists of core CDS, whereas only about 30% of the D. pigrum pangenome consists of core GCs. This is similar to the percentage of core genes in the pangenomes of other colonizers of the human upper respiratory tract, such as Staphylococcus aureus (36%) and Streptococcus pyogenes (37%) (88).
HGT, much of it likely mediated by MGEs, plays an important role in strain diversification in free-living bacteria. However, a systematic search identified few such elements per genome among D. pigrum strains. In terms of MGEs that commonly mediate HGT, we detected no autonomous plasmids. However, we identified one complete and three partial predicted prophages (Fig. 5) among 27 distinct strain genomes (2 of the 28 genomes were almost identical). To our knowledge, the predicted complete prophage (L1) is the first phage element identified in D. pigrum. The disparate nature of these candidate prophages compared to those in current databases is consistent with D. pigrum having its own specific pool of yet-to-be-identified phage predators, consistent with the strain-level specificity of many known phages. This is further supported by the scarce homology of the phage spacers in the CRISPR arrays to those available in the databases. However, some D. pigrum prophages might share a distant common ancestor with streptococcal phages, as almost one fifth L1's and at least one third of L2's (77/202) and L3's (51/187) predicted genes shared the most similarities to Streptococcus phage genes (see Text S1). Based on our findings, we predict that phage elements targeting D. pigrum have a narrow host range, consistent with patterns exhibited by other Firmicutes-targeting phages, such as those targeting Listeria and Clostridium difficile (58,76). The identification of D. pigrum prophages creates the opportunity for future work to systemically query nasal metagenomic data sets for these and other D. pigrum MGEs, as well as for CRISPR spacers.
In terms of MGEs that commonly move within genomes, D. pigrum genomes host a group II intron and most also host a small number of predicted transposases and/or integrases (see Table S3). Once present in a genome, IS movement can lead to phenotypic variation among closely related strains through disruption of open reading frames (ORFs) or changes in transcription due to insertion in or adjacent to promoters (65).
The small number of MGEs identified might be related to the multiple defense mechanisms present in each D. pigrum genome. RM systems are ubiquitous in bacteria and present in ;90% of genomes (71). They play a key role in protecting bacterial genomes from HGT, including MGEs, and maintaining genome stability. The variety of RM systems within and among D. pigrum genomes is consistent with this role. To our knowledge, this is the first report of a strongly inverse relationship between an m5C-Genomic Stability and Defense Systems in D. pigrum September/October 2021 Volume 6 Issue 5 e00425-21 msystems.asm.org 13 targeting type IV REase and an m5C-associated type II system within the same chromosomal locus. A similar relationship was described previously for two antagonistic type II systems in Streptococcus pneumoniae, where strains possess either DpnI (which cleaves only modified G m6 ATC) or DpnII (which cleaves only unmodified GATC) (89). It remains unclear whether the inverse relationship observed between the two D. pigrum systems results from competition for an integration site within a D. pigrum genome (R2; Fig. 7) or whether the type II system's m5C-modified target motif is incompatible with the type IV REase. Determination of the exact underlying mechanism for this type IV/type II relationship warrants future investigation and has implications for other bacterial genomes. CRISPR-Cas systems are another common bacterial defense system that maintain genomic stability. In a recent analysis of complete genomes from 4010 bacterial species in NCBI RefSeq, 39% encode cas clusters (74). Several characteristics of the predicted D. pigrum CRISPR-Cas systems suggest these are active. First, the preserv a t i o no fr e p e a t sa n ds p a c e r sa l o n gw i t ha l lo ft h ec o r eC a sg e n es u g g e s t sa c t i v e systems, since inactive systems often show evidence of degeneration in terms of inconsistent repeat/spacer lengths (75). Second, the diversity of spacers among D. pigrum strains supports the likelihood of activity (90). D. pigrum belongs to the order Lactobacillales in the phylum Firmicutes. Similar to our observations in D. pigrum (Fig. 8), among 171 Lactobacillus species, when multiple CRISPR-Cas systems are present in a single genome these are most often a subtype I-E and subtype II-A, and these two subtypes predominate among type I and II systems in Lactobacillus (75). More broadly, there is a positive association between subtype I-E and subtype II-A systems within the phylum Firmicutes (74). Within Lactobacillus, type I systems contain the longest arrays (average 27 spacers) (75), and we see something similar among the D. pigrum strains. Of the spacers with matches to known plasmid and phage elements in the GenBank-Phage, Refseq-Plasmid, and IMGVR databases in CRISPRTarget, almost half of the identified spacers corresponded to plasmid elements. Subtype II-A systems in Lactobacillus actively transcribe and encode spacers t h a tp r o v i d er e s i s t a n c ea g a i n s tp l a s m i du p t a k eb a s e do np l a s m i di n t e r f e r e n c e assays in which an exogenous plasmid is engineered to contain endogenous spacer sequences (75,91). This defense mechanism might explain the lack of autonomous plasmids in D. pigrum strain genomes to date.
The majority of D. pigrum CRISPR spacers lack homology to known MGEs. This is consistent with a large-scale analysis of bacterial and archaeal genomes in which only 1% to 19% of spacers (global average ;7%) in genomes match known MGEs, mostly phages and plasmids and uncommonly to self. Also, spacers without a match share basic sequence properties with MGE-matching spacers pointing to species-specific MGEs as the source for CRISPR spacers (92). In this context, our findings indicate D. pigrum strains defend themselves against a wealth of yet-to-be-identified D. pigrum-specific MGEs. Some of these MGEs might be key to developing a system for genetic engineering of D. pigrum.
Like other pangenomic studies, this one has both general and species-specific limitations. First, the open pangenome indicates that the accessory gene space of D. pigrum remains to be more completely assessed through sequencing strains beyond the 28 investigated here. All but 1 of these 28 strains were collected in North America, so a next step is genome sequencing D. pigrum isolates from human volunteers from diverse geographic settings on other continents. Second, many more isolates would need to be collected over time to generate a comprehensive analysis of D. pigrum strain circulation in humans across the United States, and beyond. Third, this is a systematic computational prediction of genome defense systems and MGEs. The next step is experimental verification of the function of these computationally predicted entities, which underscores the need for a system to genetically engineer D. pigrum. Fourth, in this study, we systematically identified known genomic elements that can affect bacterial genomic stability. This leaves a large proportion of D. pigrum's accessory genome to be explored in future work. In conclusion, a growing number of studies point to D. pigrum as a candidate beneficial bacterium with the potential for future therapeutic use to manage the composition of human nasal microbiota to prevent disease and promote health (40). One standard for bacterial strains for use in humans, either in foods, the food chain or therapeutics, is the absence of antimicrobial resistance (AMR) genes against clinically useful antibiotics (93). A prior report of 27 D. pigrum strains shows all are susceptible to clinically used antibiotics with the exception of frequent resistance to erythromycin (48). Consistent with this, only 6 of the 17 new D. pigrum genomes reported here encode AMR genes with predicted resistance to erythromycin and/or kanamycin (see Text S1). This confirms the broad antimicrobial susceptibility of D. pigrum. Further supporting its safety, we detected no virulence factors in these 28 genomes. Moreover, this pangenomic analysis of 28 D. pigrum isolates collected over the span of 20 years revealed remarkable stability in both strain circulation and chromosomal structure. Consistent with this stability, we detected relatively few MGEs in each genome; however, each genome hosted a variety of defense systems for protection against MGEs, and HGT in general. The antibiotic susceptibility, genomic stability, capacity for defense against HGT, and lack of known virulence factors described here all support the safety of D. pigrum as a candidate for use in clinical trials to determine its potential for therapeutic use.

MATERIALS AND METHODS
Collection of new D. pigrum strains. We collected strains of D. pigrum from children and adults using supervised self-sampling of the nostrils with sterile swabs at scientific outreach events in Massachusetts in April 2017 and April 2018 under a protocol approved by the Forsyth Institutional Review Board (FIRB 17-02). All adults provided informed consent. A parent/guardian provided informed consent for children (,18 years old), and all children $5 years provided assent. (Self-sampling by children was considered evidence of assent.) Briefly, participants rubbed a sterile rayon swab (BBL, Franklin Lakes, NJ) around the surface of one nasal vestibule (nostril) for 20 s, and then we immediately inoculated this onto BBL Columbia colistin-nalidixic acid agar with 5% sheep's blood (CNA blood agar). After 48 h of incubation at 37°C in a 5% CO 2 enriched atmosphere, each CNA blood agar plate was examined, and colonies with a morphology typical for D. pigrum were selected for purification. Purified isolates were verified to be D. pigrum by 16S rRNA gene colony PCR (GoTaq Green; Promega, Madison, WI) using the primers 27F and 1492R and Sanger sequencing from primer 27F (Macrogen USA, Cambridge, MA).
Genomic DNA extraction. All D. pigrum strains were cultured from frozen stocks on CNA blood agar plates at 37°C with 5% CO 2 for 48 h. For each strain, cells from eight plates were harvested with a sterile cotton swab (Puritan, Guilford, ME) and resuspended in 1 ml of sterile 1Â phosphate-buffered saline (PBS; Fisher, Waltham, MA). Then, 10 100-ml resuspensions were spread and grown on 47-mm, 0.22-mmpore-size polycarbonate membranes (EMD Millipore, Burlington, MA) atop CNA blood agar plates at 37°C with 5% CO 2 for 24 h. Three membranes were resuspended in 20 ml of TES buffer (20 mM Tris-HCl, 1 M [pH 8.0]; 50 mM EDTA; filter sterilized) and normalized to an optical density at 600 nm of 1.0 6 0.02. Half the resuspension was spun down at 5,000 rpm (2,935 Â g) for 10 min at 4°C. The genomic DNA was extracted using a Lucigen Masterpure (Epicentre, Middleton, WI) Gram-positive DNA purification kit according to the manufacturer's instructions with the following modifications: we increased the amount of Ready-Lyse lysozyme added per preparation to 2.5 ml and deleted the bead-beating step. The extracted genomic DNA was assessed for quantity using a Qubit per manufacturer instructions, for quality on a 0.5% agarose gel, and for purity by measuring 260/280 and 260/230 ratios on a NanoDrop spectrophotometer.
Whole-genome sequencing, assembly, and annotation. Single molecule, real-time sequencing (SMRTseq) was carried out on a PacBio Sequel Instrument (Pacific Biosciences; Menlo Park, CA) with V2.1 chemistry, following standard SMRTbell template preparation protocols for base modification detection. Genomic DNA samples (5 to 10 mg) were sheared to an average size of 20 kbp via G-tube (Covaris, Woburn, MA), end repaired, and ligated to hairpin barcoded adapters prior to sequencing. Sequencing reads were processed using the Pacific Biosciences SMRTlink pipeline (https://smrtflow.readthedocs.io/en/latest/smrtlink _system_high_level_arch.html) according to the HGAP version 4.0 assembly tool standard protocol. Single contigs generated through HGAP were also processed through Circlator version 1.5.5 using default settings to assign the start site of each sequence to dnaA (94). All genomes were annotated with the NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) (95,96) and uploaded to the NCBI database (accession numbers CP040408 to CP040424).
Determination of the conservative core genome and the pangenome sizes. All the genomes were annotated with Prokka version 1.13.0 (97) prior to identification of the conservative core genome with GET_HOMOLOGUES version 3.1.4. (42,98) using the cluster intersection (compare_clusters.pl; blastp) result of three algorithms: bidirectional best-hits (BDBH), cluster of orthologs (COG) triangles (99), and Markov Cluster Algorithm OrthoMCL (OMCL) (100). The nucleotide level clustering for each of these algorithms was calculated with the get_homologues.pl script and the following parameters: -a CDS, -A, -t 28, -c, -R, and either -G for COG, -M for OMCL, or no flag for BDBH. To obtain the nucleotide instead of the protein outputs, blastn instead of blastp was used to report clusters (parameter -a CDS).
Genomic Stability and Defense Systems in D. pigrum September/October 2021 Volume 6 Issue 5 e00425-21 msystems.asm.org 15 The pangenome was established using the OMCL and COG triangle algorithm with -t 0 parameter to get all possible clusters when running get_homologues.pl. The total clusters from the OMCL and COG pangenomes were then used by compare_clusters.pl with the -m flag to create a pangenome matrix tab file. The cloud, shell, soft core, and core genome of the isolates were then determined using the par-se_pangenome_matrix.pl script in GET_HOMOLOGUES using the -s flag and the pangenome matrix tab file. The average nucleotide identity and genome composition analysis were also implemented (using the -A and -c parameters, respectively, in get_homologues.pl). For the genome composition analysis, which shows how many new CDS are added to the pangenome per new genome addition, the conservative default parameters and a random seed (-R) of 1234 was selected.
Phylogenomic tree construction. A core gene alignment was created for phylogenetic analysis using the nucleotide sequences from the conservative single-copy core GCs (n = 1,102) identified with GET_HOMOLOGUES. These GCs were aligned with MAFFT version 7.245 (101) using default settings, renamed to match the isolate's strain name, and concatenated into an MSA file through the catfasta2phyml.pl script using the concatenate (-concatenate) and fasta (-f) parameters (copyright 2010-2018 Johan Nylander). The core gene multiple sequence alignment was converted into a phylip file format with Seaview version 4.7 (102). An unrooted phylogenetic tree of the conservative single-copy core ( Fig. 1) was generated using this phylip file and IQ-Tree version 1.6.9. (103). The ModelFinder function in IQ-Tree identified the GTR1F1 as the appropriate substitution model for tree construction (BIC value 5597954.8128) (104). Using this model, 553 maximum-likelihood searches with 1,000 ultrafast rapid Bootstraps (105) were used to generate the final maximum likelihood tree (ML = 22854949.911). A clade was defined as a monophyletic group of strains sharing a well-supported ancestral node. SNP pairwise distance in the rooted and unrooted tree were determined using the "harrietr" R package (https://cran.r -project.org/web/packages/harrietr/README.html) applied with an in-house script. Pairwise MASH-distances were calculated for all D. pigrum strains using the implementation of the MASH algorithm (106)i nt h e PanACoTA pipeline (107). Code and data files for this part of the analysis are available online (https://github .com/KLemonLab/DpiMGE_Manuscript/blob/master/SupplementalMethods_PhylogeneticDistances.md).
Synteny analysis. We performed a whole-genome sequence alignment on all closed genomes using progressive Mauve in Mauve version 2.4.0.r4736 with its default settings (50,51). For the five genomes that we were unable to circularize, we manually fixed the start site to dnaA and added NNNNNNNNN to the region concatenating the ends of the contigs to mark it as a region of uncertainty in the synteny alignment. Manual curating was done with SnapGene version 4.2.11 GUI platform (SnapGene software from GSL Biotech).
Functional analysis of the pangenome using Anvi'o. All genomes were reannotated with an updated Prokka version (1.14.6) (97) with default parameters, including gene recognition and translation initiation site identification with Prodigal (108). The pangenome was analyzed using Anvi'o version 7 (52,53). We followed the pangenome workflow to import Prokka annotated genomes into Anvi'o( http://merenlab.org/ 2017/05/18/working-with-prokka/), followed by the addition of functional COG annotations using the anvirun-ncbi-cogs program with the -sensitive flag (runs sensitive version of DIAMOND [109]) and the 2020 updated COG20 database (110,111). KEGG/KOfam (112,113), and Pfam (114) annotations were also added to each genome .db file, as well as hmm-hits (115). The pangenome was calculated with the anvi-pan-genome program (flags: -minbit 0.5, -mcl-inflation 10, and -use-ncbi-blast) using blastp search (116), muscle alignment (117), "minbit heuristic" (118) to filter weak hits, and the MCL algorithm (119). The functional and geometric homogeneity index, and the rest of the information shown in Fig. S3D were calculated following the standard Anvi'op a n g e n o m i cp i p e l i n e( http://merenlab.org/2016/11/08/pangenomics-v2). The core (n =2 8 ) ,s o f tc o r e( 2 8. n $ 26), shell (26 . n $ 3), and cloud (n # 2) annotations from GET_HOMOLOGUES were added to the Anvi'o pangenomic database using the interactive interface. We defined the accessory as GCs present in #25 genomes and core as GCs present in $26 genomes. The output of this Anvi'o pangenomic analysis and the code used to generate it are available online (https://github .com/KLemonLab/DpiMGE_Manuscript/blob/master/SupplementalMethods_Anvio.md). We used the summary file we exported from the Anvi'o pangenomic analysis to generate the functional enrichment plots shown in Fig. 4 and in Fig. S3E and F using an in-house R script (https://github.com/KLemonLab/DpiMGE _Manuscript/blob/master/SupplementalMethods_COGs.md) to wrangle and extract information on the informative COG20 annotated gene clusters (120,121).
PPanGGOLiN analysis. Gene clustering and annotation data were exported from the Anvi'oo u t p u t and imported into PPanGGOLiN version 1.1.141 (Partitioned PanGenome Graph Of Linked Neighbors) (68) to create a partitioned pangenome graph (PPG) that assigned GCs to the "persistent,""shell," and "cloud" partitions. Regions of genome plasticity (RGPs) and spots of insertion were predicted (122), and subgraphs of the hot spots of interest were generated by providing the sequence of the flanking proteins in a fasta file. The output of the PPanGGOLiN analysis and the code used to generate it are available online (https:// github.com/KLemonLab/DpiMGE_Manuscript/blob/master/SupplementalMethods_PPanGGOLiN.md). The subgraphs represented as inserts on Fig. 2B were obtained with the command "ppanggolin align -p pangenome.h5 -getinfo -draw_related -proteins" using the amino acid sequences for the proteins upstream and downstream of each spot of interest. Since PPanGGOLiN does not currently allow creation of subgraphs using GCs imported from external clustering methods, the pangenome was run again using the default PPanGGOLiN workflow with MMseqs2 clustering (default settings: -identity 0.8, -coverage 0.8, and -defrag).
Characterization of MGEs. We searched all genomes for phage elements using the PHASTER database and web server (http://phaster.ca) on 8 November 2018 (56,57). We took the "intact" phage elements as defined by a phage score of .90 and queried their ORFs using blastp to manually reannotate their phage genes in the SnapGene GUI. We searched for plasmid elements in all genomes using the PlasmidFinder 2.0 database and GUI interface (https://cge.cbs.dtu.dk/services/PlasmidFinder/) on 13 November 2018 using the default parameters (62). For strains with hits for a plasmid element, ORFs 1,000 kb upstream and downstream of the element were queried through blastp. Manual gene reannotation was performed using the SnapGene GUI platform.
The summary file exported from the Anvi'o pangenomic analysis (see above) was also used for the identification of MGEs on the Prokka, COG20, Pfam, and KOfam annotations. We identified 23 GCs as coding for putative transposases. GC alignments were visually inspected in AliView (123), and full-length representative sequences were selected for Pfam search at the Pfam batch sequence search/HMMER website (114,124). We identified eight GCs with complete ($80% coverage) Pfam Transposase (tnp) domains as true predicted transposases and five GCs with complete ($80% coverage) Pfam rve domains as integrases. We used Operon ConTextulization Across Prokaryotes to Uncover Synteny (OCTAPUS; https://github.com/FredHutch/octapus) to identify the gene neighborhoods in which the selected transposases and integrases were located across all 28 D. pigrum genomes (see Table S3). The approach used by OCTAPUS is to search for a set of defined query genes across a collection of reference genomes by translated amino acid alignment and then to summarize the results by their physical colocation and organization. In this way, operon structure can be identified as the consistent colocation of a set of genes across multiple genomes in the same relative orientation (including both position and strand). The groups of genes identified with OCTAPUS at minimum percent identity 85% and minimum coverage 80% were visualized using clinker (https://github.com/gamcil/clinker) (125), and summary data provided in (see Table S3) were calculated using the matrixStats package (https://github.com/HenrikBengtsson/ matrixStats). Detailed methods for this part of the analysis, as well as relevant files, are available online (https://github.com/KLemonLab/DpiMGE_Manuscript/blob/master/SupplementalMethods_MGEs.md).
We similarly used OCTAPUS to identify the gene neighborhood of the group II intron identified with Anvi'o and PPanGGOLiN (GC_00000001). Using Pfam, we confirmed two predicted domains in a sequence from D. pigrum KPL3250 in GC_00000001: a reverse transcriptase and a maturase. The best hit in a blastx search with this same sequence against the Bacterial Group II Intron Database was to the bacterial class C intron-encoded protein from La.re.I1 in Lactobacillus reuteri with 44% identity and 65% similarity over 419 amino acids (70).
Base modification analysis and prediction of restriction-modification systems. For methylome analysis, interpulse durations were measured and processed for all pulses aligned to each position in the reference sequence. We used Pacific Biosciences' SMRTanalysis v8, which uses an in silico kinetic reference and a t-te st-based kinetic score detection of modified base positions, to identify modified positions (126).
We identified RM systems using SMRTseq data, as previously described (127), using the SEQWARE computer resource, a BLAST-based software module in combination with the curated restriction enzyme database (REBASE; http://rebase.neb.com/rebase/rebase.html) (71). Prediction was supported by sequence similarity, presence, and order of predictive functional motifs, plus the known genomic context and characteristics of empirically characterized RM system genes within REBASE. This facilitated reliable assignment of candidate methyltransferase genes to each modified motif based on their RM type.
Prediction of CRISPR-Cas systems. CRISPR cas genes were detected using the CRISPRFinder (https:// crispr.i2bc.paris-saclay.fr/Server/) (129), and the array elements downstream from these genes were found using CRISPRDetect software (http://crispr.otago.ac.nz/CRISPRDetect/predict_crispr_array.html) (73). The spacers identified using CRISPRDetect were queried through databases of possible phage targets in the GenBank-Phage, Refseq-Plasmid, and IMGVR databases with CRISPRtarget (http://crispr.otago .ac.nz/CRISPRTarget/crispr_analysis.html) (73,130), keeping hits with a cutoff score greater than 14. These spacers were also queried against the predicted L1 to L4 prophages through CRISPRtarget using a cutoff score .0. All gene and array element searches were completed on the webserver on 16 February 2019-with the exception of the spacers' q u e r yt h r o u g ht h eL 1t oL 4p r o p h a g e so n2 7 July 2021-using the default parameters. We also queried the genomes through CRISPRdb and CRISPRCompar (https://crispr.i2bc.paris-saclay.fr) website on 18 March 2019 to identify and annotate spacers shared among the different strains, keeping hits with scores higher than 15 to indicate similarity (129,131,132).
Data availability. All genomes are available from the NCBI. Table 1 lists the accession numbers for each D. pigrum strain genome used in this study.  TEXT S1, PDF file, 0.1 MB.

ACKNOWLEDGMENTS
We are deeply grateful to the participants who donated nostril swab samples at a 2017 and 2018 science festival. Their contribution was critical to expanding our knowledge of D. pigrum. We thank colleagues and lab members who provided invaluable assistance at both outreach events, in particular Javier Fernandez Juarez, Kerry Maguire, Pallavi Murugkar, Pooja Balani, Sowmya Balasubramanian, Fan Zhu, Andy Kempczynski, Andrew Collins, Brian Klein, and Megan Lambert. For critical logistical support, we are grateful to Genevieve Holmes. We also thank Melinda M. Pettigrew and Yong Kong for advice on genome analysis, as well as Tsute (George) Chen, Daniel Utter, Edoardo Pasolli, Nicola Segata, and Michael Wollenberg for their computational and phylogenetic advice over the course of the project. We thank members of the Johnston Lab, the KLemon Lab, and the Starr-Johnston-Dewhirst-Lemon joint lab meeting for critique and suggestions. Author