Post-weaning shifts in microbiome composition and metabolism revealed by over 25,000 pig gut metagenome assembled genomes

Using a previously described metagenomics dataset of 27 billion reads, we reconstructed over 50,000 metagenome-assembled genomes (MAGs) of organisms resident in the porcine gut, 46.5% of which were classified as >70% complete with a <10% contamination rate, and 24.4% were nearly complete genomes. Here we describe the generation and analysis of those MAGs using time-series samples. The gut microbial communities of piglets appear to follow a highly structured developmental program in the weeks following weaning, and this development is robust to treatments including an intramuscular antibiotic treatment and two probiotic treatments. The high resolution we obtained allowed us to identify specific taxonomic “signatures” that characterize the microbiome development immediately after weaning. Additionally, we characterized the carbohydrate repertoire of the organisms resident in the porcine gut, identifying 294 carbohydrate active enzymes. We tracked the shifts in abundance of these enzymes across time, and identified the species and higher-level taxonomic groups carrying each of these enzymes in their MAGs, raising the possibility of modifying the piglet microbiome through the tailored provision of carbohydrate substrates.


INTRODUCTION
Advances in sequencing technology and computational analysis enable the study of whole microbial communities from a given environment in a high-throughput manner, bypassing the need to cultivate individual organisms. Microbial community profiling is generally approached in a targeted (e.g: 16S rRNA) or untargeted (shotgun) manner.
Only a tiny fraction of known bacterial species are represented in genome sequence databases, and in particular, difficult-to-culture and non-pathogenic organisms remain greatly underrepresented [1][2][3] . On the contrary, taxa of interest for their pathogenicity or potential in the biotech sector are more often sequenced and therefore tend to be well represented in public databases. Although an increasing diversity of reference genomes is becoming available, referencebased methods are not suited to discover new genomes, as they depend, directly or indirectly, on existing databases 4 . Moreover, proof of the nature of functional redundancy in the microbiome 5 , and recent findings highlighting the striking functional implications of small genomic differences between organisms of the same species 6 , should make us ever more aware that important compositional and functional features may be masked completely when the view of the microbiome is restricted to a small gene fragment, as in the 16S rRNA amplicon profiling technique.
Shotgun metagenomic sequencing, on the other hand, allows a much more complete view of the microbial community. The technique uses assembly and binning for the reconstruction of genomes from metagenomic reads, but it comes with many technical challenges. Among the main challenges are: 1. the detection of less prevalent taxa within the sample 7 , and 2. the high sequence similarity among strains of the same species, making assembly a particularly challenging task 8,9 . Previous work in computational metagenomics has established that repeated sampling from an environment or subject can facilitate the reconstruction of genomes from species 10,11 and strains 12 in the microbial community. Two major steps are involved in the process: 1. assembly and 2.
binning. The joint assembly of samples from an individual host is called co-assembly, while binning consists in the grouping of co-assemblies into MAGs in the process of differential coverage binning 13 . The abundance per sample is then inferred from coverage depth and/or k-mer frequencies 14 . The rationale behind this type of binning method is based on the observation that the abundance of genetic material from the same organism changes in one subject or environment over time in a highly correlated manner 10,11,13,15 . Although over half of all species in the gut are represented by single strains 16 , the fact that strains persist within a host for long periods of time (i.e. decades) 17 can be used as leverage (or: a strengthening point) to the scope of resolving genomes. of species per enzyme ID was derived as follows: MAGs were grouped by subject, enzyme ID and GTDB-assigned species; distinct groups were selected and a count was obtained; the sum of distinct species falling within one enzyme ID was obtained and the proportion was derived.

Analysis workflow and scripts
In order to manage the processing of the data in the High Performance Computing environment (sequence data processing, pooled assembly, binning, MAGs quality assessment), Nextflow 42 was used. For the installation and management of environments, conda 43 was used (v4.7.12). R 44 and the following R packages were used for the data analysis and visualization: ape 45  A schematic workflow of the data processing and analysis is represented in Supplementary   Figure 2. Workflows and scripts used for the data processing are available in the Github repository https://github.com/koadman/metapigs. Scripts used for the data analysis and visualization are available in the Github repository https://github.com/GaioTransposon/metapigs_dry.

RESULTS
A total of 51,170 MAGs from 911 samples were generated in this study, 92.96% (n=47569) of which derived from samples of post-weaning piglets, 5.24% (n=2680) from samples of the piglets' mothers, and 1.81% (n=926) from negative and positive control samples. The binning method employed here makes use of coverage information across multiple samples, with the potential for increased accuracy with increasing numbers of samples. We observed a positive correlation (Spearman r 2 =0.92; p value<0.0001) between the number of time points sampled per subject and the number of bins obtained for that subject (

Completeness and contamination of MAGs
According to quality analysis with CheckM 27 , 46.5% (n=23,798) of the total MAGs (n=51,170), were classified as ≥70% complete with a ≤10% contamination rate, while 24.4% (n=12486) were classified as nearly complete genomes as by the Bowers et al (2015) definition of MAGs with ≥90% completeness and ≤5% contamination (Figure 2). Three hundred and thirty MAGs were ≥99% complete and had a ≤0.1% contamination rate. The distribution of contig counts (median=126.0; mean=132.6; sd=70.0) per MAG is displayed in Supplementary Figure 3. Contigs had an average N50 of 36,276 nucleotides (median=25,438).

MAG ANI clusters and taxonomic clusters
As a reference-free technique of MAG generation was adopted, we grouped the MAGs into clusters of similar genomes using two methods: 1) 95% and 99% sequence similarity using the genome dereplication tool dRep 31 ; and 2) taxonomic clustering with Kraken2 29 against the GTDB 30 . Using the first method, nearly half of the MAGs (43.8%) passed the established length-filtering threshold (500,000 nt) and were assigned a cluster ID based on average nucleotide identity (ANI).
A total of 1,267 unique primary clusters (95% ANI) and 4,480 unique secondary clusters (99% ANI) were obtained. Primary or secondary clusters that were found in only one subject were defined as unique, whereas clusters that were found in more than one subject were defined as common. Primary clusters (95% ANI) were common among subjects at a higher rate than secondary clusters (99% ANI): 98.6% and 86.2% among piglets and 91.4% and 72.1% among the mothers, respectively, delineating strain specificity within hosts (Supplementary Figure 4). The second method applies taxonomic clustering using the most comprehensive to date bacterial and archaeal taxonomic database. We assessed the extent of agreement between the first method (dRepgrouping based on ANI) and the second (GTDB taxonomic assignment). The degree of agreement ranged from a mean of 94.4% at the species level (median=100%), 96.7% at the genus level (median=100%), and 98.3% at the family level (median=100%) to reach a >99.1% agreement at the order, class and phylum level (Supplementary Figure 5). Out of all MAGs assigned a cluster (43.8%), no duplicate secondary clusters were detected within the same host. On the contrary, 24 MAGs from 11 hosts were assigned a total of 7 distinct primary clusters. Among primary clusters found more than once within a host, two were also found more than once among other hosts (primary cluster "838": 3 hosts; primary cluster "1099": 4 hosts). The MAGs assigned primary cluster 838 were all assigned to the same species (F23-B02 sp000431075 species; Oscillospiraceae family) by GTDB. The MAGs assigned primary cluster 1099 were assigned to three species by GTDB, falling under the Lachnospiraceae family.
Based on GTDB clustering of MAGs derived from positive control samples (3 positive control types; 8 replicates within each), MAG taxonomic assignments matched the expected profile and the profile obtained from analysis of the reads with MetaPhlAn2 77 reported in our previous study 18 (Supplementary Figure 6).

A remarkable tightly regulated compositional shift
We found the aging-associated compositional shift throughout the length of the study to be consistent among hosts, independent of treatment, and marked by changes that were highly specific in the taxonomy. This tight regulation is observable in Figure 4, where, for visualization purposes, the abundance changes across time in all subjects are displayed for only the 26 most abundant species. A more comprehensive temporal heatmap, comprising the 112 most abundant species in the piglet population is provided in Supplementary Figure 10.
Differential abundance between time points and statistical analysis of the changes over time was obtained with SIAMCAT 33 . Time points analyzed were at consecutive weeks from the start (t0): t0, t2, t4, t6, t8, and t10. The top 10 significantly shifting species in piglets between t0 and t4, and between t4 and t8, are shown in Figure 5. Correlations with p value < 0.05 after false discovery rate correction were considered significant. Significance values, fold change and other metrics are provided in Supplementary File 1.
We display all significant abundance shifts (FDR adjusted p value < 0.05) in Supplementary   Figure 17. From this plot it is evident that most species increase during the first time interval (red dots). While most species kept increasing the following week (yellow dots; fc > 0), some decreased (yellow dots; fc < 0; cluster on the bottom left). Notably, all Prevotella species (n=16) significantly increased in abundance during the first or the second week (red and yellow dots, respectively), and nearly all of these Prevotella species (n=13) decreased during the last time interval (pink dots; fc < 0; top left). (S17) suited to highly heterogeneous environments such as the gut microbiome. The second method assumes that a set of organisms exists that do not change between subjects or over time, and these are used as internal references. This is also a risky assumption in unstable microbial environments such as that of the post-weaning piglet gut microbiota, as we witnessed in this study. The increase or decrease of a certain organism over time does not necessarily cause a decrease or increase of another organism, but the use of proportions (third method) can lead to such conclusions, as correlations "piggyback" on each other. This phenomenon is referred to as spurious correlation of ratios. Despite the latter drawback, normalisation by proportions is currently among the most widely used normalisation methods in microbiome studies.
In our study we applied a number of normalisation methods in parallel, among which we chose median sequencing depth, rarefaction and proportions. Although the temporal shift was evident when applying any of these normalisation methods, we are aware of the fact that the use of proportions to normalise for library size can lead to a higher discovery of false correlations.
Bias can also arise from the nature of organisms in the sample. The content of GC 118 and the genome size 119 of each organism both influence the representation of species in the sample, by preferential affinity of the polymerase and the portion each organism takes up in the sample having a direct link to its genome size.   Distribution of scaffold counts (A), distribution of predicted genes counts (B), and completeness and contamination rates over MAGs of partial quality (grey), medium quality (pink) and nearly complete MAGs (blue) (C). Partial and medium quality MAGs have a level of completeness between 60 and 80, and between 80 and 90, respectively, and a rate of contamination of < 10, and ≤ 10, respectively. MAGs with a ≥ 90% completeness and a ≤ 5% contamination are considered nearly complete as by the GSC MIMAG standards (Bowers et al, 2017). predicted genes The heatmap reports the log transformed normalized relative abundance from all piglets. Each panel represents one time point and, within each panel, all samples from all piglets are included. At t0 piglets are aged 3 to 4 weeks old; at t10 piglets are aged 8 to 9 weeks old. Samples are rarefied to obtain equal amount of GTDB clustered MAGs per sample, and subsequently the most prevalent taxa, present in at least 20% of all the samples are included. Analysis is performed with PhyloSeq.  Diversity of GTDB−predicted species in the piglet population Figure 5. Significantly changing GTDB-species with time.
Shown are the top 10 significantly changing GTDB taxonomically clustered MAGs at the species level, between time points 0 and 4 (2 weeks interval) (left), and between time points 4 and 8 (2 weeks interval) (right). Points show normalized and log -transformed abundance within each subject and GTDB-predicted species.  Two-hundred ninety-four carbohydrate active enzymes are represented. The position along the x-axis indicates the relative abundance (as a percentage) of the species with the highest copy number of genes encoding a specific enzyme, while the y-axis represented the prevalence among the pig population (n=126 piglets; n=42 mothers). Taxonomic assignments of MAGs are based on the GTDB.   Primary clusters (95% ANI) (A-B) were shared across subjects at a higher rate than secondary clusters (99% ANI) (B-C): 98.6% and 86.2% among piglets (A-C) and 91.4% and 72.1% among the mothers (B-D), respectively, delineating strain specificity. It appears that each individual subject harbors a specific set of ANI clustered MAGs (95% ANI or 99% ANI) that are unique to the subject, as well as a set ANI clustered MAGs (95% ANI or 99% ANI) that are common among subjects. A larger fraction of unique ANI clustered MAGs (either at 95% or 99% ANI) is seen among the mothers than among the piglets. Extent of agreement is shown between GTDB taxonomic clustering and dRep-primary clusters (95% ANI) (top) and secondary clusters (99% ANI) (bottom). Color coding represents MAG frequency (lower: dark; higher: light)   Taxonomic profile is obtained from PhyloSeq analysis of GTDB clustered MAGs of the positive controls samples. The taxonomic profile within each replicate is displayed in relative abundance. Each replicate sample is normalized by median sequencing depth.   Figure 7. Taxonomic composition of the piglet gut.

B A
Tree maps were built from MAGs taxonomically clustered with GTDB, without scaling for the relative abundance of each MAG. This accounts for prevalence of the MAG in the pig population (e.g. if the same taxon occurs in multiple piglets it can get counted more than once). The most common Phyla (top left), the least common Phyla (top right), the 50 most common Genera colored by Order (bottom left), and the 50 most common Species colored by Family (bottom right) are shown. , across the first and the second (left), and the third and fourth principal components (right). Prior to PCA, the data was normalized by proportions, the average proportion was computed by time point and treatment cohort, and the centered log transformation was applied. Labels report the treatment cohorts: Control, D-Scour, ColiGuard, Neomycin, NeoD (Neomycin followed by D-Scour), NeoC (Neomycin followed by ColiGuard). The heatmap reports the log transformed normalized relative abundance from all piglets. Each panel represents one time point and, within each panel, all samples from all piglets are included. At t0 piglets are aged 3 to 4 weeks old; at t10 piglets are aged 8 to 9 weeks old. Samples are rarefied to obtain equal amount of GTDB clustered MAGs per sample, and subsequently the most prevalent taxa, present in at least 20% of all the samples are included. Analysis is performed with PhyloSeq.