Phylogenomics of SAR116 clade reveals two subclades with different evolutionary trajectories and important role in the ocean sulfur cycle

The SAR116 clade within the class Alphaproteobacteria represents one of the most abundant groups of heterotrophic bacteria inhabiting the surface of the ocean. The small number of cultured representatives of SAR116 (only two to date) is a major bottleneck that has prevented an in-depth study at the genomic level to understand the relationship between genome diversity and its role in the marine environment. In this study, we use all publicly available genomes to provide a genomic overview of the phylogeny, metabolism and biogeography within the SAR116 clade. This increased genomic diversity revealed has led to the discovery of two subclades of SAR116 that, despite having similar genome size (ca. 2.4 Mb) and coexist in the same environment, display different properties in their genomic make up. One represents a novel subclade for which no pure cultures have been isolated and is composed mainly of single-amplified genomes (SAGs). Genomes within this subclade showed convergent evolutionary trajectories with more streamlining features, such as low GC content (ca. 30%), short intergenic spacers (<22 bp) and strong purifying selection (low dN/dS). Besides, they were more abundant in metagenomic databases recruiting also at the deep chlorophyll maximum. Less abundant and restricted to the upper photic layers of the global ocean, the other subclade of SAR116, enriched in MAGs, accommodated the only two pure cultures. Genomic analysis suggested that both clades have a significant role in the sulfur cycle with differences in the way in which both clades can metabolize the dimethylsulfoniopropionate (DMSP). IMPORTANCE SAR116 clade of Alphaproteobacteria is an ubiquitous group of heterotrophic bacteria inhabiting the surface of the ocean, but the information about their ecology and population genomic diversity is scarce due to the difficulty of getting pure culture isolates. The combination of single-cell genomics and metagenomics has become an alternative approach to study this kind of microbes. Our results expand the understanding of the genomic diversity, distribution, and lifestyles within this clade and provide evidence of different evolutionary trajectories in the genome make-up of the two subclades that could serve to understand how evolutionary pressure can drive different adaptations to the same environment. Therefore, the SAR116 clade represents an ideal model organism for the study of the evolutionary streamlining of genomes in microbes that have relatively close relatedness to each other.

8 genus HGC2D, which showed a genome size higher than the rest with an average of 193 3.2 Mb (Table S1 and Fig 1A). Likewise, this genus also exhibited high values for both 194 GC content and intergenic spacer sizes. As a consequence of the smaller size of the 195 intergenic space, genomes within the LGC have on average more than 100 genes per 196 Mb of genome (Table S3). 197 198 These genomic features suggested that members within the LGC subclade are 199 undergoing a streamlining process without genome reduction. For that reason, we 200 studied other characteristic genomic parameters that have been proposed to be 201 relevant in the streamlined genomes such as selective pressure and the number of 202 paralogs (34-37). Microevolution was measured as the ratio of nonsynonymous to 203 synonymous polymorphisms (dN/dS ratio). We found that the median dN/dS value was 204 ca. 0.09 for LGC, this value was comparable to the better known marine SAR11 clade 205 (34) and suggests a strong purifying selection acting on the genome evolution of this 206 subclade (Table S3). Within the HGC subclass, we observed much more variable 207 values. While the genus HGC2A showed similar values to the LGC (0.08), in the other 208 genera within the HGC we found markedly higher median dN/dS (ca 0.15) ( Table S3). 209 However, the number of paralogs (ca. 170) was consistent across genera in both 210 subclades (Table S3). 211

212
In order to put these genomic features into perspective, we compared these groups 213 with a collection of reference marine microbes with different ecological strategies 214 (Table S3 and Fig 1C). Despite the divergence, genomes within the LGC subclade 215 showed consistent genomic parameters, some of them (GC content and dN/dS ratio) 216 typical of well-studied streamlined genomes such as SAR11 or Ca. Actinomarina 217 minuta (35) ( Table S3 and Fig 1C). The median intergenic distance was higher than 218 these two microbes, although it was slightly lower than other marine microbes with 219 streamlined genomes such as marine ammonia-oxidizing thaumarchaeon "Ca 220 Nitrosopelagicus brevis" and the cyanobacterial Prochlorococcus marinus CCMP1986 221 (Table S3 and Fig 1C). However, the estimated genome size was double that of all 222 these reference genomes. On the other hand, the HGC group shows multiple genomic 223 evolutionary trajectories with features more similar to the marine copiotrophic 224 heterotrophs such as Erythrobacter and Alteromonas or the cyanobacterium 225 Synechococcus sp. CC9902. The case of the HGC2A group is outstanding in 226 displaying an intermediate trend with strong purifying selection and lower GC more 227 similar to LGC (Table S3 and Fig 1C). In addition, like the LGC groups, HGC2A had a 228 higher proportion of genomes recovered by single-cell genomics (Fig 1B). 229 230 Ecological distribution (metagenomic recruitment) 231 The differential genomic features observed between both subclades could be related to 232 adaptations to specific ecological niches. Therefore, we analyzed the distribution 233 patterns using metagenomic read recruitment analysis in the large global dataset from 234 the Tara Oceans Project (16). 235 236 First, we analyzed the relative abundance of all the genomes against their occurrence 237 in the metagenomics samples which allowed for the determination of several 238 genomospecies i.e groups of genomes with close phylogenomic relationship and 239 similar relative abundances within the same geographical locations (35, 38). We were 240 able to differentiate 22 genomospecies (Table S1 and Fig 1A). The minimum pairwise 241 ANI value among these ecogenomic units of classification was ca. 85%. The results 242 showed that SAR116 clade microbes were found exclusively associated with the upper 243 layers of the epipelagic zone. None of the genomospecies was present in the cold-244 water stations of the Southern Ocean or in mesopelagic zones (>200m) (Fig 2A). While 245 HGC members were only found in surface waters, LGCs showed a broader 246 distribution, being present at a higher number of stations and depths, which suggests 247 adaption to a wider range of conditions (Fig 2A). For instance, genomospecies LGC1-248 A1 and LGC1-A2 recruited in the highest number of stations from surface and deep 249 chlorophyll-maximum (DCM) (Fig 2A). While genomospecies B1, B2 within the HCG2 250 and A1, A2, B1, C1 and D1 from the LGC could be considered the most cosmopolitan, 251 present in several oceanic provinces from 30ºN to 30ºS, other genomospecies 252 presented predilection for specific regions such as the Mediterranean Sea (HGC1-A1 253 and HGC2-A2) and Pacific Ocean South-East (HGC2-A1 and LGC1-A3) (Fig 2A) Ocean (Fig 2A). 260

261
In order to examine the intrapopulation sequence diversity, we used the metagenomic 262 recruited reads to determine the read-based average nucleotide identity (ANIr). Most 263 genomospecies in both subclades (HGC and LGC) showed a median ANIr value of ca. 264 95% (species threshold). None of the genomospecies within the HGC presented a 265 lower value but genomospecies HGC1-A1, HGC2-A2 and D2 showed lower 266 intrapopulation sequence diversity (ANIr >96%). These genomospecies could be 267 considered endemic to the Mediterranean Sea and the station TARA_004 (located at 268 the connection between the Mediterranean and the Atlantic Ocean). Therefore, it could 269 suggest a more recent divergence of these groups adapted to the special conditions of 270 the Mediterranean such as limiting P concentration. A similar example has already 271 been described in the SAR11 genomospecies Ia.3/VII, which also showed a 272 preferential presence in the Mediterranean (34). However, three LGC subclade 273 genomospecies (LGC2-B2, LGC2-D1 and D2) showed higher intra-population diversity 274 which could indicate higher ecological persistence over time of these populations ( Fig  275   2A) (39). This is reflected in the linear recruitment plots of these genomospecies 276 (LGC2-D2) with a minimum alignment identity threshold located at ca. 85% and HGC2-277 D1 whose pattern could be associated with a less diverse population (ca. 97%) (Fig  278   2B). 279

280
The linear recruitments revealed the presence of metagenomic islands in two 281 genomospecies (LGC1-A1 and LGC2-C1) belonging to different families within the 282 LGC subclade in metagenomic samples from different locations (Fig S2A and S2B). The results showed a highly hypervariable region that was always preserved. The 284 location of the island was conserved among the genomes within the same 285 genomospecies. Detailed analysis of the gene content showed that they are involved in 286 synthesizing the outer glycosidic envelope of the cells (Fig S2C). This high diversity 287 has been explained because they are important phage recognition targets (40). 288 289

General Metabolic features within SAR116 HGC and LGC genomes 290
The isolation and sequencing one decade ago of two bacterial strains, IMCC1322 and 291 HIMB100 (28, 29), shed light on the physiology and metabolic potential of the SAR116 292 clade in the oceans. Here, with the increased genomic diversity of SAGs and MAGs, 293 we have expanded the knowledge of this ubiquitous marine group. Given the 294 incomplete nature of SAGs and MAGs, we used the pangenome as a unit to analyse 295 the metabolism against several functional databases (see methods). We included in 296 the comparison the genome of the pure culture IMCC1322, which was 297 phylogenomically classified into the HCG2C (Table S1). Most of the results are in 298 agreement with previous metabolic reports (28, 29) ( Fig 3A). Both HGC and LGC 299 subclades are aerobic, chemoorganotrophic microorganisms, encoding enzymes for 300 the three common glycolysis pathways (Embden-Meyerhof-Parnas, Entner-Doudoroff, 301 and pentose phosphate), although as reported from the pure cultures (28, 29), all 302 genomes of both subclades lack 6-phosphofructokinase (pfkA); the tricarboxylic acid 303 cycle (TCA cycle); and the complexes I to IV involved in the electron transport chain 12 (ETC). However, in the latter, some differences arose among subgroups. Complex II 305 succinate dehydrogenase could not be detected within the genus LGC2C (18 306 genomes). 307 308 Besides, while the most common version of the complex I detected was the H + -NADH 309 ubiquinone oxidoreductase (nuo) operon, we detected a horizontal gene transfer (HGT) 310 event within the genomes of LGC2C and LGC2D, on which the nuo operon was 311 replaced with the sodium equivalent Na+-pumping NADH:quinone oxidoreductase (nqr) 312 operon, being the closest relative to this complex the methylotrophic bacteria 313 HTCC2181 (67.52% average amino acid identity) ( Fig S3A). It has already been 314 reported that multiple HGT events have allowed the dispersal of this operon among 315 different bacterial lineages (41). In fact, there is still reminiscence of a gene belonging 316 to the nuo cluster (nuoL) in these genomes immediately adjacent to the nqr operon 317 which is not present in the HTCC2181 genome ( Fig S3A). The use of sodium ion 318 transport to generate an electrochemical potential that can be used both for ATP 319 synthesis and also as a primary sodium pump to maintain ionic homeostasis could be 320 an evolutionary advantage in the marine environment. In marine bacteria of the phylum 321 Marinimicrobia the presence of these different versions of respiratory complex I have 322 been correlated with improved ecological adaptation to discrete niches (epipelagic and 323 mesopelagic environments) (42). 324

325
The glyoxylate shunt (GS), a two-step metabolic pathway that serves as an alternative 326 to the TCA cycle was only detected in some genera of the HGC subgroup (HGC1, 327 HGC2A, HGC2D) and LGC1. In addition, we detected marked differences in the 328 acquisition and degradation of multiple sugar compounds. Overall, families of glycoside 329 hydrolases (GHs) involved in the degradation of simple and complex oligosaccharides, 330 such as glycogen, cellulose or chitin, and sugar transporters ( Fig 3A) were detected in 331 all subgroups, although it is remarkable the elevated number of GH families within 332 genera HGC2 and LGC1. Contrastingly, the low numbers of these degradative 333 enzymes within LGC2 and HGC1 may indicate different ecological strategies degrading 334 organic carbon sources (e.g., cellulase was only detected in HGC2). 335

336
Regarding the metabolism of amino acids and vitamins, they all carried the necessary 337 genes for biosynthesis of the twenty common amino acids (data not shown) and the 338 vitamins B2 (riboflavin), B5 (pantothenate) B6 (pyridoxal), B9 (folate), B12 (cobalamin), 339 the molybdenum cofactor, and the heme group ( Fig 3A). Remarkably, functional 340 annotation of proteins indicated that instead of using the aspartate 4-decarboxylase, 341 involved in the transformation of aspartate to alanine, they synthesise the latter via the 342 inhabiting these places need access to other P-compounds (e.g, phosphonates) to 350 grow and/or survive (38, 45). All of them encoded for the synthesis of a 351 proteorhodopsin. Amino acid sequence analysis indicated that all of them were proton 352 pumps (DTE motif, (46)) and most of them (90 out of 91) absorbed in the blue 353 spectrum. Next to the proteorhodopsin (co-located on the same strand), it is found the 354 gene cluster involved in the synthesis of retinal (Fig S3B). The position of these genes 355 varies between HGC and LGC, and among genera within the high GC groups, which 356 could suggest several independent acquisition events after a common ancestor (Fig  357   S3B). However, in all members of the LGC subclade, the gene coding for isopentenyl 358 diphosphate isomerase (ispA) is not present. Remarkably, LGC1 (9 genomes) is the 359 only group that lacks the retinal biosynthesis operon (Fig S3B). This genomic deletion 360 forces the bacterium to retrieve retinal from the environment, like many other marine 361 streamlined organisms (35,47,48). Despite the different evolutionary trajectories in 362 terms of genomic architecture, at the functional level, both subclades appear to have 363 many similarities including the absence of essential genes in certain pathways 364 suggesting that they have evolved from a common ancestor. 365 366

Contribution of SAR116 to the sulfur cycle in the ocean 367
The ocean represents a major reservoir of sulfur (mainly in the form of sulphates) on 368 Earth (49). In this environment, the water column can be considered as a 369 heterogeneous habitat, formed by many kinds of microorganisms that interact with the 370 sulfur cycle. For instance, photosynthetic eukaryotes can reduce sulphate to assimilate 371 it into reduced organic sources. Some bacteria can couple sulphate respiration to 372 degrade organic matter in the absence of oxygen, such as the minimum oxygen zones 373 (50). Conversely, other prokaryotic groups can oxidize inorganic and organic sulfur to 374 produce energy (51). 375 376 Functional inference of SAR116 genomes showed that this clade plays a key role in 377 the sulfur cycle (Fig 4). Reduced organic sulfur, in the form of DMSP, an organosulfur 378 compound produced by phytoplankton as compatible solute (52) can be degraded into 379 DMS gas, one of the main sources of sulfur in the atmosphere and reduced sulfur (53, 380 54) and acrylate by the activity of a DMSP lyase. We found two types of DMSP lyases, 381 dddL and dddP (Fig 4). DMS can be biotically transformed to dimethyl sulfoxide detected only in the genomes of the HGC2, and LGC1 subgroups (Fig 3B), while the 395 demethylation pathway (dmdA) was exclusively detected on LGC subclade (LGC2 396 genera). Regarding the rest of the genes involved in the degradation of MMPA to 397 methanethiol, we found homologs to dmdB and dmdC with low identity (ca. 40%), but 398 not for dmdD. This same pattern has already been described in SAR11 suggesting that 399 the function of this gene (dmdD) could be replaced by other non-orthologous 400 isofunctional enzymes (56). It is remarkable that the main pathway to degrade DMSP, 401 found in many epipelagic microorganisms (53)  among LGC1 and LGC2 SAR116 groups, but also detected in HGC2A and HGC2C 416 (Fig 3C and 4). Previous studies demonstrated the presence and activity of sulfur-417 oxidizing chemolithoautotrophs to use reduced sources of sulfur (e.g. SUP05 and 418 OM252 clades) in anaerobic waters (58, 59), but also in the photic aerobic water 419 column in which sox genes are common (18, 60, 61) for energy generation, sometimes 420 coupled to inorganic carbon fixation (62). In this sense, it seems that SAR116, like 421 many other marine prokaryotes (63, 64), may be capable of generating energy from the 422 oxidation of inorganic sulfur on surface waters. Thus, although both subclades appear 423 to be important players in the sulfur cycle in surface ocean waters, the LGC subclade 424 may have an advantage for their capability to demethylate DMSP. The LGC1 group 425 despite its streamlined genome seems to have a higher metabolic versatility than the 426 rest of the LGC group, more similar in this sense to the HGC members, not only in 427 sulfur metabolism but also with a higher richness of both GHs and transporters (Fig 3) 428 which could be one of the reasons for its abundance at DCM (Fig 2). (https://github.com/kassambara/factoextra) libraries of R were used for these analyses. 505 The FactoMineR library was used to standardize the data during the PCA analysis. The 506 plot was made using the Biplot function, in which values on the same side as the 507 variable have a high value for that variable regardless of their position in the plot. 508 509

Metagenomic fragment recruitment and SAR116 biogeography. Metagenomes 510
from Tara Oceans expedition (16) were used to study ecological distribution patterns of 511 SAR116 genomes. Only those genomes recruiting at least three reads per kilobase of 512 genome and gigabase of metagenome (RPKG) and genome coverage of >70% and 513 with an identity threshold of ≥98% were kept for further analyses. To avoid the bias 514 caused by the high similarity rRNA operon, it was removed from all genomes before 515 recruitment (35, 38). Metagenomic reads were aligned using BLASTN (82), using a 516 cut-off of 98% nucleotide identity over a minimum alignment length of 50 nucleotides 517 and ≥50% of each genome should be covered by reads for consideration. The same 518 high-quality parameters were used for the metagenomic linear recruitment. The 519 resulting alignments, together with the distribution of the reads according to the identity 520 of the alignment (histogram) were plotted using the ggplot2 package in R.      Table S1: Detailed information about the genomes used in this study. 884 LGC