Proteomic and Bioinformatic Analysis of Streptococcus suis Human Isolates: Combined Prediction of Potential Vaccine Candidates

Streptococcus suis is a Gram-positive bacterium responsible for major infections in pigs and economic losses in the livestock industry, but also an emerging zoonotic pathogen causing serious diseases in humans. No vaccine is available so far against this microorganism. Conserved surface proteins are among the most promising candidates for new and effective vaccines. Until now, research on this pathogen has focused on swine isolates, but there is a lack of studies to identify and characterize surface proteins from human clinical isolates. In this work, we performed a comparative proteomic analysis of six clinical isolates from human patients, all belonging to the major serotype 2, by “shaving” the live bacterial cells with trypsin, followed by LC-MS/MS analysis. We identified 131 predicted surface proteins and carried out a label-free semi-quantitative analysis of protein abundances within the six strains. Then, we combined our proteomics results with bioinformatic tools to help improving the selection of novel antigens that can enter the pipeline of vaccine candidate testing. Our work is then a complement to the reverse vaccinology concept.


Introduction
Streptococcus suis is a Gram-positive bacterium which inhabits as a commensal in the upper respiratory tract of pigs, colonizing up to 100% of the animals [1]. However, it can cause severe infections such as bronchopneumonia in the lower respiratory system of swine, as well as invasive diseases including meningitis, endocarditis, sepsis, and even sudden death [2,3]. Therefore, in addition to its impact in animal welfare, the economic importance of this pathobiont species is very high, as it is responsible for monetary losses in the livestock industry worldwide, increasing also the cost of production because of supplying prophylactic antibiotics [4].
In addition, S. suis is considered an emerging zoonotic pathogen, causing infections in humans that are in contact with infected pigs, mainly in the slaughter industry, as well as in people consuming raw or poorly cooked pork meat, or other pork byproducts [5][6][7]. Two outbreaks in 1998 and 2005 leading to high mortality rates caused numerous human casualties in China [1,3,8], and has become endemic in other South-East Asian countries. In Vietnam and Thailand, S. suis infections are amongst the most common causes of meningitis in adults [4,8,9]. Additionally to Asia, many other cases of infections in humans have also been reported in Europe, America, and Oceania [8].
S. suis strains are classified in 35 different serotypes according to their serological reaction of the capsular polysaccharide [10]. Of these, serotype 2 (SS2) is by far the most prevalent worldwide, being highly virulent both in pigs and in humans. Whereas there are differences in the geographical

Bacterial Surface "Shaving" and Peptide Extraction
Peptides from surface proteins were obtained by the bacterial "shaving" approach as already described [25,30], with slight modifications. Briefly, 25 mL of cultures were centrifuged at 3500× g for 10 min at 4 • C, and the pelleted bacteria were washed twice with phosphate-buffered saline (PBS). Cells were resuspended in 0.5 mL PBS/30% sucrose (pH 7.4). Protease "shaving" of resuspended bacteria was performed with 1 µg trypsin (Promega, Madison, WI, USA) for 30 min at 37 • C with top-down agitation within an incubator. The resulting digestion mixtures were centrifuged again at 3500× g for 10 min at 4 • C, and the supernatants, which contained the peptides (i.e., the "surfome" fractions) were filtered using 0.22 µm pore-sized filters (Merck-Millipore, Burlington, MA, USA). Surfomes were re-digested with 0.5 µg trypsin overnight at 37 • C with top-down agitation. Peptides were purified prior to analysis, using Oasis HLB extraction cartridges (Waters, Milford, MA, USA). Peptides were eluted with increasing concentrations of acetonitrile (ACN)/0.1% formic acid, according to the manufacturer's instructions. Peptide fractions were concentrated with a vacuum concentrator (Eppendorf, Hamburg, Germany), resuspended in 100 µL of 2% ACN/0.1% formic acid, and kept at −20 • C until further analysis.

LC-MS/MS Analysis
Peptide separation was performed by nano-LC using a Dionex Ultimate 3000 nano UPLC (Thermo Scientific, San Jose, CA, USA), equipped with a reverse phase C18 75 µm × 50 Acclaim Pepmap column (Thermo Scientific) at 300 nL/min and 40 • C for a total run time of 85 min. The mix of peptides was previously concentrated and cleaned up on a 300 µm × 5 mm Acclaim Pepmap cartridge (Thermo Scientific) in 2% ACN/0.05% formic acid for 5 min, with a flow of 5 µL/min. Solution A (0.1% formic acid) and solution B (80% ACN, 0.1% formic acid) were used as mobile phase for the chromatographic separation according to the following elution conditions: 4-35% solution B for 60 min; 35-55% solution B for 3 min; 55-90% solution B for 3 min followed by 8 min washing with 90% solution B, and re-equilibration for 12 min with 4% solution B.
Peptide positive ions eluted from the column were ionized by a nano-electrospray ionization source and analyzed in positive mode on a trihybrid Thermo Orbitrap Fusion (Thermo Scientific) mass spectrometer operating in Top30 Data Dependent Acquisition mode with a maximum cycle time of 3 s. Single MS scans of peptide precursors were acquired in a 400-1500 m/z range at 120,000 resolution (at 200 m/z) with a 4 × 10 5 ion count target threshold. For MS/MS, precursor ions were previously isolated in the quadrupole at 1.2 Da, and then CID-fragmented in the ion trap with 35% normalized collision energy. Monoisotopic precursor selection was turned on. Ion trap parameters were: (i) the automatic gain control was 2 × 10 3 ; (ii) the maximum injection time was 300 ms; and (iii) only those precursors with charge state 2-5 were sampled for MS/MS. In order to avoid redundant fragmentations a dynamic exclusion time was set to 15 s with a 10-ppm tolerance around the selected precursor and its isotopes.

Protein Identification and Database Searches
The mass spectrometry raw data were processed using Proteome Discoverer (version 2.1.0.81, Thermo Scientific). Charge state deconvolution and deisotoping were not performed. MS/MS spectra were searched with SEQUEST engine (version v.27, Thermo Scientific) against a local database containing all the proteins derived from the genome sequence of Streptococcus suis BM407 (downloaded from [31], and applying the following search parameters: Trypsin was used for theoretical digestion of protein sequences, allowing up to one missed cleavage. Methionine oxidation was set as variable modification. A value of 10 ppm was set for mass tolerance of precursor ions, and 0.1 Da tolerance for product ions. Peptide identifications were accepted if they exceeded the filter parameter Xcorr score versus charge state with SequestNode Probability Score (+1 = 1.5, +2 = 2.0, +3 = 2.25, +4 = 2.5). Validation of peptide spectral matches (PSM) was done at a 1% false discovery rate (FDR) using a percolator based on q-values. For protein quantification, precursor ion areas were calculated using the precursor ion area detector and normalized by the total protein amount mode in Proteome Discoverer 2.1.

Computational Prediction of Protein Subcellular Localization
Primary predictions of S. suis BM407 protein subcellular localization were assigned by using the web-based algorithm LocateP v2 [32]. They were contrasted by several feature-based algorithms:  [33] for searching transmembrane helices; SignalP 5.0 [34] for type-I signal peptides; LipoP 1.0 [35] for identifying type-II signal peptides, which are characteristic of lipoproteins. Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) annotations were retrieved from the KEGG database and were used for inferring relationships using the KEGG Mapper suite [36]. In silico prediction of protein vaccine candidates and antigens was done using the web-based algorithm VaxiJen 2.0 [37].

Data and Statistical Analysis
Peptide extractions were made in triplicate from three independent cultures and "shaving" experiments for each strain. Proteins were considered to be present in a given sample as long as they were identified in at least two out of the three biological replicates for such a sample. Otherwise, proteins found only in one biological replicate were not considered to be found in the sample(s) and were discarded from the overall count of identified proteins. For further quantitative analysis, means and standard deviations were calculated using an Excel spreadsheet (Microsoft Excel 2011 v14.0.0 for Mac, Microsoft, Redmond, WA, USA). Values were z-scored prior to the principal component and clustering analysis. The R package FactoMineR was used to analyze the data mainly through principal component analysis. The factoextra package was used to represent these analyses, and the pheatmap package to cluster the data and represent the corresponding heatmaps. Non-detected proteins in samples were assigned a 0 value to avoid the processing of not available (NA) data.

Bacterial Surface "Shaving" and Protein Identification
We analyzed six human clinical isolates from different and distant provinces of Spain. All of them suffered from meningitis. Three strains were isolated from cerebrospinal fluid (CSF), and the other three from blood. The six isolates were SS2.
After "shaving" the six SS2 isolates with trypsin and further LC-MS/MS analysis, the MS/MS spectra were searched against the human SS2 reference strain BM407. From the total of 557 surface proteins predicted by the LocateP v2 prediction algorithm, 131 surface proteins were identified in the six isolates (i.e., the "pan-surfome" of these strains), grouped in the following categories ( Table 2): 31 signal peptide II lipoproteins, out of 40 from this category predicted from the BM407 genome (i.e., 77.5%); 18 LPXTG cell wall-anchoring proteins, out of 20 predicted in the reference strain, representing 90% of all the proteins in this category; 11 out of 18 secreted proteins (i.e., proteins with signal peptide I), representing 61.1% of all predicted secretory proteins; and finally, 71 proteins with one or more transmembrane domains (TMD), out of the 557 predicted from the genome of this reference strain (i.e., 23.5% of total membrane proteins). Of these, 45 possessed only 1 TMD (110 predicted in the BM407 genome, i.e., 40.9%), and 26 were multi-transmembrane proteins (i.e., proteins with more than 1 TMD). These represented only 7.1% of the predicted multi-transmembrane proteins (369 in total in the genome). In addition, 759 proteins predicted as cytoplasmic were identified, but they were excluded because this work was focused on those proteins predicted to be at the extracellular side of the bacterial cell. a Protein categories were defined as follows from LocateP v2 predictions: lipoproteins were those predicted as lipid-anchored proteins; cell wall proteins, as those possessing an LPXTG motif; secretory proteins, as those with an SP1-type signal peptide; membrane proteins with one transmembrane domain (TMD), as those possessing either a Cor an N-terminally anchored transmembrane region; multi-transmembrane proteins were membrane proteins with more than one TMD. The sum of the previous categories is considered as the total number of surface proteins (either identified experimentally or predicted from the S. suis BM407 genome). Table 3 shows the complete list of the 131 identified surface proteins, as well as their presence in each individual SS2 isolate. As expected, the very vast majority of lipoproteins and predicted secreted proteins were identified in all the isolates. The same occurred with cell wall proteins, with some exceptions: two putative glucan-binding surface-anchored proteins, encoded by loci SSUBM407_0471 and SSUBM407_0949, were found only in the 1086/11 strain. However, this was not accomplished in the case of membrane proteins: those with 1 TMD were more frequently found in all (or most of the) isolates, but those with more than 1 TMD were more scarcely identified. Actually, only four in this last subcategory were found in the six clinical isolates: SSUBM407_0896, SSUBM407_1298, SSUBM407_1682, and SSUBM407_1747.
Interestingly, among the LPXTG cell wall proteins, we identified the major pilus protein encoded by the locus SSUBM407_0414. Pili proteins have been shown previously to be trypsin resistant [21,38,39], but here the protein identified was found in the six clinical isolates analyzed, with 30 peptides covering 54% of the protein sequence in its immature form ( Figure S1), or 61% of the mature protein sequence (once N-term signal peptide and C-term sortase post-processing sequence are removed).

SSUBM407_0904
Putative extracellular amino acid-binding protein Putative amino acid ABC transporter, extracellular amino acid-binding protein Branched-chain amino acid ABC transporter, amino acid-binding protein Putative amino-acid ABC transporter extracellular-binding protein Putative glutamine ABC transporter, glutamine-binding protein/permease protein Putative peptidoglycan biosynthesis protein ×

SSUBM407_1659
Putative mannose-specific phosphotransferase system (PTS), IID component × Putative accessory pilus subunit a Protein categories were defined as follows from LocateP v2 predictions: lipoproteins were those predicted as lipid-anchored proteins; cell wall proteins, as those possessing an LPXTG motif; secretory proteins, as those with an SP1-type signal peptide; membrane proteins with one transmembrane domain (TMD), as those possessing either a Cor an N-terminally anchored transmembrane region; multi-transmembrane proteins were membrane proteins with more than one TMD.

Analysis of Differences in Surface Protein Abundances among the Clinical Isolates
Next, after protein identification, we performed a label-free semi-quantitative analysis to determine differences in the abundances of surface proteins among the six clinical isolates, based on chromatography peak areas (Table S1). For that, we first carried out a principal component analysis (PCA) to evaluate differences in the overall pattern of surface protein abundances comparing the six isolates ( Figure 1). The two first dimensions of the analysis explained 70.1% of the variance, with principal component (PC) 1 responsible for 47.9%, and PC2 for 22.2%. In general, the three biological replicates of each strain were well grouped, except for isolate 117/12 in which replicate #1 showed a great dispersion from the other two replicates, as measured by the Euclidean distances ( Figure S2). The PCA showed that strains 857/06 and 41/14 were clearly separated from the other four isolates, and that 1299/06 and 34/11 were quite close each other, with 1086/11 separated from 1299/06 and partially overlapping with 34/11. The isolate 117/12 could also constitute a clearly differentiated group from the rest, but the dispersion due to the distance of replicate #1 from the other two made this strain partly overlap with the group formed by isolates 1299/06, 34/11, and 1086/11. Interestingly, among the LPXTG cell wall proteins, we identified the major pilus protein encoded by the locus SSUBM407_0414. Pili proteins have been shown previously to be trypsin resistant [21,38,39], but here the protein identified was found in the six clinical isolates analyzed, with 30 peptides covering 54% of the protein sequence in its immature form ( Figure S1), or 61% of the mature protein sequence (once N-term signal peptide and C-term sortase post-processing sequence are removed).

Analysis of Differences in Surface Protein Abundances among the Clinical Isolates
Next, after protein identification, we performed a label-free semi-quantitative analysis to determine differences in the abundances of surface proteins among the six clinical isolates, based on chromatography peak areas (Table S1). For that, we first carried out a principal component analysis (PCA) to evaluate differences in the overall pattern of surface protein abundances comparing the six isolates ( Figure 1). The two first dimensions of the analysis explained 70.1% of the variance, with principal component (PC) 1 responsible for 47.9%, and PC2 for 22.2%. In general, the three biological replicates of each strain were well grouped, except for isolate 117/12 in which replicate #1 showed a great dispersion from the other two replicates, as measured by the Euclidean distances ( Figure S2). The PCA showed that strains 857/06 and 41/14 were clearly separated from the other four isolates, and that 1299/06 and 34/11 were quite close each other, with 1086/11 separated from 1299/06 and partially overlapping with 34/11. The isolate 117/12 could also constitute a clearly differentiated group from the rest, but the dispersion due to the distance of replicate #1 from the other two made this strain partly overlap with the group formed by isolates 1299/06, 34/11, and 1086/11. Then, in hierarchically-clustered heatmaps, we presented the z-score abundances of the 131 identified surface proteins, grouped in four major categories of subcellular localization: lipoproteins, cell wall proteins, secreted, and membrane proteins. In general terms, it can be appreciated that, for most proteins, there was a higher expression on the surface of isolates 857/06, 41/14, and 117/12 compared to the other three strains, although this tendency varied according to particular proteins Then, in hierarchically-clustered heatmaps, we presented the z-score abundances of the 131 identified surface proteins, grouped in four major categories of subcellular localization: lipoproteins, cell wall proteins, secreted, and membrane proteins. In general terms, it can be appreciated that, for most proteins, there was a higher expression on the surface of isolates 857/06, 41/14, and 117/12 compared to the other three strains, although this tendency varied according to particular proteins and subcellular localization category. Thus, for lipoproteins, the highest abundances of most proteins were found in strains 857/06, 41/14, and 117/12 (Figure 2a). The highest abundance levels of LPXTG-cell wall proteins were found in 41/14 and 117/12, followed by 857/06 (Figure 2b). However, in this category there was a greater dispersion of values between replicates. Of note, as stated before, the 1086/11 strain was the only one to express the two putative glucan-binding surface-anchored proteins SSUBM407_0471 and SSUBM407_0949. For most of the 11 identified secreted proteins, higher abundances were found in 857/06 and 41/14, and to a lesser extent, in 117/12, compared to the other three isolates (Figure 2c). Finally, the category of membrane proteins exhibited a high heterogeneity on protein abundance distribution: there were many proteins identified only in one (or few) isolates, with the highest number for 857/06, followed by 41/14 ( Figure 3). A qualitative analysis for enrichment of KO terms (available in Table S1) showed that the isolate 857/06 seemed to express higher levels, as well as some exclusive proteins participating in solute binding and acting as transporters (Supplementary Dataset 1). Nevertheless, most of these proteins were also found in the other strains. On the other hand, the isolate 41/14 had a higher proportion of exclusive proteins annotated as "putative membrane" or "putative exported", without any assigned KO. and subcellular localization category. Thus, for lipoproteins, the highest abundances of most proteins were found in strains 857/06, 41/14, and 117/12 (Figure 2a). The highest abundance levels of LPXTG-cell wall proteins were found in 41/14 and 117/12, followed by 857/06 (Figure 2b). However, in this category there was a greater dispersion of values between replicates. Of note, as stated before, the 1086/11 strain was the only one to express the two putative glucan-binding surface-anchored proteins SSUBM407_0471 and SSUBM407_0949. For most of the 11 identified secreted proteins, higher abundances were found in 857/06 and 41/14, and to a lesser extent, in 117/12, compared to the other three isolates (Figure 2c). Finally, the category of membrane proteins exhibited a high heterogeneity on protein abundance distribution: there were many proteins identified only in one (or few) isolates, with the highest number for 857/06, followed by 41/14 ( Figure 3). A qualitative analysis for enrichment of KO terms (available in Table S1) showed that the isolate 857/06 seemed to express higher levels, as well as some exclusive proteins participating in solute binding and acting as transporters (Supplementary Dataset 1). Nevertheless, most of these proteins were also found in the other strains. On the other hand, the isolate 41/14 had a higher proportion of exclusive proteins annotated as "putative membrane" or "putative exported", without any assigned KO.

Prediction of New Potential Vaccine Candidates
We combined our experimental proteomics approach with bioinformatic tools to predict new vaccine candidates with antigenic potential, either as whole proteins or sequence fragments from some particular protein(s). A priori, the highly abundant and exposed proteins-mainly cell wall-anchored proteins and lipoproteins-are expected to be antigenic, as extensively demonstrated. Therefore, we searched for new PVCs in the group of membrane proteins, and particularly within those with more than one predicted TMD, as they have been poorly studied at immunogenic and protective levels.
Then, we mapped on the protein sequences the peptides experimentally identified (Appendix A, Supplementary Dataset 2) for the 26 multi-transmembrane proteins found after "shaving" the bacteria and analyzing the "surfomes" via LC-MS/MS and represented with Topo2, the theoretical topologies, according to TMHMM predictions (Figure 4). For 19 out of the 26 proteins, the identified peptides matched in regions or loops that were predicted by TMHMM to be extracellularly exposed. The other 7 proteins-SSUBM407_0279, SSUBM407_1297, SSUBM407_1333, SSUBM407_1406, SSUBM407_1682, SSUBM407_1834, and SSUBM407_1895-were identified from peptides that theoretically mapped intracellular loops.

Prediction of New Potential Vaccine Candidates
We combined our experimental proteomics approach with bioinformatic tools to predict new vaccine candidates with antigenic potential, either as whole proteins or sequence fragments from some particular protein(s). A priori, the highly abundant and exposed proteins-mainly cell wall-anchored proteins and lipoproteins-are expected to be antigenic, as extensively demonstrated. Therefore, we searched for new PVCs in the group of membrane proteins, and particularly within those with more than one predicted TMD, as they have been poorly studied at immunogenic and protective levels.
Then, we mapped on the protein sequences the peptides experimentally identified (Appendix A, Supplementary Dataset 2) for the 26 multi-transmembrane proteins found after "shaving" the bacteria and analyzing the "surfomes" via LC-MS/MS and represented with Topo2, the theoretical topologies, according to TMHMM predictions (Figure 4). For 19 out of the 26 proteins, the identified peptides matched in regions or loops that were predicted by TMHMM to be extracellularly exposed. The other 7 proteins-SSUBM407_0279, SSUBM407_1297, SSUBM407_1333, SSUBM407_1406, SSUBM407_1682, SSUBM407_1834, and SSUBM407_1895-were identified from peptides that theoretically mapped intracellular loops. Finally, we evaluated the potential antigenicity of these 26 multi-transmembrane proteins using the web-based VaxiJen tool and compared for each protein the whole sequence with the sequences found after trypsinization, and/or comprising the regions between discontinuous peptides experimentally identified if they were close to each other. In most cases, the antigenic scores improved when the experimentally identified regions were selected, compared to the whole sequences of their corresponding proteins (Table 4). Particularly, for six proteins-SSUBM407_0454, SSUBM407_0552, SSUBM407_1333, SSUBM407_1682, SSUBM407_1834, and SSUBM407_1894-the algorithm VaxiJen predicted that the whole proteins were not antigenic (score <0.4), but for four of them-SSUBM407_0454, SSUBM407_0552, SSUBM407_1834, and SSUBM407_1894-the selected peptides or regions corresponding to the sequences experimentally found by proteomics increased the score and caused them to be predicted as antigenic (score ≥0.4). For five out of the seven proteins in which the peptides found matched predicted intracellular loops-SSUBM407_1297, SSUBM407_1406, SSUBM407_1682, SSUBM407_1834, and SSUBM407_1895-there was an increase in the VaxiJen score after selecting the experimentally identified regions, compared to the whole protein sequences.  Finally, we evaluated the potential antigenicity of these 26 multi-transmembrane proteins using the web-based VaxiJen tool and compared for each protein the whole sequence with the sequences found after trypsinization, and/or comprising the regions between discontinuous peptides experimentally identified if they were close to each other. In most cases, the antigenic scores improved when the experimentally identified regions were selected, compared to the whole sequences of their corresponding proteins (Table 4). Particularly, for six proteins-SSUBM407_0454, SSUBM407_0552, SSUBM407_1333, SSUBM407_1682, SSUBM407_1834, and SSUBM407_1894-the algorithm VaxiJen predicted that the whole proteins were not antigenic (score < 0.4), but for four of them-SSUBM407_0454, SSUBM407_0552, SSUBM407_1834, and SSUBM407_1894-the selected peptides or regions corresponding to the sequences experimentally found by proteomics increased the score and caused them to be predicted as antigenic (score ≥ 0.4). For five out of the seven proteins in which the peptides found matched predicted intracellular loops-SSUBM407_1297, SSUBM407_1406, SSUBM407_1682, SSUBM407_1834, and SSUBM407_1895-there was an increase in the VaxiJen score after selecting the experimentally identified regions, compared to the whole protein sequences. The first score corresponds to the peptide matching the loop between the 3rd and the 4th transmembrane domain (TMD); the second score, to the peptide matching the loop between the 5th and the 6th TMD. b Scores corresponding to each of the three peptides identified in this protein, from Nto C-term. c Scores corresponding to each of the two peptides identified in this protein, from Nto C-term. d The first score corresponds to the region covering from the first to the last peptide identified matching the extracellular loop between the first and the second TMD; the second score, to the peptide matching the loop between the 3rd and the 4th TMD. e The two scores correspond to the regions covered by peptides matching the first and the second predicted intracellular loops, from Nto C-term, respectively. f Scores corresponding to each of the four peptides identified in this protein, from Nto C-term. g The two scores correspond to the peptides matching the 2nd and the 4th predicted intracellular loops, from Nto C-term, respectively.

Discussion
In the interplay between cells and their environment, surface proteins are key molecules playing many important biological roles [40,41]. In the particular case of bacterial pathogens, many surface proteins are involved in virulence and pathogenicity; yet since they are normally exposed and in contact with elements of the host immune system, they have the highest chances of becoming effective candidates for drug and vaccine development [14,42].
In this work, we applied for the first time the "shaving" approach, successfully used by our research group in different Gram-positive bacteria [19][20][21]25,26,39], to perform a comparative proteomic analysis of six SS2 human clinical isolates. In a previous paper, we carried out a similar study in a large collection of S. suis clinical isolates from pigs belonging to different serotypes [25], which led to the discovery of an immunoprotective cell wall protein, namely SsnA [43,44]. These studies were followed by a comparative immunosecretomic analysis of the same isolates [27]. However, there is a lack of studies on S. suis human isolates aimed at comparing the proteomic profile with those of animal isolates and discovering potential vaccine candidates that can be effective in the next outbreak affecting humans. This is especially important given the increasing concern of emerging zoonoses.
Our proteomic analysis resulted in the identification of 131 surface proteins in the six human isolates. This number is very similar to our previous work on 39 swine isolates, in which 113 surface proteins were found [25]. Actually, the numbers of proteins identified per category of subcellular location were almost identical, except for membrane proteins-71 proteins with one or more TMD in this present work, compared to 54 in our previous work [25]. Cytoplasmic proteins, which can be released because of residual cell lysis, non-canonical secretion pathways, or via extracellular vesicle blebbing [45], were not considered in the downstream workflow of PVC selection. Here, we used as database the BM407 strain, which is a SS2 reference strain from humans; whereas in our previous work with animal isolates we used the swine isolate P1/7, as one of the reference strains for SS2 in pigs. Therefore, to compare both the current and the previous protein lists, we searched for the homologous proteins of BM407 in P1/7, and almost all of them share nearly 100% identity. The equivalence for both strains can be found in Table S1. Thus, out of the 131 surface proteins that we identified in the six human isolates, 15 cell wall proteins were also present in 39 pig isolates [25], as well as 28 lipoproteins, 8 secreted proteins, and 22 membrane proteins with 1 TMD, considering that only 28 were found in pigs. This is expected as these categories are highly exposed and abundant, generating many peptides. Due to this, these proteins could be considered as potential vaccine candidates for future formulations for universal vaccines to protect both humans and animals. Actually, the vast majority of these proteins were found in the six human isolates analyzed, with some exceptions; for example, the already cited cell wall putative glucan-binding surface-anchored proteins, found only in 1086/11. However, there was a higher difference in the number of common multi-transmembrane proteins between both lists: only 10 proteins out of 26 found both in human and swine isolates were common. This is also reasonable, as these proteins are more embedded into the cell wall, therefore they are less accessible and less abundant.
Noticeably, we identified the major pilus subunit, SSUBM407_0414, in the six isolates. This was a surprising and unexpected result, as Gram-positive pilus proteins have been described to be trypsin resistant, the so-called Lancefield T antigens [46]. In previous works we used unspecific proteases, like proteinase K, to get a few peptides from this family of proteins [21,38,39]. However, in the present study 30 peptides were identified for the major pilus subunit in the six strains, representing 54% coverage on the protein sequence. We ignore the reasons of the sensitivity of SSUBM407_0414 to trypsin, compared to other pilin proteins already described to be resistant to this protease.
So far, the "shaving" approach has been used to identify proteins and to compare strains in a qualitative way (absent versus present proteins). However, we have also reported that the number of identified peptides is indicative of the protein abundance, although ratios between samples cannot be calculated using this parameter [19,21,25]. Here we used the chromatography peak areas in a label-free semi-quantitative way to report protein abundance levels. We have strong evidences that this parameter is more consistent than using spectral counting or other label-free methods (Rodríguez-Ortega, unpublished results). After PCA to evaluate the contribution of surface proteins to differences among the six SS2 human isolates, we performed clustering heatmaps to get some hints on either similarities or differences of protein expression in such strains. Thus, in general, we observed that isolates 857/06, 41/14, and 117/12 had higher abundances of surface proteins for most categories, but that exhibited a high variability of membrane proteins. This can be explained assuming that these proteins are less abundant (more embedded in the surface rendering less peptides, and lower copy numbers rendering lower peptide spectral matches), since they are more difficult to identify across all the isolates, in contrast to lipoproteins or cell wall proteins. Furthermore, KEGG Mapper analysis showed that the isolates 857/06 and 41/14 expressed more solute-binding and transporter proteins. We ignore whether this can be related to biological phenomena, like higher virulence or antibiotic resistance, or not as we lack this information from the isolates.
Twenty years ago reverse vaccinology revolutionized the way to select vaccine candidates; it follows a new concept according to which the most promising antigens can be easily predicted from a genome using adequate algorithms [28,29]. The first success in applying this concept was reached with Neisseria meningitidis [47]. Surface proteins, in contact with the environment and, therefore, the host immune system, have the highest chances to become effective vaccine antigens. It is estimated that approximately one third of genes are encoded in any genome code for surface proteins, including those secreted to the extracellular milieu [14]. However, the genome does not inform which, when, and to what extent genes are expressed and the coded proteins synthesized. This limitation caused the concept of reverse vaccinology to move forward and include wet lab-based techniques, as protein arrays or classical proteomics [48]. In previous works, we demonstrated that "shaving"-based proteomics is useful to select the most antigenic surface proteins, as there is a strong correlation between trypsinized proteins and antibody binding to the surface of live, intact cells [21]. In this present study we found most of the surface proteins showing protective activity in animal models so far, including Sao [49,50], Sat [51], SsnA [43], HP0197 [52], SsPepO [53], and Sly [54]. We did not identify either the muramidase release factor Mrp or the extracellular protein factor Epf, two important S. suis protective antigens [55]. Although these two highly variable genes are present in the BM407 strain, it does not express both proteins, as many other strains belonging either to serotype 2 or to other serotypes [56]. It is therefore not surprising that the human isolates analyzed in this study do not express such proteins.
Moreover, subcellular localization algorithms based on signal features sometimes provide misleading predictions. We also demonstrated that those misleading predictions can be corrected, or at least revisited, by mapping the experimentally identified peptides on the predicted protein topologies [19,21,25,26,45]. In the present study, in the search for alternative PVCs, we focused on less considered proteins, such as multi-transmembrane proteins, since lipoproteins and cell wall proteins have been extensively analyzed in their immunogenicity and vaccine potential in animal models of infection (for an extensive review on S. suis protein vaccine candidates, see the review by [13]). Then, we mapped the peptides identified in our proteomic workflow on the corresponding protein sequences and represented the theoretical topologies. For most proteins (19 out of 26) there was a concordance between the regions identified and the predicted extracellular domains. Then, we used a machine-learning (ML) algorithm, VaxiJen, to predict the antigenicity of such proteins and to compare the whole protein sequences with the most restricted regions for each containing the peptides experimentally identified. We chose VaxiJen because ML-based predictors do not discard proteins based on sequence signal features as decision tree predictors do (e.g., Vaxign, Jenner-predict). Multi-transmembrane proteins would be discarded if a decision-tree algorithm was used, whereas ML-based ones consider all the proteins [57]. In our study, most of the multi-transmembrane proteins increased their antigenicity score after selecting the experimentally identified regions compared to the whole sequences; particularly, five out of the seven proteins for which there was not a concordance in topology between predicting algorithms and experimental results. This indicates that the combination of proteomics and bioinformatics within the reverse vaccinology concept is useful for selecting a priori not considered PVCs, and for refining the workflows for candidates to enter the production and test pipelines. The "shaving" approach is a reinforcement of classical reverse vaccinology only based on in silico studies of subcellular location, protein topology, and antigenicity for a more accurate selection of PVCs.

Conclusions
This study shows the first comparative "shaving"-based proteomic analysis of several serotype 2 human isolates of the major zoonotic pathogen S. suis. A list of proteins identified was obtained, many of them present in all or most of the strains. Multivariate classification and clustering analysis allowed us to distinguish the expression pattern and abundances of surface proteins. The combination of proteomics and bioinformatic tools made it possible to select, for further testing in animal models, PVCs that would not be prioritized in classical reverse vaccinology approaches, such as many multi-transmembrane proteins or some exposed domains of these. Thus, our approach can be considered as an experimental-aided reverse vaccinology method for PVC selection. Our study is a needed step in the further necessary workflow to measure the immunogenic and protective capacities of such selected polypeptides. Further research is thus necessary to address this point.