Species level resolution of female bladder microbiota from marker gene surveys

The human bladder contains bacteria in the absence of infection. Interest in studying these bacteria and their association with bladder conditions is increasing, but the chosen experimental method can limit the resolution of the taxonomy that can be assigned to the bacteria found in the bladder. 16S rRNA gene sequencing is commonly used to identify bacteria, but is typically restricted to genus-level identification. Our primary aim was to determine if accurate species-level identification of bladder bacteria is possible using 16S rRNA gene sequencing. We evaluated the ability of different classification schemes, each consisting of combinations of a 16S rRNA gene variable region, a reference database, and a taxonomic classification algorithm to correctly classify bladder bacteria. We show that species-level identification is possible, and that the reference database chosen is the most important component, followed by the 16S variable region sequenced. Importance Species-level information may deepen our understanding of associations between bladder microbiota and bladder conditions, such as lower urinary tract symptoms and urinary tract infections. The capability to identify bacterial species depends on large databases of sequences, algorithms that leverage statistics and available computer hardware, and knowledge of bacterial genetics and classification. Taken together, this is a daunting body of knowledge to become familiar with before the simple question of bacterial identity can be answered. Our results show the choice of taxonomic database and variable region of the 16S rRNA gene sequence makes species level identification possible. We also show this improvement can be achieved through the more careful application of existing methods and use of existing resources.


Introduction 42
The human body provides a wide range of habitats, supporting a variety of microorganisms that 43 include bacteria, archaea, viruses and fungi, collectively known as the human microbiome (1). 44 Recent evidence from sequence-based and enhanced culturing techniques have revealed a 45 population of microbes (bacteria, fungi and viruses) that exist in the bladder, even in the absence 46 of infection(2-7). The discovery of the bladder microbiota (also known as the bladder urobiome) 47 has led researchers to question how these microbes influence the health of the host. Studies have 48 shown that altered bladder urobiome diversity is associated with urgency urinary incontinence 49 (UUI)(4,8), urinary tract infection after instrumentation of the urinary tract(9,10), and is 50 predictive of response to a common UUI drug (11). These studies collectively provide evidence 51 that the bladder urobiome, while previously overlooked, is clinically relevant and warrants 52 further investigation. 53 To study the relationships between the bacteria found in the human bladder and health of the 54 host, it is necessary to accurately identify bacteria in a rapid and large-scale manner. Reliable 55 methods of determining the bacterial identity of an unknown bacterium include Matrix Assisted 56 Laser Desorption/Ionization-time of flight (MALDI-TOF) analysis or whole genome sequencing 57 (WGS) of purified colonies; both techniques permit species-level identification of bacteria(12). 58 However, culturing specific bacterial species is also time consuming and laborious. This 59 limitation has been circumvented by adopting culture-independent methods of sequencing DNA 60 directly from an environmental sample, such as shotgun metagenomic sequencing and targeted 61 amplicon sequencing, the latter most commonly involving the 16S rRNA gene(13). These 62 culture-independent sequencing methods are an attractive strategy because they can more 63 accurately reveal microbiota diversity by identifying bacteria that are difficult to grow in culture. 64 Targeted amplicon sequencing is currently the most practical method for identifying bladder 65 bacteria in a large-scale manner. When performing targeted amplicon sequencing, DNA is first 66 extracted from all cells in a sample, including host and bacterial cells. Next, the polymerase 67 chain reaction (PCR) is used to amplify a small segment of the bacterial genome. This segment is 68 then sequenced in a high-throughput manner. Finally, bioinformatics are used to process the 69 resulting sequences and identify the taxonomy of the bacteria. Algorithms compare the short 70 DNA sequences recovered from a sample to known sequences held in a reference database until 71 the closest match is found. In general, longer or more unique strings of sequenced DNA can be 72 used to identify bacteria at a higher level of precision, though sequence length is often limited by 73 the sequencing technology. The 16S rRNA gene is commonly used in amplicon sequencing 74 studies due to its universal presence in bacteria. The 16S rRNA gene conveniently contains 75 multiple "variable regions" with unique strings of sequence that can be used for bacterial 76 identification. A common target is the 4 th variable region (V4), as this region has good 77 phylogenetic resolution down to the genus level for many bacteria(14). 78 When identifying bacteria using targeted amplicon sequencing there are three important 79 components (Figure 1). These components are: 1) the identifier, or DNA sequence of the 80 unknown bacterium; 2) a database of DNA sequences annotated with taxonomic information; 81 and 3) a classifier, which is the algorithm that compares the unknown sequence to those in a 82 database until the closest match is found. These components work together as a classification 83 scheme. One common classification scheme uses the V4 region from the 16S rRNA gene 84 sequence(15) as the identifier, the Silva database(16), and the Naïve Bayes algorithm(17 Methods). These computational amplicons (Figure 2A) were used to determine how well the 130 currently available classification schemes can distinguish bladder bacterial species. To assess 131 different classification schemes, we tested multiple permutations of the variable regions listed 132 above with different databases (i.e. Greengenes, Silva, or NCBI 16S Microbial) and different 133 classifiers (i.e. Naive Bayes or BLCA, see Figure 1). 134 To quantify the amount of information contained across variable regions of the 16S rRNA gene 135 among commonly identified bladder bacteria, we performed a sliding window analysis on a 136 multiple sequence alignment (MSA) of all genomes from the Thomas-White dataset. We 137 calculated entropy as a measure of information content along the MSA ( Figure 2B). As 138 expected, the defined variable regions contained regions of high entropy, suggesting variability 139 across species, whereas variable regions were flanked by conserved regions with low entropy 140 containing sequences that are similar among species. The V1 and V2 regions contained the 141 highest entropy, while V7 and V8 contained the lowest. 142 Evaluation of classification scheme performance. To evaluate the ability of 143 currently available resources to identify bladder species, we calculated the recall, precision and 144 F-measure for each classification scheme implemented (see Methods). Briefly, each resulting 145 taxonomic classification was evaluated as a true match, true non-match, false match or false non-146 match based on whether the taxonomic classification was correctly assigned or not. Recall refers 147 to the proportion of matches that the classification scheme correctly identified out of all possible 148 matches. Precision refers to the proportion of matches that the classification scheme called 149 correctly out of all classified matches. The F-measure is the equally weighted harmonic mean of 150 recall and precision. 151 In general, the classification schemes that use the NCBI 16S Microbial database perform the best 152 (Figure 3), with high recall and precision (range 60.3%-91.0% for both classifiers). Those using 153 the Silva database show reduced precision and recall (range 23.1%-70.5% for both measures). 154 Because the Greengenes database is missing many of the bacterial species found in the bladder, it 155 is less precise. As such, classification schemes using the Greengenes   B) The information of variable regions, measured by entropy from a sliding window analysis of the MSA. Higher entropy indicates that the region has more variability across species, and therefore more information to identify a bacterial species. Lower entropy indicates that the region has little variability (i.e. is conserved) across species and therefore less information to identify a bacterial species.

169
Confidence scores affect classification. 170 The BLCA and Naïve Bayes classifiers used in this study will classify an unknown sequence 171 even when the posterior probability for that taxon is very low. To account for this situation, a 172 confidence score is calculated that measures how much the classification changes through 173 random permutation (bootstrapping) and produces a value that reflects the "goodness of fit" of 174 that classification. When lacking any knowledge of how to choose the best confidence score that 175 minimizes the number of errors of a classification scheme (i.e. when a test set is not available), 176 using a predefined confidence score threshold is an option. Here, we evaluated the performance 177 of classification schemes when confidence score thresholds of 50% or 80% were used, such that 178 matches returned with confidence scores less than the threshold were considered non-matches. 179 Figure 4 shows the effect of increasing the confidence score on the number of true matches 180 returned by each classification scheme. 181 Almost all classification schemes had a decrease in recall when using a default confidence score 182 of 80% (Supplemental Figure 1). This effect is especially marked for the classification schemes 183 that use the Silva database, which shows a 79.3% reduction in recall, on average. Classification 184 schemes that use the NCBI 16S database are unequally affected, for example the V1-V3 185 identifier shows a slight reduction in recall (7.1% on average), while the V6 identifier shows the 186 largest (43.3% on average). Classification schemes that use the Greengenes database are slightly 187 affected. These reductions in recall are mirrored in all classification schemes when a confidence 188 score of 50% is used as a threshold, but at a smaller magnitude. 189 Figure 3: Classification scheme evaluation when ignoring confidence scores. The performance of each classification scheme is summarized by the precision (y axis) and recall (x axis) for each variable region (color). The best classification scheme would lie in the upper right-hand corner. Overall, classification schemes using the NCBI 16S Microbial database performed better than those using the Greengenes or Silva databases.
Changes in the precision of the classification schemes are affected the most by the database used 190 (Supplemental Figure 1). For the classification schemes that use the NCBI 16S database, 191 precision is generally improved regardless of confidence score, but at unequal amounts. For 192 example, using the 80% threshold, the V1-V3 identifier shows a slight increase of 3.5% on 193 average, while the V6 identifier shows a large 39.7% increase on average. Classification schemes 194 that use the Silva database are unequally affected, with both reduction and gains in precision. A 195 dramatic increase in precision is shown by the classification scheme composed of the Silva 196 database, V4 identifier, and the Naive Bayes classifier. When ignoring a confidence score, this 197 classification scheme has a precision of 52.6%, but shows a 63.1% gain when using a confidence 198 score of 80% as a threshold. In general, precision is reduced when using the BLCA classifier and 199 the Silva database. As with recall, classification schemes that use the Greengenes database show 200 slight changes in precision. 201 The overall changes in how these classification schemes perform when using a 50% or 80% 202 confidence score can be summarized by comparing the F-measure values shown in 203 Supplemental  Amplicons spanning more than one variable region identify a higher number of bladder 213 bacterial species. Amplicons spanning more than one variable region identified more unique 214 bladder bacteria at the species level than amplicons spanning a single variable region. For 215 example, with the commonly used V4 variable region and Naïve Bayes classifier, 21.8% of 216 bladder bacteria are correctly identified with the Greengenes database, whereas 52.6% are 217 identified with the Silva database and 73.1% with the NCBI 16S database (Figure 5). However, 218 with the NCBI database, when using amplicons spanning more than one variable region, such as 219 the V1-V3 region, 91.0% of bacteria are correctly identified at the species level. 220 Species identified depends on choice of database and variable region. While the results thus 221 far have focused on summarizing overall performance of classification schemes for identifying 222 bladder bacteria at the species level, we also sought to determine which classification schemes 223 could be used to identify specific bacteria (Table 1, Supplemental Figure 3). Although the 224 NCBI database contains the largest representation of bladder species, some species were not 225 identified with certain variable regions, if at all. For example, Lactobacillus species were overall 226 best represented within the NCBI database, with 8 out of 9 species being identified with the V1-227 V3 and V2-V3 variable regions (Figure 6). However, the other variable regions only identified 228 between 4 and 6 Lactobacillus species when using the NCBI database. Interestingly, 229 Lactobacillus crispatus was identifiable with the Silva and NCBI databases when using the V4-230 V6 regions, but only with the NCBI database using the V1-V3 and V2-V3 regions, and only the 231 Silva database when using the V4 and V6 regions independently. Lactobacillus iners was not 232 correctly identified from our dataset with any classification scheme. 233 Additionally, we found that there were important discrepancies for bacteria that are thought to 234 play a role in bladder health and disease (Supplemental Figure 3). Several bladder species, such 235 as Gardnerella vaginalis, were only detected with the NCBI and Silva databases. Staphylococcus 236 With the commonly used V4 variable region and BLCA classifier, 17% of bladder bacteria are correctly identified using the Greengenes database, compared with 35% correctly identified using the Silva database and 67% using the NCBI 16S database. A similar trend is seen with the Naïve Bayes classifier. Using other variable regions can lead to improved species-level resolution to a maximum number of 91% correctly identified. species were poorly identified with the V4 region but were distinguishable with all other regions. 237 Streptococcus and Corynebacterium species were best identified with NCBI. Escherichia coli is 238 not well represented in any of the databases, and was only detected with the V4 region and the 239 NCBI database. 240 241   (4,11). We reprocessed the raw sequencing data (see Methods) and performed 249 taxonomic classification to assess the performance of our computational findings. Since 16S 250 rRNA gene sequencing will detect many more bacteria than those identified even with enhanced 251 culture, we restricted the evaluation to only the bacteria that grew in culture from a given sample. 252 We used accuracy to assess the number of predicted matches that were correctly identified in the 253 V4 dataset, using classification schemes composed of the V4 identifier, each of three databases, 254 and the BLCA classifier (Figure 7). All databases had good accuracy with high proportions of 255 accurate identifications at the species level (80% for NCBI 16S, 86% for Silva, and 88% for 256 Greengenes). Accuracy was reduced when the default confidence score of 80% was applied 257 (61% for NCBI 16S, 67% for Silva, and 86% for Greengenes). The default confidence score of 258 50% reduced the accuracy of two of the classification schemes (76% for NCBI 16S and 80% for 259 Silva). We also evaluated classification schemes with the Naive Bayes classifier and found 260 similar results (Supplemental Figures 4 and 5) 261 262 Figure 7. Taxonomic classification of the V4 validation dataset. A) Results when using a classification scheme including the V4 identifier, NCBI 16S database, and BLCA classifier. Blue dots represent species identified in cultured isolates, but not identified in targeted amplicon sequencing using this classification scheme. Yellow dots represent the species that were present in cultured isolates and successfully identified by the classification scheme. B) Summary of accuracy for classification schemes that use the V4 rRNA identifier, BLCA classifier, and the three databases (Greengenes, Silva and NCBI 16S). Rows show accuracy results when ignoring confidence scores, and when using confidence scores of 50% or 80% as thresholds.

Discussion 263
Our study demonstrates that it is possible to gain higher resolution results at the species level 264 with existing resources when performing targeted amplicon sequencing of urinary specimens. 265 Though higher resolution is possible, it requires a carefully chosen classification scheme. Within 266 the classification scheme, the reference database strongly influences the identification of bacteria 267 at the species level. Overall, we found the NCBI 16S database performs the best, whereas the 268 Greengenes database performs the worst, primarily because it does not currently contain 269 representatives of bladder bacteria. The identifier, or 16S rRNA variable region that is chosen, 270 can also influence the types of bacterial species that are identified. The choice of classifier did 271 not drastically affect the identification of species and thus is less critical within the classification 272 scheme. 273 The largest limitation of any reference database is that the number of records of accurately 274 classified bacteria is dwarfed by the number and diversity of unidentified sequences obtained 275 through metagenomic sequencing of environmental samples. Because of the considerable 276 amount of work required to construct and maintain databases, they will undoubtedly 277 incompletely represent existing bacteria. 278 For species level taxonomy assignments, the reference database must contain species-level 279 information. In other words, if species of bacteria are expected in a sample, it must be verified 280 that the database contains those species. For example, we found that the Greengenes database 281 does not currently contain many bacterial species that are found in the human bladder. In 282 contrast, the NCBI 16S Microbial and Silva databases had representation of all species that were 283 identified from prior studies of bladder bacteria. Thus, the latter two databases are better choices 284 for evaluating bacterial species from the bladder. 285 While the databases reviewed in this study do have species-level information associated with the 286 records, additional work was needed before species-level identification could be achieved with 287 the Naïve Bayes classifier. This classifier requires a database that has undergone the "training" 288 steps that convert the DNA sequences to the calculated frequencies that each k-mer occurs in a 289 taxon. For available classification algorithms like the RDP classifier(25) and QIIME2(17), the 290 training is only currently done to reliably identify bacteria to the genus level. For this study, it 291 was necessary to train the Silva and NCBI 16S databases to the species level for use with the 292 Naive Bayes classifier. While training the reference databases did take significant computational 293 effort, once completed it was used repeatedly. 294 The classifiers used in this study are examples of two different strategies designed to overcome 295 the common challenges of searching an extremely large dataset in order to find matching pairs of 296 query sequences and reference records. While these two approaches are different in concept, we 297 did not find significant differences in their performance for species-level classification of bladder 298 bacteria. 299 BLCA is an example of sequence comparison by pairwise alignment. The strength of this 300 method is due to the fact that the similarities between two DNA samples are directly compared. 301 This is the most effective way to compare the characteristics of a sample to those that define a 302 taxon; however, until recent advances in computer technology, it remained impractical because 303 of the computational burden. The Naive Bayes classifier is an example of a k-mer-based 304 classification approach, and was designed to circumvent the computational challenges that are 305 faced with use of a pairwise alignment classifier. However, there are limitations when using 306 Naive Bayes for species-level identification. The first limitation arises from the database training 307 process. If one taxon has more training examples than another, Naive Bayes generates 308 unfavorable weights for the decision boundary(26). The second limitation is that all features (i.e. 309 the k-mers generated from the DNA sequences) are assumed to be independent, and weights for 310 taxa with strong dependencies among the associated k-mers are larger than those taxa with 311 weakly dependent k-mers(26). 312 Finally, as shown by both the computational and V4 validation results, the use of the 50% or 313 80% confidence score thresholds significantly reduced the recall and accuracy of the 314 classification schemes. Precision increased in several cases, for example with classification 315 schemes that use the NCBI 16S database or those that use the Silva database and Naïve Bayes 316 classifier, but at the cost of severely decreasing the number of species identified. These results 317 show that the default settings of 50% or 80% are restrictive, and limit the ability to detect bladder 318 species, especially when using the Silva and Greengenes databases. This could be resolved 319 through the use of a comparative data set to find the confidence score values yielding optimal 320 performance of these classification schemes. 321 Affordable sequencing of large-scale data is presently done with short read sequencing 322 technology, such as Illumina MiSeq. This is currently limited to sequencing reads up to 300 323 nucleotides in length. Until full-length 16S rRNA gene sequencing can be achieved affordably 324 on a large scale (such as with Oxford Nanopore and PacBio technologies), choosing the optimal 325 region of the 16S rRNA gene for identification purposes remains a significant part of the 326 experimental design. Thus, the variable regions that are used as identifiers require some 327 consideration. 328 Our findings show that use of the V2-V3, and V1-V3 regions of the 16S rRNA gene allowed for 329 the correct identification of the most bladder bacterial species when combined with the NCBI 330 16S database and either classifier. In general, amplicons that span more than one variable region 331 perform better than those that contain single variable regions. This is likely due to the increased 332 information available with longer reads. It is important to note that longer reads can also have 333 limitations, which are discussed in more detail below. While shorter variable regions, such as 334 the V4 region, did not perform as well as longer amplicons, they were able to identify many 335 bladder bacteria at the species level (52 out of 78 primer set that would amplify the entire dataset may not be possible for this region. A future 362 research direction could be to stratify the Thomas-White dataset into smaller, more closely 363 related phylogenetic groups for more specific primer design. 364

Conclusion 365
Species level taxonomy assignment will greatly benefit studies focused on the urobiome and its 366 relationship to bladder health and disease. Our results show that it is possible to reliably classify 367 bladder bacterial species using targeted amplicon sequencing of the 16S rRNA gene variable 368 regions with existing classification algorithms and databases. We determined that the most 369 important component of the classification scheme is the database used, and that the NCBI 370 database allows for best identification of bladder species. Our validation with V4 amplicon data 371 demonstrates that the predicted computational outcomes are a good approximation for how a 372 classification scheme will perform on real data. The knowledge that a majority of the predicted 373 matches reflect reality is encouraging. It can be expected that the alternate variable regions 374 covered in this study, such as the V2-V3 region of the 16S rRNA gene, would have similar 375 outcomes. 376 Importantly, we found that no single variable region gives 100% coverage of all bladder bacteria 377 species. Thus, the choice of variable region may significantly affect the results of a given study. 378 One approach to resolve this could be to use multiple amplicon sequencing or long read 379 sequencing technology. These technologies are emerging and may prove to be beneficial for the 380 urobiome community. Furthermore, no database has 100% coverage across a variable region. 381 This could be resolved by using more than one database for classification, though this approach 382 is complicated by differences in databases in terms of formatting, as well as conflicting 383 classifications. Both of these components are important for planning experimental and 384 computational aspects of urobiome studies, and should be considered when comparing results 385 across studies. 386 387

Material and Methods 388
Code resources. All scripts that were written for this project can be found in the GitHub 389 repository (https://github.com/lakarstens/BladderBacteriaSpecies). All scripts sourced from this 390 repository are referred to as "custom." 391 The Thomas-White dataset. The 78 species of bladder bacteria used in this study were 392 identified by culturing 149 urine samples and performing whole-genome sequencing, as 393 described in . This set of identified species served as the basis for our 394 computational analysis and is referred to as the Thomas-White dataset. For each species 395 identified, the 16S rRNA gene sequence of the corresponding type strain was downloaded from 396 the Silva v132 release (https://www.arb-silva.de/) on 4/27/2019. A type strain is the sequence of 397 the cultured isolate that was subject to the metabolic, genotypic and phenotypic evaluations taken 398 to define the bacterial species(27), and is the agreed bacterial organism to which the taxonomic 399 name refers. Sequences were searched using the "[T]" filter setting, and sequences longer than 400 1450 nt with alignment and pintail quality scores greater than 95% were selected. For the species 401 that had no hits, the taxonomic synonym (see below) was used as the search query, if available. 402 One unidentified Corynebacterium species had no type strain available, and was excluded from 403 the analysis. 404 The V4 validation dataset. Targeted amplicon sequences from 24 urine samples, using the 405 V4 region of the 16S rRNA gene sequence, is referred to as the V4 validation dataset. These 24 406 urine samples originated from a subsample of the women whose samples comprised the Thomas-407 White dataset. Sequencing data were generated as part of two other published studies using 408 Illumina sequencing of the V4 region of the 16S rRNA gene(4,11). The raw sequence reads were 409 processed with DADA2 version 1.14.1(18) to generate amplicon sequence variants (ASVs). The 410 ASVs were subjected to taxonomic classification with the BLCA algorithm. 411 Synonyms of species. Species names have changed in response to advances in bacterial 412 systematics. All currently known species synonyms were downloaded from the Prokaryotic 413 Nomenclature Up-to-Date(28) (PNU) website on 1/5/2020. PNU includes information down to 414 the strain level, but these entries were consolidated to the species level. For example, entries like 415 Enterobacter cloacae and Enterobacter cloacae dissolvens are treated as synonyms of 416 Enterobacter cloacae. Classification results were then checked for synonyms using the custom 417 "validate_match_batch.py" script. 418 Databases. The Greengenes database version 13_5 was downloaded on 9/23/19 from 419 (http://greengenes.secondgenome.com/?prefix=downloads/greengenes_database/gg_13_5/). For 420 use with BLCA, the database was processed using the provided "1.subset_db_gg.py" script 421 (https://github.com/qunfengdong/BLCA/). For use with the Qiime2 package, the FASTA file 422 was reformatted to work with Qiime2 using the custom "write_qiime_train_db.py" script, and 423 trained to work with the Naive Bayes classifier with the provided "fit-classifier-naive-bayes" 424 script. 425 The Silva database version 132 was downloaded on 9/14/19 from (https://www.arb-426 silva.de/no_cache/download/archive/release_132/Exports/) as a FASTA formatted file. The 427 FASTA file was compiled into a database that could be used with BLCA by using the 428 "makeblastdb" utility provided in the Blast+ suite. The taxonomy file that was required by 429 BLCA was generated with the custom "write_taxonomy.py" script. For use with the Qiime2 430 package, the FASTA file was reformatted to work with Qiime2 using the custom 431 "write_qiime_train_db.py" script, and trained to work with the Naive Bayes classifier with the 432 provided "fit-classifier-naive-bayes" Qiime2 script. 433 The 16SMicrobial database is bundled with the BLCA package, but is available from 434 (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). For use with BLCA, the database was processed using the 435 provided "1.subset_db_acc.py" script included with BLCA. For use with the Qiime2 package, a 436 FASTA file was extracted from the bundled BLCA database using "blastdbcmd" utility provided 437 in the Blast+ suite, and reformatted to work with Qiime2 using the custom 438 "write_qiime_train_db.py" script. The database was trained to work with the Naive Bayes 439 classifier with the provided "fit-classifier-naive-bayes" script included in Qiime2. 440

Presence of Thomas-White species in databases.
To verify that all species from the 441 Thomas-White dataset were present in the databases used in this study, each database was first 442 converted to a FASTA file (if needed) using the "blastdbcmd" utility included in the Blast+ suite. 443 The FASTA file was then searched using regular expressions and the Linux command-line 444 program grep for a match of each species in the dataset. The commands were implemented using 445 the custom "species_in_db.bash" script. The presence or absence of each species was recorded. 446 Multisequence alignment. The 16S gene sequences from the Thomas-White dataset were 447 formed into a multi-sequence alignment using the T-coffee program(29 sequence were validated, and the relative amount of variability was quantified, using the custom 468 "weighted_ent.py" script. 469 False match -All record pairs assigned as a match that did not have identical genus and 512 species labels. 513 False non-match -If a record representing a species in the Thomas-White dataset was 514 present in the database, but was not assigned as a match, the record was evaluated as a 515 false non-match. 516 True non-match -All records in the reference database that were not in the Thomas-517 White dataset. While records assigned to this category were not used in evaluating the 518 classification schemes in this manuscript, the definition is still included for completeness. 519 520 521 522 Performance measures. Recall, precision and the F-measure were used to evaluate the 523 performance of each classification scheme implemented. Recall refers to the proportion of 524 matches that the classification scheme correctly identified (true matches) out of all possible 525 matches (true matches plus false non-matches). Precision refers to the proportion of matches that 526 the classification scheme called correctly (true matches) out of all classified matches (true 527 matches plus false matches). The F-measure is the equally weighted harmonic mean of recall and 528 precision. For this study, we chose to maximize recall over precision, because the number of true 529 matches impacts the subsequent work on diversity measures, such as species richness and 530 evenness(38). 531 Evaluating V4 validation results. The species of bacteria in the V4 seqeuncing data 532 were identified using classification schemes composed of the V4 sequencing results as the 533 identifier, BLCA classifier, and the Greengenes, Silva, and NCBI 16S microbial databases. To 534 determine the expected bacterial species in each sample, the results of the whole genome 535 sequencing on the isolates cultured from the corresponding subject was used. For each 536 Figure 8. Example of classification evaluation used in this study. Suppose there is a classification scheme comprising a set of query sequences (the rows E,F,G) and the set of reference sequences (the columns E,F,L,M) held in a reference database. In this example, the number of reference records is greater than the query records, and the reference is missing a corresponding G record from the query set. A) If the query and reference record letters are the same, then they are designated as a match. If they are different they are designated as a non-match. B) Next, the classifier is allowed to assign record pairs as matches or non-matches for all query sequences, represented as green plus signs for matches and blank cells as non-matches. Some results are correct, and some are not. Note that despite the lack of a matching record in the reference database, the classifier still designated the (G:M) pair as a match. C) Using the definitions for assigning the classifications to the confusion matrix, there is one true match (green square), two false matches (red squares), one false non-match (yellow square), and 8 true non-matches (white squares). D) The cell values of the confusion matrix are then filled out, and performance measurements can be calculated. For this classification scheme, the precision is 1/(1+2)=.33, recall is 1/(1+1)=.5, and the F-measure is (2*.33*.5)/(.33+.5)=.40. classification scheme, accuracy was calculated by enumerating the number of species identified 537 by WGS that were also identified by the V4 16S targeted amplicon sequencing using the custom 538 "real_world_data_2020-4-17.Rmd" file. 539 The results of the V4 validation set were evaluated according to the following definitions 540 (Figure 9): 541 True match -All matches from the computational classification scheme that were 542 correctly identified by V4 16S targeted amplicon sequencing 543 False match -All species identified by V4 16S targeted amplicon sequencing that were 544 not identified by the computational classification scheme 545 False non-match -All matches from the computational classification scheme that were 546 not identified by V4 16S targeted amplicon sequencing 547 True non-match -All species that were not identified by either the computational 548 classification schemes or the V4 16S targeted amplicon sequencing 549 Figure 9: Definitions of how the classification scheme outcomes are assigned to the cells of the confusion matrix for the V4 validation results. This example shows the classification scheme composed of the Greengenes database, BLCA classifier, and the V4 region of the 16S rRNA gene as the identifier. When the Thomas-White dataset is subsetted by the 24 samples that underwent targeted amplicon sequencing, a smaller set of 49 species remains. The light yellow rows indicate the species correctly identified by the computational classification scheme. Blue dots represent species identified in the collected samples by whole genome sequencing after expanded urine culturing and isolation. Yellow dots indicate the species were identified in those samples by V4 targeted amplicon sequencing. Yellow dots in light yellow rows are true matches, when found elsewhere they are false matches. Blue dots in the light yellow rows are false non-matches, when found elsewhere they are true non-matches.

Data Availability. 550
This project used previously acquired publicly available data. 20 All code that was written for this 551 project can be found in the GitHub repository: