Direct quantification of in vivo mutagenesis and carcinogenesis using duplex sequencing

Significance Error-corrected next-generation sequencing (ecNGS) can be used to rapidly detect and quantify the in vivo mutagenic impact of environmental exposures or endogenous processes in any tissue, from any species, at any genomic location. The greater speed, higher scalability, richer data outputs, and cross-species and cross-locus applicability of ecNGS compared to existing methods make it a powerful new tool for mutational research, regulatory safety testing, and emerging clinical applications.

A taxonomy database was constructed with k-mers from human, rat, cow, and mouse. The taxonomic classifier Kraken 1 was used to identify error-corrected paired-end contaminating reads, as well as confidently indicating which reads were only from Mus musculus origin. Reads that are left unassigned due to this method are often true sequences from the source genomes, however, they contain an `N`-call or variant base often enough such that a single k-mer cannot exist that indicates a positive classification to the target genome. Reads of ambiguous assignment were discarded as they did not contain enough sequence information to positively assign them to any of the organisms at the species level.
To eliminate confounding assignment due to the human HRAS transgene in the Tg-rasH2 mouse model, a masked human genome was used for all classification where the mask territory was the exact sequence copy as integrated into Tg-rasH2.
Out of a total of 52,509,726 error-corrected paired-end reads across all 62 (1.2×10 -4 %) murine tissue samples, 50,910,333 were taxonomically classified as Mus musculus, 34 to Rattus norvegicus albus, 33 (6.3×10 -4 %) to Homo sapiens, and 0 to Bos taurus (0%). Exactly 84,865 (0.2%) paired-end reads were unclassified and 1,514,494 (2.8%) were from an ambiguous taxonomic origin. Only sequence data that could be positively identified as originating from the mouse genome was reserved for downstream analysis. Furthermore, every error-corrected pairedend read supporting a variant call in this cohort underwent manual review and BLAST+ alignment using the Blast nucleotide (nt) collection to confirm the true positive rate of taxonomic classification on this error-corrected dataset as being a perfect 100.000000%.
Tissue samples from vehicle control exposed mouse ID 9951 contained 29 paired-end reads from Homo sapiens and a tissue sample from the benzo[α]pyrene exposed mouse ID 9310 contained 28 paired-end reads from Rattus norvegicus albus suggesting that most contaminating events in both mouse cohorts were punctuated and private to just a few samples. The mean per-nucleotide mutant frequency for mouse 9951 is 1.2×10 -7 and if contaminating reads were not removed, the mean pernucleotide mutant frequency would have risen to a rate equivalent, or greater than, the mutant frequencies detected in the positive control samples. Figure S1. MF comparison in a mutagen exposed sample with and without duplex consensus level error-correction. Alternative forms of error-corrected next generation sequencing (ecNGS) may perform the error-correction on single-strands without resolving a complete duplex consensus. These single-strand error-correction forms of ecNGS are not sensitive enough for resolving small effect sizes in mutant frequency induction from experiments like those in the TGR assays. To illustrate this, we performed singlestrand error-correction data using Duplex Sequencing Adapters on two Tg-rasH2 mouse lung samples, one treated with urethane and one treated with the vehicle control. The per-nucleotide mutant frequencies for the vehicle control and urethane-exposed samples are 8.2×10

SUPPLEMENTARY FIGURES
-8 and 2.15×10 -6 using Duplex Sequencing. When measuring the same metric using only single-strand consensus sequencing (SSCS), the two mutant frequencies rise to 8.6×10 -5 and 8.6×10 -5 , respectively. The difference between the mutant frequencies of the exposed and control tissues using Duplex Sequencing are different with a p-value less than 2.2×10 -16 . This is in contrast to the single-strand error-correction measurements of mutant frequency which are not significant (p-value 0.98). Both statistical tests were performed using the Fisher's exact test for count data. Error bars reflect 95% confidence intervals.        Table S4. Early neoplastic evolution is detected with Duplex Sequencing in the cancerpredisposed mouse Tg-rasH2. The variant allele counts of A·T→T·A mutations at codon 61 in the human HRAS transgene in the Tg-rasH2 mouse model. The variant allele counts observed at this locus are those of A·T→T·A in the context CTG for urethane exposed tissues. All but one urethane exposed lung tissue harbors a variant at significant clonality. A single urethane exposed splenic sample has a small clone of two counts (0.018%) at this locus.