Modeling site-specific nucleotide biases affecting Himar1 transposon insertion frequencies in TnSeq datasets

In bacterial TnSeq experiments, a library of transposons insertion mutants is generated, selected under various growth conditions, and sequenced to determine the profile of insertions at different sites in the genome, from which the fitness of mutant strains can be inferred. The widely used Himar1 transposon is known to be restricted to insertions at TA dinucleotides, but otherwise, few site-specific biases have been identified. As a result, most analytical approaches assume that insertion counts are expected a priori to be randomly distributed among TA sites in non-essential regions. However, recent analyses of independent Himar1 Tn libraries in M. tuberculosis have identified a local sequence pattern that is non-permissive for Himar1 insertion. This suggests there are site-specific biases that affect the frequency of insertions of the Himar1 transposon at different TA sites. In this paper, we use statistical and machine learning models to characterize patterns in the nucleotides surrounding TA sites associated with high and low insertion counts. We not only affirm that the previously discovered non-permissive pattern (CG)GnTAnC(CG) suppresses insertions, but conversely show that an A in the -3 position or T in the +3 position from the TA site encourages them. We demonstrate that these insertion preferences exist in Himar1 TnSeq datasets other than M. tuberculosis, including mycobacterial and non-mycobacterial species. We build predictive models of Himar1 insertion preferences as a function of surrounding nucleotides. The final predictive model explains about half of the variance in insertion counts, presuming the rest comes from stochastic variability between libraries or due to sampling differences during sequencing. Based on this model, we present a new method, called the TTN-Fitness method, to improve the identification of conditionally essential genes or genetic interactions, i.e., to better distinguish true biological fitness effects by comparing the observed counts to expected counts using a site-specific model of insertion preferences. Compared to previous methods like Hidden Markov Models, the TTN-Fitness method can make finer distinctions among genes whose disruption causes a fitness defect (or advantage), separating them out from the large pool of non-essentials, and is able to classify the essentiality of many smaller genes (with few TA sites) that were previously characterized as uncertain.


Introduction 37
TnSeq has become a popular tool for evaluating gene essentiality in bacteria under various 38 conditions (Cain, Barquist et al. 2020). The most widely used transposons for bacterial TnSeq are those in 39 the mariner family, such as Himar1 (Sassetti, Boyd et al. 2003). To date, it has generally been assumed 40 that the Himar1 transposon, frequently used to generate the transposon libraries, inserts randomly at TA 41 dinucleotide sites in non-essential regions across the genome (Lampe, Akerley et al. 1999). The abundance 42 of transposon insertions at each TA site can be quantified efficiently using next-generation sequencing 43 (Long, DeJesus et al. 2015). Genes or loci with an absence of insertions are considered to be essential, as 44 disruption in these regions are not tolerated (Sassetti, Boyd et al. 2003). Genes or loci with a reduced 45 mean insertion count are considered mutants with growth defects, as disruptions in these regions are not 46 fatal but cause growth impairments or fitness defects (van Opijnen, Bodi et al. 2009). Genes that have 47 significant changes in mean counts between conditions are deemed as conditionally essential (Gawronski,48 Wong et al. 2009). 49 There are several sources of noise in TnSeq experiments, including stochastic variations in the 50 library generation process as well as instrument and sampling-error in DNA sequencing, resulting in a high 51 variability in insertion counts. Statistical methods developed thus far to assess gene essentiality typically 52 assume that insertions occur randomly at TA sites in non-essential regions, and the reason some sites 53 have more insertions than others is largely due to stochastic differences in abundance in the library. 54 However, some studies suggest that transposon insertions at non-essential sites is influenced by the 55 surrounding nucleotides or genomic context. Transposons Tn5 and Mu (not restricted to TA dinucleotides) 56 showed a bias towards insertions in GC-rich regions and resulted in a less uniform distribution of insertions 57 in the A-T rich genome (61% AT) of C. glabrata than their notably less-biased counterpart Tn7 (Green, 58 Bouchier et al. 2012). In addition, Lampe (Lampe, Akerley et al. 1999) showed that local bendability of the 59 DNA strand can affect the probability of Himar1 insertion at different chromosomal locations in E. coli. In this paper, we use statistical and machine learning models to identify patterns in the 65 nucleotides surrounding TA sites associated with high and low insertion counts. We discover nucleotide 66 biases within a ±4-base pair window around the TA site that suppress Himar1 insertions, and other 67 patterns that appear to select for them (i.e., associated with high insertion counts). We capture these 68 biases in a predictive model of Himar1 insertion preferences that can be used to predict expected insertion 69 counts at any TA site as a function of the surrounding nucleotide context. We demonstrate that these 70 insertion preferences exist in other Himar1 TnSeq datasets from Mtb, as well as other mycobacterial and 71 non-mycobacterial species. The final predictive model explains about half of the variance in insertion 72 counts, presuming the rest comes from stochastic variability between libraries or due to sampling 73 differences during sequencing. We demonstrate that this model can be used to improve the assessment 74 of genes' fitness by comparing the observed counts to expected counts using a site-specific model of 75 insertion preferences. 76 77 Results 78

Insertion counts at TA sites are correlated between libraries 79
Variability in insertion counts at TA sites can be attributed to various sources, including 80 abundance in library, experimental randomness, and local sequence biases, as well as genuine biological 81 significance (fitness effects). To attempt to differentiate these, we re-analyzed a previously published 82 collection of 14 independent Himar1 TnSeq libraries grown in standard laboratory medium (DeJesus,83 Gerrick et al. 2017). An extended HMM analysis the 14 datasets (see Methods) suggests that 84 approximately 11.6% of the organism's TA sites are essential for growth, and insertions in approximately 85 3.5% of the sites can cause a growth defect. In addition, 9% of sites in non-essential regions have few to 86 no insertions due to a non-permissive sequence pattern (DeJesus, Gerrick et al. 2017). Insertions at TA 87 sites in regions other than these are generally expected to occur randomly. If true, the insertion counts at 88 the same TA site in different libraries would be expected to be uncorrelated on average. However, our 89 analysis of the 14 H37Rv Tn libraries shows that there is substantial correlation of counts at individual TA 90 sites, suggesting that some TA sites have a higher propensity for Himar1 insertion than others. Figure 1  91 shows the distribution of log 10 insertions counts from each library in a genomic region with 75 TA sites 92 (the log of counts was taken to better fit a Gaussian distribution). Each library was TTR-normalized to 93 make counts from datasets of different total size comparable (DeJesus, Ambadipudi et al. 2015). Panel A 94 shows that mean insertion counts differ widely among non-essential TA sites, and the variability between 95 TA sites is more than within each site. Thus, high counts at a TA site in one library tend to have high counts 96 in other libraries and similarly, sites with low counts occur symmetrically across the libraries. As a 97 comparison, insertion counts at TA sites (excluding those marked essential or following the non-98 permissive pattern) were randomized within each library. Panel B shows the same 75 consecutive TA sites 99 in this randomized dataset. When randomized, the distribution of counts at sites in non-essential regions 100 is much more uniform. The average variance of all insertions within a TA site is 0.430, significantly lower 101 (p-value < 0.001) than the variance of 0.929 found in the randomized dataset. This makes it evident that 102 the correlation of log insertion counts across libraries is greater than expected. In fact, pairwise 103 correlations of the randomized datasets range from 0.15 to 0.33, averaging to 0.28. Pairwise correlations 104 of 14 libraries are considerably higher, ranging from 0.5 to 0.97, averaging to 0.62 (see Supplemental 105 Figure S1). 80 of the 90 pairwise correlations had significant p-values (< 0.01) from comparison by a two-106 tailed t-test (see Supplemental Table T1). A significant high correlation across libraries suggests there are 107 site-specific influences, in addition to those previously observed, on insertion probabilities at different TA 108 sites. 109 110

Modeling Insertion Counts using Linear Regression 111
To determine whether the nucleotides surrounding a TA site influence the probability of 112 insertion, we examined the association of proximal nucleotides on insertion counts, averaged over all non-113 essential TA sites in the genome. Figure 2 presents evidence of site-specific nucleotide effects that 114 influence the relative abundance of insertions at TA sites. Panel A shows overall nucleotide probabilities 115 ±20 bp from the TA site. Most of the deviation in nucleotide probabilities occurs within 4 bp of the central 116 TA site, with probabilities varying up to 20% for some nucleotides. Further insight can be gained by 117 dividing the TA sites into thirds: sites with lowest counts, sites with medium counts, and sites with highest 118 counts. Panel B, depicting the lowest third of the range of insertion counts, shows an increase in 119 probabilities of nucleotides C and G and a decrease in probabilities of nucleotides 'A' and 'T' especially at 120 positions ±2 and ±3. Panel D, depicting the highest third of insertion counts, also shows drastic changes in 121 nucleotide probability, with a notable increase in propensity for 'A' at -3 and 'T' at +3. These observations 122 suggest a correlation between the magnitude of insertion counts and nucleotides surrounding TA sites. 123 Thus, insertion counts at a TA site could be affected by the surrounding nucleotides. 124 We trained a linear regression model on the 40 nucleotides surrounding the TA site (positions -125 20...+20) to predict insertion counts in known non-essential regions (67,670 TA sites) using the mean 126 counts from the 14 libraries of H37Rv. The input to the model was a one-hot-encoding of the nucleotides, 127 where each nucleotide at each position was represented by 4 bits and concatenated into a bit vector, 128 totaling 160 binary features. The resulting linear model was: 129 is the nucleotide at position i relative to the TA site and weights w ij 131 correspond to each of the 160 binary features. This formula is equivalent to a dot-product a 160-bit 132 vector (plus an intercept) with a vector of weights, log 10 (Insertion Count) = w 0 + w 1 b -20=A + w 2 b -20=C +w 3 133 b -20=T + w 4 b -20=G +…+ w 157 b +20=A + w 158 b +20=C +w 159 b +20=T + w 160 b +20=G , where every four bits encode the 134 nucleotide at a position  20 bp from the TA site. The model was trained and evaluated using 10-fold 135 cross validation. Figure 3 shows the average correlation between predicted and observed log 10 insertion 136 counts. The model has some predictive power (R 2 value of 0.32), but also has high variance. A slight bias 137 can be seen in the figure, where the low counts are predicted too high, and the high counts too low. This 138 is a consequence of the regression model making predictions that do not span as wide of a range as the 139 actual data, due to inaccurate predictions for the sites with the most extreme (largest or smallest 140 counts). The accuracy of predictions made by this initial simplified regression will increase with 141 improved models (below) and thus this effect will be reduced. 142 In Figure 3B, nucleotides with highest coefficients in the trained model are located within a 143 window of ±4 bp around the TA site. The pattern created by the nucleotides of these coefficients are 144 consistent with the non-permissive pattern (CG)GnTAnC(CG) previously reported (DeJesus, Gerrick et al. 145 2017). Nucleotide 'G' has the highest absolute coefficient value in the -2 position and 'C' has the highest 146 absolute coefficient value in the +2 position. Moreover, both 'C' and 'G' have similarly high absolute 147 coefficients in the -3 and +3 positions. In addition to the confirmation of the non-permissive pattern (large 148 negative coefficients for 'G' at -3 and 'C' at +3), the figure shows nucleotides 'A' and 'T' with relatively 149 high positive coefficients in positions -3 and +3 from the TA site. These patterns reinforce the observations 150 made in Figure 1 and provide further evidence of previously undetected site-specific nucleotide biases 151 that affect Himar1 insertion counts. 152 153

Prediction of insertion counts at TA sites relative to local average counts 154
We assume that insertion counts are proportional to the permissiveness of a site i.e., a site with 155 a less permissive pattern will have lower counts than a site with a more permissive pattern. However, 156 insertion counts are also affected by biological fitness. It is likely that a TA site with a specific nucleotide 157 pattern in a fitness-defect gene will have a lower insertion count than a TA site with the same pattern in 158 a non-essential gene. But this effect (decrease or increase in counts) should be shared by multiple TA sites 159 locally. We can compare the insertion count observed at a site to the observed counts at other TA sites in 160 the region, the level of which should reflect the general fitness effect of disrupting the gene. Thus, 161 modeling this relative (or local) change in insertion counts would allow us to factor out biological effects 162 on counts and focus on the effect of nucleotide patterns on the insertion counts. 163 This change in insertion counts is quantified for every TA site as a log-fold-change (LFC) value. 164 The local average was calculated for each site by taking the mean insertion counts from the previous 5 165 and next 5 TA sites from the site of interest (i.e., using a sliding window of 11 consecutive TA sites). 166 The local mean excludes the central site itself and any locations marked as essential during pre-processing. 168 The LFC for each TA site was calculated by taking the log insertion count at that site plus a pseudo count 169 of 10 (to smooth out high variability of LFCs for sites with low counts) and dividing it by the local average: 170 ( ) = 2 ( ( ) + 10 ( ) + 10 ) 171 As with the previous model, this linear model was trained and tested using 10-fold cross 172 validation. The resulting model (see Supplemental Figure S2) has an average R 2 value of 0.38, indicating 173 that training the model to predict changes in insertion counts (relative to local mean) rather than absolute 174 insertion counts greatly reduces the noise due to local fitness effects (e.g., in a gene where insertion cause 175 growth defects, systematically reducing abundance of insertions in the region). This allows the model to 176 better capture the effect of nucleotides surround TA sites on Himar1 insertion preferences. 177 178 A neural network model explains up to 50% of the variability in insertion counts 179 As they can capture non-linear patterns, neural networks are considered to be some of the most 180 powerful predictors in Machine Learning (Rumelhart, Hinton et al. 1986). To see if we could increase the 181 accuracy of our model, we tried using our data to train a fully connected multi-layer feed-forward Neural 182 Network. The model contained one hidden layer of 50 nodes. This parameter along with other hyper 183 parameters of the network were tuned using a grid search (see details in Methods and Materials). A 184 random subset of 70% of the data was used to find the ideal hyper parameters through cross validation, 185 with the remaining 30% of the data used to test the final hyper parameters. A 10-fold cross-validation of 186 the entire dataset was used to train and test the model i.e., judge the model accuracy using our data. The 187 input to the model consisted of bit-vectors encoding nucleotides surrounding each TA site in the dataset, 188 totaling to 160 features. The target value was LFCs (log-fold-changes of insertion counts relative to local 189 mean). The model performed better than the previous models with an average R 2 of 0.51 (R 2 of 0.509 on 190 the hyper parameter test data) (see Supplemental Figure S3). Thus, the neural network can explain around 9 half of the variability in insertion counts at TA sites based on surrounding nucleotides; presumably, the 192 remaining differences in counts still reflect stochastic differences in abundance between libraries (or other 193 influences on TA insertion preferences for which we have not yet accounted). However, as is typical for 194 neural networks, this model (as a matrix of connection weights) does not provide us much insight into 195 nucleotide patterns that led to the predictions for the TA sites. counterpart 'T' in position +3 are both associated with higher mean LFCs, and hence can be interpreted 213 as being more permissive for Himar1 insertions (associated with increased counts). However, the effects 214 of multiple biases appearing in a single sequence are not additive. For instance, a 'C' in the -2 and an 'A' 215 in the -3 position do not "cancel" each other out; they are interdependent. We quantify how effects like 216 these combine in the tetra-nucleotide model below. 217 There appears to be a slight periodic pattern of the G and C nucleotides surrounding the TA site, 218 between 20 and 4 bp from the TA site in Figure 4 (also evident in Figure 2 Figure S4). This is reinforced by 234 the heatmaps, as a majority of the apparent effects occur within 4 bp from the TA site. If we use all the 235 sequence combinations of the nucleotides in positions -4...+4 as features in our model, we will have 4 8 236 =65,536 features (i.e., terms in a linear model, or inputs to a neural network). However, the patterns of 237 nucleotide biases are symmetrical (reverse-complement), as shown by the heatmaps, thus making the 238 distinction between all 8 nucleotides unnecessary. The four nucleotides upstream of the TA appear to 239 affect the insertion counts in the same way as the reverse-complement of the 4 nucleotides downstream 240 of the TA site. Therefore, it is only necessary to capture the association of 4 nucleotides at a time on LFCs 241 in the model. Hence, we shift to training our models based on combinations of 4 nucleotides, i.e., tetra-242 nucleotides, which reduces the number of features in our model to 4 4 = 256. 243 As input to the STLM, each TA site is represented as a vector where all features are set to 0 244 except for the upstream tetra-nucleotide and reverse-complemented downstream tetra-nucleotide (see 245 Figure 6). This is essentially the same as adding two bit-vectors, one vector with the bit for the upstream 246 tetra-nucleotide on and another separate vector with the bit for the downstream tetra-nucleotide on. The 247 result is a sparse 256-bit vector with only 2 bits on (except when the two tetra-nucleotides are the same, 248 in which case the single feature value for the tetra-nucleotide is set to 2). The result is a linear model that 249 follows the equation 250 where 1 … . 256 are the weights associated with tetra-nucleotides (to be trained by the model) and 252

… .
are the bits corresponding to the presence of the adjacent tetra-nucleotide features for 253 every TA site. Encoding both the upstream and reverse-complemented downstream tetra-nucleotides 254 allows us to use the same model to represent the bias from both sides of the TA simultaneously as 255 independent features, additively contributing equal weight. Assume for a given TA site, both upstream 256 and downstream tetra-nucleotides are associated with high LFCs; then they will reinforce to predict an 257 even higher insertion count for that site. But if the upstream tetra-nucleotide has a trend to contribute a 258 high LFC and the reverse-complemented downstream tetra-nucleotide has a trend to contribute a low 259 LFC, they will tend to cancel each other out. 260 As seen in Figure 5A, 10-fold cross validation using the H37Rv data resulted in an average R 2 261 value of 0.469. This R 2 value is slightly lower than, but nearly equal to, that of the neural network (p-value 262 < 0.01 from two-tailed t-test). However, the STLM provides us more insight into patterns contributing to 263 the prediction of the LFCs. In a regression with these tetra-nucleotide features, we expect each coefficient 264 (i.e., weights) of the model to correlate with the average LFC associated with each tetra-nucleotide (over 265 TA sites surrounded by these tetra-nucleotides). Figure 5B shows the relationship of the STLM coefficients, 266 and the mean observed LFCs of the corresponding tetra-nucleotides (shifted on the y-axis by the bias 267 (intercept) in our data). The strong linear trend visible adheres to the expectation of a high correlation 268 and indicates our model accurately represents our data. The individual tetra-nucleotide coefficients are 269 shown in Panel C, sorted in decreasing order (See Supplemental Table T4 for full table) in insertion counts at different TA sites for these datasets (using coefficients trained on M. tuberculosis 308 H37Rv data but applied to datasets from other mycobacterial species). The predictive power of our model 309 on the M. abscessus dataset (R 2 of 0.504) is slightly higher than, but about the same as, the Mtb test set. 310 biases. The predictive power of our model on some datasets, such as M. avium was lower (R 2 of 0.262), 312 but they still exhibited a correlation between observed and predicted LFCs (and hence insertion counts) 313 and displayed a nucleotide pattern similar to the heatmaps from the other mycobacterial TnSeq datasets. 314 To examine whether these biases also occur outside of the Mycobacterium genus, we obtained 315 Himar1 TnSeq datasets from Caulobacter crescentus ( between the reference and isolate strain at these sites for every TA site with an adjacent SNP can be seen 348 in Figure 10 (see Methods and Materials). We expected that when a nucleotide with a high negative bias 349 was mutated, the observed LFC would increase, and when a nucleotide with a high positive bias was 350 changed, the observed LFC would decrease. Figure 10 (for full table see Supplemental Table T2). In addition to the general pattern 360 observed in the plot, we see that the magnitudes of the LFC differences correspond to the magnitudes of 361 nucleotide biases. In the previous heatmaps,  Table T3. 447 An example of a gene whose fitness interpretation is changed via the Gene+TTN model in the 448

TTN-Fitness method (compared to Genes-only model), to better reflect its biological significance, is 449
Rv0833 (PE_PGRS13). The gene is seen in Figure 12B  Thus, the Gene+TTN model was able to evaluate the fitness of PE_PGRS13 more accurately than the Gene-460 Only model, demonstrating that incorporating the nucleotide context surrounding each TA site improves 461 the fitness assessment of this gene. 462 To investigate genes that exhibit large differences in fitness assessment between the TTN-463 Fitness method and the HMM+NP method, Figure 13   perhaps insertion counts at different TA sites could be predicted based on surrounding nucleotides. We 498 developed a model that captures nucleotide biases and uses them to predict changes in relative insertion 499 counts i.e., LFCs. The LFC metric compares raw counts at a site to the local average, which allows us to 500 predict the deviation in insertion counts from the neighborhood rather than the absolute insertion counts 501 themselves. This method allows us to examine just the effect of the nucleotides on the insertion counts, 502 independent of biological effects (e.g., genes with different levels of growth defect). The STLM developed 503 for the task incorporated tetra-nucleotides upstream and downstream of the TA site, taking advantage of 504 the symmetric nature of the bias patterns observed in the heatmaps. Furthermore, the tetra-nucleotide 505 features ensured that the model could capture non-linear combinations (interactions) of nucleotides 506 proximal to the TA site, not just incorporating the effects or individual nucleotides in an additive way. The 507 STLM statistically performed as well as the neural network, and in addition was able to provide further 508 insight into nucleotide patterns that influence insertion counts. 509 The coefficients of the trained STLM showed that there was a pattern of insertion count 510 suppression consistent with the non-permissive pattern previously observed (DeJesus, Gerrick et al. 511 2017). In addition, a pattern of increased insertion counts in the presence 'A' in the -3 position or 'T' in 512 the +3 position was also visible. But the linear model represents these patterns in a more general way so 513 that they can be used to predict expected insertion counts at any TA site, conditioned on the surrounding 514 nucleotides. These nucleotide biases were able to explain up to ~50% of insertion count variance in the 515 other Himar1 datasets. These site-specific nucleotide biases were observed in a variety of TnSeq datasets 516 from other mycobacterial and non-mycobacterial species. Comparing TA sites with substitutions in the ±4 517 bp window between two divergent strains of M. abscessus showed changes in observed LFCs that 518 corresponded to nearby SNPs as predicted by the STLM, providing further evidence of the generality of 519 these biases. 520 There is a precedent for transposons in some families having insertion biases for certain sequence 521 patterns. For example, even though the Tn5 transposase can insert anywhere in a genome, it tends to 522 insert in GC-rich regions   (Green, Bouchier et al. 2012). Furthermore, a 523 detailed pattern analysis applied to known Tn5 insertion sites suggested that the consensus pattern for 524 preferred target sites is A-GNTYWRANC-T (Goryshin, Miller et al. 1998). Tc1 (also in mariner family) was 525 shown to weakly prefer inserting at TA sites with this consensus pattern: CAYATATRTG (Korswagen, Durbin et al. 1996). The pattern included a coupled symmetric target site preference of an 'A' in position -3 and 527 a 'T' in position +3, consistent with our model. We were able to identify similar sequence-dependent 528 patterns and quantify them in a more general way with a model that can predict expected insertion counts 529 for every TA site. 530 Early studies in E. coli suggested that the Himar1 transposon tends to insert at TA sites in more 531 "bendable" regions of the genome (Lampe, Grant et al. 1998), as measured experimentally. Bendability is 532 a cumulative effect of specific nucleotides on local geometric parameters of the DNA helical axis; each 533 nucleotide makes a small contribution, on the order of a few degrees, to angular distortion (bend, roll, 534 tilt) of the axis, with different nucleotides (or combinations of nucleotides) having a different effect. This 535 can accumulate over tens of nucleotides to produce a macroscopic bend or kink in the DNA. Goodsell and 536 Dickerson (Goodsell and Dickerson 1994) parameterized the geometric effects for each trinucleotide and 537 used this to generate a model which can be used to predict the bend and twist of the helical axis 538 accumulated locally using a sliding window. It was speculated that local bendability could facilitate the 539 melting of the double-helix, recognition/binding of the transposase, and formation of the pre-cleavage 540 complex (Lampe, Grant et al. 1998). However, while it is possible that bendability contributes weakly to 541 Himar1 insertion preferences, the effect likely spans a larger window of nucleotides than just ±4 bp around 542 the TA sites; local bendability is not likely to be substantially affected by the 4 nucleotides on either side 543 of the TA sites, which have a predominant influence according to our statistical analysis. In addition, we 544 computed this around the TA sites in our dataset and added it as a covariate in our linear models, but it 545 did not improve the performance of the models. 546 The patterns of nucleotide biases on Himar1 transposon insertion preferences may have emerged 547 as a result of the physical interaction between the Himar-1 transposase and the DNA. Figure 14 displays 548 the X-ray crystal structure of the complex between the Mos1 transposon (also in the mariner family) and 549 the pre-cleavage state of the DNA double helix (Dornan, Grey et al. 2015). As expected, the components of the TA dinucleotide (T57, A58) interact with the protein (residues 119-124 (WVPHEL)-orange). 551 However, the 4 adjacent nucleotides also make extensive contact with the protein in a small tunnel by 552 packing against Asp284-His293 (green). Arg118 likely makes charged-polar interactions with the 553 nucleotides at position -2 and -3. These positions are where different nucleotides proximal to TA 554 dinucleotides are observed to have insertion biases in Himar1 datasets. The interactions between these 555 TA-adjacent nucleotides and amino acid side chains in the transposase could influence the energetics and 556 therefore the frequency of successful transposon reactions at TA sites. While it would be tempting to try 557 to perform a detailed analysis of the hydrogen-bonding and other molecular interactions between 558 nucleotides in the DNA fragment and amino acid side-chains of the transposase they contact to derive a 559 structural explanation for the observed preferences for certain nucleotides surrounding the TA site, it 560 must be remembered that this structure is of MosI (whose insertion biases are unknown, except for TA 561 restriction), and a detailed analysis of molecular interactions relevant to the biases of the Himar1 562 transposase, as we have characterized, will have to await determination of an X-ray crystal structure of a 563 complex of the Himar1 transposase bound to a target DNA fragment (containing a TA site). 564 We demonstrated the utility of our model of nucleotide biases on Himar1 insertion frequencies 565 by using it to improve gene essentiality predictions via the TTN-Fitness method. One way to determine 566 the essentiality of a gene is to take the average count of insertions at all the TA sites in the gene, and 567 determine the essentiality based on a set of cutoffs (Zomer, Burghout et al. 2012). This method treats all 568 TA sites as being equivalent a priori (i.e., as independent, and identically distributed observations, with 569 equal prior probability of insertion), and does not allow for site-specific differences that can greatly affect 570 the insertion count at each site. Incorporating these surrounding nucleotides takes out (or corrects for) 571 the effect of insertion biases and focuses the analysis on true biological effects, thus increasing our 572 certainty in fitness calls for these genes. In the TTN-Fitness method, we fit a linear model to the insertion 573 counts at TA sites, incorporating the gene in which it resides and the surrounding nucleotides each site as 25 covariates. The coefficients associated with genes in the regression model reflect how much the mean 575 insertion counts in the gene deviate from the global average, after correcting for the expected insertion 576 counts at each site in the gene. For most genes, predicted fitness did not change substantially between 577 the ablative Genes-only model and the Gene+TTN model of the TTN-Fitness method. However, the 578 assessment for some notable genes did change with the inclusion of tetra-nucleotide features. PGRS13 579 was implied to be a 'Growth Defect' gene by the previous insertion count based methodology due to the 580 low insertion counts at its TA sites. However, sites in the gene are surrounded by mostly 'G's and 'C's 581 which have been determined by Himar1 preferences to suppress insertions. So, the insertions are low, 582 but are expected to be low, and thus the gene is determined to be less essential than previously predicted. 583 The Gene+TTN model used in the TTN-Fitness method has an advantage for small genes with less than or 584 equal to 3 TA sites (220 in H37Rv genome) such as Rv3461c (rpmJ), previously undetermined by 585 essentiality estimates. The model is less susceptible to noisy counts (high or low) at individual sites 586 because we can compare the observed counts at those sites to expected counts from their nucleotide 587 context, correcting for the effect of insertion biases, and thus improving the identification of conditionally 588 essential genes and genetic interactions, i.e., to better distinguish true biological fitness effects by 589 comparing the observed counts to expected counts using a site-specific model of insertion preferences. 590 This method could also be helpful for analyzing differences in essentiality of genes between different 591 strains (e.g. clinical isolates), where the TTN-Fitness model can correct for expected counts at TA sites to 592 account for differences in the surrounding nucleotides (e.g. due to the different genetic backgrounds of 593 the libraries). This high level of saturation enabled us to reliably observe the nucleotide bias of insertion 607 counts at different TA Sites. The dataset was normalized using TTR normalization in Transit (dividing by 608 the total counts in each dataset, with top 1% trimmed to mitigate influence of outliers, and scaling back 609 up so the mean count at non-zero sites is 100.0). We identified essential regions as consecutive sequences 610 of 6 or more TA sites with counts of two or less and subsequently removed them. Using the resulting 611 dataset, we were able to explain nucleotide bias at TA Sites not only for H37Rv but also for other 612 mycobacteria and non-mycobacterial Himar1 TnSeq datasets. 613 614

Significance of the Correlation of Insertion Counts between TnSeq Datasets 615
The correlation of insertion counts at TA sites between TnSeq datasets was calculated using a Pearson 616 correlation coefficient. As mentioned previously, the log of insertion counts was used, since the Pearson 617 Correlation Coefficient assumes that the input data is normally distributed. The two-tailed T-test for the 618 means of two independent samples was used to measure whether the expected value differs significantly 619 between samples. Since we do not assume population variance between the two datasets is equal, the 620 Welch's t-test is performed. 621

10-Fold Cross Validation Linear Regression 623
The data was split for 10-Fold cross validation using sklearn.model_selection.KFold. Within these 624 folds, we used sklearn.Ridge with alpha=0.1 to train and test linear models (target values of log insertion 625 counts or LFCs) with L2 regularization. 626 627 Hyper parameter Tuning the Neural Network 628 The data was separated into training and testing using a 70-30 train-test split. We used 10-Fold 629 cross validation on the 70% training split of the data to tune the number of nodes per hidden layer, 630 number of hidden layers, the activation function, value of alpha and whether to use early stopping. We 631 used scikit-learn's GridSearchCV to perform this operation and checked the accuracy of the final hyper 632 parameters set on the reserved 30% set. Afterwards, we used the tuned parameters to perform a 10-fold 633 cross validation on the whole dataset to accurately judge the model and account for data biases. to be the mean observed LFC difference for that SNP. We performed a similar calculation for the predicted 657 LFCs. Using the STLM, we found the predicted LFCs at a TA sites in the reference strain and predicted LFCs 658 at TA sites with a specific SNP in the clinical isolate. The average of the difference in these two predicted 659 LFCs was the mean change in predicted LFC. We use this formula to determine the minimum n such that P(n)<0.05, and we then label any genes with 672 n or more TA sites, all of which have insertion counts of 0, as 'Essential-B' ("ESB"). This method is a 673 necessary additional step to the Gumbel method to find smaller genes that may have been missed, 674 especially in datasets with lower saturation. In Panel A, a point is plotted for insertion counts at each coordinate for each replicate. This scatter plot 748 is then overlayed with a box-and-whisker plot reflecting the mean and range of insertion counts at each 749 site. The region includes trpG for comparison, which is an essential gene, and hence insertion counts 750 are 0 in this gene. In the non-essential genes, the insertion counts vary more between TA sites than 751 within, supporting that some TA sites have a higher propensity for insertions than others. Panel B shows 752 the same 75 sites after randomizing the insertion counts at all TA sites except those marked ES and 753 those showing the non-permissive pattern. The mean and range of counts at each non-essential TA site 754 are much more uniform when randomized. 755 756 757 and log insertion counts as the output using 10-fold cross validation. The predictive power is moderate 773 (R 2 =0.318) , meaning it is able to explain 31% of the variation in insertion counts based on surrounding 774 nucleotides. Panel B shows Coefficients from the trained Linear Model. The coordinates along the x-axis 775 give the positions relative to, but not including, the TA site. The model is trained on one-hot-encoded 776 nucleotides and a target value of log Insertion Counts. The symmetry of the pattern is visible in positions 777 -4, -3, -2, -1 and +1, +2, +3, +4. The non-permissive pattern (CG)GnTAnC(CG) is visible in this window, as 778 well as high coefficients associated with "A" and "T".   The four heatmaps are calculated in the same manner that the H37Rv heatmap was calculated in Figure  812 4. The mean of each filtered nucleotide at every position in a ±20 bp window around the TA sites is 813 calculated. The patterns of all the heatmaps look very similar to both each other and to the H37Rv 814 heatmap in Figure 4. 815 816 817 818 819 820 The four heatmaps are calculated in the same manner that the H37Rv heatmap (Fig. 4) and 822 mycobacterial heatmaps (Fig. 7) were calculated. The mean of each filtered nucleotide at every position 823 in a ±20 bp window around each TA site is calculated, and centered on the median of mean LFCs. 824 825 826 827 828 The heatmap in Panel A is calculated in the same manner that the previous heatmaps were calculated. 830 The pattern of this heatmap looks very similar to the H37Rv heatmap (Fig. 4) as well as the heatmap for 831 the M. abscessus ATCC 19977 reference strain (Fig. 8)  in both models. Genes determined to be "Uncertain" in the HMM+NP model are assigned other states in 855 the TTN-Fitness method. A fraction of genes labeled "NE" in the HMM+NP model (highlighted matrix 856 components) are reassigned to be "GA" or "GD" using the TTN-Fitness method, indicating that the TTN-857 Fitness method is more sensitive in estimating fitness than the HMM+NP model. determined by the HMM+NP model. 877 The HMM+NP methodology labels genes as "Non-Essential" (NE), "Essential" (ES/ESD), "Growth Defect" 878 (GD), "Growth Advantage" (GA) and "Uncertain". "Uncertain" genes are typically smaller genes. The correlation of log insertion counts in the 14-replicates wig files using the Pearson correlation 901 coefficient as recorded in Supplemental Table T1. to predict LFCs. 907 Panel A shows Predicted Counts vs. Actual LFC using Linear Regression. The average predictive power of 908 the linear regression model trained with one-hot-encoded nucleotides in 20 bp from the TA site as the 909 input and LFCs as the output using 10-fold cross validation. The predictive power was not much higher 910 than the previous Insertion Counts model, but the variance has decreased, indicating a better model. 911 Panel B shows coefficients from the trained model. The coordinates along the x-axis give the positions 912 relative to, but not including, the TA site. A symmetric pattern is visible in positions -4,-3,-2,-1 and +1, 913 +2, +3, +4. The non-permissiveness pattern (CG)GnTAnC(CG) is visible in this window as well as high 914 coefficients associated with "A" and "T". Coefficients of a linear model trained on one-hot-encoded TTN and a target value of LFCs.