3 ′-5 ′ crosstalk contributes to transcriptional bursting

Background Transcription in mammalian cells is a complex stochastic process involving shuttling of polymerase between genes and phase-separated liquid condensates. It occurs in bursts, which results in vastly different numbers of an mRNA species in isogenic cell populations. Several factors contributing to transcriptional bursting have been identified, usually classified as intrinsic, in other words local to single genes, or extrinsic, relating to the macroscopic state of the cell. However, some possible contributors have not been explored yet. Here, we focus on processes at the 3 ′ and 5 ′ ends of a gene that enable reinitiation of transcription upon termination. Results Using Bayesian methodology, we measure the transcriptional bursting in inducible transgenes, showing that perturbation of polymerase shuttling typically reduces burst size, increases burst frequency, and thus limits transcriptional noise. Analysis based on paired-end tag sequencing (PolII ChIA-PET) suggests that this effect is genome wide. The observed noise patterns are also reproduced by a generative model that captures major characteristics of the polymerase flux between the ends of a gene and a phase-separated compartment. Conclusions Interactions between the 3 ′ and 5 ′ ends of a gene, which facilitate polymerase recycling, are major contributors to transcriptional noise. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-020-02227-5).

We obtained flow cytometry data (from the BD LSRFortessa TM cell analyzer and BD FACSDiva TM software) for cell lines expressing the genes env and HBB, in both their wild-type (WT) and mutated (mut) versions. For each gene, experimental data were collected in four replicates (8 in total), each containing groups of observations corresponding to cells stimulated with tetracycline (Tet) at concentrations of 5, 10, 20, 40, 80, and 250 ng/mL, respectively.
Each data-set was stored in a .fcs format file and it was imported and pre-processed in R as an object of class flowFrame, which consists of an annotated data-frame class defined in the flowCore R package [1] and designed to deal with flow-cytometry data. Rows in such data frames correspond to single measurements. Each row contains the values of two fluorescence intensities that correspond to staining for mRNA and total DNA and are labeled by R640-670/14-A and UV355-450/50-A, respectively. These readings were compesated for spectral overlap with flowCore. In addition to this, the values of four scattering observations, namely FSC.H, FSC.W, SSC.H, and SSC.W, were recorded. Such observations are thought to be correlated to cell size and granularity. Values for each observation are stored in so-called "arbitrary units" (a.u.) [2].
The first task is to identify records in the data sets corresponding to either cell debris or clumps of cells, which have to be removed from subsequent analysis. We apply the robust model-based clustering approach of Ref. [3], distributed as the flow-Clust package [4], to identify cell populations in the data. Based on the scattering observations, the points were grouped into 3 clusters, and the set corresponding to single cells is the one with intermediate size and granularity, as suggested by the DNA content (see Fig. S1). For the data sets where the three detected clusters overlap, points where grouped into two clusters, instead. For this second case, inspection shows that the cluster with lower size and granularity corresponds to single cells. Standard rectangle gates were applied to remove a few outlier points whose UV355-450/50-A reads were lower than 500. Kernel density estimates (KDE) for the populations after this sub-setting are plotted in Fig. 2 (Main Text) and Fig. S2. Technical variation affects the shapes of the distributions only for some HIV replicates, with shoulders at the lower ends sometimes merging with the main mode. The cells corresponding to the data points in the shoulders exhibit normal characteristics (in terms of cell size and DNA content) and thus probably reflect cells without mRNA. The parameter estimates, reported in section S6, appear robust with regards to the absence or presence of the shoulders save for the highest µ X s.

B. Control cells
For each replicate k, we consider control cells, where the gene of interest has been deleted (see Main text). Such control cells were subjected to the same staining procedure as the others, which leaves a background of fluorescence probes that are not specifically bound to the mRNA. We argue that such background fluorescence stain are also present in the cells expressing the transgenes and contribute a term (k) i to the signal detected by the cell-analyzer channel of label R640-670/14-A for each cell i. The histograms of the signal from control cells appear skewed, as illustrated for example in Fig. S3 (left). We chose to fit the Azzalini's skew-normal distribution, that has PDF f (y|a (k) , µ (k) , σ (k) ) = 2Φ((y − µ (k) ) σ (k) a (k) ) φ(y|µ (k) , σ (k) ), (1) to such data, where Φ and φ are the standard normal CDF and normal PDF, respectively, while the mean µ (k) , the standard deviation σ (k) , and the skewness parameter a (k) are point estimates from the control data sets. The maximum likelihood estimates for each replicate are reported in Table S1 (see also Figs. S3(left)).

C. SmFISH and Nanostring barcoding data
Flow-FISH data are supplemented by microscopybased single-molecule FISH counts (which we simply refer to as smFISH) and Nanostring nCounter ® Technology bar-coding measurements. These assays are used to choose informative priors for the mean mRNA abundance and, in turn, to calibrate the flow-FISH readouts. Symbols x, s 2 x , and s x represent sample mean, sample variance, and standard error of the mean, respectively. Based on these, we chose truncated normal informative priors for the average expression level µ X ∼ N (x, s x ), with the constraint µ X > 0, for all replicates k.
HEK293 cells are not ideal for smFISH, since they tend to overlap when growing, producing dense clusters after dividing. A further problem with sm-FISH is the limited dynamic range of suitable microscopes. In fact, images tend to be overexposed when recorded during transcriptional bursts at settings that are otherwise optimal for lower transcript numbers and, conversely, optimal settings for transcriptional bursts do not cope well when transcript numbers are low. Therefore we only exploit smFISH results to infer the average expression level, and rely on flow-FISH to study the noise. SmFISH data yield the summary statistics of Table S2 for the mean expression of HBB. Nanostring data (SI Dataset 1 ) yields the summary statistics of Table S3 for the mean expression of env.

S2. PHENOMELOGICAL GENE EXPRESSION MODELS
We describe the gene expression in terms of the standard phenomenological two-state model [5]. This model assumes that the gene randomly alternates between an "on" and an "off" state, and that the mRNA is only transcribed, at rateα, during the on state. The gene switches from "off" to "on" and from "on" to "off" states after an exponentially distributed random time with mean 1/k on and 1/k off , respectively. Consequently, the transcriptional bursting is fully characterised by the ratesα,k on , andk off . In addition to this, mRNA is degraded at rated. It is convenient to express the rates in units of the inverse of the mean mRNA life-timed, i.e.,k It can be shown that the stationary probability density function (PDF) of the mRNA population x for this model is (see, e.g., Ref. [6]) where Γ is the gamma function and 1 F 1 is the confluent hyper-geometric function of the first kind. An alternative representation of the PDF (5) is where are density distributions of Poisson and beta random variables (RVs), respectively. The PDF of equation (6) encodes the following hierarchy X|α, P ∼ Poi(αP ), (9) P |k on , k off ∼ Beta(k on , k off ).
Further details can be found, e.g., in Refs. [6,7]. It is convenient to reparametrise the Poisson-beta distribution in terms of its mean to get In fact, this allows us to exploit knowledge on µ X in the form of informative priors of S1 C. The expression for the squared coefficient of variation (CV 2 ) can also be written in terms of µ X , i.e., where the second term on the r.h.s. quantifies the overdispersion of X with respect to a Poisson random variable. Such a functional relation between CV 2 X and µ X has been encountered in gene expression data [8][9][10][11][12]. The probability P cannot be directly accessed and therefore is a latent (hidden) variable for the model. In fact, for our data, the mRNA number is a latent variable too, being only inferred from the measured fluorescence signals. This can be encoded into a measurement equation, as explained in the next section.
In the limit as k off → ∞, α → ∞, with the ratio α/k off held fixed, the population mean and CV 2 satisfy respectively, while the distribution of X approaches the negative binomial distribution with PDF This can be easily proven using the Poisson-gamma mixture formulation of the negative binomial RV X, i.e., In fact, the beta distribution scaled by α > 0 approaches the gamma distribution as k off → ∞, α → ∞, i.e., 1 α which follows from known asymptotic relations The ratio α/k off has a simple interpretation, being the expected number of transcription events during an on phase. In Ref. [13] this ratio has been referred to as the "expected burst size". In the limit as k on → ∞ with the mean expression µ X held fixed, the negative binomial distribution (17) approaches the distribution

S3. MEASUREMENT AND TECHNICAL ERROR MODEL
The DB FACSDiva TM software manual [14] specifies that the light intensity from fluorescent dyes is amplified linearly within a wide range (see also, e.g., Refs. [9,15]). Based on this, we assume that the measured fluorescence Y i of cell i is proportional to the true mRNA abundance X i and therefore can be expressed as in the following "measurement" equation, where k indexes the replicate, κ can be thought of as a scale and i is the zero of such a scale, also corresponding to the background of unspecific staining and auto-fluorescence of the ith cell [7]. The dispersion of biological data is typically due to both technical errors, caused by the measurement process, and the variability intrinsic to the underlying biology. In our measurement model, for the variables X   Informative priors for (k) i are chosen according to section S1 B, i.e., where the parameters a (k) , µ (k) , and σ (k) are estimated from the control cells at 250 ng/mL Tet. The standard errors of the maximum likelihood estimates are neglected. As a consequence, all the single-cell measurements can be thought of as being subjected to the same random background, thus mitigating tractability issues. For a more comprehensive fully-Bayesian hierarchical approach see Ref. [7].
To pin down informative priors for κ (k) , we perform gamma regression. For each flow-FISH data-set, 500 random cell readings are selected for the main Monte Carlo estimation of section S4. The remaining reads are used as response variables for a gamma regression with identity link. Covariates are mean expression level point estimates from section S1 C. As an example, this is illustrated in Fig. S4 for k = 3, WT HBB gene. The GLM estimates of the expected values µ κ along with the standard errors s κ are reported in Table S4. Our prior choice is the truncated normal RV with the constraint κ > 1, where the mean µ κ and standard deviation s κ are obtained from the 16 values in Table S4 according to the laws of total expectation and variance, respectively, i.e., where the bar notation represents averages. For the remaining parameters we assume which is a classical choice for vague priors with positive support [17].
Since the replicates are independent, the likelihood of the parameters of the Poisson-beta model, for a data-set of N measurements y  where θ (k) := (κ, a (k) , µ (k) , σ (k) ) is the vector of the parameters that describe the experimental setting. This completes the definition of the first Bayesian model for the observed data. The directed acyclic graph (DAG) of the full posterior of this model is illustrated in Fig. S5(A).
Consistently, the likelihood of the parameters of the negative-binomial model is as illustrated in Fig. S5(B). The simplest Poisson model, the likelihood is whose DAG is illustrated in Fig. S5(C).

A. MCMC samplers
Adaptive Metropolis-Hastings samplers to fit the model to the data where implemented using the PyMC library for probabilistic programming [18], version 2.3.7, which has a flexible object oriented syntax and provides tools to handle long traces and perform diagnostics. To improve the convergence speed, the array containing the N elements of a dataset was numerically sorted by value and split into M = N/10 blocks of size 10. The latent random-variable arrays P 1:N and X 1:N were batched too, as the RVs conditioned on the data of a single block are strongly correlated and are conveniently updated during single Metropolis-Hastings steps. Using the symbol 1 x for the identity matrix and N x to represent a multivariate normal RV of dimension x, the simulation of posterior samples for the Poisson-beta model proceeds as follows: • For i = 0, 1, . . . , M , values of the 10-value blocks P (i 10+1):(i+1)10 and (X i 10+1):(i+1)10 are updated according to a random walk Metropolis with proposals N 10 (µ • The RVs µ X , κ, k on , and k off are updated as a single block according to the adaptive Metropolis-Hastings method of Ref. [19] with proposal N 4 (µ, Σ).
The proposal parameters µ i , (i = 1, . . . , M ), µ, and Σ are chosen adaptively. To improve the adaptation (noting that the posterior for k off is more disperse than those of µ X , κ, and k on ), Σ is initialised to the diagonal matrix diag(0.1, 0.1, 0.1, 1).
In order to mitigate tractability issues (which is mainly due to the large number of presence of latent variables), the model is only fitted to a randomly sampled subset of N = 500 data points. For a more modern approach to cope with latent variables see Ref. [7], which also defines a more complex hierarchical model.
The sampler implemented for the negativebinomial model is similar to the one implemented for the Poisson-beta model (with data organised into M ranked batches) but converges and mixes more rapidly, as it does not encode for the latent variables P i , i = 1, 2, . . . , N . The simulation proceeds as follows: • For i = 0, 1, . . . , M , values of X (i 10+1):(i+1) 10 are updated according to a random-walk • The random variables µ X , κ and k on are updated simultaneously according to the adaptive Metropolis-Hastings method with proposal N 3 (µ, Σ), . . , M ), µ, and Σ are defined as in the former case. The Poisson-model sampler is analogous to the negative binomial, except that it does not include the parameter k on . All the samplers were successfully tested with simulated data.

B. Consensus posteriors
The posterior p(ϑ|y (1) 1:N , y (2) 1:N y represents the consensus belief on the vector of parameters ϑ among all the replicates k = 1, 2, 3, 4, with p(ϑ) being the prior. We approximate such a posterior by means of a consensus Monte Carlo approach, i.e., by running a separate MCMC on each of the datasets y (k) 1:N , k = 1, 2, 3, 4, and then averaging individual Monte Carlo draws across, as in Ref. [20]. The draws ϑ (k) j , j = 1, . . . , L, k = 1, 2, 3, 4, are combined into the weighted averages where w (k) is the vector of the reciprocal of the marginal posterior variances. This method has been only justified rigorously for Gaussian posteriors and only yields approximate posteriors, in general. However, it allowed us to distribute the Bayesian analysis across different machines , and therefore was of great utility to aggregate results.

C. Goodness of fit
To evaluate the goodness of fit (GoF) we estimate the posterior predictive distribution where θ is the vector of all parameters, by generating pseudo-dataỹ Comparison between the RMSD results for the Poisson-beta and the negative-binomial models is shown in Fig. S7, which suggests that both these models fit the data equally well. Conversely, the RMSDs for the Poisson model are always higher that the RMSDs for the two former models (see Fig. S7) implying that the Poisson model does not fit as well. Further, we computed the Wasserstein distance in distribution between the data and the pseudo-data. The Wasserstein distance between two distributions u and v is defined as where U and V are the empirical cumulative distribution functions associated to and u and v, respectively. Fig. S7 shows that the distances are always smaller than the 95% percentile of bootstrapsamples distances from the true data, thus confirming GoF. Bin sizes for the empirical distribution were chosen according to the Freedman-Diaconis rule.

S5. MRNA DECAY RATES
The draws from the posteriors of the dimensionless rates k on , k off , and α are converted to number of events per minutek on ,k off , andα by using estimated decay rates of mRNA. For the HBB gene, decay rates were measured in Ref. [22] (and are reported in Table S5). Due to the detection of two different mRNA isoforms, viz., "rt" and "pA", the empirical mRNA distribution can be thought of a Gaussian mixture density with PDF (39) p rt +p rt = 1, with parameters of Table S5. According to equations (2)-(4), the traces from k on , k off , and α were multiplied by draws from this Gaussian mixture to obtaink on ,k off , andα.
We measured total env RNA content (including non-poly-adenylated transcripts) with RT-qPCR, using gene-specific primers (forward primers bind exon 1 and reverse primers bind the 3'UTR, see section S10 C). The decay rates of env transcripts were obtained by fitting a linear model to the logarithm of RT-qPCR measurements of transcripts vs time in GoF analysis. (left and centre) RMSDs of the data with respect to the sample means of the (posterior predictive (see eq. (37)) for each dataset. Comparison of Poisson-beta model vs the negative-binomial model (left scatter plot) shows that the two models achieve similar GoF, while the RMSDs obtained from the Poisson model are always the largest (central scatter plot). (right) Wasserstein distances between the empirical histogram of the data and the negative-binomial model posterior predictive (x-axis) is always smaller that the 95% quantiles of the bootstrapped distances from the true data, which suggests that that the posterior-predictive samples for the negativebinomial model always reproduce the true flow-FISH data (as in, e.g., Fig. S6). TABLE S5. . Decay rates in number of events per minutes of the two mRNA isoforms for the HBB gene. "pA" and "rt" refer to polyadenylated and read-through isoforms, respectively (it is possible to foresee larger dispersion for the mutant than for the WT).

S6. SUMMARY OF MCMC RESULTS
A. Poisson-beta distribution Fig. S9 shows two results obtained from fitting the Poisson-beta model. The traces of the posterior chains from each replicate were combined according to the consensus Monte Carlo procedure (see section S4 B) to obtain a representation of the consensus belief of Fig. 3 (Main text). It is worth noting that the credible intervals fork off andα are very wide, while the MCMC draws of k off and α appear strongly cross-correlated (see, e.g., Fig. S10), where the drawn samples form an angle arccot(α/k off ) with the abscissae axis.
The 90% highest posterior density credible intervals (HPD CIs) and medians of the estimated parameters κ, µ X , k on , k off , and α/k off are reported in Tables S6-S10 (HBB gene) and Tables S11-S15 (HIV gene).

B. Negative binomial distribution
In contrast to the Poisson-beta model, the negative-binomial model directly encodes the ratio α/k off as a single parameter, which is inferred with rather narrow credible intervals. Parsimony suggests that the negative-binomial model is a reasonable choice for the genes considered here, as it encodes for the most relevant kinetic parameters, viz., the average burst size α/k off and the burst frequency k on . Fig. S11 shows the parameters estimated from the negative-binomial model. The traces of the posteriors chains from each replicates were combined according to the consensus Monte Carlo procedure (see section S4 B) to obtain a representation of the consensus belief in Fig. S12.
The 90% HPD CIs and medians of the estimated parameters κ, µ X , k on , and α/k off are reported in Tables S16-S19 (HBB gene) and Tables S20-S23 (HIV gene).

C. Poisson distribution
The Poisson model encodes only one biological parameter, viz., the average gene expression level µ X . We fitted this model to data from one of the replicates as a benchmark. The 90% HPD CIs and medians of the estimated parameters κ and µ X , are reported in Tables S24-S25 (HBB gene) and Tables S26-S27 (HIV gene). It is worth noting that, compared to the prior derived in section S3 and both the estimates from the Poisson-beta and negativebinomial models of Tables S6, S11, S16, and S20, the κ is overestimated. In fact, high values of κ compensate for the small dispersion encoded in a Poisson random variable. Jointly with the fact that the Poisson model shows lower GoF than the two general models (subsection S4 C and figure S7), we conclude that the expression of the genes HIV and HBB is relative to a Poisson random variable and a flexible gene expression model for X (k) i , such as the Poisson-beta or the negative-binomial models, is necessary to exploit the measurement equation (24).

S7. CELL CYCLE
Staining for DNA concentration allows us to heuristically find cells that are in G1, S, and G2 phases of the cell cycle, see Figs. S1(right) and S13. We considered the dataset with cells treated at concentration of 40 ng/mL of Tet. We separated the data points corresponding to the G1 phase from those from S and G2 using flowClust [4]. The less dense cluster S-G2 was further separated in two groups (corresponding to the phases S and G2) running the same algorithm again. Results are shown in Fig. S13 for HBB cell line, k = 3.
Data from phase G1, S, and G2 are referred to as y (k) G1 , y (k) S , and y (k) G2 , respectively, for each replicate k. We refer to their averages (sample standard deviations of mean) asȳ G1 ,ȳ S , andȳ G2 , (sȳ G1 , sȳ S , and sȳ G2 ) respectively. To take into account that the mean gene expression seems to change with the cell phase, we introduce the conversion factors c (k) = x/ȳ (k) to obtainx i = G1, S, G2. Equations (40) take the place of µ X in the DAG of Fig. S5(B) for the G1, S, and G2 phase, respectively, for each replicate k. Fitting the negative-binomial model to 500 samples from each dataset yields the consensus estimates of Fig. 4(C-D) (Main text). In addition to this, we also subset reads form each phase into three groups by their size (based on the values of their FSC-A fluorescense signal, see Main text). We assume that the cell cycle and the cell size are extrinsic contributors to the total transgene mRNA variability, with the remaining variability sources thought of as being intrinsic. As in Ref. [8], using the symbol · I for the average over the intrinsic variables, with the cell phase held fixed, and · E for the average over the different cell phases, the law of total variance allows us to write, for the mRNA abundance X, where the first term on the r.h.s. is the intrinsic noise, while the second term is the extrinsic noise. Computing the two terms gives the intrinsic and extrinsic noise levels of Tables S30 and Fig. 4(D) (Main text), which show that the cell cycle and the cell size always contributed only a minor term to the total noise.

S8. POLII-MEDIATED 3'-5' INTERACTIONS BY CHIA-PET
We considered scRNAseq and ChIA-PET data accessible at the GEO Series numbers GSE124682 [23] and GSE33664 [24], respectively. Raw chromatin contact frequency is highest for small genomic distances. In order to normalise for this and expose deviations from this general relationship, we need to divide by the expected number of reads at a given genomic distance. We calculate this by random sampling 10000 genomic intervals and measuring the contact frequency over this sample. This estimate appears robust to decreasing the number of sampled intervals. The relation between gene length and the normalised 3'-5' interaction score is illustrated in Fig. S15. Genes with length smaller than the resolution of the interaction matrices are discarded from the main analysis. We fitted the model of equations (54)-(56) to the smRNAseq data, thus enabling a convenient classification of genes based on transcription; see also, e.g., reference [25]. Without correction for incomplete capture of mRNA, the parameterk on incorporates here both biological and technical above-Poisson noise, whilst allowing the ranking of the genes based on their total noise.

S9. MICROSCOPIC GENE EXPRESSION MODEL
We referred to the models of section S2 as the phenomelogical models. In fact, our main concern there was to exploit a minimal description of the statistics of the transcription events and the stationary mRNA distribution-which is the (observed) phenomenology, indeed. Due to their simplicity, these models allowed us to attain the important goal of separating the technical noise (due to background fluorescence and measurement process) from the biological noise encoded into X i .
Nevertheless, there are specific microscopic biological mechanisms, more difficult to observe, that may give rise to the observed phenomena. In our tetracycline-inducible genes, Tet repressor (TetR) homodimers bind to the operator TetO 2 downstream of the transcription start site (TSS). When such a binding event occurs, the transcription is inhibited as the elongation is impeded. Adding Tet in turn alters the conformation of TetR and hinders the binding events, having the net effect of inducing the gene expression. Crucially, during the "on" phase, the transcription rate is proportional to the abundance of PolII (law of mass action), which can be thought of as waiting in a compartment upstream of the TSS [26]. Therefore, when the gene is actively transcribing, its rate can vary in time according to the amount of PolII ready to initiate transcription. After transcription, PolII can either be re-injected into the compartment and set ready for a new initiation event (PolII recycling), or disposed into the nuclear environment. Also, the compartment recruits PolII from the nuclear environment. This can be described by means of the following chemical reaction scheme: where DNA on and DNA off are unlocked and locked DNA configurations, respectively. The presence of the 3'-5' crosstalk loop is thought to facilitate the recycling of PolII after each transcription event; therefore we can study the effect of the recycling on the simulated expression data by tuning l in the reac-tion scheme. Obviously, the pA mutation lowers the recycling probability l with respect to the WT, but l is not supposed to be zero in mutant genes, as the recycling can occur by means of other mechanisms (e.g., diffusion). By the law of mass action where K λ is a chemical affinity and n is the concentration of TetR. Hence, we can imitate variations in the Tet dose by fine-tuning n, with large values of Tet (high induction levels) corresponding to small values of n. Unlike the simpler phenomenological models, we do not have an analytical likelihood for this model, thus parameter inference is more challenging, to be addressed with likelihood-free methods. We simulate the model using the Doob-Gillespie algorithm; sample trajectories of mRNA abundances are plotted in Fig. 5 D (Main text). When all the chemical species are highly abundant and the gene is always in "on" state (this can be achieved in the limit as k off → 0), it is straightforward to derive the following rate equations, where [X] is the abundance of the species X. The stationary mRNA abundance is then which corresponds to the vertical lines of Fig. 4

B (Main text).
While the parameters γ, β, d, δ, K λ are chosen to simulate mRNA abundances and noises in ranges consistent with those of the real data, fine-tuning the recycling probability and the induction parameters l and n yields patterns similar to those observed in the experimental setting (i.e., those of Figs. S9, S11, and 2 (Main text)). More specifically, a simple scatter plot of the sample averages versus the CV 2 of [mRNA] shows a drift of the noise curve from the Poisson case CV 2 X = 1/µ X as the recycling rate l increases. Fitting a negative binomial (NB) Bayesian model mRNA ∼ NB(µ X , k on ), (54) µ X ∼ Gamma(0.001, 0.001), (55) k on ∼ Gamma(0.001, 0.001), to 500 simulated stationary mRNA abundances, allowed us to estimate the average burst size α/k off = µ X /k on and the burst frequency k on shown in Figs. S16 and 5 C (Main text). The wildtype HBB and HIV-1-env cell lines have been utilized in previous studies [22,27,28]. The nomenclature was changed for the present manuscript, with the cell lines denoted HBB WT, HBB mut, HIV WT and HIV mut, which had been denoted β pA+, β pA−, HIV-1 pA+ and HIV-1 pA− in [22], respectively. Cells were maintained in DMEM medium supplemented with 10% fetal bovine serum and 100 µg mL −1 penicillinstreptomycin (DMEM-10). Induction of cell lines was carried out for 16 hours before downstream experiments. Deletion cell line construction. The design and construction of the deletion cell line used the protocol detailed in [29], with the following changes. A dual sgRNA strategy was employed with a 5' guide binding between the AmpR promoter and CMV enhancer and a 3' guide binding just after the 3' FRT site. The use of plasmid pSpCas9(BB)-2A-GFP (PX458) and dual targeting necessitated the transfection with two plasmids, each containing a respective guide. Transfection was carried out using calcium phosphate, followed by washing the cells with warm PBS after 16-24 hours and replacing the me-dia. Cells were allowed to recover for 48-72 hours before single-cells were isolated in 96-well plates via FACS (BD ARIAFusion), with the brightest 10% GFP positive cells being sorted. Testing for deletion was initially verified via genomic DNA extraction and PCR, followed by smFISH assay using flow cytometry.

B. Single-molecule RNA fluorescence in situ hybridization
Probe sets. Probe sets for HBB and HIV-1-env RNA were designed with the tool at www.biosearchtech.com/stellarisdesigner (see Table S31 for sequences). The probe sets were synthesized by LGC Biosearch Technologies as custom Stellaris ® probe sets. AKT1 probes were readymade and ordered from LGC Biosearch. smFISH. smFISH staining followed the probe manufacturer's protocol. Briefly, cells were grown on poly-L-lysine treated glass coverslips overnight, fixed in 3.7% formaldehyde for 10 minutes and permeabilized in ethanol for > 1 hour. After overnight staining at 37°C in dextran sulphate and formamide buffered with SSC, cells were washed, followed by mounting onto a slide using Vectashield with DAPI as the mounting medium. Imageing was carried out DNA staining was carried out with FxCycle TM Violet Stain (ThermoFisher, F10347) at a concentration of 1 µg mL −1 . Fixed cells were analysed on a BD Fortessa. Processing and data analysis of raw flow cytometry data was carried out using the flowcore R package [1] (v1.48, R version 3.3). SmFISH spot counting. Quantification of RNA was carried out using FISH-quant [30]. Images were imported with the following settings: XY 64.8 nm; Z 200 nm; Refractive index 1.515; NA 1.40; Em 592; Ex 546; Microscope widefield. Cell outlines were drawn manually. A single image was then processed and the settings used to batch-process the remaining set. The threshold and quality score parameters of FISH-quant were set to quantify as many spots as possible while reducing spurious detection through batch-specific selection of these parameters.
C. RNA isolation and preparation, and degradation rate estimation Total RNA was extracted from the respective cell lines following the RNeasy Mini Kit (Qiagen, 47104) protocol, using QIAshredder (Qiagen, 79654). RNA for RNA-seq analysis was treated with TURBO DNA-free TM kit (ThermoFisher, AM1907).
To estimate the mRNA degradation rate, RNA was reverse transcribed using random primers (Promega, C118A) and M-MLV reverse transcriptase (Promega, M170A) followed by qPCR using SensiMix TM SYBR ® No-Rox (Bioline, QT650-02) on a Qiagen Rotor-Gene Q. Gene-specific primers were used (HIV Forward TCTCCTACGGCAGGAAGAAG; HIV Reverse GGTAGCTGAAGAGGCACAGG). Analysis was carried out by calculating the CT values using the qpcR R package [31] (v1.4-1) and from this 2 −∆∆Ct were calculated using the mut time 0 concentration as the reference sample. A degradation time series was carried out by standard induction method at 250 ng mL −1 tetracycline for 16 hours, followed by removal of media and washing with warm DMEM-10. Cells were then placed in fresh medium and samples were taken at different time points following on from this.

D. Nanostring
Cells were seeded, induced and processed as indicated previously (subsection S10 A), with the following alteration: after trypsinisation cells were resuspended in 1 mL of PBS and kept on ice. Counting of cells was carried out via Countess (Ther-moFisher) cell counter with 100 µL (50 : 50) PBS to trypan blue. Samples were spun down at 500 g for 5 minutes and were then resuspended in RLT buffer from RNeasy Mini Kit (Qiagen, 47104) with beta-mercaptoethanol to obtain a concentration of 6500 cells per µL. Samples were then vortexed for 1 minute and placed at −80°C. Cell lysis was verified under a microscope. Samples were shipped on dry ice to an external provider for processing. Custom probe sets, including probes targeting HIV-1env along with GAPDH and AKT1 as house-keeping genes, were designed and shipped by NanoString Technologies.

E. RNA-seq
Library preparation. RNA-seq libraries were prepared using 500 ng of total input RNA and the NEBNext ® UltraTM II Directional RNA Library Prep Kit for Illumina (E7760L), along with the NEBNext ® rRNA Depletion Kit (E6310L) and NEBNext ® Multiplex Oligos for Illumina Set 1 and 2 (E7335, E7500). Ribo-depletion was carried out to capture transgene RNA regardless of the absence or presence of a poly(A) tail. The manufacturer's manual was followed, with the final PCR amplification using 9 cycles. Libraries were assessed via Bioanalyser, diluted and mixed before being sequenced on an Illumina ® NextSeq 500, generating paired-end reads with read length 42. RNA-seq analysis. Data quality control was performed with FastQC v0.11.5. Read and adapter trimming was carried out using TrimGalore! v0.4.3 with cutadapt v0.4.3 using default settings [32]. Indices for STAR to map to were constructed from the human genome (GRCh38.p12, Gencode primary annotation) and the respective (HBB WT/mut and HIV-1-env WT/mut) transgenic sequence.
Fold changes for the HBB and HIV genes were calculated using DESeq2 v1.22.1 [38] from Bioconductor release 3.8 and R v3.5.1. Splicing analysis. To analyse potential alternative splicing in the env transgene, BAM files were imported into R (v4.0.2) and analysed using SGSeq (v1.4) [39]. A table of counts relating to potential splice variants (Table S32) indicates that there are potential alternative splicing events, however the number of reads related to each variant indicates that they are present across all samples and appear related to the overall abundance of mRNA. To further analyse the difference between introns and exons in the HIV-1 env wild-type and mutant cell lines, the Transcripts Per Kilobase Million (TPM), a normalisation that takes account of both sequence depth and gene length, were calculated (Fig. S17). BAM files were imported into R (v4.0.2) and analysed using Rsubread (v2.2.6) [40], calling FeatureCounts with the following parameters: minOverlap=20, isPairedEnd=TRUE, strandSpecific=2. We do see a slightly higher fraction of intronic reads present in the env mutant at 250 ng mL −1 Tet, although the principal difference between the mutant and wildtype appears to be overall mRNA abundance. In addition to this, ref. [22] quantified the levels of HBB pre-mRNA relative to the total HBB RNA and determined that ratios are the same within the first 2 hours of induction (splicing was slightly effected after 24 hours, albeit this phenotype appeared to arise subsequent to RNAPII depletion).   2  28  21  14  233  164  223  42  46  44  404  434  303  3  28  24  27  46  51  58  31  27  45  33  28