Identification and prediction of developmental enhancers in sea urchin embryos

Background The transcription of developmental regulatory genes is often controlled by multiple cis-regulatory elements. The identification and functional characterization of distal regulatory elements remains challenging, even in tractable model organisms like sea urchins. Results We evaluate the use of chromatin accessibility, transcription and RNA Polymerase II for their ability to predict enhancer activity of genomic regions in sea urchin embryos. ATAC-seq, PRO-seq, and Pol II ChIP-seq from early and late blastula embryos are manually contrasted with experimental cis-regulatory analyses available in sea urchin embryos, with particular attention to common developmental regulatory elements known to have enhancer and silencer functions differentially deployed among embryonic territories. Using the three functional genomic data types, machine learning models are trained and tested to classify and quantitatively predict the enhancer activity of several hundred genomic regions previously validated with reporter constructs in vivo. Conclusions Overall, chromatin accessibility and transcription have substantial power for predicting enhancer activity. For promoter-overlapping cis-regulatory elements in particular, the distribution of Pol II is the best predictor of enhancer activity in blastula embryos. Furthermore, ATAC- and PRO-seq predictive value is stage dependent for the promoter-overlapping subset. This suggests that the sequence of regulatory mechanisms leading to transcriptional activation have distinct relevance at different levels of the developmental gene regulatory hierarchy deployed during embryogenesis. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07936-0.


9
Similar to SpHox11/13b module ME, distal regulatory module Intron-D of onecut integrates enhancer 188 and silencer functions that are necessary and sufficient to recapitulate the expression of this 189 transcription factor (36). Likewise, ATAC-seq peak calls underscore Intron-D in both stages, but only 20 190 hour dREG peak calls highlight Intron-D, coincident with an augment of pause at the onecut promoter 191 (Fig. S2). Thus, both in SpHox11/13b ME and onecut Intron-D PRO-seq signals may correspond to a 192 blend of transcriptional activation and silencing functions in different embryo regions. Only ATAC-seq 193 peaks intersect onecut Intron-C module, which is inactive in 20 hour embryos (Fig. S2). Thus, similar to 194 module L of SpHox11/13b, the ATAC-seq peak call does not correspond with enhancer activity. 195 There are far more ATAC-seq peak calls than dREG predictions at both loci ( Fig. 3 and Fig. S2), along with 196 the general genomic trend (Fig. 2. E). Both loci where scanned by a comprehensive reporter tiling 197 scheme to test the entire regions for enhancer activity in an unbiased manner (18,36). Remarkably,198 dREG TRE predictions correspond closely with regulatory elements experimentally mapped to their 199 minimum range ( Fig. 3 and Fig. S2). However, most dREG predictions do not match CRM enhancer 200 reporter activity ( Fig. 3 and Fig. S2), and even a higher proportion of ATAC peaks do not correspond with 201 enhancer-active CRMs. The distribution of Pol II ChIP peaks is even broader, particularly at introns, 202 which are prone to contain TREs ( Fig. 3 and Fig. S2), and, therefore, Pol II ChIP signal seems poorly 203 suited for TRE predictions on its own. Manual analysis of other experimentally characterized 204 transcriptional regulatory elements (37-39) generally confirms the trends outlined above (Fig. S2). 205 Additional regions of the genome can be analyzed with the data sets deposited at the NCBI Gene 206 Expression Omnibus (40). In summary, our manual analysis suggest that accessibility may have a looser 207 correspondence with enhancer activity. In addition, although increased pause does not necessarily 208 correspond with silencing, as it may be associated with increased release and elongation, the dual 209 report of transcriptional elongation and pause by PRO-seq may correspond to enhancer and silencer 210 activities differentially deployed in space, which is very common among early embryo regional 211 specification TREs (4). 212

Prediction of enhancer activity from chromatin accessibility and Pol II 213
We decided to systematically test if machine learning models using chromatin accessibility, Pol II 214 distribution and transcription initiation could predict the previously quantified enhancer activity of 389 215 CRMs primarily selected for their evolutionarily sequence conservation (27). We first tested a subset of 216 reporters quantified at high temporal resolution using nanoString technology (26), but the very small 217 number of inactive reporters was unsuitable for model training (results not shown). We therefore 218 settled for a previous data set that measured reporter enhancer activity by qPCR and included 12 and 24 219 hour time points (27). Although this generated a mismatch with the 20 hour stage examined in our 220 genomic data, no major regulatory transitions have been identified for the majority of the genes 221 involved during this 4 hour period (41). The average size of the 389 CRMs (2,839 bp) is about one order 222 of magnitude larger than the average of ATAC peaks (316 bp) Pol II peaks (338 bp), and dREG TRE 223 predictions (373 bp) (Fig. 4 A). Thus, in order to reduce confounding inputs to the CRMS that may not 224 relate to TRE function, such as background transcription at introns, or background accessibility along the 225 CRM span, the ATAC-, Pol II ChIP-and PRO-seq signals were computed at CRM regions overlapping peak 226 calls and dREG predictions. CRMs were defined as active if they drove reporter expression twice above 227 the basal promoter ( Fig. 4 B and C). CRM activity cannot be explained by CRM size because there is no 228 significant size difference between active and inactive enhancers (Fig. 4 A, inset, Wilconox p-value = 229 0.11). However, active CRMs in 12 and 24 hour embryos have significatively higher PRO-, ATAC-, and Pol 230 II ChIP-seq signals (Fig. 4 D, p-values < 1.8 e-06). 231 Logistic regression classifiers trained and tested by 5 fold cross-validation repeated 50 times resulted in 232 predictions significatively above random guess in both embryo stages (Fig. 4 E and F). The performance 233 11 of models using ATAC, dREG (PRO-seq reads at TRE dREG predictions), their combination (ATAC + dREG), 234 and Pol II ChIP was slightly higher for ATAC + dREG models when evaluated by the Area Under the 235 Receiver Operating Characteristic (AUROC) plot ( Fig. 4 E and F), which graphs the relation between true 236 and false positive rates at different model prediction thresholds. However, the CRM expression data set 237 is highly unbalanced, with about 10 times more CRMs reporting inactive than active enhancer activity 238 ( Fig. 4 B and C), and, in these cases, Precision-Recall Curves (PRCs), which plot precision values along the 239 range of true positive rates, provide a better discrimination metric for classifier evaluation (42). 240 When AUPRCs are used for model evaluation, more distinct model performances are obtained, 241 particularly for the 20 hour data sets ( Fig. 4 F, bottom). Individually, all assays perform much better 242 than chance in both stages ( Fig. 4 E and F). The combination of ATAC and dREG predictors may slightly 243 improve performance at some recall values (Fig. 4 E and F, bottom), and Pol II ChIP signal does not 244 facilitate better enhancer activity predictions alone (Fig. 4 F) or in combination with other data sets 245 (results not shown). Incorporation of other parameters such as peak summit value, did not improve any 246 predictive models, as illustrated for dREG (Fig. 4 H). As expected, lower model performance was also 247 obtained when the functional genomic data were computed along the entire CRM instead of restricting 248 the signal input to peak call windows (not shown). Optimization of other machine learning methods, 249 such as random forest and support vector machine, did not improve classifier performance over logistic 250 regression (not shown), likely reflecting the small size of the available data. In short, total ATAC-seq and 251 PRO-seq signals at dREG peaks are the best predictors of active enhancer activity among the profiles 252 tested in this study. 253 Interestingly, about half of the CRMs that overlap promoters are active in the reporter assays, indicating 254 a high degree of enhancer activity from promoter-adjacent DNA in sea urchin embryos. Nearly all these 255 promoter-overlapping CRMs were previously shown to be active in both orientations (27), 256 demonstrating bona fide enhancer activity. The sizes of promoter-overlapping CRMs (Fig. 4 A) suffice to 257 12 include both distal and proximal TREs, including promoters. ATAC and dREG models trained with the 258 entire data set (Fig. 4 F) underperformed relative to Pol II ChIP based models in the prediction of 259 enhancer activity of the promoter-overlapping CRM subset (Fig. 4 G, top). In the complementary 260 analysis, the Pol II ChIP model trained with the entire set further underperformed relative to ATAC and 261 dREG models in the prediction of CRMs not overlapping promoters, while ATAC and dREG maintained 262 performance similar to predictions with the entire set ( Fig. 4 G, bottom). The exclusion of the 41 263 promoter-overlapping CRMs from the training and testing data set decreased the prediction 264 performance of all models in both stages ( Fig. S3 A and B). Overall performance was broadly similar 265 between ATAC and dREG models trained and tested with the promoter overlapping or non-overlapping 266 ( Fig. S3 A and C). In contrast, ATAC and dREG models trained with CRM-overlapping promoters failed to 267 predict the activity of their hold out set and where outperformed by Pol II ChIP models in the 20 hour 268 data set (Fig. S3 D). All of the above, suggests that distinct functional marks associate with the enhancer 269 activity of promoter proximal and distal TREs, and that the enhancer activity predictive power of ATAC-270 seq and PRO-seq for promoter-proximal CRMs dramatically devalues during the 12 to 20 hour transition. 271 The larger proportion of positive enhancers among CRMs that overlap promoters relative to CRMs not 272 overlapping promoters, ~ 50 % vs. ~13% in 20 hour embryos, is not surprising given the bias for 273 regulatory genes active during development of this data set (27) combined with the general trend of 274 enhancers to be near their promoter targets (43). The enhancer activity of promoters has precedents 275 (44,45) and it is perhaps not surprising for evolutionary reasons (2). 276 We tested if the functional genomic datasets could predict the levels of reporter enhancer activity of 277 CRMs. In all cases, better linear regression model prediction was obtained with non-promoter 278 overlapping CRM sets. The best performing model included the ATAC-seq plus the ATAC:dREG 279 interaction, which explained about one third of the expression variation (average R 2 = 0.29) in 20 hour 280 embryos (Fig. 5). ATAC-seq was a better predictor or enhancer activity in 20 hour embryos relative to 12 281 13 hour embryos (R 2 = 0.26 versus R 2 = 0.17, p-value < 2.2 e-16), and outperformed dREG in 20 hour 282 embryos (R 2 = 0.17, p-value = 7.4 e-16) (Fig. S4). The difference in dREG model performances between 283 stages or with ATAC-seq models in 12 hour embryos was not significant due to the small size of the 284 dataset. Nevertheless, the relative enhancer predictive power of ATAC-seq and PRO-seq is stage-285 dependent. The predicted value of most CRMs generally varies with the training set, as expected. 286 However, there is a group of CRMs that are consistently and erroneously predicted as barely active ( reporter expression levels. The mismatch between the size of TRE peak calls and CRMs tested is less 299 than ideal. In most high-throughput reporter assays (11), the regulatory regions tested are usually 300 smaller than the few hundred base pairs of common TREs, which represents one of the several 301 limitations of enhancer activity evaluation by reporter constructs (11). In contrast, the genomic regions 302 tested in our data set are large (Fig. 4 A) and often contain several ATAC and dREG peak calls, whose 303 signals would possibly better match enhancer activity if individually tested. In addition, enhancer 304 14 activity is tested with a heterologous promoter, allowing for the mismatch between functional genomic 305 assays and reporter activity due to enhancer-promoter specificity (46). 306 307 Despite the tight match between dREG TRE predictions and CRMs experimentally narrowed down to the 308 smallest functional regulatory elements in a generally unbiased manner (Fig. 3), our results reveal that 309 PRO-seq has similar predictive power as ATAC-seq. Perhaps this results from the dual report of 310 transcription and pause by PRO-seq. In addition, enhancer and silencer activities in different territories 311 are common for developmental regulatory elements (4,47), as previously discussed in the context of 312 SpHox11/13b and onecut ( Fig. 3 and Fig. S2). Alternatively, the small set of positive enhancers 313 measured by reporter assays could result in having insufficient statistical power. Nevertheless, despite 314 all these caveats, ATAC-seq and PRO-seq alone suffice to explain between one quarter and one fifth of 315 the reporter enhancer activity in 20 hour embryos ( Fig. 5 and Fig. S4). It is reasonable to expect even 316 better performance in single cell assays exclusively testing the genomic regions highlighted by ATAC-317 and PRO-seq profiles. 318 Our results confirm and extend reports of distinct enhancer prediction performance for promoter-319 proximal regulatory elements previously obtained with a distinct set of functional genomic profiles (14).

Conclusions 343
In summary, ATAC-and PRO-seq are efficient predictors of reporter enhancer activity of distal CRMs in 344 sea urchin embryos, while the prediction of promoter-overlapping CRMs is stage-dependent. In late 345 blastula embryos, Pol II enrichment is the best predictor of promoter-proximal CRM enhancer activity. 346 There is a net increase in dREG TRE predictions during later embryonic stages, while accessibility peaks 347 remain relatively constant. In combination, this suggests that the sequence of regulatory events leading 348 to developmental TRE enhancer activity has different relevance at different GRN levels or 349 developmental stages. Our work facilitates ongoing developmental gene regulatory studies by mapping 350 genome-wide candidate TREs, identifies PRO-seq and ATAC-seq as candidate factor-independent 16 methods that predict developmental enhancer activity in whole embryos, and outlines the stage-352 dependency and predictive value of distinct functional genomic profiles associated with proximal and 353 distal regulatory elements. 354

Preparation of nuclear extracts and sequencing libraries 357
Sea urchin embryos were reared to different stages as previously described (19). Nuclei for ATAC-seq 358 and PRO-seq were prepared using a modified version of a density gradient method (51) as follows. Sea 359 urchin embryos were centrifuged at 500 g for 3 minutes at 0 °C, the pellet was resuspended in 10 360 volumes of ice cold lysis buffer consisting of 20 mM EDTA, 2% polyethylene glycol, and 4 mg/ml of 361 Protease Inhibitor Tablets (Thermo Scientific™ Pierce™ #A32965), added just before use, in 0.1 X PBS 362 (PBS is 0.137 mM NaCl, 2.7 mM Cl, 10 mM Na 2 HPO 4 and 18 mM KH 2 PO 4 ), and incubated on ice for 5 363 minutes. Dissociated cells were further disrupted with 50 or more strokes in a fine dounce 364 homogenizer. Density gradient nuclear wash and floating layers were prepared by diluting iodixanol 60 365 % (OptiPrep™) in 1 X PBS to 20 % and 40%, respectively. About 5 ml of nuclear lysate was deposited on 366 top of 10 ml of nuclear wash and nuclei were collected over 200 µl of floating layer after centrifugation 367 for 30 minutes at 2 °C and 3,000 g in a swing bucket rotor. Nuclei aliquots were flash-frozen in liquid 368 nitrogen. For the 20 hour stage, nuclei of one of the two ATAC-seq biological replicates was prepared as 369 previously described (19). For ChIP-seq, fixation was performed by resuspending embryo pellets in 370 crosslinking solution (1 mM EDTA, 0.5 mM EGTA, 100 mM NaCl, 1.8 % formaldehyde, 50 mM HEPES, pH 371 8.0) for 15 minutes at 22 o C, followed by gravity settling and subsequent resuspension in stop solution 372 (125 mM glycine, 0.1% Triton X-100 in PBS), 500 g centrifugation, and two washes with PBT (0.1% Triton 373 X-100 in PBS). The embryos were transferred to 25 ml of ice cold homogenization buffer (15 mM Tris-374 HCl pH 7.4, 0.34 M sucrose, 15 mM NaCl, 60 mM KCl, 0.2 mM EDTA, 0.2 mM EGTA, with 4 mg/ml 375 protease inhibitors) and incubated on ice for 5 minutes. Embryos were first dounced 20 times with 376 pestle type A (loose), followed by 10 times with pestle type B (tight). Nuclei were then filtered through 377 a 20 μM filter and pelleted at 3,500 g for 5 minutes at 4 °C. The nuclei were resuspended in 7.5 ml of 378 PBTB (5% BSA in PBT buffer, with proteinase inhibitors). Propidium iodine stained nuclei were 379 quantified using a hemocytometer and a fluorescence microscope. 380 ATAC-seq library preparation and Illumina sequencing followed similar procedures to those previously 381 described (28). The ENCOCE-DCC atac-seq-pipeline (52)  For ChIP-seq library preparation, 50 to 100 million nuclei were spun at 4,000 g for 5 minutes at 4°C. 395 Nuclei were resuspended in 1 ml FA buffer (50 mM HEPES/KPH pH 7.5, 1 mM EDTA, 1% Triton X-100, 396 0.1% sodium deoxycholate, 150 mM NaCl, 0.1% sarkosyl, and protease and phosphatase inhibitors). The 397 resuspended nuclei were then sonicated at 4°C for 15 minutes on high, cycle of 30 seconds on and 30 398 seconds off, to obtain an extract with fragmented chromatin. Extracts were brought up to 440 μL with 399 FA buffer with protease and phosphatase inhibitors. Between 1 to 2 mg of extract and 4 μg of antibody 400 were used per ChIP. Prior to the addition of antibody, 5% of the extract was taken for input. Mouse ROCs and PRCs. The 95 % confidence interval abutting in gray the average PR and ROC curves ( Fig. 4 and 444  Open chromatin assays are known to have significant GC bias. Please take this into consideration as necessary. Mitochondrial reads are filtered out by default. The non-redundant fraction (NRF) is the fraction of nonredundant mapped reads in a dataset; it is the ratio between the number of positions in the genome that uniquely mapped reads map to and the total number of uniquely mappable reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations with EXACTLY one read pair over the genomic locations with AT LEAST one read pair. PBC1 is the primary measure, and the PBC1 should be close to 1. Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is the ratio of genomic locations with EXACTLY one read pair over the genomic locations with EXACTLY two read pairs. The PBC2 should be significantly greater than 1.

Library complexity quality metrics
NRF (non redundant fraction) PBC1 (PCR Bottleneck coefficient 1) PBC2 (PCR Bottleneck coefficient 2) PBC1 is the primary measure. Provisionally 0-0.5 is severe bottlenecking 0.5-0.8 is moderate bottlenecking 0.8-0.9 is mild bottlenecking 0.9-1.0 is no bottlenecking For overlap/IDR peaks: repX_vs_repY: Comparing two peaks from true replicates X and Y repX-pr1_vs_repX-pr2: Comparing two peaks from both pseudoreplicates from replicate X pooled-pr1_vs_pooled-pr2: Comparing two peaks from 1st and 2nd pooled pseudo replicates  Mitochondrial reads are filtered out by default. The non-redundant fraction (NRF) is the fraction of nonredundant mapped reads in a dataset; it is the ratio between the number of positions in the genome that uniquely mapped reads map to and the total number of uniquely mappable reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations with EXACTLY one read pair over the genomic locations with AT LEAST one read pair. PBC1 is the primary measure, and the PBC1 should be close to 1. Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is the ratio of genomic locations with EXACTLY one read pair over the genomic locations with EXACTLY two read pairs. The PBC2 should be significantly greater than 1. repX: Peak from true replicate X repX-prY: Peak from Yth pseudoreplicates from replicate X pooled: Peak from pooled true replicates (pool of rep1, rep2, ...) pooled-pr1: Peak from 1st pooled pseudo replicate (pool of rep1-pr1, rep2-pr1, ...) pooled-pr2: Peak from 2nd pooled pseudo replicate (pool of rep1-pr2, rep2-pr2, ...) For overlap/IDR peaks: repX_vs_repY: Comparing two peaks from true replicates X and Y repX-pr1_vs_repX-pr2: Comparing two peaks from both pseudoreplicates from replicate X pooled-pr1_vs_pooled-pr2: Comparing two peaks from 1st and 2nd pooled pseudo replicates