Deciphering the Gene Regulatory Landscape Encoded in DNA Biophysical Features

Summary Gene regulation in higher organisms involves a sophisticated interplay between genetic and epigenetic mechanisms. Despite advances, the logic in selective usage of certain genomic regions as regulatory elements remains unclear. Here we show that the inherent biophysical properties of the DNA encode epigenetic state and the underlying regulatory potential. We find that the propeller twist (ProT) level is indicative of genomic location of the regulatory elements, their strength, the affinity landscape of transcription factors, and distribution in the nuclear 3D space. We experimentally show that ProT levels confer increased DNA flexibility and surface accessibility, and thus potentially primes usage of high ProT regions as regulatory elements. ProT levels also correlate with occurrence and phenotypic consequences of mutations. Interestingly, cell-fate switches involve a transient usage of low ProT regulatory elements. Altogether, our work provides unprecedented insights into the gene regulatory landscape encoded in the DNA biophysical features.


INTRODUCTION
The genome consists of multiple gene regulatory units comprising of proximal and distal regulatory elements. Recent studies have shown that the function and utilization of these elements during cellular differentiation and in response to intracellular and extracellular cues relies on a dynamic control by epigenetic machinery in concert with transcription factors (TFs) (Li et al., 2007;Long et al., 2016;Margueron and Reinberg, 2010;Moris et al., 2016). Enhancers are known to be critical for setting up transcriptome underlying cell-type identity and function from far away distances on their target promoters (Heinz et al., 2015;Long et al., 2016;Rickels and Shilatifard, 2018;Sakabe et al., 2012;Schoenfelder and Fraser, 2019;Shlyueva et al., 2014;Spitz and Furlong, 2012). Importantly, mutations in these gene regulatory elements are known to disrupt their function affecting gene expression and ultimately cell identity and hence underlie several diseases (Li et al., 2019;Rickels and Shilatifard, 2018;Sakabe et al., 2012;Weinhold et al., 2014). Typically, a range of methods are employed to identify and validate such distal regulatory elements including quantifying certain histone modifications and DNase hypersensitivity assays (Ong and Corces, 2011;Pradeepa et al., 2016;Shlyueva et al., 2014;Zentner and Henikoff, 2013). These methods have their own limitations and a number of alternate assays and histone modifications have recently been used to discover enhancers (Arnold et al., 2013;Pradeepa et al., 2016;Vanhille et al., 2015). Thus, our current approach to reveal regulatory elements in entirety is highly limited and vouches to search for conserved features that can explain enhancer evolution and function.
Interestingly, sequences from regulatory loci are able to recapitulate endogenous TF binding pattern, chromatin state, and cell-type-specific activity when placed at an exogenous genomic site or tested in isolation (Lienert et al., 2011;Yanez-Cuna et al., 2013). In addition, computational analysis has further shown that the occurrence of certain sequences at genomic loci is predictive of their regulatory potential (Colbran et al., 2017;Yanez-Cuna et al., 2014;Yang et al., 2017a). These lines of evidence strongly suggest the existence of inherent gene regulatory potential of these genomic loci at the sequence level. Despite these advances, we lack understanding of the evolutionary constrains in the selection of certain genomic DNA elements for their gene regulatory function (Pennacchio et al., 2013). It is thus important to decode the power of sequence features in determining the gene regulatory potential and differential usage in cell-type specification. In addition, it is important to catalog novel regulatory elements, an effort that is limited by insufficient knowledge of existing features of these elements.
Several laboratories have attempted to employ computational approaches to predict enhancers based on sequence information (Kleftogiannis et al., 2016;Lee et al., 2011;Rusk, 2014). Although these methods were able to predict enhancers to a certain degree, they were unable to decipher the underlying code that drives enhancer selection and strength (Pennacchio et al., 2013). A previous study suggested that the local DNA topography differs at functional noncoding regions of the genome including enhancers (Parker et al., 2009). Interestingly, DNA shape features such as propeller twist (ProT), major grove width, and helical twisting determine different local geometries, which in turn contribute to the control of transcription factor binding and gene regulation Ma et al., 2017;Mathelier et al., 2016;Zhou et al., 2013). Overall, the existing evidences suggest a genetic feature code beyond simple sequence that may dictate selection of enhancers and their strength of function. Here we show that DNA shape features are highly informative of the gene regulatory potential of genomic loci. We discover that the ProT levels can reveal the location of enhancers, their strength, the affinity landscape of transcription factors, and distribution in the nuclear 3D space with high accuracy. Using experimental assays including single-molecule AFM imaging measurements, we show that indeed high ProT levels cause increased DNA flexibility and surface accessibility and may potentially explain their usage as regulatory elements. Furthermore, ProT levels also determine the effectivity landscape of the genome to tolerate mutations. Altogether, this work reveals the gene regulatory landscape encoded in the basic genetic sequence features and provides a significant advance in unfolding the mysteries of genetic code.

Genomic Surface Accessibility and Flexibility Are Encoded in DNA Shape Features
The ability for genomic regions to function as gene regulatory elements is thought to be significantly influenced by their inherent accessibility for DNA-binding proteins such as TFs (Bell et al., 2011). To probe accessibility, we began by investigating whether the surface accessibility of DNA is influenced by its biophysical features. We used hydroxyl radical cleavage maps as a proxy for solvent accessible surface area of the DNA  and correlated this with various DNA shape features such as ProT, major groove width (MGW), helix turn (HelT), and roll predicted by an established tool-DNAshape (Zhou et al., 2013). We found that ProT, defined as the angle of twisting of two neighboring nucleotides from the axis of their geometrical center, highly correlates with the DNA surface accessibility (Pearson correlation coefficient = 0.967, pval <0.001) ( Figure 1A). The other features do not show as strong a correlation with hydroxyl radical cleavage maps and with each other (Figures S1A-S1C). This analysis established ProT as a proxy to measure inherent surface accessibility of DNA.
Next, to directly test how increased levels of ProT affect the mechanical properties of DNA segments, we decided to carry out high-resolution AFM imaging experiments of $1 kbp long DNA sequences with different ProT levels (Data S1). Toward this, we first used PCRs to generate different linear sequences predicted to be either genomic ''random'' or ''high'' in terms of ProT levels. Subsequently, atomic force microscopy (AFM) images of the DNA molecules were measured and analyzed by tracing the DNA paths ( Figure 1B). An analysis of the mean-squared separation of pairs of points located at different distances along the contour length confirmed that the DNA molecules were equilibrated at the surface ( Figure S1D) and allowed determination of the bending persistence length. We found bending persistence length values P z 56 nm, in good agreement with previous measurements under similar conditions (Mazur and Maaloum, 2014;Rivetti et al., 1996;Wiggins et al., 2006). The data did not reveal significant differences in the persistence lengths from control (P = 56.1 G 0.2 nm) and high ProT sequences (P = 56.7 G 0.6 nm), suggesting that the bending stiffness at longer length scales is similar for different levels of ProT. Therefore, to probe the local flexibility of the DNA sequences, we analyzed the distribution of bend angles between points separated by 5 nm along the contour. Taking the negative logarithm of the histogram of bend angles directly gives the effective bending energy ( Figure 1C). For both random and high ProT sequences, the data for angles up to q $ 1 rad are well described by a simple elastic model, the so-called worm-like chain, whereas for bending angles q > 1 rad clear deviations from the elastic model are apparent, as have been observed previously (Wiggins et al., 2006). Interestingly, the high ProT sequences exhibited larger deviations from the elastic model and a significantly higher fraction of medium (q > 0.8 rad; p = 7.5$10À6) and large bends (q > 1.1 rad; p = 0.0013) compared with the control sequences ( Figure 1D). In contrast, different control sequences and different high ProT sequences gave the same fractions of medium and large bends, respectively, within experimental error. Taken together, the AFM imaging analysis suggested that on short length scales ($5 nm), high ProT sequences exhibit enhanced bendability compared to random sequences.

Propeller Twist Levels Correlate with 3D Nuclear Positioning of Distinct Chromatin States
Eukaryotic genomes are compartmentalized into distinct domains marked by active (eu-) and inactive (hetero-) chromatin. Inspired by the observation that ProT highly correlates with the inherent surface accessibility and bendability of DNA, we hypothesized that these regions could potentially mark open active chromatin regions that are also known to be more fluid in nature. A previous study employed single cell Hi-C assays to reconstruct the 3D genome of mouse embryonic stem (mES) cells at a high resolution (Stevens et al., 2017). We processed these data and overlaid with histone modifications indicative of (C) Energy landscape for bend angles reconstructed from the bend angle distribution for pooled control (N molecules = 801) and pooled ProT (N molecules = 425) sequences. In total, we traced 87,168 bend angles from 1,226 imaged DNA molecules. The broken line depicts the energy landscape expected for a wormlike chain with persistence length P = 55 nm. (D) Fraction of large bends (>0.8 rad) for pooled control (N large bends = 182; N bends, total = 56,874) and high ProT (N large bends = 157; N bends, total = 30,294) sequences. The fraction of large bends is significantly higher for the ProT versus control sequences (p = 7.5 3 10 -6 ). The error bar is the standard deviation from counting statistics, i.e. the square root of the counts divided by the number of total counts. See also Figure S1. euchromatin and heterochromatin, H3K27ac and H3K9me3, respectively. Interestingly, euchromatin was found to have a higher surface depth (as defined in Stevens et al., 2017) as compared with heterochromatin (Figures 2A and S2A). These findings are also in line with the local enrichment of heterochromatic laminaassociated domains (LADs) at the nuclear periphery (van Steensel and Belmont, 2017).
We next analyzed the radial distribution of sequences with different ProT levels within the Hi-C data derived from 3D nuclear positioning. Interestingly, ProT levels were found to correlate well with the nuclear distribution, where ''high ProT'' sequences occupy an internal position, whereas ''low ProT'' sequences are localized at the periphery ( Figure 2B). Importantly, while the surface depths of genomic loci at the single cell level across various ES cells is variable, the overall radial positioning of differential ProT regions in the genome is highly consistent (Figures S2B, S2C, and 2B). Simultaneous visualization of chromosome-wide surface depth and ProT profiles also showed a clear correlation between these two features (Figures 2C,2D,and S2D). Collectively, these results suggested a potential contribution of ProT in influencing the nuclear positioning and its association with distinct epigenetic states.

Propeller Twist Encodes the Regulatory Potential of Genetic Elements
Intrigued by the above findings, we next attempted to perform a detailed characterization of high ProT regions. In line with our previous findings, we find that ''high ProT'' sequences are prevalent at regions To further validate these findings, we segmented the epigenome of human K562 myeloma cells into 15 different chromatin states using ChromHMM (Ernst and Kellis, 2012) and determined their ProT levels. Consistent with the previous observations, we found generally higher ProT levels at genomic regions marked by active chromatin marks as compared with repressive ones (Figures S3A-S3C). An interesting exception was H3K27me3, a repressive mark, which correlates with higher ProT levels ( Figure S3C). This may be explained by the fact that H3K27me3 marks certain genomic regions that permit enhancer activity under certain physiological conditions (Taberlay et al., 2011). Furthermore, this mark is also known to be present at ''poised'' promoters that represent a transcription ready state (Bernstein et al., 2006).
We next extracted experimentally validated regulatory regions from a variety of cell types and analyzed their ProT profiles. Strikingly, we noticed that the regulatory elements defined by H3K27ac mark show a highly characteristic distribution of ProT levels where ProT peaks appear symmetrically next to the center of H3K27ac peaks ( Figure 3C). Furthermore, ProT peaks overlap with the centers of regulatory elements identified by CAGE or STARR-seq experiments, suggesting that ProT is an intrinsic property of regulatory regions ( Figures 3D and 3E). Extended analysis of CAGE-defined enhancers across 71 cell types further supports these findings  ( Figures 3F and S3D).
Next, we sought to monitor the correlation of ProT levels with enhancer activity in a quantitative manner. Because H3K27ac levels at enhancers are known to correlate with gene expression levels, we used this as a proxy for enhancer usage (Karlic et al., 2010). A comparison of H3K27ac enrichment with ProT levels demonstrated a clear relationship ( Figure 3G), which was also true with CAGE-or STARR-seq-determined enhancer strength across multiple cell types ( Figures 3H-3J). Based on these observations we also hypothesized whether ProT levels could also help discriminate enhancer usage across cell types. Interestingly, we indeed observed that higher ProT level-containing regions tend to be ubiquitous enhancers, whereas those showing lower ProT level were enhancers of cell-type specific genes ( Figure 3K). This may relate to an easy activatable state of housekeeping genes versus those of cell-type specific genes that generally require distinct machinery and program to induce their expression. Further, we observed that the ProT levels correlate with expression levels ( Figures 3L and S3E), suggesting that the transcriptional competence is potentially orchestrated at the genetic level by DNA shape features.
The current repertoire of histone modifications does not seem sufficient to define all genomic regulatory elements, and efforts are continuously being made to uncover new chromatin features that allow mapping all enhancers. In line with this, H3K122ac modification was shown to mark enhancers that do not exhibit any H3K27ac mark (Pradeepa et al., 2016). Further corroborating our previous observations, H3K122ac positive and H3K27ac negative enhancers show a characteristic high ProT profile ( Figure 3M). Thus, high ProT levels constitute a common feature of enhancers irrespective of the chromatin mark defining these regions. These findings argue that high ProT levels constitute a common feature of

ProT Profile Is a Deterministic Feature of Enhancers
To further establish predictive nature of ProT levels in priming genomic regions for a gene regulatory function we developed SVM (Support Vector Machine) models to classify between random genomic loci and CAGE-defined brain enhancers using single nucleotide ProT values across a 2000 bp window. The resulting nine models could classify the location of enhancers at randomly chosen genomic sites with very high accuracy (mAUC = 0.78) (Figure 4A). Next, we trained five models in a similar manner to identify STARR-seq defined enhancers from mouse NIH3T3cells. Again, our SVM models closely predicted enhancer locations (AUC = 0.96) ( Figure 4B). Enhancers are known to contain multiple TF (Transcription Factor) binding sites, and given a strong relationship between ProT and enhancer occurrence and usage, we next had a closer look at TF motifs at ProT profiles. Here we clustered TF-bound motifs, as derived from actual ChIP-seq assays for these TFs, into eight different clusters based on various histone modification patterns at these sites as hallmark of euchromatin and heterochromatin ( Figures S4A-S4C). Interestingly, an analysis of ProT levels at the center of motif at these TF bound sites for each of these clusters showed that the activator and repressor TF motifs can be clearly delineated by ProT levels. The TFs that function primarily as activators preferably target motifs embedded in high ProT environments, whereas TFs acting mainly as repressors bind motifs within lower ProT environments ( Figures 4C and 4D). These findings argue that ProT profile is a deterministic feature of enhancers and can predict the active distal gene regulatory landscape with high accuracy.

Cell-Fate Switches Involve a Transient Usage of Low ProT Regulatory Elements
We next explored the ProT dependency landscape of different TFs to reveal the possible impact of DNA structure on the function of general vs cell-fate-determining TFs. Toward this, we looked for systems that involve dynamic reprogramming of cell-fate using defined TFs. Somatic cells can be efficiently reprogrammed into an embryonic stem cell state, i.e. induced pluripotent stem cells (iPSCs), using a distinct set of TFs, namely Oct4, Sox2, Klf4, and c-Myc (Takahashi and Yamanaka, 2006). Therefore, using datasets from a previous study we analyzed binding of these four TFs during fibroblast-to-iPSC reprogramming and assessed its relation to ProT levels and nucleosome occupancy in pre-induced human fibroblasts, as measured by MNase sequencing (MNase-seq) (Chronis et al., 2017). Interestingly, c-Myc showed lesser affinity for nucleosome bound regions, whereas Oct4, Sox2, and Klf4 preferentially targeted nucleosome-enriched sites ( Figure 5A). These data are consistent with a previous finding that Oct4, Sox2, and Klf4, but not c-Myc, could function as pioneer TFs during reprogramming by virtue of their ability to target ''closed'' chromatin sites (Soufi et al., 2015). Furthermore, while c-Myc, Klf4 and Sox2 ChIP-seq enrichment relied on levels of ProT, this was less so in the case of Oct4.
Intrigued by these findings, we analyzed the rewiring of the active chromatin landscape using H3K27ac mark and its relationship with Oct4 binding dynamics during distinct stages of reprogramming. Strikingly, although H3K27ac sites in either fibroblasts or iPSCs show stable ProT levels, those occurring during any of the transient states during reprogramming show a significant drop in their ProT levels ( Figure 5B). This pattern was closely mimicked by genomic regions targeted by Oct4 at distinct stages of reprogramming ( Figure 5B). These results suggest that while Oct4 binding and enhancer activation occurs at ''low ProT'' regions in transient cell states occurring during reprogramming, the acquisition of a fully reprogrammed cell-fate involves utilization of ''average ProT'' sites as enhancers ( Figure 5B). Altogether, these findings imply that although high ProT sites are hallmark of enhancers in defined cell types, relatively lower ProT sites may play a crucial role during setting up of these cell-fates during reprogramming and potentially in development.

ProT Levels Correlate with Occurrence and Phenotypic Consequences of Mutations
Given our findings of a deterministic role of ProT in the regulatory potential of distinct genomic sites, we next assessed the differential sensitivity of ProT sites to tolerate mutations. Strikingly, our analysis revealed a higher occurrence of random mutations (i.e. non-phenotype associated) at ''high ProT'' regions of the genome ( Figures 6A and S5). This implies that inherent higher DNA accessibility plays a critical role in enhancing mutability, possibly because mutagenic agents or machinery have an easier access to such sites. In contrast, the occurrence of cancer-associated mutations (i.e. phenotype associated) is higher in genomic loci characterized by a lower ProT ( Figures 6A and S5). It is likely that these low ProT sites offer reduced access to DNA repair machineries and consequently more likely to result in phenotypic consequences.
We further employed LINSIGHT predictions to determine effectivity of SNPs in causing a detectable phenotype (Huang et al., 2017). We analyzed genomic segments in 1 kb bins and found that ProT levels are lower at phenotype-associated SNPs than at random genomic loci, in line with our previous observations ( Figure 6B). We further implemented SVM models trained on ProT information for 2 kbp genomic segments to classify effective versus random genomic loci in terms of phenotypic association. Strikingly, the generated models were able to predict whether mutation at a specific locus could be phenotypic with a considerably high accuracy ( Figure 6C). Taken together, we show that ProT levels are correlated with occurrence and phenotypic consequences of mutations.

DISCUSSION
The DNA sequence composition is known to influence local DNA shape across the genome (Parker et al., 2009). However, at a broader scale, the DNA sequence and structures appear as independent and deconvolved features (Abe et al., 2015). Previous studies have hinted upon a role of DNA sequence as well as topography in determining certain epigenetic features (Arnold et al., 2017;Parker et al., 2009;Rusk, 2014;Wang and Moazed, 2017). The genetic features were also shown to be important in determining TF access on the DNA (Yang et al., 2017b;. However, whether the local DNA biophysical features play any role in a chromatin context and in gene regulation is unknown. This study provides unprecedented insights into the deterministic role of DNA biophysical features in governing epigenetic and gene regulatory landscape underlying cell identity and function. Our study has discovered ProT as a novel proxy for measuring inherent surface accessibility as determined by OH-radical cleavage mapping and local DNA flexibility as probed by atomic force microscopy. Further assessment of distribution of different ProT sequences within the 3D nuclear space revealed that higher ProT regions are enriched in euchromatic domains and are more interiorly located in 3D genome structure. In contrast, lower ProT regions are enriched in heterochromatic domains and are more exterior in their location within the 3D genome structure. These intriguing results implied a novel role for ProT levels of DNA sequences in guiding DNA surface accessibility and flexibility, epigenetic state, and ultimately the nuclear organization of distinct chromatin domains. As LADs are known to be AT-rich, future studies should attempt to dissect any contribution of DNA sequence versus DNA shape to our observations (van Steensel and Belmont, 2017). Importantly further, although our analysis found ProT to positively correlate with the DNA surface accessibility, HelT showed negative correlation. Further investigation is required to uncover the relevance of this anti-correlation, in particular towards the gene regulatory landscape.
A large scale, systematic analysis showed specific enrichment of ProT at the regulatory elements that were previously identified by a number of independent experimental measures in multiple cell-types and across species. Importantly, further, the regulatory potential of the genomic regions showed a strong correlation with ProT levels. Additional analysis revealed that ProT is a deterministic feature of the enhancers irrespective of the chromatin mark used for their identification. Strikingly, the predictive power of ProT to identify sequence-intrinsic enhancer features, as experimentally measured by STARR-seq (Arnold et al., 2013;Muerdter et al., 2015), was very high, suggesting that ProT predictions are able to decode sequence logic with confidence in such experiments and offers an alternative to heavy experimentation-based analysis. Collectively, our findings argue that the local DNA biophysical features hold the potential to prime a genomic region for particular epigenetic state and gene regulatory potential, thus revealing an underestimated role of DNA structure in guiding genome function. This is further in line with previous studies that have shown that the DNAshape algorithm works better than certain k-mer (2,3) combinations for some biological functions (Abe et al., 2015;Mathelier et al., 2016).
ProT, along with other DNA shape features, have been shown to contribute to TF access DNA (Yang et al., 2017b;. Our analysis shows that the activator and repressor TFs have an inverse binding affinity with ProT levels. Importantly, further, in contrast to general TFs, pioneer TFs have lesser dependency on these DNA features for their binding. A number of efficient reprogramming TFs are known to be pioneer TFs (Pataskar et al., 2016;Soufi et al., 2015). Interestingly, employing pioneer TF binding and H3K27ac data during distinct stages of cellular reprogramming of a differentiated to a pluripotent state, we find that the switch between cell-fates involves a transient usage of low ProT as regulatory elements.
In further support of this highly influential role of ProT in guiding gene regulation, DNA flexibility, and surface accessibility, we found a higher occurrence of random or non-phenotype-associated mutations at ''high ProT'' regions of the genome, whereas the occurrence of phenotype-associated mutation is higher at lower ProT genomic regions, which we identified to mark cell-identity enhancers. This implies that high ProT-imposed increased DNA accessibility plays a critical role in enhancing mutability in the genome, potentially because mutagenic agents or machinery have an easier access to such sites. A previous study has suggested that the GC-rich sequences tend to have less pronounced ProT values, whereas AT-rich sequences tend to have more negative ProT values (Hancock et al., 2013). It is thus also likely that the observed relationship to the mutation rate is also influenced to certain extent by the sequence composition and warrants further investigation.
Altogether, our work provides unprecedented insights into the gene regulatory landscape encoded in the DNA biophysical features. Our findings open a new area of investigation that was previously underestimated for its relevance and vouches for the necessity to include DNA shape features while studying epigenetic gene regulatory mechanisms in various contexts. Our study hypothesizes that DNA sequences have evolved in a highly orchestral manner, wherein genomic DNA is segmented into compartments of different inherent biophysical states, which are then chosen to be in different nuclear chromatin compartments of different regulatory potential. Follow-up studies should aim to investigate how epigenetic machineries such as DNA modifying enzymes alter the DNA structure at specific sites. Furthermore, it will also be important to determine how combinatorial TF binding influences DNA flexibility and if this crosstalk is relevant for iScience 21, 638-649, November 22, 2019 647 cell-fate decisions during development. In addition, it will be interesting to investigate how genomic sites with various ProT levels are utilized in the 3D nuclear space in response to external cues during development and in various diseases. Such investigations will ultimately decode the relationship between genetic and epigenetic mechanisms, which is key for a comprehensive understanding of genome function in health and disease.

Limitations of the Study
DNAshape algorithm predicts shape from DNA sequence, and hence it will show high correlation to the sequence. Given this close dependency, it is almost impossible to fully dissect the contribution of DNA sequence versus shape. This limitation is valid for several previous publications using DNAshape algorithm as well as our own study.

METHODS
All methods can be found in the accompanying Transparent Methods supplemental file.

ACKNOWLEDGMENTS
We would like to especially thank Dr. Michael Stadler (FMI, Switzerland) and Dr. Attila Nemeth for assistance and critical feedback to this study. We would also like to thank the members of the Tiwari lab for their cooperation and critical feedback during this study. We also thank Tim Liedl and Joachim Rä dler for access to the AFM imaging lab. This study was supported by the Deutsche Forschungsgemeinschaft TI 799/1-3 to V.K.T. and Deutsche Forschungsgemeinschaft SFB 863, project A11 to J.L..

DECLARATION OF INTERESTS
The authors declare no competing financial interests.

Figure S4
PC1 ( Cluster 1   respectively. PCR products were purified using a PCR cleanupCleanup kit and stored in Tris-EDTA buffer (10 mM Tris-HCl, pH = 8.0; 1 mM EDTA) at 4° C. To verify successful amplification of the desired sequence, the product was analyzed on a 1 % (w/v) agarose gel.

AFM imaging
A buffer solution (12 mM MgCl2; 10 mM Tris-HCl, pH = 8.0) containing 1167 bp linear DNA construct at 0.5 ng/ L was deposed onto freshly cleaved mica by drop-casting for 30 seconds, followed by gentle rinsing with milliQ water (25 mL) and drying under N 2 . AFM imaging was performed on a Multimode AFM equipped with a Nanoscope III controller (Digital Instruments) and a type E scanner (Bruker). Images were recorded on dried samples, under ambient conditions, and using silicon cantilevers (Nanosensor; SSS-NCHR; resonance frequency ≈ 300 kHz). Typical scans were recorded at 1-3 Hz line frequency, with optimized feedback parameters and 512 × 512 pixels of 1.9 nm each.

AFM data analysis
AFM topography images were loaded in Scanning Probe Imaging Processor software (v6.4) and background corrected using global fitting with a second order polynomial and in a lineby-line fashion using the histogram alignment routine. The processed images were saved in ASCII format and analyzed using custom-written code implemented in Python framework.
Image analysis involves finding (x,y) coordinates that define the DNA curvilinear length by tracing the molecular contour with a step-length l = 5 nm, following the routine introduced by Wiggins et al. (Wiggins et al., 2006). Bend angles are defined as the deviation from linearity between a certain set of tangent vectors separated by l. In total, we traced 87168 bend angles from 1226 imaged DNA molecules. The energy landscape for bending was reconstructed from the corresponding bend angle distribution by taking the negative logarithm. The energy required to introduce a bend increases with the bend angle θ. Up to θ ~ 1 rad the increase is approximately quadratic and in agreement with the prediction of the worm-like chain model EWLC(θ) = 1/2·kBT·(P/l)·θ 2 , with P ~ 55 nm ( Figure 1C, dashed line), where kB is the Boltzmann constant, T the absolute temperature (295 K), and l the segment length (5 nm in our analysis). For bending angles θ > 1 rad, the energy to bend the DNA grows more slowly -in other words large bends occur more frequently-than predicted by the WLC model, as has been reported previously (Wiggins et al., 2006). While both the control and high ProT sequences deviate from the WLC prediction, the energy required to introduce a large bend for the high ProT sequences is lower and, therefore, deviates more strongly from the WLC prediction compared to the control sequences.
Further, the (x,y) coordinates are used to quantify the end-to-end distances R between any two traced positions as a function of their separation L along the contour. In turn, the relation between the mean squared end-to-end distance <R 2 > and L can be used to examine surface equilibration using the formula <R 2 > = 4 P L (1 -2 P/L (1 -exp(-L / 2P))) with P the bending persistence length.

DNAshape and OH-radical cleavage prediction
DNAshapeR package (Chiu et al., 2016) was used to generate and plot DNA shape feature predictions, namely Propeller Twist, Major Groove Width, Helix Turn and Roll. Density plots of these features were plotted using DNAShapeR (Chiu et al., 2016). OH-radical cleavage predictions were obtained from ORCHID2  server. For genomewide predictions of DNAshape features, bigwig files (mm9 and hg19 for mouse and human respectively) were downloaded from GBShape (Chiu et al., 2015). Correlation within DNAshape (Chiu et al., 2015) features and with OH-radical cleavage intensity (ORCHID) were done by using Deeptools (Ramirez et al., 2014) at the resolution of 1KB.

Reconstruction of 3D Genome Structures and overlay with Chromatin features and
Propeller Twist predictions HDF5 Datafiles containing genome structure features reconstructed from single-cell HiC experiments of seven mouse ES cells were downloaded from supplementary information provided in study from Stevens et al (Stevens et al., 2017). The surface depths with resolution of 1MB were analyzed from these HDF5 files. Euchromatin segments were defined as those 1 MB segments that shows more than 10 peaks of H3K27ac chromatin marks, while heterochromatin marks showed enrichment of more than 10 peaks of H3K9me3. Propeller Twist predictions at resolution of 1MB in mouse genome (mm9) was analyzed from BigWig files downloaded from GBShape (Chiu et al., 2015) using Deeptools (Ramirez et al., 2014). For visualization of 3D genome and chromosome structures, PDB files were downloaded from supplementary information provided in study from Stevens et al. (Stevens et al., 2017). The PDB files visualization and programming was done using Pymol. Overlay of chromatin states, surface depth and propeller Twist quartiles was done by processing PDB files in R and visualization in Pymol. Scripts used for processing are available upon request.

Analysis of CAGE and STARR-seq enhancers
capSTARR-seq data was downloaded from the supplementary data published in the study from Vanhille et al (Vanhille et al., 2015). CAGE Enhancer analysis was downloaded from FANTOM5 Atlas . Propeller Twist profiles were overlaid using mean across BED file, centered at mid, using DNAshapeR (Chiu et al., 2016).

Histone modification ChIPseq analysis
Following ChIP-seq were processed in this manner: mES H3K27ac, mES H3K9me3, K562 H3K27ac. The ChIP-sequencing output in FASTQ format was subjected to a quality check using FASTQC v2.6.14 (Andrews). Bowtie v0.12.9 (Langmead, 2010) was used to align the reads uniquely, i.e., each read was maximally aligned to one position, to mm9 genome with UCSC annotations (Rosenbloom et al., 2015). The alignment output files from two biological replicates were merged together after checking for correlations across the replicates using the SAMTOOLS v0.1.19 (Li et al., 2009) merge function. SAMTOOLS v0.1.19 (Li et al., 2009) was used for the alignment file format conversions and sorting of alignment output files. The WIGGLE files for the alignment files were generated using QuasR package (Gaidatzis et al., 2015). The peaks were computed without providing input with MACS v2.0.10.20120913 (Zhang et al., 2008) using the default parameters. The enrichment was calculated by QuasR (Gaidatzis et al., 2015) using the following formulae: ℎ = 2(( * min( , ) + )/( * min( , ) + p) ) where ns is the total number of reads that align at the genome level in the ChIP-seq Sample, Ns is the number of reads that aligned the entire ChIP-seq sample, nb is the number of reads that aligned at the genomic level in the input, and Nb is the total number of reads that aligned in the input. P is the pseudocount, which is used to correct the enrichment values at the genomic features with low read counts and was set to 8. For ChIPseq of H3K27ac during time-points of reprogramming, data was downloaded from supplementary information provided in the study by Chen et al.(Chen et al., 2016) All other ChIPseq data was downloaded pre-analyzed from ENCODE (de Souza, 2012).

Transcription factor ChIPseq analysis
Transcription factor ChIPseq data for K562 cells was obtained from ENCODE (de Souza, 2012). ChIPseq data was analyzed in the same way as Histone Modification data given below with specifically restricting to narrow peak calling algorithm from MACS (Zhang et al., 2008) Transcription factor motifs were obtained from TRANSFAC (Wingender et al., 2000) database. Only those Transcription factors were further analyzed for which motif information were available. Motif enrichment analysis was performed using HOMER (Heinz et al., 2010).
Binding sites with ChIPseq peaks centered accross motifs were retained for further DNAshape analysis using DNAshapeR (Chiu et al., 2016). The transcription factors were classified according to families from AnimalTF Database (Zhang et al., 2015). For ChIPseq of Oct4 during time-points of reprogramming, data was downloaded from supplementary information provided in the study by Chen et al. (Chen et al., 2016).

Histone modification ChIP-seq in K562 analysis
Histone modification ChIP-seq data in K562 was obtained from ENCODE (de Souza, 2012) as BAM and peak files. Heatmap and density plot enrichment of Histone modifications was generated using ngs.plot.r (Loh and Shen, 2016;Shen et al., 2014). For the PCA plot of Transcription factor distribution on PCA dimensions as a function of enrichments across various chromatin states was undertaken with data of overlap of Transcription factor peaks with histone modification peaks.

Support Vector Machine and linear and logistic regression analysis
Support vector machine was implemented using e1071 R package 80% of data was used for training of SVM model, while rest 20% of rest was used for testing unless specified. Multiple models as independently mentioned were trained for the reducing chances of overfitting.
ROC curves were plotted using ROCR package (Sing et al., 2005). Regression analysis was done using R with similar strategy.

Generating of features for Support vector machine analysis
For classification of enhancer positive and random genomic loci, the corresponding sequences of 2000BP centered on the peak-mid were extracted using BEDTOOLS (Quinlan, 2014)and DNAshape analysis was undertaken using DNAshapeR (Chiu et al., 2016).
Propeller Twist values for each base pair step was used as feature set for SVM analysis. For classification of cell-type-cluster specific classification, the feature set included PHASTCons (Felsenstein and Churchill, 1996) derived conservation scores of enhancer sequences from each clusters, TRANSFAC (Wingender et al., 1996) transcription factor motifs enrichments for each enhancer sets, PhastCons derived Conservation scores of the transcription factor motif gene were used.

Conservation and Mutation analysis
For mutation analysis, COSMIC database  was used to download phenotypic and non-phenotypic mutations. The genomic classes in increasing order of mutation density was used by scanning number of mutations occurring in each nonoverlapping 2000 BP regions, and stratification by the quartile analysis of the counts.
Conservation scores were obtained from PhastCons (Felsenstein and Churchill, 1996) Programming and scripting Scripts for data analysis is written in R and PERL and is available upon request.