Semi-nonparametric modeling of topological domain formation from epigenetic data

Background Hi-C experiments capturing the 3D genome architecture have led to the discovery of topologically-associated domains (TADs) that form an important part of the 3D genome organization and appear to play a role in gene regulation and other functions. Several histone modifications have been independently associated with TAD formation, but their combinatorial effects on domain formation remain poorly understood at a global scale. Results We propose a convex semi-nonparametric approach called nTDP based on Bernstein polynomials to explore the joint effects of histone markers on TAD formation as well as predict TADs solely from the histone data. We find a small subset of modifications to be predictive of TADs across species. By inferring TADs using our trained model, we are able to predict TADs across different species and cell types, without the use of Hi-C data, suggesting their effect is conserved. This work provides the first comprehensive joint model of the effect of histone markers on domain formation. Conclusions Our approach, nTDP, can form the basis of a unified, explanatory model of the relationship between epigenetic marks and topological domain structures. It can be used to predict domain boundaries for cell types, species, and conditions for which no Hi-C data is available. The model may also be of use for improving Hi-C-based domain finders.


Background
The emerging evidence suggests that 3D nuclear architecture is important for the regulation of gene expression and it is tightly linked to the function of the genome. For instance, expression in the beta-globin locus is mediated by folding to bring an enhancer and associated transcription factors within close proximity of a gene [1,2]. Similarly, loci of mutations that affect expression of genomically far-away genes (eQTLs) interact significantly more frequently within a contact range in 3D to their regulated genes [3], indicating that 3D genome structure plays a wide-spread role in gene regulation. Lastly, spatial regions that interact with nuclear lamina are generally inactive [4]. Measuring and modeling the 3D shape of a genome is thus essential to obtain a more complete understanding of how cells function.
Chromatin interactions obtained from a variety of recent chromosome conformation capture experimental techniques such as Hi-C [5] have resulted in significant advances in our understanding of the geometry of chromatin structure [6,7]. These experiments yield matrices of counts that represent the frequency of cross-linking between restriction fragments of DNA at a certain resolution. Analysis of the resulting matrix by Dixon et al. [8] led to the discovery of topologically-associated domains (TADs) which correspond to consecutive, highly-interacting matrix regions typically a few megabases in size that are close in densely packed chromatin. TADs have been identified across different cell cycle phases and in prokaryotes [9]. Several lines of evidence suggest that Sefer and Kingsford Algorithms Mol Biol (2019) 14:4 TADs are a building block of genomic regulatory architecture [10,11]. Segmental packaging of genome via TADs likely have critical roles in cell dynamics such as long-range transcriptional regulation and cell differentiation [12,13].
The mechanism by which these TADs form and are demarcated is still largely unknown. A plethora of epigenetic modifications have been identified in metazoan genomes that are associated with 3D genome shape [14], and thus TADs. Several modifications have been found to be specifically correlated with TAD boundaries [8]. For instance, histone modifications H3K4me3m H3K27ac and insulator proteins are enriched within TAD boundaries [15], although the causal direction of these associations is still unknown [12]. Despite these analyses, the complete picture of how histone modifications are related to TAD formation is missing. This is partially because previous analyses relating histone marks to domain boundaries have often considered each histone mark independently, without accounting for their combined affects. It is unknown to what extent relationships between the histone markers are important or whether there is a small set of markers that are of primary importance.
Here, we develop and train a joint model, which we call nTDP (Nonparameteric Topological Domain Partitioner), of how histone modifications are associated with domain boundaries and interiors. We show that we are able to train this model optimally in polynomial time because its likelihood function is convex. The model does not make any assumptions about the effect of each histone mark on domain formation, and instead fits the histone-domain relationship nonparametrically. Using this model, we systematically identify a small set of histone markers that in combination appear to explain TAD boundaries. We find a small number of epigenetic elements account for a large proportion of the accuracy of TAD prediction. All of these identified marks fail to predict domain boundaries when considered independently. We show that these markers are conserved across species and cell types in a very strong way: models trained on mouse continue to work well on human, and models trained on IMR90 cells continue to work on embryonic stem cells.
Our approach, nTDP, can form the basis of a unified, explanatory model of the relationship between epigenetic marks and topological domain structures. It can be used to predict domain boundaries for cell types, species, and conditions for which no Hi-C data is available. The model may also be of use for improving Hi-C-based domain finders. The rest of the paper is organized as follows: We start by formally defining nTDP model. Then, we present provably optimal methods to train our model from markers and TADs, as well as to predict TADs over trained model. Lastly, we present results on prediction of domains on the same species as well as across species and cell types.

Related work
Previous work mainly focused on analyzing epigenetic data in an unsupervised way. Segway [16] and Chrom-HMM [17] take as input a collection of genomics datasets and learn chromatin states that exhibit similar epigenetic activity patterns which then have different interpretations such as transcriptionally active, Polycomb-repressed. Libbrecht et al. [18] improve Segway predictions by integrating Hi-C data which is not as abundant as histone data, whereas [19] jointly infers chromatin state maps in multiple genomes by a hierarchical model. However, none of these methods deal directly with TADs. Even though a subset of their chromatin states overlap with TADs, predicting TADs from them heuristically does not perform well. Additionally, they either ignore the histone densities, or make parametric distribution assumptions such as geometric or normal which are not always reflected in the true data. When modified to run in a supervised setting, they cannot capture the most informative subset of epigenetic elements.
The recent approach [20] proposes a supervised learning method based on random forests to predict TAD boundaries from histone modifications and chromatin proteins. In general, this approach is reported to perform quite accurately in predicting boundaries. However, it does not model interior TAD segments and it treats each segment independently ignoring the fact that TADs form as a result of the joint effects of multiple segments.

The likelihood function
Let V be the ordered set of genome restriction fragments (bins), where each bin v represents the interval We propose a supervised, semi-nonparametric, highdimensional model nTDP that uses H to model and predict D. Our model can be seen as a generalization of Conditional Random Field [21,22] where we have continuous weights instead of binary features and where we model the marker effects semi-nonparametrically.
Specifically, we assume there are 3 types of segments in V that are relevant for modeling: those that are at the domain boundaries ( V b ), those that are in the interior of domains ( V i ), and those that are not part of a domain ( V e ), and we have , that will describe the relationship between marker count c and the fragment type (b, i, e) for marker type m. Here, w b m , w i m , w e m are parameters that we will fit to determine the shape of the effect function. Thus, for example, f i m (c, w i m ) will describe how a count of c for marker m influences whether the fragment is in the interior (i) of a domain.
We assume that these effect functions combine linearly. Therefore, let be the total effect of all the markers on fragment v for boundary formation (b). Summations E i vq and E e vq are defined analogously for interior (i) and inter-domain fragments (e).
Let W be the union of model parameters w b m , w i m , w e m , and let D train = {D q : q = 1, . . . , Q} be several domain decompositions (in different sequences or conditions) and let H train = {H q : q = 1, . . . , Q} be a set of corresponding histone markers. Under the assumption that the training pairs are independent, the log-likelihood of parameters W given D train is is the total quality of partition D q and marker data H q under model parameters W. Let V q be the set of segments in pair q. Due to independence of segments: is the partition function defined over all possible nonoverlapping partitions D ′ , c b , c i , c e are relative weights of different types of fragments to account for unbalanced training set, and V q e is the set of fragments that do not belong to any domain in D q .

Nonparametric form of the effect functions
Because the shape of the marker effect function is unknown, we choose the f functions from the nonparametric family of Bernstein basis polynomials. Bernstein polynomials can approximate any effect function and additionally can handle imposed shape constraints such as monotonicity and concavity. Let A be the chosen dimension of these polynomials; larger A results in a more expressive family, but more parameters to fit. Let m max be the maximum possible density of marker m. This is is used to transform the input c q vm to the range [0, 1]; therefore define p are the base Bernstein kernels.

Optimal algorithms for training and inference
We must train the parameters W for the above model using data of the form D train , H train . We will examine these trained parameters (and several good solutions for them) for insights into which markers are most informative for describing D train and thus topological domains.

Problem 1
Training: Given a set of marker data H train , likely from several chromosomes and cell conditions, and corresponding set of TAD decompositions D train , we estimate the most likely parameters W according to Eq. 2.

Problem 2 Inference:
Given marker data H model parameters W, we estimate the best domain partition D of the track.

Training
A nice feature of the objective (3) is that it is convex in its arguments, {w b m , w i m , w e m } m∈M , which follows from linearity, composition rules for convexity, and convexity of the negative logarithm. However, training involves several challenges: (a) computing the partition function  (3), and (b) estimating W so that the weights are sparse. We solve each of these challenges next.

Estimating the partition function
We estimate Z q |V q | in (3) recursively in polynomial time since each segment can belong to one of 4 states: domain start (sb), inside a domain (i), domain end (eb), nondomain (e), and state of each segment depends only on the previous segment's state.

Estimating a sparse set of good histone effect parameters
We would like to augment objective function (2) so that we select a sparse subset of markers from the data and avoid overfitting. If the coefficients w b m = 0 , then there is no influence of marker m. For this purpose, we will impose grouped lasso type of regularization on the coefficients w mk . Grouped lasso regularization has the tendency to select a small number of groups of non-zero coefficients but push other groups of coefficients to be zero.
We introduce two types of regularization. First, we require that many of the weights be 0 using an L 2 -norm regularization term. Second, we want the effect functions {f } to be smooth. Let P = {b, i, e} . We modify our objective to trade off between these goals: where 1 , 2 are the regularization parameters, and R(f p m ) is the smoothing function for effect of marker m at p ∈ P: Group lasso in (6) uses the square of block l 1 -norm instead of l 2 -norm group lasso which does not change the regularization properties [23]. Objective function (6) is convex which follows from the convexity of R(f p m ) as proven in Theorem 1.
Proof Second-order derivative in (7) can be written more explicitly as in (8) according to [24]: (9): We note that (6) is convex, but it is a nonsmooth optimization problem because of the regularizer. We solve it efficiently by using an iterative algorithm from multiple kernel learning [23]. By Cauchy-Schwarz inequality: where γ mp ≥ 0 , m∈M γ mp = 1, p ∈ P , and the equality in (11) holds when This modification turns the objective into the following which is jointly convex in both w p m and γ mp : We solve this by alternating between the optimization of w p m and γ mp . When we fix γ mp , we can find the optimal w p m by any quasi-newton solver such as L-BFGS [25] which runs faster than the other solvers such as iterative scaling or conjugate gradient. When we fixed w p m , we can obtain the best γ mp by the closed form equation (12). Both steps iterate until convergence.

Training extensions
We can model a variety of shape-restricted effect functions by Bernstein polynomials that cannot be easily achieved by other nonparametric approaches such as smoothing splines [24]. We add the following constraints to ensure monotonicity: which is a realistic assumption since increasing the marker density should not decrease its effect. We can also ensure concavity of the effect function by: which has a natural diminishing returns property: the increase in the value of the effect function generated by an increase in the marker density is smaller when output is large than when it is small. Our problem is different than smoothing splines since our loss function is more complicated than traditional spline loss functions due to partition function estimation in (5) which makes it hard to directly apply the smoothing spline methods [26]. In addition, these nonnegativity and other shape constraints can be naturally enforced in our method.
We can also extend the problem to modeling multiple domain subclasses instead of a single class where (12) domains are categorized into subclasses according to their gene-expression profiles such as TADs with highlyactive genes, TADs with repressive genes, etc.

Inferring domains using the trained model
Given marker data H over a single track and W, the inference log-likelihood is: where D = {[s, e] | s, e ∈ V , e − s ≥ 1} is the set of all potential domains of length at least 2 and The intuition is that variable x se = 1 when the solution contains interval [s, e], and variable y v = 1 if v is not assigned to any domain. The log(Z |V | ) term is removed during inference since it is same for all D. We solve (19)- (20) to find the best partition D: where M[v] is the set of intervals that span fragment v. We replace y v in (18) with 1 − [s,e]∈M [v] x se since each segment can be assigned to at most a single domain. (20) ensures that inferred domains do not overlap. Problem (19)- (20) is Maximum Weight Independent Set in interval graph defined over domains which can be solved optimally by dynamic programming in O(|V | 2 ) time.

Experimental setup
We binned ChIP-Seq histone modification and DNaseseq data at 40 kb resolution, estimate RPKM (Reads Per Kilobase per Million) measure for each bin, and transform values x in each bin by log(x + 1) , which reduces the distorting effects of high values. In the case of 2 or more replicates, the RPKM-level for each bin is averaged to get a single histone modification file, in order to minimize batch-related differences. We convert any data mapped to hg19 (mm8) to hg18 (mm9) using UCSC lift-Over tool. We define TADs over human IMR90, human embryonic stem (ES), and mouse ES cells Hi-C data [8] at 40 kb resolution after normalization by [27]. We use consensus domains from Armatus [28] as the true TAD partition by selecting threshold γ where maximum Armatus domain size is closest to the maximum Dixon et al. [8] (18) argmax D log (P (D|W , H) domain size ( γ = 0.5 for IMR90, γ = 0.6 for human ES, and γ = 0.2 for mouse ES cells).
We solved the training optimization problem by L-BFGS [25]. We use the public implementation of Armatus [28], and obtain histone modifications from NIH Roadmap Epigenomics [29] and UCSC Encode [30]. Code and datasets can be found at http://www.cs.cmu. edu/~cking sf/resea rch/ntdp. nTDP is reasonably fast: we train on all human IMR90 chromosomes in less than 3 h on a MacBook Pro with 2.5Ghz processor and 8Gb Ram. The iterative procedure in general converges in fewer than 10 iterations.
We prevent overfitting by following a two-step nested cross-validation which has inner and outer steps. The outer K-fold cross-validation, for example, trains on all autosomal human chromosomes except the one to be predicted. Within each loop of outer cross-validation, we perform (K − 1)-fold inner cross-validation to estimate the regularization parameters.

nTDP finds a small subset of modifications predictive of TADs
We identified a minimal set of histone marks that can model TADs as follows: we run nTDP independently on each chromosome of human IMR90 to obtain 21 sets of marks. These sets overlap significantly across all chromosome pairs (hypergeometric p < 0.05 for all pair-wise comparisons), and a total of 16 modifications cover all chromosomes. Despite the regularization, the weights of several of these marks are still very close to 0, so we identify a non-redundant subset of the modifications by Bayesian information criterion (BIC) [21] which penalizes model complexity more strongly.
As we increase the number of included modifications from 1 to 16, the BIC decrease nearly stops after 4 modifications, with some additional small reduction up to 6 modifications. The sets of 4 and 6 modifications that were most informative are: {H3K36me3, H3K4me1, H3K4me3, H3K9me3} and {H3K4me3, H3K79me2, H3K27ac, H3K9me3, H3K36me3, H4K20me1}. These non-redundant set of elements are preserved when we repeat this procedure between species. We find that only these 4-6 modifications are needed to accurately predict TADs.

These marks are common in good models
The 4 modifications {H3K36me3, H3K4me1, H3K4me3, H3K9me3} are also enriched among a collection of high quality training solutions. We measure the agreement between estimated and true partitions by normalized variation of information NVI = VI log |V | [31] where VI measures the similarity between two partitions and lower score means better performance. We analyze the fraction of models with 4 histone modifications for which NVI score is at least 95% of optimum NVI score obtained by running nTDP over all modifications as in Fig. 1. We find 161, 139 solutions satisfying this criteria among 1820 candidates for human IMR90 and human ES histone modifications respectively. We find the 4 histone modifications above to be significantly overrepresented in the set of models for both human IMR90 and ES cells (hypergeometric p < 0.0001 ). As a boundary case, restricting the effect function to be a linear function of model parameters in human IMR90 cells does not significantly change the results as in Fig. 2. In another species, mouse ES cells, these 4 histone modifications are also the most informative predictors of TADs as in Fig. 3. These significance values combined with the results above suggest the importance of the identified modifications in TADs.

These marks have nearly optimal coherence score
We assess the performance of various subset of modifications by the coherence score which is the exponential of the negative mean log-likelihood of each chromosome on the test set, and it is normalized by the best model coherence score as in Table 1. As such it is a relative measure of the quality of various models. The coherence score using only the set {H3K36me3, H3K4me1, H3K4me3, H3K9me3} is almost as high as the score for all 28 histone modifications in human IMR90. Restricting the effect function shape to be nonnegative and concave slightly improves the score. Similar ordering of models according to coherence scores is also observed in human ES cells as in Table 2. Our analysis indicates that the remaining modifications carry either redundant information or are less important for TADs.

Predicting TADs from histone marks in human
nTDP is able to predict domain boundaries accurately using 4 histone marks alone in both human IMR90 and human ES cells. We compare TAD prediction performance of nTDP with the chromatin state partition predicted by Segway [16] in terms of NVI. Even though Segway does not predict TADs directly, its chromatin state partition can still be used as a baseline. Training with all 28 histone modifications instead of with the identified 4 modifications does not lead to a major performance increase as shown in Fig. 4a even though it increases the training time approximately 4 times for human IMR90 cells. Restricting the effect function to be monotonic and concave only slightly increases the performance. Chromatin states inferred by Segway do not directly correspond to TADs which leads to a lower TAD prediction performance even though they have other meaningful interpretations.  We find combinatorial effects of histone modifications to be important for accurate domain prediction since none of the modifications can achieve NVI score better than 0.2 when considered independently. To verify that there are not inherent structures in the data that can lead to an easy prediction, we randomly shuffle domains in the training set by preserving their lengths without shuffling modifications, which NVI score is never better than 0.3 in all chromosomes showing the importance of histone modification distributions in TADs.
nTDP also predicts TADs accurately across different species as well as across different cell types as in Fig. 4bd. For example, if we train on human IMR90 data, the model we obtain is still able to recover domains in human ES cells (Fig. 4b). Using the identified 4 histone modifications achieves NVI score of 0.13 in human ES whereas using all modifications achieves slightly lower 0.11 on average. This performance difference is not significant except chromosomes 7 and 21 in human ES. This holds true across species as well: training on human ES data, for example, produces a model that can work well on mouse ES data. Accurate prediction of TADs by training with the identified 4 histone modifications across different species and cell types suggests the consistency of the identified modifications across species and cell types.

Multiscale analysis of the predicted TADs
We find that our predicted TADs match true TADs more accurately at different scales defined by different Armatus γ 's as in Fig. 5. We observe a slight performance improvement if we define true TAD partition at lower Armatus γ values in human IMR90 which correspond to longer TADs. This figure suggests that some of our wrong TAD predictions may actually correspond to longer TAD blocks which we erroneously interpret as incorrect due to a scale mismatch.

Conclusion
We formulate semi-nonparametric modeling of TADs in terms of histone modifications, and propose an efficient provably optimal solution nTDP for training and inference. Experimental results on human and mouse cells show that a common subset of histone modifications can accurately predict TADs across cell types and species. Via our trained model, we also accurately  predict TADs without using any Hi-C data which is especially useful for understanding the 3D genome conformation on species with limited Hi-C data. We expect our method to become increasingly useful with faster accumulation of epigenomic datasets than Hi-C interaction data. Additionally, some of our mispredictions may actually correspond to TADs at different scales suggesting a possibility of better inference performance than presented here.