Inference of temporally varying Bayesian Networks

Motivation: When analysing gene expression time series data, an often overlooked but crucial aspect of the model is that the regulatory network structure may change over time. Although some approaches have addressed this problem previously in the literature, many are not well suited to the sequential nature of the data. Results: Here, we present a method that allows us to infer regulatory network structures that may vary between time points, using a set of hidden states that describe the network structure at a given time point. To model the distribution of the hidden states, we have applied the Hierarchical Dirichlet Process Hidden Markov Model, a non-parametric extension of the traditional Hidden Markov Model, which does not require us to fix the number of hidden states in advance. We apply our method to existing microarray expression data as well as demonstrating is efficacy on simulated test data. Contact: thomas.thorne@imperial.ac.uk


Introduction
The analysis of gene expression data in the field of systems biology is a challenging problem for a number of reasons, not least because of the high dimensionality of the data and relative dearth of data points.A number of approaches have been taken to inferring regulatory interactions from such data, often employing graphical models or sparse regression techniques [13,18,22].
These problems are further compounded by the nature of the biological systems under consideration, due to the influence of unobserved actors that may alter the behaviour of the system.Often experiments are performed over long time periods during which it is natural to expect the regulatory interactions at work to change.It is also worth noting that the time scales of regulatory responses to stimuli often differ from those of signalling and metabolic responses, and so it may be that responses to stimuli, around which experiments are often designed, take place in several phases each having different time scales.
Previous studies have attempted to address this problem by introducing changepoints in the time series, allowing the inferred network structure to differ between the resulting segments of the time series.For example in Lèbre et al. [14] a changepoint model is applied in which Dynamic Bayesian Networks are inferred for each segment of the time series.However such approaches may place strong prior assumptions on the number of changepoints that may be observed, and do not adjust for the complexity of the observed data automatically.Instead in Grzegorczyk et al. [10] an allocation sampler is used in combination with Bayesian Networks to assign each observation to a group, but unlike changepoint models this method treats the observations as being exchangeable, ignoring the fact that the data are sequential.The similar methodology in Ickstadt [11] uses a more flexible nonparametric prior on group assignments, applied to the modelling of molecular interactions using Bayesian Networks, but suffers the same drawbacks in not recognising the sequential nature of the data.
Here we present a methodology that allows us to infer network structures that may change between observations in a nonparametric framework whilst modelling the sequential nature of the data.To that end we have employed the infinite hidden Markov model (iHMM) of Beal et al. [1], also known as the hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) [25], in particular the "Sticky" extension of Fox et al. [5], in conjunction with a Bayesian network model of the gene regulatory network structure.The HDP-HMM allows the number of different states of the network structure to adapt as necessary to explain the observed data, including a potentially infinite number of states, of course restricted in practice by the finite number of experimental observations.In the previous work of Rodriguez et al. [20] it was demonstrated that the HDP-HMM outperforms a Dirichlet Process mixture for Gaussian graphical models on heterogeneous time series.
We apply our methodology to both simulated data and gene expression data for Arabadopsis Thaliana and Drosophila Melanogaster, demonstrating its effectiveness in detecting changes in network structure from time series data.

Methods
Given gene expression time series data over m genes at n time points, we denote the observations as the n × m matrix X = (x 1 , . . ., x n ) T , where x j = (x j1 , . . ., x jm ) T , the column vector of expression levels for each of the m genes at time point j.We formulate our model as a Hierarchical Dirichlet Process Hidden Markov Model, a stochastic process whereby a set of hidden states s 1 , . . ., s n govern the parameters of some emission distribution F over a sequence of time points 1 . . .n.
Each observation x j is then generated from a corresponding emission distribution F (θ k ), where s j = k.For our emission distributions, F , we use a Bayesian Network model over the m variables to represent the regulatory network structures corresponding to each hidden state.

Hierarchical Dirichlet Process Hidden Markov Models
To model a hidden state sequence that evolves over time we apply the methodology first introduced in Beal et al. [1] whereby a finite state Hidden Markov Model, consisting of a set of hidden states s 1 , . . ., s n over some alphabet 1 . . .K, is extended so that K → ∞.In a classical Hidden Markov Model, the number of states K is typically specified in advance, and states follow a Markovian distribution whereby transitions are made between states with probability π kl = p(s j = l|s j−1 = k), so that the next state in the sequence is drawn conditional only on the previous state.
The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) [1,25,26] instead applies a Dirichlet Process prior to the transition probabilities π k• out of each of the states, and uses a hierarchical structure to couple the distributions between the individual states to ensure a shared set of potential states into which transitions can be made across all of the π.This allows for an unlimited number of potential states, of course limited in practice by the number of observed data points.
More formally, each hidden state k possesses a Dirichlet Process G k from which the next state is drawn, and a common (discrete) base measure G 0 is shared between the G k , so that G k ∼ DP(α, G 0 ).As a result transitions are made into a discrete set of states shared across all of the G k , and drawn from G 0 .The base measure G 0 is in turn drawn from a Dirichlet Process, G 0 ∼ DP(γ, H), H being our prior over parameters for the emission distributions F k .
Then using the stick breaking construction of Sethuraman [23] for G 0 and drawing θ l independently from H, we have that G 0 = ∞ l β l δ θ l , with β ∼ GEM (γ), and so G k = ∞ l π kl δ θ l with π k ∼ DP(α, β).The resulting model is outlined in figure 1.
However in a biological system it is probably realistic to assume that only a subset of the wide variety of potential behaviours of the hidden state sequence is relevant, as behaviour such as rapid The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) represented as a graphical model.
cycling between states at adjacent time points would a priori seem to be unlikely to be observed in most gene expression data sets.Thus we choose to apply the Sticky HDP-HMM [4,5], that introduces an extra parameter κ that biases the prior probability of transitions between states towards remaining in the current state rather than transitioning to a differing one.Adding such a prior assumption simply states that we expect the state of the system to remain the same between successive time points; this is both parsimonious and would seem to be justified in the case of gene expression microarray data sets, where we might only expect to observe a small number of transitions to differing states across the time series.
This modification to the HDP-HMM gives us a model generating observed data points x j as [5] β|γ ∼ GEM(γ), x j |s j ∼ F (θ s j ). (5)

Gibbs sampling for the Sticky HDP-HMM
To sample from the hidden state sequence we have used a Gibbs sampling procedure based on the conditional probabilities for the hidden state s i given the remaining hidden states s −i [4], updating each hidden state individually in a sweep over the n states, where s −j denotes the state sequence s 1 , . . ., s n excluding s j , N −j kl indicates the number of observed transitions from state k to state l within the hidden state sequence s −j , and N −j k• the total number of transitions from state k within s −j .

Bayesian Network emission distributions
To model the regulatory network structure corresponding to the hidden states of the HDP-HMM, we have applied a Bayesian Network methodology to capture the relationships between the genes represented in our data.Thus each hidden state has a unique Bayesian Network describing the interactions occurring between the genes at the time points corresponding to a particular state.
Bayesian Networks are probabilistic models, whereby a directed graph defines the conditional independence relationships between a set of random variables [12].For the model to remain consistent, the graph structure G, with nodes u ∈ N G representing random variables and directed edges (v, u) ∈ E G representing conditional probability relationships between them, must be acyclic.
For a given Bayesian network structure G and model parameters θ, the joint distribution p(X|G, θ) factorises as a product of local distributions for each node, where for each observation the value x iu of a node u is dependent on the values of its set of parent nodes pa G (u) = {v ∈ N |(v, u) ∈ G} and some parameters θ u .Here we have used a BGe model [7] that allows the variables to take continuous values and defines the local distributions for each observation i ∈ 1, . . ., m of a gene u as with parameters θ u = µ u , b u , σ 2 u .Then using a conjugate Normal-Wishart prior and making certain assumptions we can calculate the local marginal likelihoods p(x u |pa G (u)) as described in Geiger and Heckerman [7] and from these derive the joint probability p(X|G).
Unfortunately, due to the restriction of the network structure to that of a directed acyclic graph (DAG) it is difficult to explore the space of possible network structures.Several MCMC schemes have been proposed, including those of Grzegorczyk and Husmeier [9] and Madigan et al. [16], but performing random walks over DAG network structures faces the problem that proposing moves that maintain the DAG structure can be complex and time consuming, and mixing of the Markov chain can be slow.
However, as noted in Friedman and Koller [6], a DAG structure G corresponds to a partial ordering on the nodes and so induces a (non-unique) total ordering, and so the method of Friedman and Koller [6] instead proposes to perform a random walk over total orderings of the nodes.This allows the Markov chain to move quickly around the space of possible graph structures, improving the mixing properties of the chain.
Whilst this introduces a bias in the prior distribution over graph structures [9] it greatly simplifies the computational complexity of the MCMC procedure and such a bias may be justified by arguments of parsimony, since graphs consistent with more orderings are more likely to be sampled.Furthermore the uniform prior on DAG structures is not uniform over Markov equivalent graphs, and so also introduces a different kind of bias in the results.Finally, a trivial modification of the algorithm of Friedman and Koller [6] allows for a correction of the bias [3].
Thus in our methodology we apply the MCMC sampler of Friedman and Koller [6] to infer Bayesian Network structures for each hidden state of the HDP-HMM by sampling over total orderings of the nodes given the data points corresponding to the state in question.It is easy to calculate the likelihood of an ordering using [6] where pa G denotes the set of possible parent sets over the nodes of G consistent with the ordering .Then we can use a Metropolis Hastings sampler to sample from the posterior of orderings p( |X) = p(X| )p( ) [3], by beginning with an initial ordering and proposing and accepting new orderings , distributed as q( | ) with probability according to the Metropolis Hastings acceptance probability over a number of iterations.We choose to propose changes by swapping nodes in the ordering rather than more complex schemes such as "deck cutting", as these were found to have little impact on performance in previous studies [3,6].Proposals ∼ q(•| ) are thus drawn by selecting two nodes within the ordering uniformly at random and exchanging their positions to produce a new ordering.In the absence of a compelling alternative, we take the prior over orderings p( ) as the uniform distribution.
Then for our emission distribution for a given state k we apply a Bayesian Network ordering k generating observed data points X k distributed as p(X k | k ) where by X k we denote the subset of X ij including only rows i corresponding to states s i = k.
The full method is outlined in algorithm 1, and combines Gibbs updates of the hidden state sequence with Metropolis Hastings updates of the node orderings of the Bayesian Networks for each state at every iteration.To sample hidden state sequences and orderings from the posterior distribution the algorithm is first run for a number of burn-in iterations, after which samples are collected.Since a single iteration of our algorithm combines a full Gibbs update sweep along with an update of the Bayesian Network orderings over a number of Metropolis Hastings steps, in practice a comparatively small number of iterations of the algorithm are required to reach the posterior.To reduce the computational complexity of the Bayesian Network inference, we restrict the number of potential parents of a gene to be 2 or less.Even in such a case, we still face a large number of possible parent sets, of size and so in the analyses presented below we restrict our data set to a subset of genes of special interest, as is commonly the case in gene expression data analysis.
Finally, once we have inferred the hidden state sequence and generated a posterior sample of orderings corresponding to each state, we can then easily sample DAG structures from the posterior by first sampling an order from the posterior of a given state, and then sampling from the graphs consistent with this ordering, weighting the choice of parents by the local scores, and optionally attempting to account for the bias in the prior as described in Ellis and Wong [3].Initialise state sequence s 1 , . . ., s n ; Set K ← number of states in s; Initialise orderings 1 , . . ., K ; for iter ← 1 to N iter do for j ← 1 to n do for k ← 1 to K do l k ← p( k |X k )p(s j = k|s −j , α, β, κ); end Sample new ordering K+1 from p( ) l K+1 ← p( K+1 |x j )p(s j = K + 1|s −j , α, β, κ); s j ← sample from 1, . . ., K + 1 with weights l; Update K; end Sample β|α, γ, s 1 , . . ., s n ; Algorithm 1: MCMC sampler for HDP-HMM Bayesian Networks 3 Applications

Example -simulated data
To evaluate the efficacy of our method we generated simulated data from three different Bayesian network structures and interleaved the data points into a single time series.Applying our methodology to this data we then attempted to recover the hidden state sequence.Three different Bayesian Networks of 10 nodes each having random structures and parameters were used, with the restriction that each node had at most two parents.Such a restriction is realistic for real world biological networks and reduces the computational complexity of the Bayesian network inference as the number of potential sets of parents of each node is greatly reduced by constraining the search.A total of 100 data points were used, consisting of a sequence of 25 generated by the first network, 25 by a second network structure, another 25 from a third network structure and finally a further 25 data points generated by the second network structure.
The Gibbs sampling MCMC scheme outlined in algorithm 1 was applied over 500 iterations after a 1000 iteration burn in, with 100 MCMC iterations of the Bayesian network order sampler run on each network structure between each Gibbs sweep.
In figure 2 we show a comparison of the true hidden state sequence with the state sequences for the 500 samples from the Gibbs sampler.Our method perfectly recreates the original hidden state sequence, correctly identifying that the network structure is the same between two separate segments of the sequence of data points.

D. melanogaster midgut development gene expression data
Applying our method to real world gene expression data, we took the publicly available gene expression data set of Li and White [15], as stored in the Gene Expression Omnibus database [2].This data set gives tissue specific expression levels for genes in D. melanogaster midgut at time points before and after pupariam formation, taken at 11 time points.A subset of genes to analyse was chosen by selecting genes having the highest variance across the time series, using the genefilter R package in Bioconductor www.bioconductor.org[8,19].This resulted in a data set of 23 genes at 11 time points.This allows us to apply our approach without having to consider the additional issues arising the 'large-p-small-n' problem.
The results shown in figure 3 identify two regions of the time series having different network structures, with a change occuring after the 0 hour time point at which pupariam formation occurs.This suggests that a different structure of regulatory interactions is at work during the midgut development after the pupariam formation begins.The networks inferred for each of the different states are shown in figure 4, illustrating a clear change between differing network structures.

Transcriptome of starch metabolism during A. thaliana diurnal cycle
We have also analysed the gene expression data set of Smith et al. [24], as included in the GeneNet [21] R package [19].The data set consists of expression levels for 800 genes encoding enzymes involved in starch synthesis and in conversion of starch to maltose and Glc, at 11 time points over 12 hours, transitioning from dark to light.The first 5 time points were collected during a dark period after which a switch to a light period was made, with time points spaced so that expression is measured at 0, 1, 2, 4 and 8 hours after the switch to the dark period, and the same intervals after the switch to the light period [24], as well as a final 24 hour time point at the switch back to the dark period.A reduced subset of the 800 genes in the data set was selected using the genefilter R package, as described previously, giving a subset of 40 genes that were analysed using our method.
In figure 5 we show the results generated by our method, clearly indicating two distinct phases within the time series.It appears that one phase is detected from 1 to 12 hours, with a second phase inferred between 13 and 24 hours, that is also represented at the initial time point.This is consistent with the design of the experiment, as a change in expression would perhaps not be expected to be    Figure 6: Sampled Bayesian Network structures for the two states inferred by our method applied to the A. thaliana diurnal cycle expression data [24].
observed immediately at the point at which the switch between light and dark takes place, but rather later at a subsequent time point, as is observed here.Since the 24 hour time point was taken under the same conditions as the initial time point, one would expect these two time points to be grouped together using our method.The networks inferred for the two different phases, shown in figure 6, again demonstrate a clear change in the network structure, with the two networks having distinct topologies.

Discussion
From our simulated data it is clear that the HDP-HMM Bayesian Network sampler we have constructed accurately infers the hidden state sequences governing Bayesian Networks that capture how the regulatory organisation of a biological system changes with time.The accuracy of our method on test data suggests that it will perform well on real world data sets, whilst the existence of more sophisticated and demonstrably more efficient samplers indicates that there is room for even further improvement and computational efficiency.For example the beam sampler of Van Gael et al. [27] and the Hierarchical Chinese Restaurant Process (HCRP) formalism of Makino et al. [17] have better mixing and perform better than standard Gibbs samplers, especially on time series such as those we examine here where neighbouring states are likely to be correlated.We remark that it is essential to consider the fluid nature of regulatory network structures when inferring networks from data sets where such change is likely.Performing an analysis on data using a model with a fixed network structure, when it is known that the network structure will change due to some stimulus, is inherently incorrect, and thus will introduce unnecessary bias into the results.Whilst it may be possible to infer correct results from an incorrect model, it would not seem wise to rely on such approaches when alternatives exist.
Our methodology crucially accounts for the sequential nature of the data, something that has previously been overlooked [10,11], but we feel is crucial to the modelling of gene expression time series data sets.Furthermore our methodology has an advantage over changepoint models that data may be shared between distinct segments of the time series sharing the same hidden state when inferring the network structure -something that is explicitly represented in our model, but generally omitted in changepoint models.This is especially import in gene expression data analysis where time points are a scarce and valuable resource.
The versatility of the HDP-HMM means that our methodology is applicable not only to time series data where the underlying process is divided into distinct contiguous segments, as would be expected in gene regulatory networks, but also to processes describable by a Markov process, for example rapidly changing between a sequence of hidden states with some underlying transition mechanism.Thus it may be of use for other problems of network inference in systems biology outside of the area of sequential gene expression time series data, or in other fields where networks that change with time are encountered.
Finally whilst other methods may require manual specification of an appropriate prior distribution on the number of possible states, taking a nonparametric approach allows our prior distribution to naturally expand to explain the observed data as the size and complexity of the data grows.Bayesian nonparametric methods demonstrably outperform regular priors in a variety of applications and we have shown here their potential in modelling hidden variables in theoretical systems biology.

Figure 2 :
Figure 2: (a) Counts of the number of the 500 posterior samples in which pairs of data points where assigned to the same state.(b) The same illustration for the true hidden state sequence.

Figure 3 :
Figure 3: Posterior distribution of states at each time point inferred by our method applied to the D. melanogaster midgut development expression data [15].States are represented by colours, and frequencies of their appearance for each time point in the posterior samples is plotted.

Figure 4 :
Figure 4: Sampled Bayesian Network structures for the two states inferred by our method applied to the D. melanogaster cell cycle expression data [15].

Figure 5 :
Figure 5: Posterior distribution of states at each time point inferred by our method applied to the A. thaliana diurnal cycle expression data [24].States are represented by colours, and frequencies of their appearance for each time point in the posterior samples is plotted.