Inferring Time-Varying Network Topologies from Gene Expression Data

Most current methods for gene regulatory network identification lead to the inference of steady-state networks, that is, networks prevalent over all times, a hypothesis which has been challenged. There has been a need to infer and represent networks in a dynamic, that is, time-varying fashion, in order to account for different cellular states affecting the interactions amongst genes. In this work, we present an approach, regime-SSM, to understand gene regulatory networks within such a dynamic setting. The approach uses a clustering method based on these underlying dynamics, followed by system identification using a state-space model for each learnt cluster—to infer a network adjacency matrix. We finally indicate our results on the mouse embryonic kidney dataset as well as the T-cell activation-based expression dataset and demonstrate conformity with reported experimental evidence.


INTRODUCTION
Most methods of graph inference work very well on stationary time-series data, in that the generating structure for the time series does not exhibit switching. In [1,2], some useful method to learn network topologies using linear statespace models (SSM), from T-cell gene expression data, has been presented. However, it is known that regulatory pathways do not persist over all time. An important recent finding in which the above is seen to be true is following examination of regulatory networks during the yeast cell cycle [3], wherein topologies change depending on underlying (endogeneous or exogeneous) cell condition. This brings out a need to identify the variation of the "hidden states" regulating gene network topologies and incorporating them into their network inference framework [4]. This hidden state at time t (denoted by x t ) might be related to the level of some key metabolite(s) governing the activity (g t ) of the gene(s). These present a notion of condition specificity which influence the dynamics of various genes active during that regime (condition). From time-series microarray data, we aim to partition each gene's expression profile into such regimes of expression, during which the underlying dynamics of the gene's controlling state (x t ) can be assumed to be stationary. In [5], the powerful notion of context sensitive boolean networks for gene relationships has been presented. However, at least for short timeseries data, such a boolean characterization of gene state requires a one-bit quantization of the continuous state, which is difficult without expert biological knowledge of the activation threshold and knowledge of the precise evolution of gene expression. Here, we work with gene profiles as continuous variables conditioned on the regime of expression. Each regime is related to the state of a state-space model that is estimated from the data.
Our method (regime-SSM) examines three components: to find the switch in gene dynamics, we use a change-point detection (CPD) approach using singular spectrum analysis (SSA). Following the hypothesis that the mechanism causing the genes to switch at the same time came from a common underlying input [3,6], we group genes having similar change points. This clustering borrows from a mixture of Gaussian (MoG) model [7]. The inference of the network adjacency matrix follows from a state-space representation of expression dynamics among these coclustered genes [1,2]. Finally, we present analyses on the publicly available embryonic kidney gene expression dataset [8] and the T-cell 2 EURASIP Journal on Bioinformatics and Systems Biology activation dataset [1], using a combination of the above developed methods and we validate our findings with previously published literature as well as experimental data.
For the embryonic kidney dataset, the biological problem motivating our network inference approach is one of identifying gene interactions during mammalian nephrogenesis (kidney formation). Nephrogenesis, like several other developmental processes, involves the precise temporal interaction of several growth factors, differentiation signals, and transcription factors for the generation and maturation of progenitor cells. One such key set of transcription factors is the GATA family, comprising six members, all containing the (-GATA-) binding domain. Among these, Gata2 and Gata3 have been shown to play a functional role [8,9] in nephric development between days 10-12 after fertilization. From a set of differentially expressed genes pertinent to this time window (identified from microarray data), our goal is to prospectively discover regulatory interactions between them and the Gata2/3 genes. These interactions can then be further resolved into transcriptional, or signaling interactions on the basis of additional biological information.
In the T-cell activation dataset, the question is if events downstream of T-cell activation can be partitioned into early and late response behaviors, and if so, which genes are active in a particular phase. Finally, can a network-level influence be inferred among the genes of each phase and do they correlate with known data? We note here that we are not looking for the behavior of any particular gene, but only interested in genes from each phase.
As will be shown in this paper, regime-SSM generates biologically relevant hypotheses regarding time-varying gene interactions during nephric development and T-cell activation. Several interesting transcripts are seen to be involved in the process and the influence network hereby generated resolves cyclic dependencies.
The main assumption for the formulation of a linear state-space model to examine the possibility of gene-gene interactions is that gene expression is a function of the underlying cell state and the expression of other genes at the previous time step. If longer-range dependencies are to be considered, the complexity of the model would increase. Another criticism of the model might be that nonlinear interactions cannot be adequately modeled by such a framework. However, around the equilibrium point (steady state), we can recover a locally linearized version of this nonlinear behavior.

SSA AND CHANGE-POINT DETECTION
First we introduce some notations. Consider N gene expression profiles, g (1) , g (2) , . . . , g (N) ∈ R T , T being the length of each gene's temporal expression profile (as obtained from microarray expression). The jth time instant of gene i's expression profile will be denoted by g (i) j . State-space partitioning is done using singular spectrum analysis [10] (SSA). SSA identifies structural change points in time-series data using a sequential procedure [11]. We will briefly review this method.
Consider the "windowed" (width N W ) time-series data given by {g (i) 1 , g (i) 2 , . . . , g (i) NW }, with M (M ≤ N W /2) as some integer-valued lag parameter, and a replication parameter K = N W − M + 1. The SSA procedure in CPD involves the following.
(i) Construction of an l-dimensional subspace: here, a "trajectory matrix" for the time series, over the interval [n + 1, n + T] is constructed, where T yields a collection of singular vectors-a grouping of l of these Singular vectors, corresponding to the l highest eigenvalues-denoted by I = {1, . . . , l}, establishes a subspace L n,I of R M .
(iii) Construction of the test matrix: Here, we use the length (p) and location (q) of test sample. We choose p ≥ K, with K = N W − M + 1. Also q > p, here we take q = p + 1. From this construction, the matrix columns are the vectors G i,(n) j , j = p + 1, . . . , q. The matrix has dimension M × Q, Q = (q − p) = 1.
(iv) Computation of the detection statistic: the detection statistics used in the CPD are (a) the normed Euclidean distance between the column span of the test matrix, that is, G i,(n) j and the ldimensional subspace L n,I of R M . This is denoted by D n,I,p,q ; (b) the normalized sum of squares of distances, denoted by S n = D n,I,p,q /MQμ n,I , with μ n,I = D m,I,0,K , where m is the largest value of m ≤ n so that the hypothesis of no change is accepted; (c) a cumulative sum-(CUSUM-) type statistic W 1 = S 1 , W n+1 = max{(W n + S n+1 − S n − 1/3MQ), 0}, n ≥ 1.
The CPD procedure declares a structural change in the time series dynamics if for some time instant n, we observe W n > h with the threshold h = (2t α /(MQ)) (1/3)q(3MQ − Q 2 + 1), t α being the (1 − α) quantile of the standard normal distribution.
(v) Choice of algorithm parameters: (a) window width (N W ): here, we choose N W T/5, T being the length of the original time series, the algorithm 3 provides a reliable method of extracting most structural changes. As opposed to choosing a much smaller N W , this might lead to some outliers being classified as potential change points, but in our set-up this is preferred in contrast to losing genuine structural changes based on choosing larger N W ; (b) choice of lag M: in most cases, choose M = N W /2.

MIXTURE-OF-GAUSSIANS (MoG) CLUSTERING
Having found change points (and thus, regimes) from the gene trajectories of the differentially expressed genes, our goal is to now group (cluster) genes with similar temporal profiles within each regime. In this section, we derive the parameter update equations for a mixture-of-Gaussian clustering paradigm. As will be seen later, the Gaussian assumptions on the gene expression permit the use of coclustered genes for the SSM-based network parameter estimation.
We now consider the group of gene expression profiles G = {g (1) , g (2) , . . . , g (n) }, all of which share a common change point (time of switch)-c 1 . Consider gene profile i, Tc 1 ] T , a T c1 -dimensional random vector which follows a k-component finite mixture distribution described by where α 1 , . . . , α k are the mixing probabilities, each φ m is the set of parameters defining the mth component, and θ ≡ {φ 1 , . . . , φ k , α 1 , . . . , α k } is the set of complete parameters needed to specify the mixture. We have For a set of n independently and identically distributed samples, (1) , g (2) , . . . , g (n) , the log-likelihood of a k-component mixture is given by (i) Treat the labels, Z = {z (1) , . . . , z (n) }, associated with the n samples-as missing data. Each label is a binary vector where z (i) m = 1 and z (i) p = 0, for p = m indicate that sample g (i) was produced by the mth component.
In this setting, the expectation maximization algorithm can be used to derive the cluster parameter (θ) update equations.
In the E-step of the EM algorithm, the function Q(θ, where w (i) m is the posterior probability of the event z (i) m = 1, on observing g (i) m . The estimate of the number of components (k) is chosen using a minimum message length (MML) criterion [7]. The MML criterion borrows from algorithmic information theory and serves to select models of lowest complexity to explain the data. As can be seen below, this complexity has two components: the first encodes the observed data as a function of the model and the second encodes the model itself. Hence, the MML criterion in our setup becomes, N p is number of parameters per component in the k component mixture, given the number of clusters k min ≤ k ≤ k max .
For the kidney expression data, since we are interested in the role of Gata2 and Gata3 during early kidney development, we consider all the genes which have similar change points as the Gata2 and Gata3 genes, respectively. We perform an MoG clustering within such genes and look at those coclustered with Gata2 or Gata3. Coclustering within a regime potentially suggests that the governing dynamics are the same, even to the extent of coregulation. We note that just because a gene is coclustered with Gata2 in one regime, it does not mean that it will cocluster in a different regime. This approach suggests a way to localize regimes of correlation instead of the traditional global correlation measure that can mask transient and condition-specific dynamics. For this gene expression data, the MML penalized criterion indicates that an adequate number of clusters to describe this data is 4 EURASIP Journal on Bioinformatics and Systems Biology two (k = 2). In Tables 1 and 2, we indicate some of the genes with similar coexpression dynamics as Gata2/Gata3 and a cluster assignment of such genes. We observe that this clustering corresponds to the first phase of embryonic development (days 10-12 dpc), the phase where Gata2 and Gata3 are perhaps most relevant to kidney development [12][13][14][15].
A word about Table 1 is in order. The entries in each column of a row (gene) indicate the change points (as found by the SSA-CPD procedure) in the time series of the interpolated gene expression profile. Our simulation studies with the T-cell data indicate that the SSM and CoD performance is not much worse with the interpolated data compared to the original time series (Table 7). We note that because of the present choice of parameters N W , we might have the detection of some false positive change points, but this is preferable to the loss of genuine change points. An examination of the change points of the various genes in Table 1 indicates three regimes-between points approximately 1-5, 5-11 and 12-20. The missing entries mean that there was no change point identified for a certain regime and are thus treated as such. Since our focus is early Gata3 behavior, we are interested in time points 1-12, and hence we examine the evolution of network-level interactions over the first two regimes for the genes coclustered in these regimes.
To clarify the validity of the presented approach, we present a similar analysis on another data set-the T-cell expression data presented in [1]. This data looks at the expression of various genes after T-cell activation using stimulation with phorbolester PMA and ionomycin [16]. This data has the profiles of about 58 genes over 10 time points with 44(34 + 10) replicate measurements for each time point. Since here we have no specific gene in mind (unlike earlier where we were particularly interested in Gata3 behavior), the change point procedure (CPD) yields two distinct regimesone from time points 1 to 4 and the other from time points 5 to 10. Following the MoG clustering procedure yields the optimal number of clusters to be 1 (from MML) in each regime. We therefore call these two clusters "early response" and "late response" genes and then proceed to learn a network relationship amongst them, within each cluster. The CPD and cluster information for the early and late responses are summarized in Table 3.

STATE-SPACE MODEL
For a given regime, we treat gene expression as an observation related to an underlying hidden cell state (x t ), which is assumed to govern regime-specific gene expression dynamics for that biological process, globally within the cell. Suppose there are N genes whose expression is related to a single process. The ith gene's expression vector is denoted as where T is the number of time points for which the data is available. The state-space model (SSM) is used to model the gene expression (g (i) t , i = 1, 2, . . . , N and t = 1, 2, . . . , T) as a function of this underlying cell state (x t ) as well as some external inputs. A notion of influence among genes can be integrated into this model by considering the SSM inputs to be the gene expression values at the previous Table 1: Change-point analysis of some key genes, prior to clustering (annotations in Table 8). The numbers indicate the time points at which regime changes occur for each gene.  Table 8).

Gene symbol Change point I Change point II Change point III
Genes with the same dynamics as Gata3 Genes with the same dynamics as Gata2 Cldn4 Table 3: Some of the genes related to early and late responses in T-cell activation (annotations in Table 9).
Genes related to early response (time points: 1-4) Genes related to late response (time points: 5-10) IL2r (ii) observation equation: t ] T . A likelihood method [1] is used to estimate the state dimension K. The noise vectors e s,t and e o,t are Gaussian distributed with mean 0 and covariance matrices Q and R, respectively. From the state and observation equations (10) and (11), we notice that the matrix-valued parameter D = [D i, j ] j=1,...,N i=1,...,N quantifies the influence among genes i and j from one time instant to the next, within a specific regime. To infer a biological network using D, we use bootstrapping to estimate the distribution of the strength of association estimates amongst genes and infer network linkage for those associations that are observed to be significant.
Within this proposed framework, we segment the overall gene expression time trajectories into smaller, approximately stationary, gene expression regimes. We note that the MoG clustering framework is a nonlinear one in that the regimespecific state space is partitioned into clusters. These cluster assignments of correlated gene expression vectors can change with regime, allowing us to capture the sets of genes that interact under changing cell condition.

SYSTEM IDENTIFICATION
We consider the case where we have R g = B × P realizations of expression data for each gene available. Arguably, mRNA level is a measure of gene expression, B(= 2) denotes the number of biological replicates, and P(= 16 perfect match probes) denotes the number of probes per gene transcript. Each of these R g realizations is T-time-point long and is obtained from Affymetrix U74Av2 murine microarray raw CEL files. In the section below, we derive the update equations for maximum-likelihood estimates of the parameters A, B, C, D, Q and R (in (10) and (11)) using an EM algorithm, based on [17,18]. The assumptions underlying this model are outlined in Table 4. A sequence of T output vectors (g 1 , g 2 , . . . , g T ) is denoted by {g}, and a subsequence {g t0 , g t0+1 , . . . , g t1 } by {g} t1 t0 . We treat the (x t , g t ) vector as the complete data and find the log-likelihood log P({x}, {g}) under the above assumptions. The complete E-and M-steps involved in the parameter update steps are outlined in Tables 5 and 6.

BOOTSTRAPPED CONFIDENCE INTERVALS
As suggested above, the entries of the D matrix indicate the strength of influence among the genes, from one time step to the next (within each regime). We use bootstrapping to find confidence intervals for each entry in the D matrix and if it is significant, we assign a positive or negative direction (+1 or −1) to this influence.
The bootstrapping procedure [19] is adapted to our situation as follows. Initial state covariance (iv) Using the above obtained sample mean and variance, estimate confidence intervals for the elements of D. If D lies in this bootstrapped confidence interval, we infer a potential influence and if not, we discard it. Note that even though we write D, we carry out this hypothesis test for each D i, j , i = 1, . . . , n; j = 1, . . . , n; for each of the n genes under consideration in every regime.

SUMMARY OF ALGORITHM
Within each regime identified by CPD, we model gene expression as Gaussian distributed vectors. We cluster the genes using a mixture-of-Gaussians (MoG) clustering algorithm [7] to identify sets of genes which have similar "dynamics of expression" -in that they are correlated within that regime. We then proceed to learn the dynamic system parameters (matrices A, B, C, D, Q, and R) for the state-space model (SSM) underlying each of the clusters. We note two important ideas: (i) we might obtain different cluster assignments for the genes depending on the regime; (ii) since all these genes (across clusters within a regime) are still related to the same biological process, the hidden state x t is shared among these clusters.
Therefore, we learn the SSM parameters in an alternating manner by updating the estimates from cluster to cluster Table 6: E-step of the EM algorithm for state-space parameter estimation.

E-
Step Forward while still retaining the form of the state vector x t . The learning is done using an expectation-maximization-type algorithm. The number of components during regime-specific clustering is estimated using a minimum message length criterion. Typically, O(N) iterations suffice to infer the mixture model in each regime with N genes under consideration. Thus, our proposed approach is as follows.
(i) Identify the N key genes based on required phenotypical characteristic using fold change studies. Preprocess the gene expression profiles by standardization and cubic spline interpolation. (ii) Segment each gene's expression profile into a sequence of state-dependent trajectories (regime change points), from underlying dynamics, using SSA. (iii) For each regime (as identified in step 2), cluster genes using an MoG model so that genes with correlated expression trajectories cluster together. Learn an SSM [17,18] for each cluster (from (10) and (11) for estimation of the mean and covariance matrices of the state vector) within that regime. The input to observation matrix (D) is indicative of the topology of the network in that regime.
(iv) Examine the network matrices D (by bootstrapping to find thresholds on strength of influence estimates) across all regimes to build the time-varying network.
The discussion of the network inference procedure would be incomplete in the absence of any other algorithms for comparison. For this purpose, we implement the CoD-(coefficient-of-determination-) based approach [20,21] along with the models proposed in [1] (SSM) and [22] (GGM). The CoD method allows us to determine the association between two genes within a regime via an R 2 goodness of fit statistic. The methods of [1,22] are implemented on the time-series data (with regard to underlying regime). Such a study would be useful to determine the relative merits of each approach. We believe that no one procedure can work for every application and the choice of an appropriate procedure would be governed by the biological question under investigation. Each of these methods use some underlying assumptions and if these are consistent with the question that we ask, then that method has great utility. These individual results, their evaluation, and their comparison are summarized in Section 8.

Application to the GATA pathway
To illustrate our approach (regime-SSM), we consider the embryonic kidney gene expression dataset [8] and study the set of genes known to have a possible role in early nephric development. An interruption of any gene in this signaling cascade potentially leads to early embryonic lethality or abnormal organ development. An influence network among these genes would reveal which genes (and their products) become important at a certain phase of nephric development. The choice of the N(= 47) genes is done using FDR fold change studies [23] between ureteric bud and metanephric mesenchyme tissue types, since this spatial tissue expression is of relevance during early embryonic development. The dataset is obtained by daily sampling of the mRNA expression ranging from 11.5-16.5 days post coitus (dpc). Detailed studies of the phenotypes characterizing each of these days is available from the Mouse Genome Informatics Database at http://www.informatics.jax.org/. We follow [24] and use interpolated expression data pre-processing for cluster analysis. We resample this interpolated profile to obtain twenty points per gene expression profile. Two key aspects were confirmed after interpolation [24,25]: (1) there were no negative expression values introduced, (2) the differences in fold change were not smoothed out.
Initial experimental studies have suggested that the 10.5-12.5 dpc are relatively more important in determination of the course of metanephric development. We chose to explore which genes (out of the 47 considered) might be relevant in this specific time window. The SSA-CPD procedure identified several genes which exhibit similar dynamics (have approximately same change points, for any given regime) in the early phase and distinctly different dynamics in later phases (Table 1).
Our approach to influence determination using the statespace model yields up to three distinct regimes of expression over all the 47 genes identified from fold change studies between bud and mesenchyme. MoG clustering followed by state-space modeling yield three regime topologies of which we are interested in the early regime (days 10.5-12.5). This influence topology is shown in Figure 1. We compare our obtained network (using regime-SSM) with the one obtained using the approach outlined in [1], shown in Figure 2. We note that the network presented in Figure 2 extends over all time, that is, days 10.5-16.5 for which basal influences are represented but transient and condition-specific influences may be missed. Some of these transient influences are recaptured in our method ( Figure 1) and are in conformity (lower false positives in network connectivity) with pathway entries in Entrez Gene [15] as well as in recent reviews on kidney expression [8,12] (also, see Table 8). For example, the Mapk1-Rara [26] or the Pax2-Gdf11 [27] interactions are completely missed in Figure 2this is seen to be the case since these interactions only occur during the 10.5-12.5 dpc regime. We also see that the Acvr2b-Lamc2 [28] interaction is observed in the steady state but not in the first regime. This interaction becomes active in the second regime (first via the Acvr2b-Gdf11 and then via the Gdf11-Lamc2), indicating that it might not have particular relevance in the day 10.5-12.5 dpc stage. Several of these predicted interactions need to be experimentally characterized in the laboratory. It is especially interesting to see the Rara gene in this network, because it is known that Gata3 [29,30] has tissue-specific expression in some cells of the developing eye. Also Gdf11 exhibits growth factor activity and is extremely important during organ formation.
In Figure 3, we give the results of the CoD approach of network inference. Here the Gata3-Pax2 interaction seems reversed and counterintuitive. As can be seen, some of the interactions (e.g., Pax2-Gata3) can be seen here (via other nodes: Mapk1-Wnt11), but there is a need to resolve cycles (Ros1-Wnt11-Mapk1) and feedback/feedforward loops (Bmp7-Gata3-Wnt11). Both of these topologies can convey potentially useful information about nephric development. Thus a potentially useful way to combine these two methods is to "seed" the network using CoD and then try to resolve cycles using regime-SSM.  Figure 3: Steady-state network inferred using CoD (solid lines represent the first regime, and the dotted lines indicate the second regime).
LAT T-cell activation

T-cell activation
The regime-SSM network is shown in Figure 4. The corresponding network learnt in each regime using CoD is also shown ( Figure 5). The study of this network using GGM (for the whole time-series data) is already available in [22]. Though there are several interactions of interest discovered in both the SSM and CoD procedures, we point out a few of interest. It is already known that synergistic interactions between IL-6 and IL-1 are involved in T-cell activation [31]. IL-2 receptor transcription is affected by EGR1 [32]. An examination of the topology of these two networks (CoD and SSM) would indicate some matches and is worth pursuing for experimental investigation. However, as already alluded to above, we have to find a way to resolve cycles from the CoD network [33]. Several of these match the interactions reported in [1,22]. However, the additional information that we can glean is that some of the key interactions occur during "early response" to stimulation and some occur subsequently (interleukin-6 mediated T-cell activation) in the "late phase." An examination of the gene ontology (GO) terms represented in each cluster as well as the functional annotations in Entrez Gene shows concordance with literature findings (Table 9). Because this dataset has been the subject of several interesting investigations, it would be ideal to ask other questions related to network inference procedures, for the purpose of comparison. One of the primary questions we seek to answer is what is the performance of the network inference procedure if a subsampled trajectory is used instead? In Table 7, the performances of the CoD and SSM algorithms are summarized. Using the T-cell (10 points, 44 replicates) data, we infer a network using the SSM procedure. With the identified edges as the gold standard for comparison, we now use SSM network inference on an undersampled version of this time series (5 points, 44 replicates) and check for any new edges ( f new ) or deletion of edges ( f lost ). Ideally, we would want both these numbers to be zero. f new is the fraction of new edges added to the original set and f lost is number of edges lost from the original data network over both regimes. Further, we now interpolate this undersampled data to 10 points and carry out network inference. This is done for each of the identified regimes. The same is done for the CoD method. We note that this is not a comparison between SSM and CoD (both work with very different assumptions), but of the effect of undersampling the data and subsequently interpolating this undersampled data to the original data length (via resampling). Table 7 suggests that as expected, there is degradation in performance (SSM/CoD) in the absence of all the available information. However, it is preferred to infer some false positives rather than lose true positive edges. This also indicates that interpolated data does not do worse than the undersampled data in terms of true positives ( f lost ).
We make three observations regarding this method of network inference.
(i) It is not necessary for the target gene (Gata2/Gata3) to be present as part of the inferred network. We can obtain insight into the mechanisms underlying transcription in each regime even if some of the genes with similar coexpression dynamics as the target gene(s) are present in the inferred network. (ii) Probe-level observations from a small number of biological replicates seem to be very informative for network inference. This is because the LDS parameter estimation algorithm uses these multiple expression realizations to iteratively estimate the state mean, covariance and other parameters, notably D [17]. Hence inspite of few time points, we can use multiple measurements (biological, technical, and probe-level repli-  Figure 6: Steady-state network inferred using GGMs. cates) for reliable network inference. This follows similar observations in [34] that probe-level replicates are very useful for understanding intergene relationships. (iii) Following [24], it would seem that several network hypotheses can individually explain the time evolution behavior captured by the expression data. The LDS parameter estimation procedure seeks to find a maximum-likelihood (ML) estimate of the system parameters A, B, C, and D and then finally uses bootstrapping to only infer high confidence interactions. This ML estimation of the parameters uses an EM algorithm with multiple starts to avoid initializationrelated issues [17], and thus finds the "most consistent" hypothesis which would explain the evolution of expression data. It is this network hypothesis that we investigate. Since this network already contains our gene of interest Gata3, we can proceed to verify these interactions from literature and experimentally.

DISCUSSION
One of the primary motivations for computational inference of state specific gene influence networks is the understanding of transcriptional regulatory mechanisms [36]. The networks inferred via this approach are fairly general, and thus there is a need to "decompose" these networks into transcriptional, signal transduction or metabolic using a combination of biological knowledge and chemical kinetics. Depending on the insights expected, the tools for dissection of these predicted influences might vary. For comparison, we additionally investigated a graphical Gaussian model (GGM) approach as suggested in [35] using partial correlation as a metric to quantify influence ( Figure 6). This method works for short time-series data but we could not find a way to incorporate previous expression values as inputs to the evolution of state or individual observations-something we could explicitly do in the statespace approach. However, we are now in the process of examining the networks inferred by the GGM approach over the regimes that we have identified from SSA. Again, we observe that the network connections reflect a steady-state behavior and that transient (state-specific) changes in influence are not fully revealed. The same is observed in the case of the T-cell data, from the results reported in [22]. A comparison of all the presented methods, along with regime-SSM, has been presented in Table 10. The comparisons are based on whether these frameworks permit the inference of directional influences, regime specificity, resolution of cycles, and modeling of higher lags.

CONCLUSIONS
In this work, we have developed an approach (regime-SSM) to infer the time-varying nature of gene influence network topologies, using gene expression data. The proposed approach integrates change-point detection to delineate phases of gene coexpression, MoG clustering implying possible coregulation, and network inference amongst the regimespecific coclustered genes using a state-space framework. We can thus incorporate condition specificity of gene expression dynamics for understanding gene influences. Comparison of the proposed approach with other current procedures like GGM or CoD reveals some strengths and would very well complement existing approaches (Table 10). We believe that this approach, in conjunction with sequence and transcription factor binding information, can give very valuable clues to understand the mechanisms of transcriptional regulation in higher eukaryotes.