Centrality analysis methods for biological networks and their application to gene regulatory networks.

The structural analysis of biological networks includes the ranking of the vertices based on the connection structure of a network. To support this analysis we discuss centrality measures which indicate the importance of vertices, and demonstrate their applicability on a gene regulatory network. We show that common centrality measures result in different valuations of the vertices and that novel measures tailored to specific biological investigations are useful for the analysis of biological networks, in particular gene regulatory networks.


Introduction
The interaction of biological entities such as genes, proteins and metabolites is of great interest in life science research and is increasingly important for systems biological approaches (Oltvai and Barabási (2002); Kitano (2002)). The interplay of different interactions is often represented by biological networks such as gene regulatory, protein interaction and metabolic networks. To investigate these complex and large networks different network analysis methods have been developed or employed from other fi elds of sciences (Junker and Schreiber (2008)). Centrality analysis, the ranking of network elements used to identify interesting elements of a network is one of these methods (Koschützki et al. (2005)). It is particularly useful to identify key players in biological processes. For example, it has been shown that highly connected vertices in protein interaction networks are often functionally important and the deletion of such vertices is related to lethality (Jeong et al. (2001)). Wuchty and Stadler applied three different types of centralities to metabolic, protein interaction and domain sequence networks (Wuchty and Stadler (2003)). Fell and Wagner discuss the possibility that metabolites with highest degree (i.e. highest number of connections) may belong to the oldest part of the metabolism (Fell and Wagner (2000)). However, it has also been shown that the degree of a vertex alone, as a specifi c centrality measure, is not suffi cient to distinguish lethal proteins clearly from viable ones (Wuchty (2002)), that in protein networks there is no relation between network connectivity and robustness against aminoacid substitutions (Hahn et al. (2004)), and that for biological network analysis several centrality measures have to be considered (Wuchty and Stadler (2003); Koschützki and Schreiber (2004)).
To assist scientists in the exploration of biological networks, we discuss and compare different centrality measures. Some of them are already known in biological sciences, others are transferred from different fi elds of sciences such as social network analysis. We also show that it is useful to consider biological knowledge in network analysis and discuss motif-based centralities which have been specifi cally developed for gene regulatory networks.

Graphs and Centralities
A network is an informal description for a set of elements with connections between them. In a formal way a network is modelled as a mathematical object called graph. A directed graph G = (V, E) consists of a fi nite set V of vertices and a fi nite set E ⊆ V × V of directed edges. An edge e = (u, v) connects two vertices u and v and is directed from u to v. The vertices u and v are said to be incident with the edge e and adjacent to each other. The set of all vertices which are adjacent to a vertex u is called the neighbourhood N(u) of u.
The degree d(v) of a vertex v is the number of its incident edges. Let (e 1 ,…,e k ) be a sequence of edges in a graph. This sequence is called a walk if there are vertices v 0 ,…,v k such that e i = (v i−1 ,v i ) for i = 1,…,k, that is the end vertex of an edge e i is the start vertex of an edge e i+1 . If all edges are pairwise distinct and all vertices are pairwise distinct the walk is called a path. The length of a walk or path is given by its number of edges. A shortest path between two vertices u, v is a path with minimal length. The distance dist (u,v) between two vertices u, v is the length of a shortest path between them. If no path exists between two vertices u, v, then the distance dist(u,v) is undefined. Two vertices u, v of a graph are called strongly connected if there exists a walk from vertex u to vertex v. If any pair of different vertices of the graph is strongly connected, the graph is called strongly connected.
A subgraph of the graph Two graphs G 1 = (V 1 ,E 1 ) and G 2 = (V 2 ,E 2 ) are isomorphic if there is a one-to-one correspondence between their vertices, and there is an edge directed from one vertex to another vertex of one graph if and only if there is an edge with the same direction between the corresponding vertices in the other graph.
Small recurring subgraphs within a given graph are called motifs ). A motif M is a directed graph. A match G M of a motif M in a graph G is a subgraph of G which is isomorphic to the motif M. The motif match set MS G of a motif M is the set of all matches of M in the graph G. Figure 1 shows a motif and two matches of the motif in a graph.

Centralities in networks
Formally a centrality is a function C which assigns every vertex v of a graph a numeric value C(v). As we are interested in the ranking of the vertices of the given graph G we choose the convention that a vertex u is more important than another vertex v if and only if C(u) > C(v).
In the following sections we explain different centrality measures and show an example graph and the corresponding centrality values. We restrict our analysis to centrality measures which have been used to analyze biological networks or are used in our study in the second part of this paper. A comprehensive overview of different centrality measures was published in (Koschützki et al. (2005)).

Degree centrality
An obvious order of the vertices of a graph can be established by sorting them according to their degree. The corresponding centrality measure degree-centrality is defi ned as C deg (v) = d (v). For directed networks two degree centralities, the in-degree centrality (considering only ingoing edges) and the out-degree centrality (considering only outgoing edges), exist. Degree centrality is a local centrality measure: only the immediate neighbourhood of the vertex of interest is considered. Degree can be computed for all kinds of networks. See the work of Freeman (1979) for a list of references to the usage of degree-centrality in social network analysis. For biological network analysis degree centrality has been applied in numerous situations. For example, it is used by Jeong et al. (2001) to correlate the degree of a protein in the network with the lethality of its removal. Another study by Hahn and Kern (2005) compared three centralities (degree, closeness and betweenness) for the identifi cation of essential proteins in three different organisms: Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster. In all three networks and for all three centralities it was shown that the mean centrality value for essential proteins is significantly higher than the centrality value of nonessential proteins.

Closeness centrality
Closeness-centrality uses information about the length of the shortest paths within a network; it uses the sum of the minimal distances of a vertex to all other vertices. The closeness-centrality is defi ned as the reciprocal of this sum: As the distance between vertices is only defi ned for pairwise strongly connected vertices this centrality can only be applied to strongly connected networks. Closeness-based centrality has been used in different studies. Wuchty and Stadler (2003) apply this centrality to different biological networks and show the correspondence with the service facility location problem. According to a slight modifi cation of the closeness centrality 8 of the top 10 metabolites of the metabolic network of E. coli are part of the glycolysis and citrate acid cycle pathways (Ma and Zeng (2003)).

Radiality and integration
Similar to the closeness measure are the centralities radiality and integration introduced by Valente and Foreman (1998). The computation of both centralities is based on the reverse distance matrix which is defi ned on the basis of the distance matrix is the diameter, the highest distance value, of the graph. On the basis of this matrix RD radiality is defi ned as C rad A vertex with a high radiality value can easily reach other vertices. A vertex with a high integration value is easily reachable from other vertices. Similarly to closeness both radiality and integration are shortest path based measures. In contrast to closeness which can be only computed for strongly connected networks, radiality and integration can also be computed for weakly connected or even unconnected networks.

Shortest path betweenness centrality
Shortest path betweenness centrality quantifi es the ability of a vertex to monitor communication between other vertices. Every vertex that is part of a shortest path between two other vertices can monitor communication or fl ow between them. Counting how many such communications a vertex may monitor leads to an intuitive defi nition of a centrality: a vertex is central if it can monitor many communications between other vertices. In the following let σ st denote the number of shortest paths between two vertices s and t, and let σ st (v) denote the number of shortest paths between s and t that use v as an interior vertex. The rate of communication between s and t that can be monitored by an interior vertex v is denoted by If no shortest path between s and t exists we set δ st (v) = 0. The shortest path betweenness centrality (Freeman (1977) There are several studies investigating shortest path betweenness in biological networks. For an S. cerevisiae protein interaction network it was reported that proteins with a high betweenness centrality value cover a broad range of degree centrality values. In particular, proteins with a high betweenness and low degree value (HBLC, high betweenness low connectivity proteins) are prominent as they are supposed to support modularization of the network (Joy et al. (2005)). Shortest-path betweenness centrality was applied to mammalian transcriptional regulatory networks and it was noted that betweenness appears to be an interesting topological characteristic in regard to the biological signifi cance of distinct elements (Potapov et al. (2005)).

Katz status index and PageRank
For the analysis of gene regulatory networks discussed in the second part two further centralities can be applied: the status index defi ned by Katz (1953) and the PageRank centrality (Page et al. (1998)) which is the algorithmic method behind the search engine Google. Both centralities are best described as computations performed on the adjacency matrix accompanied to the graph of interest. As we focus on the result of different centralities and their comparison we skip a lengthy formal defi nition here and refer to the literature for details (Katz (1953); Page et al. (1998); Koschützki et al. (2005); Koschützki (2008)).

Motif-based centralities
Given a graph G, a motif M and the corresponding motif match set MS G a centrality can be defi ned. The motif-based centrality C mb assigns to every vertex v the number of matches the vertex v occurs in (Koschützki et al. (2007)). For example the vertex v01 in the graph shown in Figure 2 occurs in two matches of the FFL motif shown in Figure 3. Therefore C mb (v01) = 2. Two extensions of this centrality exist: motif-based centrality with roles and motif-based centrality with classes.
Vertices of motifs may represent different functions. For example, in the gene regulatory network context three different functions of the vertices of the feed forward loop (FFL) motif as shown in Figure 3 can be identifi ed: (1) the vertex at the top is the master regulator, this vertex regulates the other two vertices; (2) the vertex on the right side is the intermediate regulator, it is regulated by the master regulator and itself regulates together with the master regulator the vertex at the bottom; and (3) the vertex at the bottom of the drawing is regulated by both other vertices and is therefore called the regulated vertex. Such different functions of vertices within motifs are called roles and three roles can be assigned to the vertices of the FFL motif. The motif-based centrality with roles C mbr restricts the number of counted matches to those matches where the vertex occurs in the match with the role under consideration; see Koschützki et al. (2007) for details.
Using the previously introduced concepts we can extend the motif-based centrality method further. By assigning the same role to similar vertices of a group of similar motifs we can establish a centrality based on a class (or group) of motifs. Consider, for example, a group of chains (see Fig. 4), where all vertices at the start of such chains have a similar characteristic (no incoming   edges) and all vertices at the end have another similar characteristic (no outgoing edges). For gene regulatory networks several motif classes are known. For example, the regulatory chain motif class, as in the example above, consists of a set of chains of three or more regulators in which one regulator regulates another regulator, which in turn regulates a third one and so forth (Lee et al. (2002)). In the motif class single input motif (SIM) a set of vertices is exclusively regulated by a single vertex (Shen-Orr et al. (2002)). The motif-based centrality with classes C mbc therefore is the sum of motif-based centralities with roles C mbr for the same role in similar or related motifs.
Several motifs have been studied in all kinds of biological networks. The best studied motif is the FFL motif which functional properties have been analyzed in detail theoretically and experimentally especially in gene regulatory networks ; ; Shen-Orr et al. (2002); Wall et al. (2005)). However, in these approaches only the occurrence of motifs is considered but motifs are not used to rank the genes.
Different motifs occurring in a human cellular signalling network were analysed by Awan et al. (2007). They discovered that genes which are related to cancer are enriched in the target vertices of several motifs and that cell mobility genes are enriched in the source vertices of motifs. For a gene regulatory network of E. coli Wang and Purisima (2005) discovered, that transcript with short halflives are enriched in motifs, especially in SIMs, FFLs and bi-fans.  Table 1 shows the centrality values for the centralities that are applicable to this graph.

Analysing Gene Regulatory Networks with Centralities
The applicability of specifi c centrality measures for the investigation of biological networks depends on the type of the particular network, and depending on the type of the network different centrality measures are used. Here we focus our analysis on gene regulatory networks.
As an example, we analyze centralities within the gene regulatory network (GRN) of Escherichia coli. The network is based on the data of transcriptional regulatory interactions of genes from RegulonDB, Version 5.5 (Salgado et al. (2006)). Genes are represented by vertices and transcriptional regulatory interactions between genes are modelled as edges, a common approach to model GRNs. The interactions between genes represent transcriptional control of transcription factors on the transcription of regulated genes. There are a few cases where transcription factors are formed by subunits of different gene products. They are here replaced by a common identifi er which corresponds to the transcription factor, e.g. ihfA or ihfB result in ihfAB. The regulatory interactions of such different subunits are assigned to this new identifi er, and parallel edges which occurred due to the previous operation are replaced by a single edge. The resulting network consists of 1250 vertices and 2515 edges. In gene regulatory networks genes at a high level within the hierarchy of regulatory control are of particular interest due to their far reaching infl uence on other genes within the network. These genes are commonly called global regulators. Some criteria for the characterization of global regulators have been proposed, such as the number of regulated genes, the number and type of co regulators, the number of other regulators they control, the size of their evolutionary family, and the variety of conditions where they exert their control (Martínez-Antonio and Collado-Vides (2003)).

Comparison of different centralities for GRN
In this section, we compare different centrality measures that can be applied to GRNs. As GRNs are directed graphs that are not necessarily strong connected only the centralities degree, shortestpath betweenness, integration, radiality, Katz status index, PageRank and the different motif-based centralities can be applied. The centralities PageRank and Katz status index are sensible to the directionality of the edges and therefore we consider two variants of the graph, the original graph and the graph with all edge directions reversed.
The top 25 genes (top 2% of all genes) according to the eight best centrality measures (i.e. the centrality measures which identify the highest number of global regulators within the top 2% of all genes) are shown in Table 2. In total 18 global regulators have been identified by Martínez-Antonio and Collado-Vides (2003). All different Table 1. The centrality values that are discussed in this paper computed for the example graph in Figure 2. Abbreviations: chains: motif-based centrality for the chain class; ffl A, ffl B and ffl C: motif-based centrality for the FFL motif with roles (different roles A, B, C; see Figure 3); ffl Sum: motif-based centrality for the FFL motif without roles; ideg: in-degree; int: integration; kat: Katz status index; katR: Katz status index for the reversed graph; odeg: out-degree; par: PageRank; parR: PageRank for the reversed graph; rad: radiality; spb: shortest-path betweenness. centrality measures shown in Table 2 are able to identify more than 50% of the global regulators within the top 2% of the ranked genes. For example, shortest path betweenness fi nds 11 global regulators and motif-based centrality with the chain motif class is able to identify 15 global regulators.
It should be also noted that for nearly all centrality measures the top 5 positions are occupied by global regulators. However, all centralities result in different rankings even for global regulators which are often ranked very high. For example, the gene ihfAB is ranked either very high at the second position (e.g. radiality, PageRank) or not even under the top 25 genes (shortest path betweenness). Radiality ranks similar to the motif-based centrality with the chain motif class (short chain centrality) but even in this short list differences are visible. For example, the global regulator fur ranked on position 8 (radiality) is ranked on position 18 by the chain centrality.
Correlation coeffi cients are a valid measure to show that centralities do not coincide. Table 3 shows the pairwise Kendall's correlation coefficients for the centralities used in Table 2. From these centralities only a few correlate with a coeffi cient above 0.9 to other centralities. These are out-degree, PageRank, Katz status index, radiality and the motif-based centrality with chain classes (chain). The centralities based on the FFL motif and shortest-path betweenness do correlate only with correlation coeffi cients less than 0.9 to other centralities.
For the fi ve centralities with a correlation coeffi cient above 0.9 these high coeffi cients can easily be explained: 1101 out of 1250 (88.08%) vertices have an out-degree of zero. All these vertices are assigned the same centrality value of nearly zero for the Katz status index and the PageRank Table 2. Names of the top 25 genes (top 2% of all genes) according to 8 best centrality measures, i.e. centralities which fi nd a high number of global regulators within the top 2% of all genes. Global regulators according to Martínez-Antonio and Collado-Vides (2003) are highlighted in bold face. Note that in few cases were genes with the same centrality value occur they are ranked in alphabetical order. For each centrality the last row of the centrality, and the value zero for the radiality and the motif-based centrality with chain classes. Therefore, the comparison of correlations between all centrality values is not feasible for the complete vector of centralities: all fi ve centralities rank these 1101 vertices into the same group. Table 4 shows the pairwise correlation coefficients for the centrality values of the vertices which have a non-zero out-degree. These coeffi cients show a different picture: all fi ve centralities do rank the remaining 149 genes differently, only the centrality radiality and Katz status index archive a considerable high correlation to each other and to the motif-based centrality with chain classes.
In conclusion, the centralities applied to the GRN rank the genes differently and the motifbased centrality with chain classes is able to rank the highest number of interesting genes (global regulators) within the top 2% of all genes. The chain centrality identifi es 15 out of 18 global regulators (83%) identifi ed by Martínez-Antonio and Collado-Vides (2003) and outperforms the other centralities used.

Discussion
To investigate large biological networks different analysis methods have been developed, and centrality analysis is a particularly useful method to analyze the structure of these networks. In this paper we discussed and compared different centrality measures and applied them to a gene regulatory network of E. coli. The results show that using centrality analysis methods from other fi elds of sciences such as social network analysis is a starting point to investigate gene regulatory networks. However, we also show that it is useful to consider biological knowledge in network analysis and that the recently introduced motifbased centrality outperforms other methods.
The comparison of the pairwise correlation coeffi cients and the analysis of the rankings of the top 25 genes show that the motif-based centralities, in particular with the chain motif class, produce rankings different to the rankings computed by existing centralities, and that these rankings show interesting features of the gene regulatory network under analysis.