Algorithm for the Construction of a Global Enzymatic Network to be Used for Gene Network Reconstruction

Relationships between genes are best represented using networks constructed from information of different types, with metabolic information being the most valuable and widely used for genetic network reconstruction. Other types of information are usually also available, and it would be desirable to systematically include them in algorithms for network reconstruction. Here, we present an algorithm to construct a global metabolic network that uses all available enzymatic and metabolic information about the organism. We construct a global enzymatic network (GEN) with a total of 4226 nodes (EC numbers) and 42723 edges representing all known metabolic reactions. As an example we use microarray data for Arabidopsis thaliana and combine it with the metabolic network constructing a final gene interaction network for this organism with 8212 nodes (genes) and 4606,901 edges. All scripts are available to be used for any organism for which genomic data is available.


INTRODUCTION
One of the major goals in systems biology is to understand how functional relationships between genes under specific conditions determine changes in the organism's behavior and cell physiology. Co-expression networks and genome-scale metabolic models, have been successfully applied to advance this kind of biological knowledge [1][2][3][4].
There are many strategies which allow the construction of such co-expression networks [1][2][3], however, only a few of those are designed to integrate more than one type of genomic data [4][5][6][7][8], namely: gene expression, sequence homology, cell location, vicinity of chromosomal genes, sites binding to transcription factors, fusion events and phylogenetic profiles. Although all these types of data provide information for the construction of networks, the most valuable type are metabolic networks, since they directly give information on the relationship between cellular entities (enzymes) and metabolites. Moreover, metabolic reactions are global for all organisms and therefore a global metabolic network containing all known metabolic reactions is of special interest for biological network reconstruction of any kind.
One technique that has been implemented for the reconstruction of metabolic networks, and one that integrates various types of genomic data is Kernel Canonical Correlation Analysis (KCCA) [6,8,9]. One major difficulty with KCCA occurs, however, in the necessary tuning of arbitrary parameters (for example thresholds) for the network construction. Methodologies have also been developed for integrating genomic data types into partially known networks [10], generating networks of lesser genomic coverage. Similarly, other proposed methodologies are able to add similarity results on networks established previously as gold standards [9]. Other techniques rely on probabilistic approaches such as Boolean or Bayesian networks [11] or fully probabilistic descriptions [4], which however, have the drawback of being computationally non-practical for databases with a large number of genes. New and effective methodologies that integrate different types of genomic data are therefore needed.
Of particular interest would be novel approaches that do not depend on arbitrary parameters and that are applicable to a large number of organisms. Such flexible and relatively simple methods will help advance biological knowledge of both model and less studied organisms through the generation of functional gene predictions leading to the formulation of new biological hypotheses.
In this work we propose a general methodology for constructing gene networks using information on metabolic reactions and gene expression data. Our strategy follows the following basic steps: i) construction of the global enzymatic network (GEN); ii) construction of the organismic enzymatic network (OEN) using a gene similarity matrix based on expression data of an organism of interest; iii) integration of both networks in order to obtain a final weighted organismic enzymatic network (WOEN). The implementation of the methodology is available under request as Perl scripts.
To illustrate the proposed methodology, we consider the model organism Arabidopsis thaliana and construct WOEN using microarray data and the GEN constructed separately. The used GEN has a total of 4226 nodes (EC numbers) and 42723 edges, and the final WOEN of Arabidopsis thaliana has 8212 nodes (genes) and 4606,901 edges, the increase in the number of nodes between the GEN and the WOEN, is explained by the fact that in an organism more than one enzyme can have the same enzymatic activity. The GEN we construct here can be used in combination with gene expression data from the same or any other organism in order to construct the associated gene interaction networks. In that sense, we are using the GEN as a starting point in order to obtain WOENs representing gene coregulations that will depend on the microarray data used. Moreover, any other genomic data type that is also representable as a similarity matrix can be included in the construction by combination with the GEN and could therefore enrich the obtained WOEN.
In addition to the gene expression data, information on all known metabolic reactions was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG). The KEGG API was used to download the list of up to date EC numbers, enzymes related to each EC number and enzyme sequence. Perl scripts were developed to connect to the REST based KEGG API service, and download all metabolic information required for the global metabolic network; this ensures that any future users of the scripts will use up to date information.

Preprocessing of Microarray Data and Gene-gene Similarity Matrix
The downloaded datasets were independently preprocessed for noise reduction, quantile normalization and log2 transformation. The RMA (Robust Multiarray Average) method was applied to Affymetrix data using R affy library [12,13]. Probe IDs were converted to gene IDs. Single probes that matched more than one gene were removed [14]. For those multiple probes that matched a single gene, the maximum expression among the multiple probes was assigned to the gene as suggested in Dozmorov [15]. Genes common to all databases were used for construction of the gene-gene similarity matrix. Microarrays expression data was used to assess the gene expression similarity matrix (GESM). The similarity in GESM for gene pairs was obtained using the mutual information coefficient [16].

Global Enzymatic Network Construction
Using the names, codes and reactions of biochemical activities performed by enzymes, which are defined by the Enzyme Commission (EC) [17], a global enzymatic network (GEN) was constructed, in which the nodes are enzymatic activities (EC numbers), and two activities are connected by an edge if they share at least one metabolite, either as substrate or product [18].
For the construction of the GEN the following strategy was applied, for which several Perl scripts ( Table 1) were developed (see Fig. 1, GEN construction):

Enzymatic Reactions File Construction
A list of all currently known EC numbers was retrieved from the KEGG API, and for each one of the EC numbers, all metabolites involved in the reactions catalyzed by the enzymes annotated with that EC number were downloaded, so that a file with EC numbers and associated metabolites was created. The Perl script used to accomplish this was called "1_Fetch_EC_met.pl" and does not need any input; it connects directly to the KEGG database and retrieves all information needed.

Scripts
"1_metabolite_name-code_hash.pl" and "2_reactions_metID.pl" were created for construction of GEN from ftp KEGG data: i) "1_metabolite_name-code_hash.pl" uses as input one file with the identification code, name and alternative names of all biological compounds to build a list of equivalences ("dictionary") relating names and alternate names of metabolites with an identification code (used in KEGG). The goal of this step is to create the metabolic network with a less complex and easy to understand nomenclature for future users; ii) using the equivalence list, the metabolic reactions were converted into reactions with codified metabolites. Furthermore, as each EC number can have many reaction variants in different organisms, we summarized this information relating each metabolite to one EC number using the main reaction as handle. This step is achieved with the script "2_reactions_metID.pl", that uses information of each known enzymatic activity (names of the activity, reaction and involved metabolites) and the list containing information on names and codes of metabolites.

Filtering of Metabolites
As some metabolites, for example H 2 O or NADH, are very common and used in many enzymatic reactions, they have to be withdrawn to avoid an overconnected network. Following Kharchenko [18], the forty most common metabolites were filtered (Supplementary Table 1). This filtering was done with the perl script "3_filter_reactions _slim.pl", which allows filtering the first n most common metabolites.

Comparison of Enzymatic Activities and Construction of the GEN
Once the n(=40) most common metabolites were filtered, the GEN was constructed, connecting two nodes (EC numbers) if they share at least one metabolite. This step is achieved using the script "4_network_construction.pl". Weights the OEN using the GESM "Gen-Gen_adjacency.matrix"; a GESM valid file "Gen-Gen_similarity.matrix"

Organismic Enzymatic Network Construction
In the organismic enzymatic network (OEN) nodes are genes from the genome of the organism of interest, and two genes are connected by an edge if the EC numbers associated to each one of the genes share a metabolite. The following strategy was designed to construct the OEN (Fig. 1, OEN construction):

Match of Enzymatic Activities to Genes in the Genome of Interest
The first step, is to assess sequence homology between all known enzymes assigned to each one of the EC numbers (Gold Standard Enzymes (GSE)) [18] present in the GEN and all protein-coding genes in the genome of the organism of interest (Fig. 1). This step is achieved with the Perl script "1_Fetch_genes.pl", which downloads from KEGG aminoacid (aa) sequences of all known GSE. Then, this script performs a BLASTP homology search comparing these aa sequences and all known protein-coding genes of the organism of interest. The E-value threshold can be chosen, and for this implementation we used 1x10 -5 . Because this search is computationally expensive, the script was developed to run the BLASTP for subsets of EC numbers, providing therefore the option of parallel running over different subsets, and finally combine the results into a single file. The result is a file with a list of protein coding genes and associated EC numbers.

Construction of the OEN
Once the list of enzymes of the organism of interest has been obtained and related to the corresponding EC numbers of the GEN, the OEN can be constructed (Fig. 1). The Perl script "2_DataBase_network.pl" does this by searching in the GEN, and linking up genes to define the edges in the OEN, if the associated EC numbers are linked by an edge in the GEN.

Representation of the OEN as an Adjacency Matrix
Finally, the script "3_Gen-gen_adjacency_matrix.pl" represents the OEN in terms of an adjacency matrix. This matrix is a matrix containing 1 if an edge exists between enzymes coded by the genes of the organism and 0 elsewhere. All protein-coding genes, for which a metabolic activity could not be associated by the BLAST search, are not included in the matrix.  2SGEN, 3SGEN, 4SGEN, 1SOEN and 2SOEN represent the corresponding abbreviation for the scripts in (Table 1). Black boxes marked with ^ represent examples of the files generated by the related scripts; ECM: first column are EC numbers and the other columns are codified metabolites associated to each EC number; ECMF: ECM after applying the filter for the most common metabolites; ECEC: each line is a pair of EC numbers representing an edge in the GEN; ECG: first column are EC numbers and second column are the genes assigned to each EC number; GG: each line is a pair of genes representing an edge in the OEN.

Weighted Organismic Enzymatic Network Construction
At this point a final network (weighted organismic enzymatic network (WOEN)) is obtained combining the OEN and the GESM. This WOEN is represented as an adjacency matrix (Fig. 2, WOEN construction). Edges could be weighted using the script "4_Weight_adjacency_matrix.pl".

Topological Analysis of Networks
Different topological properties of the obtained network were computed in order to reveal changes in the network configuration at each step of our methodology. We consider the networks: GEN, OEN and WOEN; and the following topological properties: number of nodes, number of edges, clustering coefficient, average path length and centralization [19]. This analysis was performed using the R library igraph [20].

Validation of Edges Linking Immunity Related Genes (IRGs)
The WOEN was mined to validate some functional relationships between immunity related genes (IRGs). Given their importance on immune processes, four IRGs were selected (FLS2, CLV1, RPS2 and RPS4), their edges were filtered and compared with interactions previously reported in literature.

GESM for Arabidopsis thaliana Microarray Data
After calculating the similarity measurements between all gene pairs, we obtained a GESM of 8,212x8,212 genes for the microarray data of Arabidopsis thaliana. This similarity represents the amount of coordinated activity between all pairs of genes and contains all known genes of the organisms, regardless whether their products participate or not in enzymatic activities. Similarities between genes on this matrix strongly depend on the gene expression data used.

GEN Construction
Using all enzymatic activities data known to date, a GEN ( Table 2) was constructed (representing 4,226 enzymatic activities, with a final number of 2'043,335 associated GSE in a GEN constructed after filtering for the 40 most common metabolites). The GEN was drawn using Gephi [21] (Fig. 3). This network is available at, as a list of connected pairs of enzymatic activities.

OEN and WOEN Construction for Arabidopsis thaliana
Using the proposed methodology a OEN and subsequently a WOEN ( Table 2) was constructed for Arabidopsis thaliana in order to illustrate the methodology. These networks do not aim to represent overall genetic networks, although if very general gene expression data is used (representing many different conditions), a more general network could be achieved. Nevertheless, a detailed analysis of the network allowed us to retrieve several interesting relationships between genes previously reported in the literature (see section on validation).

Topological Analysis of Networks
One way of identifying the effect of the different steps of the proposed algorithm is to track the changes in topological properties of the obtained networks ( Table 2). It was found, for example, that GEN is a small graph compared to the subsequent constructed OEN and WOEN. After the BLASTP homology search was performed on GEN, a huge amount of hits per enzyme was revealed. Consequently, the number of edges increased drastically when passing to OEN.
Due to the high number of edges, we expected the OEN to have a higher clustering coefficient; however, during the BLASTP step, the number of nodes augmented simultaneously. As a consequence, both networks exhibit approximately the same degree of clustering. This result supposes that modules of highly connected nodes can be observed Fig. (2). Workflow of the procedure for the WOEN construction. * 1SWOEN represents the corresponding abbreviation for the script in (Table 1). Black box marked with ^ represent an example of the files generated by the related script; WGG: each line is a pair of genes representing an edge with the assigned weight in the WOEN.

WOEN Construction
Nodes are genes, these are connected by edges if a metabolite is shared between the assigned EC numbers, edge weight is set by the similarity score between two genes in the GESM Assign the similarity score in the GESM as the weight for each gene pairs representing and edge  Same clustering coefficients do not mean equal connectivity in the process represented. To evaluate how edges improve the graph global connectivity, we calculated the centralization and average path length for each network ( Table  2). The higher number of nodes in OEN generated a better connectivity as each pair of nodes is separated by 2 edges contrasting the 4 edges in GEN. The better connectivity found in OEN and WOEN also means that graphs are neither centralized nor hubs-dependent. Besides, in metabolic networks a low average path length is an indicative of more efficient transfer processes [22].
The weighting step reduced the number of nodes and edges in the network. Despite this size reduction, the value of the topological properties of WOEN was about the same. We conclude that genes without expression data are not relevant for the system representation. However, these genes could not be identified using just pathways data. It must be pointed out that expression data allowed us to weight the functional relationships, but also to drop irrelevant nodes from the final WOEN. Finally, our topological analysis results are comparable to those from other methodologies [23].

Validation of Edges Linking Immunity Related Genes (IRGs) Based on Previous Works
Four of the most important IRGs were searched on the WOEN and their edges were compared with literature ( Table  3). One of them, FLS2, was found linked toBAK1 and BRI1. The protein FLS2 is a LRR receptor-like serine/threonineprotein kinase that recognizes peptide from flagellin and triggers plant immunity [24]. On the other hand, BAK1can regulate the tradeoff between immunity and responses to hormones. While BRI1 is a receptor of the growth hormonebrassinosteroid [25]. BAK1 is not only a co-receptor of FLS2 but alto interacts with BAK1 as reported previously [24,26,27].
Other edges predicted for CLV1 and RPS4 are referred in (Table 3). CLV1 controls shoot and floral meristem size. Equally, PSY1 is a tyrosine-sulfated peptide hormone. This hormone stimulates cellular proliferation and maintenance of root stem cells [32]. PSY1 and other secreted peptide hormones such as CLE2, suffer post-translational modifications and could function as ligands of CLV1 [32]. Finally, the R protein RPS4 specifies resistance to Pseudomonas syringae pv. tomato expressing avrRps4. SGT1 is an ubiquitin ligaseassociated protein proved to have a role in host and non-host resistance [33]. The work ofLi, Li, Bi, Cheng, Li and Zhang [34] suggested that SGT1 conforms a complex that negatively regulates RPS4 accumulation. All in all, the functional predictions for these Arabidopsis IRGs are well documented and therefore, the WOENs can be mined for potential IRGs.

CONCLUSION
The proposed procedure allows obtaining a global enzymatic network and a gene network for any organism for which genomic data is available. Topological analyses showed the graph transformation at each step. The tendency of nodes to cluster remains constant along the process, while an improvement in connectivity and noise reduction was observed after the blast search and expression data integration. The WOEN edges are reinforced with the biological data found in literature. Furthermore, our results from the merging of immunity microarray data and the obtained metabolic network, predict a strong relationship between some genes immune processes in Arabidopsis.

AVAILABILITY AND REQUIREMENTS
Pre-processing of microarray data and similarity measurements between genes based on the gene expression profiles, can be obtained with the aid of several software packages. We recommend using R (R Development Core Team, 2011). Perl is free software and the scripts described in this work can be run in any LINUX/UNIX operating system with a running installation of BLAST+. Scripts are available under request at llopezk@unal.edu.co.

CONFLICT OF INTEREST
The author(s) confirm that this article content has no conflict of interest.

ACKNOWLEDGEMENTS
This research has been partially financed by a Grant of the Colombian Banco de la República (Project # 3202).

AUTHORS' CONTRIBUTIONS
• Liliana López-Kleine proposed and directed the Banco de la Repúplica Project proposing. Jorge Ramírez was the co-director of this Project. They design the strategy of the software together. Andrés Quintero wrote all scripts and implemented the software for the A. thaliana data in Perl. Luis Leal analyzed the topology of the obtained network and reviewed biological information on immunity processes of A. thaliana • Andrés Quintero is a biologist and actually is a student of the Master of Science -Biology program at Universidad Nacional de Colombia -Sede Bogotá. His main research area is reconstruction of biological networks.
• Jorge Ramirez is a civil engineer with a MSc and a PhD in mathematics. He currently holds an associate professorship at the Mathematics department of Universidad Nacional de Colombia. His research is mainly focused on stochastic processes and their applications to problems in the natural sciences.
• Luis Leal is a chemical engineer with MSc in statistics at Universidad Nacional de Colombia. His research focuses on statistical approaches to genomics.
• Liliana López-Kleine is a biologist with PhD in statistics and actually associate profesor at the Universidad Nacional -Sede Bogotá Statistics Department. Her main research area is Statistical Genomics.

SUPPLEMENTARY MATERIALS
Supplementary material is available on the publisher's web site along with the published article.