Send Orders of Reprints at Reprints@benthamscience.net Genomes to Hits in Silico -a Country Path Today, a Highway Tomorrow: a Case Study of Chikungunya

These are exciting times for bioinformaticians, computational biologists and drug designers with the genome and proteome sequences and related structural databases growing at an accelerated pace. The post-genomic era has triggered high expectations for a rapid and successful treatment of diseases. However, in this biological information rich and functional knowledge poor scenario, the challenges are indeed grand, no less than the assembly of the genome of the whole organism. These include functional annotation of genes, identification of druggable targets, prediction of three-dimensional structures of protein targets from their amino acid sequences, arriving at lead compounds for these targets followed by a transition from bench to bedside. We propose here a " Genome to Hits In Silico " strategy (called Dhanvantari) and illustrate it on Chikungunya virus (CHIKV). " Genome to hits " is a novel pathway incorporating a series of steps such as gene prediction, protein tertiary structure determination, active site identification, hit molecule generation, docking and scoring of hits to arrive at lead compounds. The current state of the art for each of the steps in the pathway is highlighted and the feasibility of creating an automated genome to hits assembly line is discussed. 1. INTRODUCTION The automation of genomes to hit molecules pathway poses several challenges. It involves, inter alia, (i) accurate genome annotation , (ii) identification of druggable target proteins, (iii) determination of 3-dimensional structures of protein targets, (iv) identification of hits for the target, (v) optimization of hits to lead molecules to realize high levels of affinity and selectivity to the target and low toxicity. Here, we describe the progresses achieved in each of the above areas, the conceivability of a " Genome to hits " assembly line in silico (Fig. 1) and illustrate the approach with chikungunya virus (CHIKV).


INTRODUCTION
The automation of genomes to hit molecules pathway poses several challenges. It involves, inter alia, (i) accurate genome annotation, (ii) identification of druggable target proteins, (iii) determination of 3-dimensional structures of protein targets, (iv) identification of hits for the target, (v) optimization of hits to lead molecules to realize high levels of affinity and selectivity to the target and low toxicity. Here, we describe the progresses achieved in each of the above areas, the conceivability of a "Genome to hits" assembly line in silico (Fig. 1) and illustrate the approach with chikungunya virus (CHIKV).

BACKGROUND
We describe here the science and the software behind "Genome to Hits" assembly line which comprises six steps ( Fig. 1), classifiable into three major areas of research viz. (a) genome annotation (steps 1 and 2), (b) protein tertiary structure prediction (step 3) and (c) structure based drug design (steps 4 to 6). Information available on chikungunya virus, which is taken up as an illustrative case in this study is summarized in the subsection (d).
(a). Genome Annotation. The computational genome annotation can play a vital role in finding potential therapeutic target molecules for pathogens. In the present research scenario, it is a big challenge to carry out the structural and functional annotation of the whole genome sequence or the translated ORFs (open reading frames). These annotations can be used in comparative genomics, pathway reconstruction and particularly in drug design.
Genome annotation is the process of exploring biological/functional information from sequences ( Table 1). It is done by following two main steps: (i) identification of distinct, potentially functional elements on the genome, a process called gene prediction *Address correspondence to this author at the Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India; Tel: 91-11-2659 1505; 91-11-2659 6786; Fax: 91-11-2658 2037; Emails: bjayaram@chemistry.iitd.ac.in; website: www.scfbio-iitd.res.in in the context of identification of protein coding regions and (ii) assignment of biological function to these elements (genes or proteins).
Automated annotation tools provide a faster computational annotation as compared to manual annotation (curation) which involves human expertise. Ideally, these approaches coexist and complement each other in the same annotation pipeline. The basic level of annotation involves finding genes and isolating the protein coding sequences from non-coding sequences. A variety of computational approaches have been developed to permit scientists to view and share genome annotations ( Table 2). Most of the available computational methods are knowledge-based and adopt techniques like Hidden Markov Models or machine learning methods. The accuracies of these models are limited by the availability of data on experimentally validated genes, and as typically seen in newly sequenced genomes, can lead to suboptimal levels of prediction. Ab initio methods originating in physico-chemical properties of DNA can help overcome the limitations of knowledge-based methods.  [1,2]. The performance of PSI-BLAST and other database search tools to identify homologs of a given query in a sequence database has been measured by others [3]. However these benchmarks do not suffice the requirements in genome annotation. Our efforts are aimed at eliminating the limitations of PSI-BLAST in correctly annotating protein coding sequences in genomes by using ab initio approach. Physico-chemical properties such as hydrogen bonding, stacking, solvation etc. show clear signatures of the functional destiny of DNA sequences [4][5][6][7][8], which has formed the basis of Chemgenome. In the present study, we have used Chemgenome, the SCFBio tool (http://www.scfbioiitd.res.in/chemgenome/chemgenome3.jsp) to produce and interpret structural annotations for the viral genome of Chikungunya virus.
(b). Protein tertiary structure prediction. The genome annotation is followed by protein annotation at structural, functional and at genomic scale which is essential for routine work in biology and for any systematic approach to the modeling of biological systems. To bridge the expanding sequence-structure gap, many computational approaches are becoming available which assign structure to a novel protein from its amino acid sequence. A plethora of automated methods to predict protein structure have been developed based on a variety of approaches. These include (a) homology modeling, (b) fold recognition or threading, (c) ab initio or de novo methods. Homology modeling and fold recognition methods utilize the information derived from structures solved previously via x-ray and NMR methods. This method is effective, popular, reliable and fast for protein tertiary structure prediction when a close sequence homolog exists in the structural repositories. Several protein structure prediction tools are available in the public domain ( Table 3).
To make biological sense out of large volumes of sequence data, it is necessary to compare the protein sequences with those proteins that have been already characterized biochemically. To design drug molecules, structural annotation plays an important role. Structural genomics (SG) efforts facilitate such comparisons by determining the structures for a large number of protein sequences, but most SG targets have not been functionally characterized. It is already known that accurate functional details of a protein can neither be inferred from its sequence alone nor from sequence comparisons with other proteins whose structures and functions are known but only from its own native structure [9][10][11].
Ab initio or de novo methods are frequently employed for predicting tertiary structures of proteins by incorporating the basic physical principles, irrespective of the availability of structural homologs. In this study, Bhageerath and Bhageerath-H servers are employed for protein structure prediction. Bhageerath is an energy based software suite for predicting tertiary structures of small globular proteins, available at http://www.scfbio- iitd.res.in/bhageerath/index.jsp [12,27]. It predicts five candidates for the native, from the input query sequence. Bhageerath-H [28] is a hybrid (homology + ab initio) server for protein tertiary structure prediction [29,30]. It identifies regions which show local sequence similarity in respect to sequences in RCSB (protein data bank) to generate 3D fragments which are patched with ab initio modeled fragments to generate complete structures of the proteins. This server again predicts the best five energetically favorable structures, which are expected to be close to the native. The knowledge of tertiary structures of proteins serves as a basis for structure-based drug design. (c). Structure based drug design. Design of small molecules in structure based drug discovery requires knowledge of the binding pocket on the protein which upon blockade results in loss of function. Experimental information on protein active sites and function loss are useful. In the absence of any experimental information, one could identify all potential binding sites on the protein from the structural information (Table 4). In this study we use, AADS (http://www.scfbio-iitd.res.in/dock/ActiveSite_new.jsp) methodology for an automated identification of ten potential binding pockets which are expected to bracket the true "active site" (binding pocket). AADS requires the 3D structure of the target protein and detects the top 10 potential binding sites with 100% accuracy in capturing the actual binding (active) site.
Once the binding pockets on proteins are identified, libraries of small molecules are screened against these sites to identify a few hit molecules using software such as RASPD (http://www.scfbioiitd.res.in/software/drugdesign/raspd.jsp). RASPD protocol is designed in the spirit of structure-based drug design approach but with a rapid turnover rate. RASPD screens small molecule databases against the active sites based on physiochemical descriptors or in general the set of Lipinski parameters such as hydrogen bond donors, hydrogen bond acceptors, molar refractivity, Wiener index and volume for the protein and drug and also the functional groups  [31][32][33]. The most interesting feature of RASPD is that it generates a set of hit molecules based on the complementarities of the aforementioned properties with a certain cutoff binding affinity bypassing the conventional docking and scoring strategies, which reduces the search time significantly. The libraries incorporated in RASPD are a million compound library of small molecules and a natural product library. The users can also sketch molecules of their choice or use a non-redundant dataset of small molecules NRDBSM [34] (http://www.scfbio-iitd.res.in/software/nrdbsm/index.jsp) and submit them for RASPD screening. The screening is followed by atomic level docking and scoring strategies ( Table 5) such as Sanjeevini (http://www.scfbioiitd.res.in/sanjeevini/sanjeevini.jsp) to identify a few candidates which could be pursued as leads for experimental synthesis and validation [35,36]. ParDOCK module of Sanjeevini is an all-atom energy based Monte Carlo algorithm for protein-ligand docking. It involves the positioning of ligands optimally with best configuration in the protein binding site and scores them based on their interaction energies. This utility is freely accessible at http://www.scfbio-iitd.res.in/dock/pardock.jsp [37]. ParDOCK uses BAPPL scoring function [38] for atomic level scoring of nonmetallo protein ligand complexes and in ranking them accurately with their estimated free energies. BAPPL is again freely accessible at http://www.scfbio-iitd.res.in/software/drugdesign/bappl.jsp. The accuracy of this scoring function in predicting binding free energy is high with ±1.02 kcal/mol average error and a correlation coeffi-cient of 0.92 between the predicted and experimental binding energies for 161 protein-ligand complexes. An extended version of BAPPL, i.e. BAPPL-Z can be used for the prediction of binding energies of the complexes having zinc metal ion in their active sites. BAPPL-Z utility is accessible at http://www.scfbioiitd.res.in/software/drugdesign/bapplz.jsp [39]. All these tools are collectively gathered in Sanjeevini software, which is a complete drug design software suite, freely accessible at (http://www.scfbioiitd.res.in/sanjeevini/sanjeevini.jsp) [34, [40][41][42][43][44][45][46][47]. Thus, the assessment of candidate molecules is done based on their binding energies and the molecules identified as good binders to the target are considered further for synthesis and testing.
(d). Chikungunya Virus. Chikungunya fever (CHIK) is a mosquito (Aedes aegypti) borne devastating disease caused by Chikungunya virus (CHIKV), an alphavirus belonging to the family Togaviridae. It is one of the most important re-emerging infectious diseases in Africa and Asia with sporadic intervals and is responsible for significant global impact on public health problems [48][49][50][51][52][53][54][55][56][57][58][59][60][61][62]. CHIKV is listed as a category C pathogen in 2008 by National Institute of Allergy and Infectious Diseases (NIAID) and as a biosafety level 3 (BSL3) pathogen [50,[63][64][65][66]. CHIKV causes debilitating and prolonged arthralgic syndrome incapacitating the affected population for longer periods. CHIKV is usually found in tropics but has widespread across the globe in recent years due to a range of transmission vectors, globalization and climatic changes . The 'Chikungunya' word has originated from the Ma- konde root verb kungunyala, meaning "that which bends up" [112,113] which is in reference to drying up or becoming contorted and signifies the cause of stooped posture developed due to the excruciating joint and muscle pain and other rheumatologic manifestations [114,115]. The disease etiology consists of sudden onset of fever with arthalgia, which generally resolves within a few days [116,117]. Female mosquitoes acquire the virus by taking blood from viremic vertebrate hosts (Fig. 2). The virus elicits a persistent infection and replicates at a high pace, especially in the salivary glands of the insects [118,119]. In addition to salivary glands, it replicates in various other organs inside body cavity including gut, ovary, neural tissue, body fat etc. [120]. When this CHIKV loaded mosquito infects a healthy human, it transfers the virus into its blood stream. These virions through interaction with the receptors reach the target cells by endocytosis. The acidic environment of the endosome triggers conformational changes in the viral envelope that expose the E1 peptide [121][122][123][124][125], which mediates virus-host cell membrane fusion. This allows cytoplasmic delivery of the core and release of the viral genome in cytoplasm. The site of mRNA transcription is in the cell cytoplasm.
Although no specific drugs are available, CHIK is usually treated with non-steroidal anti-inflammatory drugs (NSAIDs), with inconsistent success   (Table 7). Owing to the nonavailability of a potential drug to cure the disease, there is an urgent need to adopt a skilled strategy to develop new therapeutics. We describe in the following section how computational approaches can help in reducing the time in arriving at potential lead molecules.

CALCULATIONS & RESULTS: APPLICATION OF THE G2H ASSEMBLY LINE TO CHIKV
The genome sequence of Chikungunya virus was retrieved from NCBI (http://www.ncbi.nlm.nih.gov/nuccore/NC_004162). For gene prediction, the sequence was processed using ChemGenome 3.0 (http://www.scfbio-iitd.res.in/chemgenome/chemgenome3.jsp) software [5,6]. The results displayed the existence of two genes which were similar to the already published ones, essentially implying that in this case, 100% accuracy is achieved with ChemGenome 3.0. These nucleotide sequences were translated to protein sequences by ChemGenome 3.0. The proteins in CHIKV are polyproteins i.e. the sequence displayed in results contains sequences for all proteins coded by the gene. The individual proteins from polyprotein are cleaved during post translational processing. Till date no reliable computational approach is available to cleave the polyproteins, therefore the sequences were dissected manually for each protein, based on literature and experimental evidence to identify cleavage site. The ChemGenome 3.0 results are archived at http://www.scfbio-iitd.res.in/software/chemgenomeresult.jsp.
The sequences extracted from Chemgenome 3.0 served as inputs to Bhageerath-H (http://www.scfbioiitd.res.in/bhageerath/bhageerath_h.jsp), a tertiary structure prediction server [28]. For each submitted sequence, five structures were returned by the server. The results received from Bhageerath-H are shown in (Fig. 4). As no homolog information is available to give strength to these structural models, all the five structures are considered as plausible candidates for the native, and considered for    further studies. It may be noted that tertiary structure prediction of structural proteins associated with membranes is a nascent area with low success rate at this stage and hence the focus here has been on nonstructural proteins which can fold autonomously. Most of the experimentally determined structures have some information of ligand binding domain/site but in the present scenario, CHIKV proteins lack the structural information, thus necessitating detection of ligand binding sites (active sites). In order to facilitate active site detection, an automated version of active site finder i.e. AADS (Automated active site docking and scoring) (http://www.scfbio-iitd.res.in/dock/ActiveSite_new.jsp) is utilized which predicts the potential binding site(s) and further performs the docking of the selected molecule to the top ten cavities in an automated mode [40]. Binding sites on each of the five structural models of each nonstructural protein are identified. Not all cavities determined by the active site identifier may be true binding sites with functional implication but one among them is very likely to be such a site. The additional cavities can be checked for their ability to act as allosteric sites. The predicted top 10 binding sites are shown as black dots in the protein structures (Fig. 4).
In search of probable hits, the 10 cavities per structure identified by AADS are further subjected to RASPD (Rapid screening of preliminary drugs) (http://www.scfbioiitd.res.in/software/drugdesign/raspd.jsp) software [41]. The RASPD returned more than 500 molecules against the predicted cavities of CHIKV proteins with -8.00 kcal/mol as the binding energy cutoff.
The in silico drug design beyond this stage involves rigorous docking and scoring [173,174]. The hits identified from screening via RASPD above are further docked with their respective target site using Sanjeevini software (http://www.scfbioiitd.res.in/sanjeevini/sanjeevini.jsp) which utilizes ParDOCK as a docking tool. For all the modeled structures, one molecule for each cavity has been proposed on SCFBio's CHIKV webpage which is accessible at http://www.scfbio-iitd.res.in/software/chikv.jsp. This webpage contains information on the genome annotation, protein tertiary structure prediction, and hit molecule identification and docking and scoring results of the complete genome to hit protocol.
The best 20 molecules selected against the nonstructural proteins of CHIKV are displayed in (Table 8). From here on, the in silico strategies go hand-in-hand with experimentation. In an iterative process of synthesis, testing, modification, docking and scoring, these molecules can be further improved to yield candidate drugs while taking care of the ADMET profiles [175][176][177][178][179][180].

DISCUSSION ON THE G2H ASSEMBLY LINE
The wealth of information available from experimental hostpathogen interaction studies invites computational biologists to develop databases and newer computational methods to advance further focused experimentation. Consequently, bioinformatics is rapidly evolving into independent fields addressing specific problems in interpreting (i) genomic sequences, (ii) protein sequences and 3D-structures, as well as (iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in inhibiting the growth of pathogen in the host.
A basic overview of the G2H technology is given in this review with an application to Chikungunya virus. G2H assembly line is a culmination of several recent advances in computational chemistry and computational biology implemented in a high performance computing environment. At least three areas for further improvement can be immediately identified: (i) development of algorithms for cleavage of polyproteins, (ii) algorithms for identification of druggable protein targets, (iii) improved accuracies in tertiary structure prediction of nonstructural proteins, (iv) development of methods for determining tertiary structures of structural proteins and (v) identification of hit molecules with reduced toxicities. This protocol should ultimately result in an accelerated emergence of new methods for treating infectious diseases. Similarly, metabolic disorders can also be accessed via the "Genome to Hit" pathway.

CONCLUSION & PERSPECTIVES
Post-genomic research era encompasses many diverse aspects of modern science. The "Genome to hits" pathway described here symbolizes the emergence of an integrated technology to address specific health issues, and more specifically provides a novel and rapid approach to identifying new and potent hit molecules from genomic information.

CONFLICT OF INTEREST
The authors confirm that this article content has no conflicts of interest.

ACKNOWLEDGEMENTS
This project is funded by the programme support to SCFBio from the Department of Biotechnology, Govt. of India. The authors gratefully acknowledge the help received from Ms. Garima Khandelwal, Ms. Priyanka Dhingra, Ms. Tanya Singh, Mr. Goutam Mukherjee, Mr. Avinash Mishra, Mr. Shashank Shekhar and Ms. Vandana Shekhar. The authors thank Dr. Aditya Mittal for useful discussions and a critical reading of the manuscript.

SUPPLEMENTARY INFORMATION ON CHIKUNGUNYA AT SCFBIO WEBSITE
Details of the results on genes and protein tertiary structures predicted, binding pockets, hit molecules identified and lead molecules proposed for synthesis are available for free download from the SCFBio website (http://www.scfbioiitd.res.in/software/chikv.jsp). These results will be updated periodically with improvements in protocols for protein structure prediction and ADMET evaluations.