Application of biasing-potential replica-exchange simulations for loop modeling and refinement of proteins in explicit solvent

Comparative or homology modeling of a target protein based on sequence similarity to a protein with known structure is widely used to provide structural models of proteins. Depending on the target-template similarity these model structures may contain regions of limited structural accuracy. In principle, molecular dynamics (MD) simulations can be used to refine protein model structures and also to model loop regions that connect structurally conserved regions but it is limited by the currently accessible simulation time scales. A recently developed biasing potential replica exchange (BP-REMD) method was used to refine loops and complete decoy protein structures at atomic resolution including explicit solvent. In standard REMD simulations several replicas of a system are run in parallel at different temperatures allowing exchanges at pre-set time intervals. In a BP-REMD simulation replicas are controlled by various levels of a biasing potential to reduce the energy barriers associated with peptide backbone dihedral transitions. The method requires much fewer replicas for efficient sampling compared with T-REMD. Application of the approach to several protein loops indicated improved conformational sampling of backbone dihedral angle of loop residues compared to conventional MD simulations. BP-REMD refinement simulations on several test cases starting from decoy structures deviating significantly from the native structure resulted in final structures in much closer agreement with experiment compared to conventional MD simulations.


INTRODUCTION
Knowledge of the three-dimensional (3D) structure of protein molecules is essential to understand its function. Experimental structure determination is time consuming and costly and the structure of only a limited number of proteins relative to the total number of protein sequences has so far been solved. The enormous recent progress in high-throughput genome sequencing further increases the gap between the number of known protein sequences and known structures. Comparative modeling techniques are increasingly being used to generate model structures for many proteins with sequence similarity to a known template structure. 1 Template based comparative (or homology) modeling methods are so far the most reliable modeling approach for modeling protein structures. 1-12 However the prediction accuracy is limited by the availability of suitable template structures, the accuracy of the target-template sequence alignment, and the modeling of segments with low (or no) similarity to a template. 3,12,13 Often homology models of proteins are accurate for parts of the protein with high sequence similarities to a template but are much less realistic for flexible regions such as loops that connect conserved structural elements. Loops often determine the functional specificity of a given protein structure and can contribute to enzymatic activity or ligand binding properties. 14 Loop modeling is defined as construction of 3D atomic models for short protein segments that connect regular secondary structure elements. Several methods have been developed to predict loop conformations. 12-16 Generally, loop modeling methods can be classified into knowledge-based methods, de novo or ab initio methods and combined approaches. Knowledge-based approaches employ databases of experimental protein structures as a source of loop conformations. 17-24 Possible loop conformations for a given protein segment are evaluated by using rule-based filters, with evaluating criteria such as geometric fit (e.g., distance of the loop termini), sequence similarity to the target segment and inclusion of knowledge-based potentials. The de novo structure prediction approaches perform conformational

ABSTRACT
Comparative or homology modeling of a target protein based on sequence similarity to a protein with known structure is widely used to provide structural models of proteins. Depending on the target-template similarity these model structures may contain regions of limited structural accuracy. In principle, molecular dynamics (MD) simulations can be used to refine protein model structures and also to model loop regions that connect structurally conserved regions but it is limited by the currently accessible simulation time scales. A recently developed biasing potential replica exchange (BP-REMD) method was used to refine loops and complete decoy protein structures at atomic resolution including explicit solvent. In standard REMD simulations several replicas of a system are run in parallel at different temperatures allowing exchanges at preset time intervals. In a BP-REMD simulation replicas are controlled by various levels of a biasing potential to reduce the energy barriers associated with peptide backbone dihedral transitions. The method requires much fewer replicas for efficient sampling compared with T-REMD. Application of the approach to several protein loops indicated improved conformational sampling of backbone dihedral angle of loop residues compared to conventional MD simulations. BP-REMD refinement simulations on several test cases starting from decoy structures deviating significantly from the native structure resulted in final structures in much closer agreement with experiment compared to conventional MD simulations. searches or generate loop conformations from scratch and these searches are guided by force field energy functions or other types of scoring functions 25-40 with a variety of treatments of electrostatics and solvation. [41][42][43] Knowledge-based potentials have also been used in combination with conformational sampling methods, as well as energy functions that combine molecular mechanics force-field terms with statistical potentials. 21,44-46 The databases method are limited by the rapid increase in the number of geometrically possible conformations as a function of loop length. 44 Loop models generated using database methods may require also extensive optimization for loops longer than four residues because a discrete loop template may not exactly close the loop. Systematic de novo search methods encounter similar difficulties because the number of putative loop conformations or size of the conformational space increases with increasing loop length. 7 Similar to modeling of protein loop segments the refinement of protein structures built by comparative modeling requires both large scale conformational sampling and accurate atomic energy functions. Progress has been made using various knowledge based potentials that are augmented by energy terms motivated by consideration of important physical interactions and sampling is often facilitated by low resolution initial conformational searches. 47,48 Baker and coworkers 48 for example, reported encouraging results for 5 of 16 small proteins using a refinement protocol in which multiple rounds of random torsion-angle perturbation and Monte Carlo (MC) relaxation were performed on low-resolution models built from a set of sequence homologues of the target protein using the Rosetta approach. Another method based on fragment assembly and MC simulation is Tasser, 49 which has been applied to the refinement structures determined by NMR spectroscopy. 50 Recently, these ab initio approaches have also been applied to the refinement of homology models. Misura et al.. 51 attempted to refine a series of homology models using Rosetta in combination with evolutionarily derived distance constraints. The authors found in 22 out of 39 cases a model that is closer to the native structure than the template over the aligned regions within the 10 lowest-energy models. However, the method is computationally expensive with the refinement of one model requiring several CPU days. 51 In addition to the approaches outlined earlier, a number of methods that employ statistical potentials or empirical scoring functions to select near-native models from an ensemble of homology models have also been described. [52][53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][69] Alternatively, molecular dynamics (MD) should be potentially useful for structure refinement and loop modeling. 70-75 In principle, MD simulations include conformational entropy effects and could serve as tool for conformational search of the loop region and also for atomic resolution refinement of protein model. 26 However, at room temperature affordable MD simulations can be kinetically trapped and explore only the ba-sin of attraction near the starting structures. Several methods have been proposed to over come the barrier crossing problems. For example, Viktor and Simmerling 26,28 have used low barrier MD simulations in implicit solvent to generate loop conformations. Kirsi 27 and Riemann and Zacharias 64 used adjustable -barrier dihedral potentials for conformational search of protein loops during MD simulations. Hao and Mark 70 carried out conventional (c)MD simulations in explicit water for the refinement of homology modeled structures. Though still a significant deviation of simulated structures from the experimental structures was observed, simulations of tens to hundreds of nanoseconds resulted on average in final conformations in closer agreement with experiment than the start structures. 70 The replica exchange MD (REMD) simulation method has been frequently used for enhancing the conformational sampling of biomolecules. 76-79 In REMD simulations, several copies (replicas) of the system are simulated independently and simultaneously using classical MD at different simulation temperatures (or force fields: Hamiltonians). At preset intervals, pairs of replicas (neighboring pairs) are exchanged with a specified transition probability. In most REMD simulations the temperature is used as a parameter that varies among the replicas (T-REMD). The random walk in temperature allows conformations trapped in locally stable states (at a low simulation temperature) to escape by exchanging with replicas at higher simulation temperature. T-REMD simulations in combination with a Generalized Born implicit solvent model have been used for the prediction of loop conformations and refinement. 34,35,80,81 Recently, Jiang et al. 82 have used T-REMD simulations with a statistical potential for refinement of homology modeled proteins in explicit solvent. However, a major drawback of T-REMD simulations is the rapid increase of the number of required replicas with the system size to cover a desired temperature range 83 in turn requiring rapidly increasing computational resources.
As outlined above, the majority of loop prediction and structural refinement methods employs an implicit solvent description. In particular, in case of solvent exposed loop structure this may strongly limit the realistic prediction of loop structures. Ultimately, it is desirable to efficiently predict loop structures and perform structural refinement accounting explicitly for surrounding solvent and ions. Promising results have been achieved using very extensive conventional MD simulations 70 but requiring large computational resources not useful as standard rapid refinement methodology.
Recently, we have proposed a Biasing Potential Replica Exchange (BP-REMD) method that specifically lowers barriers for peptide backbone transitions along the replicas. 84-86 This method requires much fewer replicas than T-REMD especially in the presence of explicit solvent. In the current work we have applied BP-REMD simulations for modeling of loops in protein structures and to refine protein decoy structures in explicit solvent. In 13 test cases the 5 replica BP-REMD protocol is able to identify a most populated conformational cluster closer to experiment than the start structure and performed always better than five independent cMD simulations.

Test structures
The initial structures for the loop modeling and for refinement were selected from the Rosetta decoy set. 87 For loop modeling several alpha, beta and alpha/beta protein decoys were chosen (see Table I), in such a way that the model contained already most (>90%) of the correct secondary structure elements and differed only in the connecting loop parts from the native structures. To perform realistic loop modeling, distances restraints were applied to the rest of protein molecule other then the specific loop residues. It is important to note that distances between all C a -atoms excluding the specific loop residues, were derived from the initial decoy structure (not the native structure) and were included during both MD and BP-REMD simulations. During the simulations structures were allowed to move freely within C a 2 C a distance changes of AE0.5 Å and a quadratic penalty with a force constant of 1 kcal mol 21 Å 22 for deviations larger then 0.5 Å from the decoy start structure. The target loop regions connecting secondary structure elements were completely free during all stages of the simulations.
For the refinement simulations of complete proteins seven alpha and alpha/beta protein decoy structures from the Rosetta decoy set were used (see Table II). No restraints were used during these refinement simulations.

Simulation setup
Hydrogen atoms were added to the decoy structures using the xleap module of the Amber9 package. 88 Explicit TIP3P water molecules 89 were added to all decoys to form a truncated octahedral boxes using xleap (minimum distance between protein and box boundary: 10 Å ). All MD simulations were carried out with the Sander module of the AMBER9 package in combination with the parm03 force field. 90 The simulation systems were first subjected to energy minimization (1000 steps). During MD simulations, the protein was initially harmonically restrained (25 kcal mol 21 Å 22 ) to the energy minimized start coordinates, and the system was heated up to 300 K in steps of 100 K followed by gradual removal of the positional restraints and a 1ns unrestrained equilibration at 300 K. The resulting system was used as starting structure for both BP-REMD and conventional MD simulations. During all simulations the long range electrostatic interactions were treated with the particle mesh Ewald (PME) method 91 using a real space cutoff distance of cutoff 5 9Å (constant pressure of 1 bar). The Settle algorithm 92 was used to constrain bond vibrations involving hydrogen atoms, which allowed a time step of 2 fs.

Biasing potential replica-exchange simulations
The BP-REMD method is a Hamiltonian REMD method that employs a biasing potential for the F and Rmsd of loop Ca atoms and heavy atoms (in brackets) after best superposition of the complete protein to the native reference structure (the second row for each case indicates the results for a second simulation starting from a different loop start structure). N res corresponds to the total number of residues in the protein model. The last 5ns of simulation trajectories were used for calculation of lowest Rmsd and Rmsd after cluster analysis (Rmsd was calculated for the loop-C a atoms after best superposition of the complete protein onto the native experimental structure). Rmsd(Å ) cluster indicates the Rmsd of the centroid (cluster center) representing the most populated conformational cluster relative to the native structure.
C peptide backbone dihedral angles. 84 The biasing potential is based on a potential of mean force (PMF) for each of the two dihedral angles calculated for a model peptide (alanine dipeptide) in explicit solvent. 84 Addition of the biasing potential during a simulation lowers the energy barriers for backbone dihedral transitions in a peptide or protein essentially making the protein chain more flexible. The biasing potential had been derived previously by potential-of-mean force (PMF) free energy simulations on the backbone dihedral F and C of Alanine dipeptide in explicit water. 84 During a BP-REMD the level of biasing is gradually changed along the replicas such that frequent transitions are possible at high levels of biasing. Note, that the same biasing potential derived from the model system was used for all backbone dihedral in the protein (except Gly and the Pro F angle). In a BP-REMD simulation different biasing potential levels are applied in each replica (one reference replica runs without any biasing potential) and replica exchanges between neighboring biasing levels were attempted every 1000 MD steps (2ps) and accepted or rejected according to a Metropolis criterion 93 (5000 attempted exchanges in 10 ns). An advantage compared to T-REMD is the fact that the energy differences are only affected by the force field term that changes upon going from one replica to another replica run. For the present simulations we used five replicas, and the acceptance probability for replica exchanges was in the range of 30-40%. For each case two independent sets of BP-REMD simulations (10 simulations in total) and two sets of conventional MD simulations with five independent standard MD simulations per set were carried out at 300K using two different starting decoy conformations. In the case of loop modeling distance restraints were used during both MD and BP-REMD simulations (no restraints during full refinement).

Prediction of protein loop structure
MD loop modeling simulations were performed on three alpha helical proteins, two beta proteins and one a-beta protein structures. The protein structures contained loop structures ranging in size between 3 and 7 residues. For each case two different starting structures were used with heavy atom root-mean square deviation (Rmsd) of 2.5-5.5 Å from the native loop structure (Table I). Decoy start structures were obtained from the Rosetta protein decoy set 34 and contained already almost correctly folded secondary structures (>90% according to the DSSP program) with the main structural variation due to loop regions that connect secondary structure elements. To stabilize the near-native secondary structure and also the spatial arrangement of secondary structure elements distance restraints (between backbone Ca atoms) were included during both MD, and BP-REMD simulations to prevent the system from unfolding or sampling states far away from the folded structure (see Methods for details). Note, that the distance restraints were derived entirely from the decoy start structures without inclusion of any knowledge of the native structure.
The distance restraints were sufficiently soft to allow distance fluctuations of AE0.5 Å without any change in restraining energy. The protocol corresponds basically to a loop refinement approach that in contrast to most other approaches with fixed loop anchor sites 80,81 allows considerable freedom for adjustment of secondary anchor elements connected to the flexible loop segments (the loop regions were completely mobile during all simulation stages). For all the cases five independent standard MD simulations (with different initial velocities) were carried out from two different start structures at 300 K. The results of MD loop modeling simulations are summarized in Table I and are illustrated for pdb5znf and pdb1pft in Figure 1. The Rmsd of sampled conformations was calculated by superimposing the whole protein on to its native structure and calculating the Rmsd only of the loop part. Hence, the calculated Rmsd includes also the arrangement of the loop relative to the rest of the protein. In case of pdb1pft (loop 2, residues 13-17) during five independent 10 ns simulations (with different initial velocities the deviation from the native loop structure remained similar to the start structure [ Fig. 1(a)]. For pdb5znf (loop1, residues 5-8) during three out of five cMD simulations the Rmsd from the native loop structure decreased from $4 Å to $2.5 Å [ Fig. 1(d)]. Along with standard cMD simulations BP-REMD simulations with five biasing levels at 300K were carried from the same initial decoy structures (two independent BP-REMD simulations with different start structures such that the total computational demand is identical to the cMD control simulations). For both pdb1pft and pdb5znf the reference replica (without biasing) in the BP-REMD simulations sampled loop conformations significantly closer to experiment than observed during cMD simulations [ Fig. 1(b,e)]. The rapidly oscillating Rmsd in the reference replica (during the whole simula- tion) also indicates extensive exchanges between replicas during the entire BP-REMD runs. Similar to pdb1pft and pdb5anf the BP-REMD simulations sampled loop conformations close to the native structure often with heavy atom Rmsds of only $1 to 1.5 Å (Table I). Moreover, the most populated cluster centroid was also in all cases close to the native loop structure. This indicates that the native loop topology can simply be identified by a structure representing the most populated conformational cluster.
In contrast, starting from the same incorrect decoy loop conformations cMD simulations sampled mainly conformations that remained close to the starting conformations (Table I) with an Rmsd of the most populated cluster centroid close to the Rmsd of the start structure (cf. column 5 and 7 in Table I).
The sampling quality of BP-REMD and cMD for sampling relevant loop structures was also compared in terms of backbone dihedrals angles for the central residues of loop1 in pdb5znf (see Fig. 2). With standard cMD simulations, the sampled dihedral states remain close to the starting dihedral angle state presumably because of the presence of energy barriers. The lowering of dihedral angle barriers along the replicas in the BP-REMD allows more transitions that if relevant for the sampling in the unbiased force field can exchange and improve sampling in the unbiased reference simulation (see Fig. 2). For all four loop residues a broader sampling of dihedral states Comparison of backbone dihedral angle sampling of the four central loop residues of pdb5znf during cMD simulations and in the unbiased reference replica of the BP-REMD simulation (during final 5ns). The upper/lower two plots correspond to the results starting from the first/second loop decoy structure, respectively. Each dot in the Ramachandran plots corresponds to a F-C-pair recorded every 2 ps (same number of dots for each case). Note, that in case of the cMD simulations the combined results of all five cMD simulations is shown (to achieve the same point density only every fifth data point is plotted). Start (green circle) and native (red triangle) dihedral angle combination are indicated.
was observed in case of the BP-REMD compared to cMD simulations overlapping with the native dihedral angle state.

Refinement simulations of decoy protein structures
MD refinement simulations were carried out on several decoy model protein structures including five alpha helical proteins and two alpha-beta protein. Similar to the loop modeling simulations, the initial structures were taken from the Rosetta protein decoy set. In contrast to the loop prediction simulations no restraints were used during the decoy refinement simulations. The decoy structures already contained partially correct secondary structures and in some cases an overall topology similar to the native structure [ Fig.   3(a)]. However, for three cases the initial decoy structures were far from the native structure (Rmsd >5 Å ) with partially incorrect topology [pdb1gab, 2ptl, 1bw6; illustrated in Fig. 4(a)]. The backbone Rmsd with respect to the native structure was in all cases between 2.4 and 8.1 Å (corresponding to $3.4 to 9.6 Å Rmsd of heavy atoms; Table II). Starting from two different decoy structures five independent cMD simulations and BP-REMD simulation with five replicas were carried out at 300K in explicit solvent.
In the case of standard cMD simulations during 10 ns simulation time the cluster centroid of the most populated cluster and the lowest Rmsd (from native) structures showed overall in most cases only a slight or no improvement compared to the decoy start conformations (Table II). In contrast, application of the BP-REMD methodology starting from the same conformations  resulted in basically all cases in finally sampled conformations in much closer agreement with experiment than the cMD simulations (Table II, Fig. 3).
Remarkably, for the three proteins with largest deviation of initial decoys from the native structure (backbone Rmsd 5.1-8.1 Å ) [pdb1GAB, 2PTL and 1BW6; Fig. 4(a)] application of the BP-REMD method resulted in finally most populated conformational clusters with a deviation of 3.5 to 4.8 Å from the native conformation. Even in the final stage of the simulations frequent exchanges between the near native compact structures and more expanded structures were observed. For example, in case of the 1GAB protein the reference replica of the REMD simulation frequently sampled a near-native conformation but also less folded structures close to the start conformation in contrast to cMD simulations (see Fig. 5). This indicates also a small free energy difference between the more open state and the folded compact state of the 1GAB protein.

DISCUSSION
Depending on the degree of sequence similarity to a template structure homology modeled structures often require further refinement in particular in loop regions or segments of low target-template similarity. Even within a family of homologous proteins functional variations can arise as a consequence of structural differences which are often found on exposed loop regions. And more over the generation of more realistic models that are closer to the native state than the template structure is still a great challenge. In recent years a variety of loop modeling and structural refinement methods have been developed (see Introduction for an overview). The great majority of these methods employ implicit solvation models 26,28 and often use rigid anchor conformations which may severally restrict the available conformational space of the loop segment. 33 Preferably, loop structure prediction and refinement should be performed in explicit solvent to represent the aqueous environment of the protein as realistically as possible. MD simulations in explicit solvent have already been used to refine homology modeled proteins but require in the order of $100ns simulation time to achieve reasonable refinement depending on the deviation of the starting conformation. 70 The application of the BP-REMD method significantly enhanced the conformational sampling of loop structures compared to cMD simulations and transitions  to near-native loop conformations were observed in basically all loop modeling test cases and accumulated as most populated cluster in the final phase of the simulations. In contrast to many other procedures not only the loop segment was flexible during the simulations but also the anchor regions and the rest of the protein were allowed to adjust within predefined limits. This might be important for realistic loop prediction because inaccuracies in the placement of the anchor regions (e.g., during comparative protein modeling) may interfere with realistic loop structure prediction. Similar to loop structure prediction the BP-REMD method showed also significantly better performance than standard cMD simulation during refinement of complete protein structures in explicit solvent conditions. The much smaller computational demand compared to T-REMD allows application of the methodology to a large range of proteins and modeling under the currently most accurate explicit solvent description to realistically account for protein solvation. The relatively short simulation length of $10 ns may also allow the methodology to effectively compete in terms of speed with systematic search approaches or other MD-based methods that employ implicit solvent descriptions. An implicit solvent model may not always be sufficiently accurate to realistically describe the relative energies of different conformational states of proteins and loop structures and to correctly account for solvent entropy contributions. It should be kept in mind that not only sampling but also the accuracy of the underlying force field limits the outcome of refinement simulations. However, the improved sampling in BP-REMD simulations compared to cMD simulations could also help to systematically refine force field parameters to further improve the output of comparative modeling and refinement calculations.

ACKNOWLEDGMENTS
This work was performed using the computational resources of the CLAMV (Computer Laboratories for Animation, Modeling and Visualization) at Jacobs University Bremen and supercomputer resources of the EMSL (Environmental Molecular Science Laboratories) at the PNNL (Pacific Northwest National Laboratories).