Correction: Prediction and analysis of the modular structure of cytochrome P450 monooxygenases

In our article (Sirim et al. BMC Struct Biol 2010, 10: 34) the cytochrome P450 monooxygenase P450cam was referred to as CYP101D. In the latest release 3.0 of our cytochrome P450 database CYPED (www.CYPED.uni-stuttgart.de) P450cam was reassigned to family CYP101A and referred to as CYP101A1.


Background
Cytochrome P450 monooxygenases (CYPs) are a ubiquitous protein family, existing in all eukaryotes, most prokaryotes and Archae. These heme-containing enzymes catalyze the monooxygenation of a large variety of substrates [1]. CYPs have an essential function in drug metabolism, hence focussed in the pharmaceutical industry [2]. Besides, they are of great interest for synthetical application in biotechnology as versatile biocatalysts [3]. A profound knowledge in the factors mediating selectivity and activity of these proteins is a prerequisite in the development of CYPs with improved properties. Therefore, deeper insights in the relationships between sequence, structure and function are of great interest.
According to Nelson's classification [4] CYPs are grouped into homologous families and superfamilies, predominantly based on sequence similarity. The sequence identity between proteins from different superfamilies is extremely low and may be less than 20% [5]. Only three amino acids are totally conserved, the glutamic acid and the arginine of the ExxR-motif, which is involved in stabilizing the core and heme-binding [6], and the heme-binding cysteine. However, the increasing number of crystal structures shows that despite this unusual variability the overall structure is highly conserved: CYPs consists of structural conserved modules that are essential for structure and function, and of variable regions that mediate the individual biochemical properties. The defined conserved secondary structures are named αA-L and β1-5 and could be identified in all CYP structures and make up the so called CYP-fold [7][8][9].
Most CYPs require interaction with a reductase to provide electrons, either as separate proteins or as fusion proteins. Depending on the nature of their electron transfer partner, CYPs are assigned to different classes. Although, no consensus has been reached in the definition of this classification, there are several proposed schemes which subdivide CYPs in up to nine classes [10][11][12]. The most general one, which was applied in this work, discriminates between two major classes of CYPs [13]: class I, which comprises mitochondrial and bacterial CYPs and class II which comprises CYPs interacting with a cytochrome P450 reductase-type (CPR-type) FMN/ FAD reductase and represents a simplification of the widely accepted classification scheme by Kelly et al. in [1]. Further, there are CYPs known which do not need a reductase for their reaction [14]. Fusion proteins, such as the self-sufficient class II CYP 102A1 from Bacillus megaterium (P450 BM-3) which contains a heme domain and a reductase, as well as those CYPs which do not require any reductase interaction appear very rarely in nature [15]. Therefore, in most CYPs the interaction with their appropriate redox partner is prerequisite for their reaction to occur. Many different CYP isoenzymes interact with only one reductase, and it is assumed that CYPs of the same class are comparable in regard to their reductase interaction sites [16]. It is expected that there are favorable electrostatic interactions between CYPs and their electron transfer partner [17]. A crystal structure for a CYP-reductase-complex is not yet available. Even though the kinetics in P450 reduction may not be generalized among different P450 systems, and the concepts regarding the influence of a rate-limiting step are not universal [18], the electron transfer from the reductase to the heme domain is often slow and one of the rate-limiting aspects in many CYP systems [19]. However, the interactions between the components of the electron transfer systems still remain unclear. A deeper understanding of the factors determining reductase interaction gained by the analysis of the reductase interaction sites of CYPs will assist in improving interactions and consequently lead to optimized enzymes for biocatalytic applications [20].
Previous analyses of the structure conservation in CYPs showed that all CYPs have a well-conserved heme-binding structural core formed out of αD, αE, αI, and αL and αJ and αK [21]. The β-bulge region which contains the thiolate heme ligand is referred to as Cyspocket. Between αK and the Cys-pocket, a structurally conserved region is located, the so-called 'meander' loop. It is spanned by 7-10 amino acid residues and is supposed to play a role in heme binding and stabilization of the tertiary structure. The proposed reductase interaction face of CYPs mainly comprises the αJ/αJ' and the insertion following the meander loop [6]. Since the structures of all CYPs are highly similar, but differ in substrate specificity and their electron transfer partners, the different biochemical properties of CYPs are mediated by the diverse regions, which vary in both sequence and structure [8].
Six regions which are involved in recognition and binding of substrates and hence determine substrate specificity were described as SRSs (substrate recognition sites [22]). SRS1 lies in the highly variable loop region between αB and αC (BC-loop), SRS2 is located in the C-terminal end of αF, SRS3 and SRS4 are spanned by the N-terminal regions of αG and αI, β1-4 houses SRS5 and β4-1 SRS6. While the access of the substrate to the binding pocket is limited by flexible regions in the entrance channel, such as αF and αG which undergo strong conformational changes upon substrate binding [23,24], the regions flanking directly the binding pocket and thus limiting the access of the substrate to the heme, namely αI, the BC-loop region and SRS5, were observed to remain rigid during simulation [25,26]. In a systematic analysis of SRS5 in more than 6300 sequences, single substrate-and heme-interacting residues could be identified in this region [27]: Thus, a hotspot for regio-and stereoselectivity in one residue in SRS5 and one position in the BC-loop (F87), were previously reported as key residues in determining activity, regio-and stereoselectivity in CYP102A1 [28][29][30]. Combinations of variants of these two positions were applied to design a minimal mutant library with improved selectivity [31]. Due to the high variability of the BC-loop, the identification of position 87 in CYP102A1 in other CYPs, remains a challenge for sequences without structural information.
To serve as a tool for a comprehensive comparison of protein sequences and structures within the vast and diverse family of CYPs in order to transfer the newly gained insights among the CYP sequences, the Cytochrome P450 Engineering Database (CYPED) [32] has been designed. In its current version 2.02 it contains 8614 sequences [33]. The highly similar structures have been compared in detail to identify the common core and to assign the variable regions. For this purpose a structural alignment was used as a base to generate a reliable structure profile. With this profile all structurally conserved regions (SCR) could be predicted and annotated among all CYPED protein sequence entries, hence allowing a structural navigation in those sequences lacking structural information. Beyond this, the CYPED website provides an interface which allows the prediction of the SCRs for every user-specified CYP sequence.

CYP Structures
A set of 31 PDB structures [34] was extracted from version 1.1 of the CYPED [32] as listed in table 1. The selection includes 16 bacterial structures of class I and 12 CYPs assigned to class II CYPs, comprising CYPs which interact with a CPR-type FMN/FAD reductase. The structures in this class are predominantly of mammalian origin. The only exception is CYP102A1 (P450 BM-3) from Bacillus megaterium, which is a fusion enzyme, consisting of a P450 domain and a FMN/FAD reductase domain [15]. Because of its structural similarity to CYP102A1, the bacterial CYP175A1 isolated from the thermophilic Thermus thermophilus was also assigned to class II [14]. Additionally analyzed crystal structures were: CYP8A (human prostacyclin synthase), which accepts endoperoxides or hydroperoxides as substrates and does not require any electrontransfer partner or molecular oxygen [35]; CYP55A2 from Fusarium oxisporum and 152A1 from Bacillus subtilis (P450 Bsβ ) are representatives for CYPs which obtain electrons directly from NAD(P)H or catalyze a peroxide-dependent reaction. All structures represent the closed form of CYPs since including the open form as available for example for CYP2B4 [36] would worsen the alignment quality. Eleven recently published CYP structures were not included in the alignment but were used to validate the prediction of the structurally conserved regions.

CYP Sequences
The analysis of CYP sequences and structures was performed based on the updated version 2.02 of the CYPED [33]. It integrates sequences of 8614 proteins. The proteins are organized into 249 superfamilies and 619 homologous families according to Nelson [4]. Reliable multisequence alignments are available for each family. The sequences are annotated by automatically extracted GenBank annotations [37], which were manually enriched. Secondary structure information is available as DSSP annotation within the multisequence alignments for those homologous families containing members with existing PDB structures.

Structure-based HMM profile
SCRs were determined by the generation of a structurebased multisequence alignment using STAMP [38]. STAMP estimates the probability of structural equivalence of residues [39] and uses the Smith-Waterman algorithm [40] to determine the best path through a matrix of numerical pairwise similarity values of corresponding sequence positions. This allows STAMP to calculate two measures of alignment confidence: P' ij , a measure for residue equivalence and S c , the STAMP score, which reflects overall alignment quality. A S c > 5.5 implies a high degree of similarity of the considered structures. Stretches of residues in the alignment having P' ij > 6.0 imply regions of conserved secondary structure and are marked by black boxes in the alignment output. To visualize secondary structure information on the alignment output, STAMP uses DSSP [41] outputs. Therefore, in a first step DSSP was applied on the CYP structures to calculate secondary structure information. The resulting structure-based multisequence alignment was checked for correctly aligned secondary structures, ExxR motif and Cys-pocket. Regions with high P' ij which indicate conserved secondary structures were defined as SCR, extracted from the alignment and visualized ( Figure 1) on the structure from CYP102A1 [PDB: 1BU7] as reference structure using PyMOL [42]. Structure-based HMM-profiles were derived from the structure-based multisequence alignments using HMMER http://hmmer.janelia.org/. Ligand-free and ligand-bound structures are indicated by -and +, respectively.

Structural analysis
Structural superpositions and visualizations were generated using PyMOL [42].

Sequence analysis
For the analysis of all CYP sequences, the CYPED and the DWARF system [43] were applied. The data warehouse system DWARF is the in-house repository for the CYPED data and assists local analysis. Besides integrating sequences and structures of this protein family, it provides a set of bioinformatics tools for sequence and structure analysis. We took advantage of its modular and extensible architecture and designed a Perl program which implements an automated procedure that subsequently generates a structure-based multisequence alignment for every CYPED entry by mapping it on the structure-based HMM profile which was derived from the STAMP alignment. Using the alignment row which represents the structure of CYP102A1 as a reference, the start and stop positions of each conserved secondary structure were identified within each alignment and transferred to the query sequence. Therefore, the absolute positions of the SCRs of each query sequence could be predicted. The positions were stored as annotations in the CYPED and are visualized in the multisequence alignments and on the feature page for each CYPED entry. The same procedure as for the identification of the SCRs was applied to identify the specificity and regioselectivity determining position which corresponds to F87 in CYP102A1 in all sequences among the CYPED. Again, the sequence of the structure of CYP102A1 was used as the reference. Each CYPED query sequence was mapped on the structure-based HMM profile and the resulting alignment was used to determine the residue corresponding to F87.
The accuracy of this method was tested in a leave-oneout cross-validation [44] by generating for each of the 30 crystal structures a structure-based HMM profiles, leaving subsequently one structure out and mapping the sequence of the left-out crystal structure on the corresponding profile. The generated alignment was checked for the correct prediction of the residue corresponding to F87.
An online version of the prediction tool was integrated into the CYPED homepage. Since the method operates exclusively for sequences with CYP fold, input sequences are first checked for applicability by sequence homology via a BLAST [45] query using an E-value of 10 -100 . Structurally conserved regions are determined as described above.

Structural Core
From the simultaneous superposition of the 31 structures using STAMP, a multiple sequence alignment could be derived which resulted in 257 structurally equivalent residues out of 400-450 residues. The calculated average RMS deviation after fitting all structures by these 257 residues was 2.4Å and their averaged sequence identity was 25%. The overall STAMP alignment score S c was 6.0 and is above the threshold for highly similar structures. Stretches of structurally equivalent residues (P' ij > 6.0) are marked by black boxes in the structure-based sequence alignment (figure S1, Additional file 1). The residues of the conserved core are organized into 19 SCRs that include at least partially all defined secondary structures αA-L and β1-4. The SCRs extracted from the structural alignment were mapped on the reference structure CYP102A1 from Bacillus megaterium (Figure 1).
A topological overview of the conserved CYP structure illustrates the distribution of SCRs on the CYP structure ( Figure 2). Some SCRs are part of individual secondary structures; other SCRs include several secondary structure elements. Among these, SCR3 comprises β1-2 and αB, SCR7 β3-1 and αE. SCR11 is assembled by αI and αJ and SCR13 by β1-4 and β2-1. β2-2, β1-3 and αK' together form SCR14 and the heme-binding Cys-pocket and αL together form SCR16. The structural alignment further revealed that the β-5 sheet which is not present in all CYP structures does not belong to the conserved parts of the CYP structures [14]. The variable termini of the secondary structure elements αF, αG, αI, β1-4, β4-1, and the BC-loop are surrounding the heme and house the residues defining the SRS regions 1-6.
By applying the procedure on each CYPED sequence and mapping it on the HMM profile generated from the STAMP alignment, the SCRs could be identified and annotated in all sequence entries. The conserved secondary structures appear in the online version of the CYPED either within the annotated multisequence alignments or on the feature page of each protein entry. Its labelling appears in moving over the respective region. The results of the online prediction for any CYP sequence are displayed as colored and annotated regions and as a tabular output listing each conserved secondary structure and the corresponding start and stop position.

BC-loop
In CYP102A1, the phenylalanine at position 87 is assumed to mediate selectivity and activity. Due to its proximity to the heme center, this residue has a strong evidence to be involved in substrate binding and to control substrate specificity and regioselectivity [31]. Therefore, the identification of residues corresponding to this position would be beneficial in the design of CYPs with engineered properties. Since it is located in the SRS1 region of the highly variable BC-loop the identification of this position in enzymes without structural information is not possible merely by sequence alignment. However, a comprehensive analysis of the BC-loops in the structures analyzed in this work revealed that although being highly variable ( Figure 3A), the BC-loop in almost every structure of different proteins that were compared houses one residue, which points directly towards the heme, and remains rigid during substrate binding, which could be shown by comparing multiple structures of the same protein (figures S2 and S3, Additional file 1). By the overall superposition of structures of different proteins on the structure of CYP102A1, it could be shown that this position is located exactly at the same position, corresponding to the phenylalanine in CYP102A1 ( Figure 3B) located at position 87. Table 2 lists the corresponding residue in each structure.
To validate our structure-based method to assign SCRs in a one-leave-out cross-validation, the position which corresponds to F87 in CYP102A1 was predicted for each sequence of each structure. For 23 out of 30 (80%), the predicted positions agreed with the crystal structure, in 7 CYPs they deviated by up to 2 residues. To further apply and to validate the procedure, the position was predicted in eleven structures published in progress of this study. 8 correct predictions, 2 deviations by one position, and one wrong prediction for the case of CYP7A1 which has in the crystal structure no residue located at this position, again confirmed an accuracy of 80%. It should be noticed that in some crystal structures  the residue numbering of the structures deviates from the residue numbering in the sequences due to missing residues and therefore the numbering of the protein structure was considered. The crystal structure of CYP231A from the thermoacidophilic Picrophilus torridus was missing a part of the BC-loop [46] which made the prediction not clearly defined.

Amino acid composition of the F87 corresponding position
In addition to the identification of the F87 corresponding position, a comprehensive analysis of the sequences of all 8614 CYPED protein entries was performed in respect to the amino acid composition, by a prediction of the position in all sequences analogous to the SCR prediction. It could be observed that 73% of the residues predicted at this position include aliphatic residues and phenylalanine. The remaining 24% at this position are small polar residues and only are 3% charged residues. Phenylalanine (22%), leucine (22%), and valine (12%) were the most frequently occuring amino acids followed by isoleucine (10%) and alanine (9%). Other amino acids appear more rarely with frequencies less than 4% (Figure 4). A predicted gap at this position indicates that the BC-loop region houses no residue which is located close to the heme or the BC-loop itself winds away from the active site as it could be observed for example in the structures CYP8A and CYP51B1.

Analysis of reductase interaction sites
The structural regions αJ/J' and the insertion between the meander loop and the Cys-pocket are of particular interest since they were previously proposed to form the reductase interacting face of the molecules [6]. These sites strongly vary in their length and conformation. The structural analysis ( Figure 5) reflects the differences of αJ/J' (further referred to as reductase interaction site 1, RIS1) ( Figure 5A) and the insertion between meander loop and Cys-pocket (further referred to as reductase interaction site 2, RIS2) ( Figure 5B) of CYPs from different redox classes. A comparison of the human CYP2C9 and the bacterial P450cam CYP101D shows that RIS1 (αJ/J' region) of CYP2C9 is 18 residues longer. RIS2 differs by 9 residues between CYP2C9 and CYP101D. By counting the number of residues spanning these regions in the STAMP alignment (figure S1, Additional file 1), it was revealed that these regions in class II CYPs interacting with CPR-type reductases are long, in class I CYPs extremely short or not existing at all and that those CYPs which do not require any electron transfer partner form a subgroup of class II, in some cases with extremely long loops. The αJ/J' region differs from 21 to 22 residues for class II (long) and 3 to 5 residues for class I CYPs (short). The length of the meander insertion differs from 11 to 17 residues for class II (long), up to 23 residues (very long) in those CYPs which do not require a redox partner and 3 to 5 residues for class I CYPs (short). Counting the number of amino acids in each CYPED sequence for RIS1 ( Figure 6A) revealed two peaks in the RIS1 length distribution. This allowed defining two classes. Proteins having short RIS1 with less than 10 residues spanning the αJ/J' region make up 17.5% of all protein entries. According to the result of the length analysis of RIS1 of the structural alignment, they comprise class I CYPs. Proteins having long RIS1 with more than 15 residues spanning the αJ/J' region make up 81% of all protein entries. According to the result of the length analysis of RIS1 of the structural alignment, they comprise class II CYPs. Only 1% of all protein entries can not reliably be assigned by RIS1 length since their length is in between 10 and 15 amino acids.
The analysis of the length of RIS2 in each CYPED sequence ( Figure 6B) showed a distribution in three main areas. Therefore, three classes according to the result of the length analysis of RIS2 in the structural alignment were defined. Proteins having short RIS2 with less than 7 residues spanning the meander insertion   make up 18% of all protein entries. According to the result of the length analysis of RIS2 in the structural alignment, they comprise class I CYPs. Proteins having long RIS2 with between 11 and 17 residues spanning the meander insertion make up 66% of all protein entries. According to the result of the length analysis of RIS2 in the structural alignment, they comprise class II CYPs, with a subgroup of proteins having very long RIS2 with more than 18 residues spanning the meander insertion. 4% of all protein entries can not reliably be assigned by RIS2 length since their length is in between 8 and 10 amino acids.
0.5% of entries with RIS1 and 0.5% of entries with RIS2 length above 35 amino acids were formally assigned to class II, but could not be further analyzed since they comprise biochemically not characterized proteins.

Discussion
Despite their inherently low sequence similarity, all CYPs share a common structural fold. The well-defined secondary structure elements can be found in all determined crystal structures, which house their active-site with the cofactor heme deeply inside the protein [21]. The generation of a structural alignment out of 31 CYP structures revealed structurally conserved regions which contain most of the described secondary structure elements of the CYP fold. It could be shown that some of the secondary structure elements merge together to structure modules, described as structurally conserved regions (SCR) 1-19, reflecting the modular structure of cytochrome P450 monooxygenases. The generation of a reliable structure-based HMM profile which was applied to every CYPED entry assisted in consistently annotating the conserved secondary structures in the CYPED entries. But besides addressing the problem of predicting conserved regions, an even more challenging issue could be solved: the identification and classification of the variable regions.
Since the residues that determine the substrate specificity of CYPs are assumed to lie in the variable regions [8,22], their identification is of greatest interest for engineering of biochemical properties. Two of the six proposed substrate recognition sites, SRS1 and SRS5, together with the helix I directly flank the substrate binding cavity and are therefore supposed to interact with the substrate [27]. SRS1 houses a residue, which previously was described as essential for activity, regio-and stereoselectivity in CYP102A1 [28][29][30]. Located at position 87 and pointing directly towards the heme, a corresponding residue to this phenylalanine can be found in almost all CYP structures. Its location in the highly variable BC-loop region makes its determination very difficult in sequences without structural information.
The position, which corresponds to F87 in CYP102A1 could be correctly predicted in almost 80% of all analyzed CYP structures. By surveying more recent CYP structures, the validity of the prediction could be confirmed. The analysis of this position in all 8614 CYP sequences in the CYPED revealed that the residues at this position predominantly are of aliphatic nature or a phenylalanine, less frequently small polar amino acids and only very infrequently of charged nature. Since the characteristics of the residue at this position highly influence substrate specificity and regioselectivity, its identification contributes to the design of CYPs with more suitable properties for biocatalytic applications.
Even though there were two reductase interaction sites proposed to be located in αJ/αJ' and in the insertion following the meander loop [6], termed RIS1 and RIS2, these regions which are highly variable in sequence and structure were difficult to determine in sequences. The identification of the preceding and the successive SCR solved this problem. Depending on the length for RIS1, two classes (short and long RIS1) and three classes for RIS2 (short, long and very long RIS2) were introduced. From the analysis of the CYP structures in respect to their redox partner it was assumed that class II CYPs have long RIS1 and long RIS2, class I CYPs have short RIS1 and short RIS 2.
The largest percentage of all CYPs has long RIS1 and long RIS2 (53%). All CYPs with available structure which possess these long loops clearly belong to class II, and most of them are of human origin. The class II protein P450 BM-3 also shows the characteristic CPR-interacting loop length. The 12% of proteins with short RIS1 and RIS2, respectively, are assumed to be class I proteins. 27% could not be clearly classified, either because of unusual long loops (above 35 residues), or a combination of short RIS1 with long RIS2 and vice versa. This comparison of reductase interaction sites allows drawing conclusions on its reductase interaction.
The remaining 8% of CYPs consist of proteins with long RIS1 and very long RIS2. Members of this unusual group cannot easily be categorized in regards to their reductase interaction. For example, the human prostacyclin synthase CYP8A1, which has endoperoxidase activity and does not require a reductase as source of electrons, is a representative of this class of proteins [35]. It has a long RIS1, consisting of 22 amino acids and a very long RIS2 of 23 amino acids. However, the crystal structure for the human cholesterol 7 alphahydroxylase CYP7A1 which was recently solved also contains very long proximal loops [47] that were correctly predicted containing 22 (RIS1) and 23 (RIS2) amino acids. CYP7A1 was previously compared to the structure of CYP8A1 [48], but in contrast is a typical monooxygenase.
The fatty acid hydroxylase CYP152A1 from Bacillus subtilis (P450 Bsβ ) is a hydrogen peroxide driven enzyme [49] and therefore could be assigned to those CYPs which do not require a redox partner. CYP152A1 has a short RIS1 of 5 amino acid residues and a long RIS2 of 11 residues, like the CPR-type interacting class II CYPs, which is unexpected for this kind of CYP. Indeed, CYP152A1 and its homologous protein CYP152A2 from Clostridium acetobutylicum (P450 CLA ) experimentally showed much higher conversions in the presence of a CPR-type reductase than in the presence of hydrogen peroxide and the absence of a reductase [50]. The recently solved crystal structure of the allene oxide synthase CYP74A1 is an atypical cytochrome P450 family member and does not require a reductase [51]. However, CYP74A1 also shows similar loop lengths to class II CYPs of RIS1 of 21 AS and RIS2 of 10 AS. Due to an unusual nine amino acid insert in the Cys-pocket which allows its access to the protein surface, the interaction with a redox partner might be disrupted [51]. Therefore, CYP74A1 cannot be compared to typical monooxygenases with similar RIS1 and RIS2 length by our model.
Since most CYPs require electrons from a redox partner, and CYP152A1 and CYP152A2 showed higher activities by adding a reductase, it can be assumed that the interaction of CYPs with reductases plays a pivotal role in the CYP mechanism. Finding the optimal redox partner for CYPs may significantly enhance their activity but is quite difficult. The analysis and classification which led to the prediction of possible redox partner interactions offers the potential of engineering enhanced interactions.

Conclusion
In order to navigate in CYP sequences and to determine functionally relevant residues, a procedure which allows identifying conserved modules and functionally relevant sites within variable regions was implemented. Regions involved in substrate binding as well as redox partner recognition and interaction could be determined in the absence of structural information, based on sequence only. The structurally annotated sequences and multisequence alignments are accessible on the current version of the CYPED http://www.cyped.uni-stuttgart.de. Via a web interface integrated in the CYPED homepage at http://www.cyped.uni-stuttgart.de/cgi-bin/strpred/ dosecpred.pl, the structural prediction is provided for every sequence which is similar to CYPs or presumably shares the CYP fold. The navigation in CYP sequences and the determination of functionally relevant sites in turn is a great advantage in the prediction of promising targets for the design of CYPs with improved biocatalytic properties.

Additional material
Additional file 1: This file contains figures S1, S2, and S3 mentioned in the text.