Computational Identification and Analysis of Ubiquinone-Binding Proteins

Ubiquinone is an important cofactor that plays vital and diverse roles in many biological processes. Ubiquinone-binding proteins (UBPs) are receptor proteins that dock with ubiquinones. Analyzing and identifying UBPs via a computational approach will provide insights into the pathways associated with ubiquinones. In this work, we were the first to propose a UBPs predictor (UBPs-Pred). The optimal feature subset selected from three categories of sequence-derived features was fed into the extreme gradient boosting (XGBoost) classifier, and the parameters of XGBoost were tuned by multi-objective particle swarm optimization (MOPSO). The experimental results over the independent validation demonstrated considerable prediction performance with a Matthews correlation coefficient (MCC) of 0.517. After that, we analyzed the UBPs using bioinformatics methods, including the statistics of the binding domain motifs and protein distribution, as well as an enrichment analysis of the gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.


Introduction
Ubiquinone, also known as coenzyme Q (CoQ), is a fat-soluble organic compound that is ubiquitous in all cell membranes of all animals and most bacteria. It is a powerful antioxidant in cell membranes and lipoproteins, an important component of the electron transport chain, and plays a central role in the process of mitochondrial oxidative phosphorylation [1,2]. Coenzyme Q-10 (CoQ10) is the most common form of ubiquinones, where Q represents the quinone chemical group, and 10 represents the number of isoprenyl chemical subunits in its tail [3]. CoQ10 tends to concentrate in organs with high energy requirements, such as the heart, liver, and kidney. Special interest in CoQ10 has arisen from the fact that it associates with many types of diseases, such as heart disease [4,5], chronic kidney disease [6,7], and cancer [8,9]. However, the pharmacology of CoQ10 is not entirely clear, and further comprehensive studies are required for clarity.
Like other ligands, ubiquinones perform their functions mainly by binding with receptors. Ubiquinone-binding proteins (UBPs) are receptor proteins that dock with ubiquinones. The identification and characterization of UBPs provide important clues for understanding the metabolic pathways involving ubiquinones. Currently, abundant protein sequence data have been garnered. Recognizing UBPs from large numbers of proteins through traditional biochemical experiments has become more and more and the remaining 148 samples (74 UBPs and 74 non-UBPs) were used as the testing dataset (UBPs148). To avoid the potential instability of the predictor, we repeated the process for extracting negative samples five times and built five benchmark datasets. All the predicted results reported in this work were the average of the five experiments using these datasets. The benchmark datasets can be found in the Supplementary Materials (Data S1: BenchmarkDatasets.zip).

Feature Construction
The primary amino acid sequences of the proteins cannot be directly used by machine learning methods. Encoding proteins with digital features is the first step in constructing a predictor. The quality of the features directly affects the performance of the predictor. We chose three types of features to encode the proteins: the amino acid composition (AAC), the dipeptide composition (DC), and the position-specific scoring matrix (PSSM).

Amino Acid Composition (AAC)
The twenty amino acids directly encoded by triplet genetic codons are known as the standard amino acids. A protein sequence is a linear chain of standard amino acid residues. Proteins differ from one another primarily in their arrangement of amino acids, which results in the protein folding into a specific structure that determines its function. The amino acid composition (AAC) represents the distribution of twenty amino acids in proteins, which is the most intuitive way to describe the differences between proteins. Numerous methods have been developed by using AAC as an important component of features for annotation of protein function [19,20]. The amino acid composition of a protein is a 20-dimension vector in which the element is the ratio of the corresponding residue that appears in the protein. For a given protein, the AAC feature can be defined as where AA i represents the number of the i-th type of amino acids appearing in the protein, and L represents the length of the protein.

Dipeptide Composition (DC)
Similar to the feature of AAC, the dipeptide composition (DC) is another amino acid composition descriptor that introduces the intrinsic information of the protein sequences. The DC feature has been applied in many predictive problems, such as mycobacterial membrane protein type identification [21] and bioluminescent protein prediction [22]. DC is a 400-dimension vector that indicates the occurrence frequency of all the possible adjacent amino acid pairs. The element of the vector is the ratio of the corresponding amino acid pair that appears in the protein. Given a protein, the DC feature can be defined as where AA i AA j represents the number of the corresponding adjacent amino acid pair that appears in the given protein, and L represents the length of the protein.

Position-Specific Scoring Matrix (PSSM)
With the evolutionary process over successive generations, certain heritable characteristics of proteins become more common or rarer within a protein family. The similarities of evolutionary conservation are always associated with structural or functional needs [23,24]. The position-specific scoring matrix (PSSM) is one of the most effective and widely used descriptors that represent the evolutionary conservation of protein sequences. PSSM has received a great deal of attention from researchers and has been successfully used in a number of problems, such as protein secondary structure prediction [25] and DNA-binding protein identification [26,27]. The PSSM of a given protein can be obtained by using the PSI-BLAST [28] tool to search the Swiss-Prot database (released on June 5, 2019) through three iterations, with an E-value threshold of 0.01. The E-value is the statistical measurement of the number of expected matches in the database. The lower the E-value, the more likely the match is to be significant. The PSSM of a protein can be defined as where S i,AA j is the element's value of PSSM, which represents the occurrence frequency of AA j at the i-th position of the given protein in the result of multiple sequence alignment. L represents the length of the protein. For a given protein, we further flattened the original PSSM into a vector with equal length and obtained a 400-dimension feature that can be defined as where R i represents the i-th residue in the protein sequence.
In summary, 820-dimensional features were used to encode proteins, including 20 AAC, 400 DC, and 400 PSSM.

Feature Selection Strategy
According to the descriptions in the previous section, three categories of features were used for the prediction model: AAC, DC, and PSSM. All of these features were statistically based and considered all amino acid types and pairwise combinations. Although these features introduced the intrinsic information of protein sequences, noisy and redundant features inevitably existed. To better understand these features and correctly verify the prediction model, a feature selection strategy was required. By using the process of feature selection, noisy features were removed, and the prediction performance was further improved.
Random forest (RF) [29][30][31] was used to rank the importance of the features in this work. The importance score for a given feature was computed by averaging the decrease in the Gini index when the feature was chosen to split the nodes. Features with larger scores were ranked as more important [32]. A total of 526 out of 820 features remained in this step because the important scores of the other features were 0. We believed that the more important features would contribute more to the classification performance. According to the ranked feature list, the incremental feature selection (IFS) [33] strategy was used to build a series of feature subsets by increasing the features one by one. For each feature subset, a model was built and evaluated. The model that achieved the highest MCC value was chosen as the final prediction model, and the features in the corresponding subset were chosen as the optimal features. The optimal feature subset contained 242 features, including 9 AAC, 87 DC, and 146 PSSM features.

Binary Prediction Model
Extreme gradient boosting (XGBoost) [34] is a decision tree-based ensemble algorithm that applies the principle of boosting weak learners using gradient descent architecture. XGBoost is an optimization of the gradient boosting algorithm using both software and hardware optimization techniques. XGBoost was first proposed by Chen T.Q. and Guestrin C. in 2016. Since its introduction, this algorithm has become the driving force in several cutting-edge research fields, including (but not limited to) bioinformatics. XGBoost has been widely used in many bioinformatics problems, such as gene expression value prediction [35], protein subcellular localization [36], and internal ribosome entry site prediction [37].

Parameter Tuning
XGBoost is a highly sophisticated algorithm that involves multiple parameters. We needed to consider parameters and tune their values to fully leverage the advantages of XGBoost. These parameters could be divided into three categories: general, booster, and learning task parameters. General parameters define the overall functionality, notable among them are booster and nthread. "booster" controls the type of booster to be run at each iteration, which can be gbtree (tree-based booster, which is the default) or gblinear (linear booster). We set it as default because the tree-based booster significantly outperforms the linear booster. "nthread" controls the number of cores used for parallel processing based on the maximum number of threads that are available. We set nthread as default to obtain the optimal speed. In each iteration, the booster parameters define the individual tree-based booster, and the learning task parameters guide the optimization objective. There are plenty of parameters in both categories, but not all of them are worthy of tuning. We tuned eight parameters that significantly influence the predictor, including learning_rate, n_estimators, max_depth, subsample, colsample_bytree, gamma, reg_alpha, and reg_lambda. The details of these parameters are illustrated in Table 1. Tuning the parameters of XGBoost is a typical multi-objective optimization problem, and the traditional grid search method is extraordinarily time-consuming. In this case, multi-objective particle swarm optimization (MOPSO) [38] was employed to obtain the ideal values of these parameters. Particle swarm optimization (PSO) [39] is a single-objective optimization algorithm that mimics the social behavior of bird flocks. MOPSO is an extension of PSO for multi-objective optimization problems, which takes the concept of Pareto dominance to list the best solutions to guide the movement of the particles. We randomly initialized 80 sets of 8-dimension vectors as the particle population, in which each item in the vector of a particle represented one parameter of XGBoost. The stopping criterion was when the maximum number of iterations was reached (200 times). The default values of the parameters before tuning, the searching thresholds, and the optimal value combination tuned by the MOPSO is illustrated in Table 1.

Performance Evaluation
The identification of UBPs is a typical binary classification problem that requires reliable evaluation processes and metrics. When training the model and tuning the parameters on the training dataset, we used 5-fold cross-validation to evaluate the model. First, the training dataset was randomly divided into five equal subsets. Then, one subset was chosen as the validation dataset, while all of the remaining samples were used for training. This process was repeated ten times to build ten submodels. Finally, the average validation results between the rounds were used as an estimate of the identifier. After training the model, the samples in the testing dataset were tested to assess the identification model's ability to predict the first seen samples.
To quantitatively evaluate the proposed identifier, six measurements that are widely used for binary classification problems were adopted in this study: sensitivity (Sen), specificity (Spe), precision (Pre), accuracy (ACC), F1-measure (F1), and the Matthews correlation coefficient (MCC). Sen measures the proportion of observed UBPs that are correctly identified. Spe measures the proportion of observed non-UBPs that are correctly identified. Pre measures the proportion of predicted UBPs that are correctly identified. ACC measures the proportion of samples that are correctly identified. The F1-measure is a comprehensive measurement of test accuracy that considers both the Sen and the Pre. The value range of these five metrics is [0,1]. The higher the coefficient, the better the prediction performance. MCC is the correlation coefficient between the observed and predicted classification values, whose range is [−1,1]. The coefficient value 1 represents an entirely correct prediction, −1 represents a completely wrong prediction, and 0 represents a random prediction. MCC was regarded as the most reliable evaluation metric when tuning the model.
where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively.

Comparison of Different Classifiers
A powerful binary classifier is the foundation of a predictor. Thus, we tested six commonly used machine learning methods via cross-validation, including naïve Bayes (NB), multi-layer perceptron (MLP), support vector machine (SVM), adaptive boosting (AdaBoost), random forest (RF), and extreme gradient boosting (XGBoost). According to Table 2, XGBoost achieved the best prediction performance and was chosen as the classifier for the predictor. Notably, the last three methods were ensemble learning-based and significantly outperformed the others.

The Feature Selection Result
After ranking the importance of the features by random forest, 526 out of 820 features remained. The incremental feature selection (IFS) strategy was used to build a series of feature subset by increasing the features one by one. For each feature subset, a model was built and then evaluated over cross-validation. The performance of these prediction models was measured by the MCC value. As shown in Figure 1, the MCC value reached its maximum when 242 features were collected as the optimal feature subset. The predictor obtained an MCC value of 0.490 using all the features and obtained an MCC of 0.560 using the optimal feature subset.  [1][2][3][4][5][6] are the performance evaluation indicators of the predictor: Sen represents the sensitivity; Spe represents the specificity; Pre represents the precision; ACC represents the accuracy; F1 represents the F1-measure; MCC represents the Matthews correlation coefficient (MCC). 7 The bolded parts represent the highest value of the corresponding evaluation indicator. To discover the contributions of different types of features, we investigated the distribution of each kind of feature in the optimal feature subset. As shown in Figure 2, 9 out of 20 (45%) for the amino acid composition (AAC), 87 out of 400 (22%) for dipeptide composition (DC), and 146 out of the position-specific scoring matrix (PSSM) were selected for the optimal feature subset. The PSSM was obviously much better than the others. Although the number of DC in the optimal feature subset was higher than that of the AAC, the percentage of the selected DC among all the DCs was much lower than that of the AAC. We thus considered AAC to be more effective than DC.

The Result of Parameter Tuning
XGBoost is a highly sophisticated algorithm that involves multiple parameters. To leverage the advantages of XGBoost, we tuned the eight most common parameters using multi-objective particle swarm optimization (MOPSO). The prediction performance of the predictor before and after the parameter tuning over the cross-validation and independent validation are shown in Table 3. The prediction performance of the predictor was significantly improved after parameter tuning. This demonstrated that XGBoost was parameter-sensitive, and the process of parameter tuning was necessary.

Case Studies
We investigated four proteins that composed a mitochondrial respiratory complex II (PDB ID: 1YQ4 [40]) as case studies to demonstrate the effectiveness of UBPs-Pred. The UniProt IDs of the four proteins were Q9YHT1 (SdhA), Q9YHT2 (SdhB), D0VWW3 (SdhC), and Q5ZIS0 (SdhD). It should be noted that the sequences of these proteins were different in PDB and UniProt, and we used the version of UniProt. UBPs-Pred identified that SdhB, SdhC, and SdhD were ubiquinone-binding proteins, but SdhA was not a ubiquinone-binding protein. SdhB and SdhD were annotated as ubiquinone-binding proteins in UniProt. Although SdhC was not annotated as a UBP in UniProt, previous work demonstrated that SdhC contained ubiquinone binding sites [41]. These proteins are potential drug targets through which the nuclear SdhB, SdhC, and SdhD genes encoding complex II are considered to be tumor suppressor genes. Furthermore, in SdhC, mutation (Gly-713 to Glu) leads to the increased production of reactive oxygen species and premature aging that shortens the life span [42,43]. The sequence and structure information of these proteins can be found in the Supplementary Materials (Data S2: CaseStudies.zip).
Respiratory complex II or succinate dehydrogenase (Sdh) is an enzyme complex that can be found in the inner mitochondrial membrane. It is the only enzyme that participates in both the Krebs cycle [44] and the mitochondrial respiratory chain [45]. As illustrated in Figure 3, Sdh is composed of four subunits: a flavoprotein subunit (SdhA), an iron-sulfur protein subunit (SdhB), a cytochrome b560 subunit (SdhC), and a cytochrome b small subunit (SdhD) [46]. SdhA and SdhB are hydrophilic proteins where the enzymatic activity of the complex takes place. SdhC and SdhD are hydrophobic transmembrane proteins that anchor to the inner mitochondrial membrane. In the Krebs cycle, Sdh catalyzes the oxidation of succinate to fumarate with the reduction of ubiquinone to ubiquinol [47]. In the mitochondrial respiratory chain, the fully oxidized form of flavin adenine dinucleotide (FAD) is reduced to its hydroquinone form (FADH 2 ). Electrons flow from FAD to FADH 2 and are then transferred to ubiquinone through a series of iron-sulfur clusters: Fe 2 S 2 , Fe 4 S 3 , and Fe 3 S 4 . The ubiquinone undergoes reversible redox cycling between its oxidized form (ubiquinone) and its reduced form (ubiquinol). This redox cycling allows the ubiquinone to function as an electron carrier in the mitochondrial respiratory chain [48,49].

Ubiquinone-Binding Domain Analysis
In genetics, a protein sequence motif is a sequence pattern that is widespread and has biological significance. We tried to determine the motif within ubiquinone-binding domains that might assist the discovery of potential drug targets. The ubiquinone-binding domains were extracted from all 10,224 UBPs in the Swiss-Prot database. A total of 1803 binding sites were annotated for 1791 UBPs. The ubiquinone-binding domain for a given binding site was constructed by slicing a sequence fragment with 21 residues for which the binding site was the center. We determined the motifs of these sequence fragments by using MEME [50]. The statistical significance of each motif was measured by its E-value. Motifs with an E-value within 0.05 were considered to be statistically significant, and eight motifs were identified. Figure 4 shows the sequence logos of eight motifs and the 3D visualizations of these examples. Details on motif discovery can be found in the Supplementary Materials (Data S3: BindingDomain.zip).   . Sequence logos of the motif within the ubiquinone-binding domains. The threshold of the E-value is 0.05. "Sites" represents the number of sites contributing to the construction of the motif. "Width" represents the width of the motif. The 3D visualization on the right is an example of the corresponding motif. "Protein" represents the PDB ID_Chain (domain). "Ligand" represents the type of ubiquinone.

Most UBPs are Membrane Proteins
As of June 5, 2019, a total of 10,224 UBPs were found in the Swiss-Prot database. According to the statistics, 8880 out of 10,224 proteins (86.9%) were membrane proteins (MPs). Among them, 6085 proteins (68.5%) were transmembrane proteins (TMPs). All of the TMPs were α-helical transmembrane proteins (α-TMPs). After preprocessing, we collected 524 UBPs to construct the benchmark datasets, including the training dataset and the testing dataset. According to the statistics of the selected UBPs, 387 out of 524 proteins (73.9%) were membrane proteins (MPs). Among them, 265 proteins (68.5%) were α-helical transmembrane proteins (α-TMPs). All of the TMPs were α-helical transmembrane proteins (α-TMPs). It was obvious that the majority of UBPs were membrane proteins. The statistics of UBP types can be found in Supplementary Materials (Distribution.zip).

Superfamilies of UBPs
The protein superfamily is the largest protein clade with common ancestry. This common ancestry is usually inferred from the similarity of proteins' sequence, structure, and function. A superfamily typically contains several homologous protein families that have similar motifs or are involved in similar biological processes. We analyzed the superfamily distribution of the selected UBPs by extracting the clan annotation from the Pfam protein family database [51]. A total of 280 out of 525 UBPs were annotated by Pfam, and the statistical results are illustrated in Figure 5.
A total of 47 protein superfamilies appeared in the statistics, and the top 10 superfamilies contained about 74.1% UBPs. The top three superfamilies were ComplexI-N (CL0425), NADP_Rossmann (CL0063), and FumRed-TM (CL0335). ComplexI-N (CL0425) contained 80 UBPs. Proteins in this superfamily were part of respiratory complex I, which is associated with proton translocation across the membrane by catalyzing the electron transfers from NADH to ubiquinone. NADP_Rossmann (CL0063) [52] contained 34 UBPs and was a class of redox enzymes that contains two domains: one is a catalytic domain that confers the precise reaction of the enzyme, and the other one is the Rossmann domain that binds nicotinamide adenine dinucleotide (NAD) and FAD. FumRed-TM (CL0335) included 20 UBPs that contain transmembrane proteins from both the succinate dehydrogenase and fumarate reductase complexes [53]. Information about the superfamilies of UBPs can be found in the Supplementary Materials (Data S4: Distribution.zip). UBPs appeared in various superfamilies that were involved in multiple complex biological processes. In order to understand the functions of UBPs, especially in the human body, we further analyzed the gene ontology enrichment and KEGG pathway enrichment of UBPs in the human species.

Gene Ontology Enrichment Analysis
Gene ontology (GO) is an important bioinformatics project that unifies information of genes and gene products across species. The basic information of GO contains information about the biological process (BP), cell component (CC), and molecular function (MF). The enrichment analysis of GO was used to test whether the queried set of genes was statistically enriched in a GO term, which could be measured by P-values: where M is the number of all genes that are annotated by certain GO terms, m is the number of query genes annotated by certain GO terms, N is the number of all the genes of the specific organism that are annotated in GO, and n is the number of query genes annotated by the GO term. The cut-off for the P-value was set to 0.05. We obtained information regarding 113 human ubiquinone-binding proteins from Swiss-Prot. Figure 6 illustrates general information of the GO enrichment analysis results for these proteins, which feature 10 significantly enriched terms in BP, CC, and MF. A total of 2225 BPs were enriched, and this was considered statistically significant for 923 of them. The mitochondrial respiratory chain and metabolic processes were the most highly enriched biological processes. In total, 266 CCs were enriched, and this was considered statistically significant for 130 of them. The mitochondrial associated cell components were highly enriched; 407 MFs were enriched, and this was considered statistically significant for 140 of them. Apart from ubiquinone binding, catalytic activity, oxidoreductase activity, and dehydrogenase activity were the three molecular functions that were observed to be the most highly enriched. The description on the left side of the bar refers to the name of the gene term. "Percent of Genes" refers to the percentage of the number of genes involved in a given term compared to the total number of genes in the query proteins. The digital label on the right side of the bar of a gene term refers to the number of the genes involved in this term and its corresponding P-value. "Max Level" refers to the maximal annotated level of the given term in the GO graph. Different colors refer to the different max levels. Terms with the same max level are sorted according to P-value.

KEGG Pathway Enrichment Analysis
The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a bioinformatics database that unifies information on genomes, biological pathways, diseases, drugs, and chemical substances. Enrichment analysis of the KEGG pathway was used to test whether a query set of genes was statistically enriched in the KEGG pathway. This enrichment could be measured by the P-value using the formula used for GO enrichment analysis. Figure 7 illustrates the top 10 significantly enriched KEGG pathways for the human UBPs. For the 113 human UBPs, 70 pathways were enriched, and 11 of them were considered statistically significant. Three categories of pathways were enriched, including five metabolism pathways, one organismal system pathway, and four human disease pathways. Parkinson's disease, Alzheimer's disease, liver disease, and Huntington's disease are the top four disease pathways associated with UBPs. Details about the KEGG pathway analysis can be found in the Supplementary Materials (Data S6: KEGG.zip).

Conclusions
In this study, we proposed the first UBPs predictor (UBPs-Pred), for which three types of sequence-derived features were collected: AAC, DC, and PSSM. Subsequently, a feature selection strategy that combined RF and IFS was used to build the optimal feature subset. Then, the selected features were fed into XGBoost, and the parameters of XGBoost were tuned using MOPSO. The experimental results demonstrated excellent prediction performance, with an MCC value of 0.517.
We then analyzed the UBPs using bioinformatics methods. By analyzing the ubiquinone-binding domains, we found eight motifs that were considered statistically significant. By analyzing the distribution of UBPs, we found that 86.9% of UBPs were membrane proteins. UBPs appeared in 47 superfamilies, and the top 10 superfamilies contained about 74.1% UBPs. ComplexI-N (CL0425), NADP_Rossmann (CL0063), and FumRed-TM (CL0335) were the top three superfamilies. By analyzing the GO enrichment of human UBPs, we found that 923 BP, 130 CC, and 140 MF were statistically significant. The mitochondrial respiratory chain and metabolic processes were the most strongly enriched biological processes. The mitochondrial associated cell components were highly enriched. Apart from ubiquinone binding, catalytic activity, oxidoreductase activity, and dehydrogenase activity were the three molecular functions that were found to be the most highly enriched. By analyzing the KEGG pathway enrichment of human UBPs, we found that 11 pathways were statistically significant. Among them, Parkinson's disease, Alzheimer's disease, liver disease, and Huntington's disease were the top four disease pathways associated with UBPs.

Conflicts of Interest:
The authors declare no conflicts of interest.