Identifying Operons with Deep Neural Networks

Genes in prokaryotic genomes often assemble in transcription units called operons. Detecting operons are of signiﬁcant importance to help infer functionality and detect regulatory networks. Several tools have been proposed to detect such operons computationally. We propose a new method, which we name Operon Hunter, that uses visual representations of genomic fragments and a residual neural network architecture to make operon predictions. Our method uses a pertained network via transfer learning to leverage big datasets. We report the highest accuracy in the literature that we know of when tested on the standard datasets that are reported in various studies: E. coli and B. subtilis, with an F1 score of 0.83, outperforming the previously reported state of the art tools. Our method also demonstrates a clear advantage when it comes to detecting full operons rather than separate gene pairs.


Introduction
In prokaryotes, contiguous genes are often arranged into operons. These operons usually include metabolically or functionally related genes that are often grouped together on the same strand and transcribed in the same polycistronic messenger RNA [1][2][3][4][5][6][7][8][9][10][11] . Genes in the same operon share a common promoter and terminator 12 . We reserve the definition of Operons as those including at least two genes, distinguishing them from Transcription Units, which may include one or more genes [1]. It is estimated that more than 50% of bacterial genes form operons 13 . Since operons are often formed by genes that are related functionally or metabolically, predicting operons could help us predict higher levels of gene organization and regulatory networks 1,2,[8][9][10][11]14 . Such predictions could help annotate gene functions 15 , aid in drug development 16 , and antibiotic resistance 17 .
Many tools perform their predictions on gene pairs, and then assemble their predictions that are made on an entire genome into operons by combining contiguous pairs that are labeled as operonic. 1,3 . These tools are often benchmarked using two genomes whose operons have been studied extensively, E. coli and B. subtilis 3 .
Among the tools that predict operons, we focus on two methods. The first is reported to have the highest accuracy among operon prediction tools 7 . It is based on an artificial neural network that uses inter-genetic distance and protein functional relationships. The method uses gene neighborhood, fusion, co-occurrence and co-expression in addition to protein-protein interaction and literature mining to generate a score that can be used as input to their network 5,7 . The predictions made by this method were compiled into what is called the Prokaryotic Operon Database (ProOpDB) 8 . The second tool is called the Database of Prokaryotic Operons (DOOR) and was ranked as the best operon predictor among 14 other predictors 2 , and was also reported by another study to have the second highest accuracy after ProOpDB 7 . Their algorithm uses a combination of a non-linear (decision-tree based) classifier and a linear (logistic function-based classifier) depending on the number of experimentally validated operons that could be used in the training 2 .
Before delving into our method, it's worth pointing out some of the challenges that undermine operon prediction. Predictors that rely on features such as gene conservation or functional assignment are limited in the sense that they could not be applied to genomes or genomic fragments that lack these features. So while such predictors might perform well on gene pairs that include the necessary features, their performance might drop considerably when considering the whole genome 1 . Moreover, even though most methods validate their results by comparing their predictions to experimentally verified operons, the fact that the experimentally verified datasets are not available for all prokaryotes or that the datasets used vary between studies poses extra challenges when trying to compare prediction methods. Brouwer et al tried to compare several methods on a uniform dataset and noticed significant gap in the metrics reported. The drop was even higher when considering full operons rather than gene pairs 7,36 . Finally, Some methods include the testing dataset in the training dataset which leads to reported accuracies that are significantly higher, whereas the information flow would not be easily and readily transferable to other genomes 9 . With those challenges in mind, we describe out method in the next section and how we tried to alleviate some of the challenges, and present our results in section 3.

Methods
In our previous work 37 , we demonstrated a successful method that relies on using images to represent genomic data to identify genomic islands. The idea is that using graphical representations makes it easier to leverage the powerful ML technologies that have become the state of art in solving computer vision problems. Algorithms based on Deep Neural Networks have proven to be superior in competitions such as ImageNet Large Scale Visual Recognition Challenge (ILSVRS) 38 . Deep learning is the process of training deep neural networks (i.e neural networks with many hidden layers). The depth of these networks allows them to learn more complex patterns and higher order relationships, at the cost of being more computationally expensive and requiring more data to work effectively. Improvements in such algorithms have been translated to improvements in a myriad of domains reliant on computer vision tasks 39 . In our approach, the images were generated by PATRIC (The Pathosystems Resource Integration Center), which is a bacterial Bioinformatics resource center that we are part of (https://www.patricbrc.org) 40 . More specifically, one of the services PATRIC provides is the compare region viewer service, where a query genome is aligned against a set of other related genomes anchored at a specific focus gene. The service starts with finding other genes that are of the same family as the query gene, and then aligning their flanking regions accordingly. Such graphical representations are appealing as they help users visualize the genomic areas of interest. We replicated the service by implementing an offline version forked from the production UI, which is necessary for computational efficiency and for consistency in the face of any future UI changes. Each row of arrows in the generated image represents a region in a genome, with the query genome being the top row. Each arrow represents a single gene, capturing its size and strand directionality. The arrows and the distance between them are scaled on each row to reflect the actual length and distance between the genes. Colors represent functionality. The blue arrows are reserved to represent the query gene pair. The rest of the genes share the same color if they belong to the same family, or are colored black if they are not found in the query genome's focus region. The families used in the coloring process are PATRIC's Global Pattyfams that are generated by mapping signature k-mers to protein functionality, using non-redundant protein databases built per genus before being transferred across genera 41 .
We also introduce slight modifications to the produced images, to capture more meaningful operon-specific features. Namely, we highlight the inter-genetic distance between the query gene pair by filling the gap as a red rectangle. As a result, the generated images capture most of the prominent features mentioned earlier, such as gene conservation, functionality, strand direction, size, and inter-genetic distance. Figure 1 shows an example of how the generated images look like.
Images offer a natural way to compare genes (horizontally) and clusters across genomes (vertically) with 2D convolution. The fact that the compare region view sorts genomes by evolutionary distance allows the neural network to exploit locality and place more emphasis on close genomes via incremental pooling. An additional benefit of working with the image modality is to be able to leverage the state-of-the-art deep learning models, many of which were first developed in vision tasks and perfected over years of iterations. Other than our previously mentioned successful approach to detect genomic islands, Google researchers have used spectrogram (instead of text) in direct speech translation 42 and DNA sequence pile-up graphs (instead of alignment data) in genetic variant calling 43 . In both cases, the image-based models outperformed their respective previous state-of-the-art method based on traditional domain features. Further, the low-level visual feature patterns learned in pre-trained image models have been demonstrated to transfer to distant learning tasks on non-image data in several preliminary studies ranging from environmental sound classification to cancer gene expression typing 44 . Much like feature engineering methods, casting tabular data to images encodes information in a way more amenable to learning without explicitly adding information. It can also be easily integrated with other data modalities in the latent vector representation to prevent information loss. We hypothesize this emerging trend of representing data with images will continue until model tuning and large-scale pre-training in scientific domains start to catch up with those in computer vision.
When it comes to training the model, perhaps the most important step is building the right dataset. We wanted to stick to Figure 1. Example of an image generated by our service to be fed as input to the neural network. Each arrow represents a single gene. Each row captures the area of interest in a genome. The query genome is the top row. The rest of the rows are genomes selected by evolutionary distance. The query gene pair are color-coded with blue, and the inter-genetic distance between them is colored red. The rest of the genes share the same color if they belong to the same family.
experimentally verified data instead of rely on other tools' predictions. But considering how the former is more limited, and how deep the models we were using are, it put the model at the risk of over-fitting. To get around that, we used a technique referred to as transfer learning 45 . In transfer learning, a model does not have to be trained from scratch. Instead, the idea is to retrain a model that has been previously trained on a related task. The newly retrained model should then be able to transfer its existing knowledge and apply it to the new task. This approach gives us the ability to reuse models that have been trained on huge amounts of data, while adding the necessary adjustments to make them available to work with more limited datasets, adding a further advantage to our approach of representing the data visually. We used the FastAI 46 platform to train and test our model. After testing the performance of the different available pre-trained models on our dataset, we observed the best performance when using the ResNet18 model. These models were previously trained on Imagenet, which is a database that contains more than a million images belonging to more than a thousands categories 38 . Thus, a model that was previously trained on ImageNet is already good at feature extraction and visual recognition. To make the model compatible with the new task, the top layer of the network is retrained on our operon dataset, while the rest of the network is left intact, which is more powerful than starting with a deep network with random weights. We also found better results when we changed the mode of transfer learning to also re-train the previous layers of the model (as opposed to just the last layer) by unfreezing their weights. For our training data, we aligned the query genome with the set of reference+representative genomes found on PATRIC. For each genome used, our program produces an image for every consecutive gene pair, capturing the surrounding 5 Kbp (kilo base pairs) flanking region. A balanced dataset was then curated from the total set of images created. In this supervised learning approach, we used the "known operons" section of the Operon DataBase (ODB) 47 to label our images. The Known Operons section of ODB contains experimentally verified operons. To construct the training dataset, we used 8 genomes (Table 1) with a significant number of labeled operons. The result was 4,861 images of Operonic gene pairs. We generated a close number of images representing non-Operonic gene pairs. To construct the negative dataset (the non-operons), we resorted to the same approach reported by ProOpDB 8 as the standard. The idea is to select the known Operon boundaries along with the respective upstream or downstream gene and label the resulting pair as non-operonic.

Results
To verify our results, we resort to two extensively studied genomes with experimentally verified operons: E. coli and B. subtilis, and report the achieved specificity, sensitivity, and F1 score. These genomes are the standard datasets for verifying operon prediction results in the literature, and serve as a good comparison ground. We compared our predictions to those made by ProOpDB and DOOR, the two tools with top accuracies as reported by independent studies 2,7,8 . To build the testing dataset,  49 , as is commonly done. RegulonDB is a database containing verified operons found in E. coli, and although it is regularly updated, we used the predictions made by the version that matches ProOpDB's release. DBTBS is a database that contains verified operon predictions for B. subtilis. The resulting dataset consists of 1081 operon gene pairs. We used the same approach mentioned earlier to construct a matching negative non-operon dataset. Namely, we used the boundaries of operons reported by ODB, RegulonDB and DBTBS, and reported each boundary with its neighboring gene as non-operonic pair. We then chose a subset of all the pairs to have a balanced dataset. We also filtered out the non-operonic pairs that were predicted by ProopDB or DOOR as operons, so the resulting negative dataset turned out to be slightly smaller than the positive operon dataset. Using predicted operons as well as verified ones diminishes the possibility of labeling an operonic gene pair with the wrong class. We skipped single-gene operons and rna genes as these are not handled by our tool. Table 2 shows a breakdown of the testing dataset. It is important to highlight that while other tools face the challenge of transferring their performance across lineages, our approach does not face that challenge, which is made clear by how different the genomes constituting our datasets are. It is worth noting at this point that ProOpDB included its testing data as part of its training data, and reported a decrease in accuracy when tested on separate dataset. In our experiments however, we exclude the testing dataset from the training dataset. So for example when we are evaluating the model's performance on E. coli, we train the model on all the images belonging to all the genomes except those that belong to E. coli, and then test the model's performance on the images generated from the E. coli genome. In addition, since we mentioned earlier that accuracies tend to change when trying to detect full operons rather than gene pairs 7 , we present a more in-depth breakdown to show how much they change. To make full operon predictions, we first make predictions on every consecutive gene pair in a genome, and then merge the consecutively labeled operon pairs into full operons. Tables 3 and 4 report the specificity and sensitivity achieved by the three mentioned tools over the constructed datasets, and Table 5 shows the resulting F1 score. results reported by ProOpDB vary from those reported in their paper, which could be attributed to the fact that we are reported results that cover the whole genome, and not just gene pairs with COG assignments as is required by their tool. Also note that DOOR's performance on E. coli seems to be significantly more superior than its performance on B. subtilis. To make sure that the algorithms are not simply over-predicting the positive class, we compare the predictions made on a negative dataset in Table  4. To present a fair comparison in one table, we present the resulting F1 score for each tool in Table 5. It's worth noting that even though ProOpDB scored the highest sensitivity, it scored the lowest specificity. While the high sensitivity could be attributed to having the testing data as part of their training data, the low specificity could be due to a significant number of genes missing COG assignments, resulting in a lower overall F1 score. Predicting full operons is a more challenging task that requires accurate boundary detection. We present how the 3 tools cope with full predictions in Table 6. We can see that Operon Hunter achieves the highest accuracy, with 62% of its predictions exactly matching full operons, and 24% being partial hits, while missing 14% of operons mentioned in the dataset, which is also the most across the three tools. Compare that to ProOpDB whose accuracy drops significantly to 46% when making exact hits on full operon predictions. DOOR comes in last getting only 21% of exact matches with full operons. Operon Hunter seems better adept to predict full operon boundaries accuractely, while ProOpDB and DOOR mostly get partial hits. Even though DOOR gets a small percentage of full hits, it does get all the other reported operons in the datasets partially, with the minimum number of absolute misses among the three tools (0%). Table 6. Comparison of the results between OperonHunter, ProOpDB, and DOOR when considering full operon predictions. Full hits are the percentage of operons predicted that exactly match the validated results, partial hits are predictions that have different boundaries but intersect with the verified results, and misses are verified operons that are missed by the tool. The percentages are reported over 479 operons.

Availability of data and materials
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.