Tool recommender system in Galaxy using deep learning

Abstract Background Galaxy is a web-based and open-source scientific data-processing platform. Researchers compose pipelines in Galaxy to analyse scientific data. These pipelines, also known as workflows, can be complex and difficult to create from thousands of tools, especially for researchers new to Galaxy. To help researchers with creating workflows, a system is developed to recommend tools that can facilitate further data analysis. Findings A model is developed to recommend tools using a deep learning approach by analysing workflows composed by researchers on the European Galaxy server. The higher-order dependencies in workflows, represented as directed acyclic graphs, are learned by training a gated recurrent units neural network, a variant of a recurrent neural network. In the neural network training, the weights of tools used are derived from their usage frequencies over time and the sequences of tools are uniformly sampled from training data. Hyperparameters of the neural network are optimized using Bayesian optimization. Mean accuracy of 98% in recommending tools is achieved for the top-1 metric. Conclusions The model is accessed by a Galaxy API to provide researchers with recommended tools in an interactive manner using multiple user interface integrations on the European Galaxy server. High-quality and highly used tools are shown at the top of the recommendations. The scripts and data to create the recommendation system are available under MIT license at https://github.com/anuprulez/galaxy_tool_recommendation.


Introduction
Life sciences depend increasingly on high-throughput data (HTD), turning them into data science to a large extent. However, raw HTD does not have much value on its own without proper analysis and interpretation of the data. To simplify the data analysis process and to ensure a reproducible analysis, several workflow engines have emerged ((20); (14); (2)). The main idea for workflow systems is based on the observation that any computational analysis of HTD encompasses multiple steps such as quality control, preprocessing, quantification and statistical analysis to transform raw data into scientific results. Collectively, these steps form a workflow where each step performs a definite transformation of the data, which can be performed using standardised tools. Using workflow for the analysis is simple and convenient and has several advantages. First, it is easy to replace individual tools by a newer version or to assess the influence of the associated step on the final result. Second, a workflow can be saved, shared and used multiple times, which ensures reproducible research. Therefore, workflows are becoming essential in the analysis of scientific data and there are multiple platforms where researchers can create workflows for their analyses. However, a critical question is how to assess whether a generated workflow is State-of-Art or even valid at all. To give a concrete example, one can use several realvalued input vectors (such as fluorescence-based measurement stemming from arrays), transform them into integerbased values in the first step and combine it with a tool that uses a count-based statistics (such as negative binomial distribution as used in DeSeq2) to determine values that show high differential behaviour. While this workflow would run on a workflow system without problems and even produce some results, the generated results are not valid because of the wrong statistical model.
Galaxy is a open-source data processing platform which enables researchers create and store their workflows for multiple scientific analyses ((1)). A workflow in Galaxy is a directed acyclic graph and consists of one or many tool sequences to analyse scientific data such as DNA and RNA sequences. A tool consumes one or more data files as input and produces one or more data files as output and has a defined number of data types for these input and output files. In workflows, the tools are connected one after another following a constraint that the adjacent tools must have compatible data types. In other words, the data types of output files of a tool should match the data types of input files of the following tool. Galaxy has thousands of accessible tools and acquiring familiarity and constructing workflows with these tools can be a complex and time-consuming task, especially for researchers new to Galaxy. To assist them in creating workflows and making them aware of the possible tools for further analyses, a recommender system is devised. The benefits of having such a system are manifold. First, it will avoid the loss of time spent in creating erroneous or less optimal workflows by choosing tools which may be wrong and thereby making researchers more efficient. Second, it will help them bypass the step of searching for tools separately which shows potential to further reduce the time spent in creating workflows and increase the accessibility of tools. Third, it will promote tools having higher usage frequencies in the past to the top of the recommendations and downgrade those having lower usage frequencies to the bottom of the recommendations. It is achieved by assigning weights to tools which are derived from their usage frequencies over a period of time. Finally, it can also be used to promote the newly added tools in Galaxy by showing them alongside the recommended tools predicted using the deep learning approach.
A. Recommender systems. The objective of having recommender systems in fields such as scientific literature search, online shopping, travel bookings, media-service providers and many other fields is to let people discover suitable, interesting and the newly-released products. These recommended products are recognised based on the usage and purchasing patterns of people in the past. In the field of scientific literature search, the exponential increase in the number of published papers necessitates having a recommender system to help scientists explore relevant and recent papers quickly ( (3)). Companies such as Amazon and Netflix have appropriately used recommender systems to learn preferences of their respective customers in selecting products such as their favourite books or movies and to propose a few products out of a large store. It becomes faster for their customers to sift through a few recommended products to find the most suitable ones rather than looking in their complete stores. By enabling their customers discover reasonable and customised products, the recommender systems have helped these companies grow as organisations ( (32); (31)). The successful implementations of recommender systems by many organisations across the world working in diverse areas to assess the needs of their customers in choosing relevant items and to propose the most useful ones motivated us to create a tool recommender system in Galaxy.

B. Related work.
To simplify creating workflows for scientific analyses a few approaches have been proposed which suggest alternative tools and workflows. ((25)) makes use of EDAM and semantic annotations of tools to compose workflows automatically for mass-spectrometry based proteomics. The annotations include the names, functionalities, input and output data types of tools. The PROPHETS (Process Realisation and Optimisation Platform using Human-readable Expression of Temporal-logic Synthesis) ((24)) program generates suitable candidates of workflows which match the goal of the proposed workflow and its annotations. WINGS offers multiple variations of a workflow created using different tools. It makes use of the input parameters, types of datasets and functions of tools to build the variations ((34)). The approach used by ((13)) utilises data types to facilitate the automatic creation of workflows. All these approaches depend either on annotations or matching input and output data types of adjacent tools in workflows and they pose challenges such as the addition and maintenance of the meaningful annotations of tools and extracting input and output data types of adjacent tools. Moreover, these approaches have their workflow generation restricted to a few specific bioinformatics analyses such as proteomics or proteogenomics. In addition, they do not discuss the presence of higher-order relationships ((22)) in tool sequences of workflows. Our approach to recommend tools in workflows aims to overcome these challenges in the following manner. First, it does not require storing the metadata of tools. Second, it takes into account the higher-order relationships among tools in the tool sequences. Finally, it incorporates workflows from multiple scientific analyses to train the neural network. C. Sequential learning on workflows. Workflows, created by many researchers in Galaxy for different scientific analyses, are decomposed into numerous tool sequences (figure 1). The sequential nature of these tool sequences where tools are connected one after another inspires us to apply similar learning techniques used for other sequential data such as text and speech. There are multiple studies in the fields of natural language processing, clinical research and speech recognition which apply deep learning techniques on sequential data to obtain good accuracy in predicting future items. ((36)) finds context in the long sequences of words for sentiment analysis and part-of-speech tagging using RNN and achieves 85% and 93% accuracy, respectively. For clinical data as well, learning on long sequences of health states proves to be beneficial. The health states of patients recorded at different time points are analysed by accessing their electronic health records. The future health states of patients are predicted by training RNN on the sequences of their health states in the past to achieve 85% accuracy ((21)). Moreover, the variants of RNN are used to model speech and music signals ((9); (5)). These successful studies benefit from the sequential learning techniques using different variants of RNN. Therefore, in our work as well, a variant of RNN (GRU) is used to create the tool recommender system in Galaxy. A Bayesian network can also be used for modeling directed acyclic graphs (workflows) ((18); (33)). It requires computing joint and conditional probabilities of nodes in graphs and an increase in the number of nodes can lead to a higher cost to compute these probabilities. In addition, making predictions by learning a probabilistic network is a hard problem ((6);(11);(7)). Because of these drawbacks of using a Bayesian network it is not used in our approach to create the recommender system in Galaxy.

Materials and methods
To create a tool recommender system in Galaxy all the workflows are collected from the European Galaxy server. A workflow may have one or many tool sequences where tools are connected one after another. Tool sequences are transformed into matrices and produced as input to a GRU neural D Data preparation network to learn patterns in the connections of tools.

D. Data preparation.
A workflow (figure 1a) is divided into smaller tool sequences (figure 1b, 1c and 1d). The last tool, shown in green, of each tool sequence (of length n) is assigned as the label of the sub-sequence (of length n-1) shown in blue (figure 1). A label is an output which is learned and predicted by the recommender system. In the neural network learning, a tool is a label. For example, in figure 1b, Tools D and E are the labels of the sub-sequence Tool A → Tool B → Tool C. They show higher-order dependencies in their connections which implies that a tool is not only dependent on its immediate predecessor but also on all prior tools in the tool sequence. For example, in figure 1c, the Tool C is dependent on Tools B and A. By analysing multiple workflow fragments in this way the neural network should learn that the label of a tool sequence Tool A → Tool B is Tool C. It is expected that dividing a tool sequence into fragments with a minimum length of two tools, as shown in figure 1c and 1d, will improve the generalisation performance of the neural network because it gets more tool sequences with a variety of lengths to learn from. The dependencies shown in figure 1b, 1c and 1d present in tool sequences are learned using the GRU neural network by modeling the conditional probability given by equation 1 ( (17)). The probability of a tool (x T ) is estimated given all other prior tools ( The neural network learning is classification because there are labels for tool sequences which are learned and then predicted. Moreover, the classification is multi-class (multiple tools as labels) and multi-label (multiple tools as labels for a tool sequence) ( (35)). To ensure an unbiased learning and evaluation by the neural network, the set of tool sequences is divided into two parts -training and test. The training data is used for learning a model and the test data is used for evaluating the model.
E. Relevance of tools. The tools in Galaxy have different usage patterns. Some tools are used more often than other tools for multiple reasons such as differences in their functions and availability of similar but better tools. It is essen- tial to analyse the usage patterns of tools because the recommender system proposes tools for researchers and these tools should have high relevance to their analyses. One of the key indicators of relevance of tools can be their high usage frequencies. If a tool has been used often in the recent past, it confirms that the tool is relevant. However, if a tool was used often a few years ago but is being used less often in the last six months then the relevance of that tool has certainly declined. The usage frequencies of tools, shown as labels in figure 1, over the past year are shown in figure 2. To incorporate this usage based relevance of tools in the recommender system, the usage frequencies of all the tools used in the last one year have been collected and are used in the neural network training as the weights (logarithm of usage frequencies) of tools.
A tool which has been used often (for example Tool B in figure 2) in the past one year is assigned a higher weight than a tool (for example Tool C in figure 2) which has been used less often in the past one year. When tools are recommended a score is assigned to each tool by the neural network. It is expected that a tool with higher weight gets a higher score and a tool with a lower weight get a lower score. To summarise, the relevance of a tool to be used in a workflow decays if its usage drops over time in Galaxy. Alternatively, the relevance of tools can also be ascertained by counting the occurrence of each tool in all workflows and these occurrences can be used as their weights in the neural network training. It may happen that some tools which were used often in the past to create workflows are not used anymore. Therefore, assigning weights to these tools in the neural network training based on their occurrences in workflows may not be a good indicator of their relevance and overall, may not be optimal.
F. Implementation. Tool sequences extracted from workflows are transformed into vectors because neural networks require input data to be represented as vectors and matrices.
Each tool sequence has one or more labels (figure 1) and they are transformed into different vectors -a tool sequence vector (figure 3b) and a label vector (figure 3d). To form these vectors a dictionary of tools is needed which stores an index for each tool. Using the indices of tools a tool sequence vector is created preserving the original order of tools as in the tool sequence. For example, Tool A has an index of "12" in the dictionary, therefore it is replaced by "12" in the vector  labels (tools) are turned "on" (set to 1) specifying that these tools are the labels of the tool sequence and others are not (set to 0). It has the same size as the dictionary of tools. In machine learning field, it is also known as multi hot-encoded vector. Together, these two vectors form a training sample for the neural network. A pair of vectors are created in this manner for each tool sequence and for all the tool sequences they are combined to form two matrices -one for tool sequences and another for their respective labels. These matrices form input data to the neural network which learns patterns of connections in tool sequences and maps them to their respective labels during training.  4). GRU has certain advantages which helps it to learn on sequential data. First, it avoids the problems of vanishing and exploding gradients which commonly occur in traditional RNN ( (26)). It is important because learning higher-order dependencies depends on the gradients of errors concerning the parameters (recurrent and input weight matrices) of GRU layers. Second, GRU has slightly fewer parameters than the long short-term memory network (LSTM), another variant of RNN, which makes using GRU simpler than LSTM. Finally, it achieves similar accuracy as the LSTM ( (9)).

Output layer:
The last component of the neural network architecture is a dense layer which computes the predictions ( figure 4). The dimension of this layer is equal to the num-ber of unique tools because it predicts a score for each tool (label). The predicted score of each tool is considered as its probability of being the label of an input tool sequence. The closer the predicted score of a tool is to 1, the more probable it is to be the recommended tool and the closer it is to 0, the less probable it is to be the recommended tool.
Dropout layer: Overfitting happens when a neural network performs exceptionally well on the training data but its performance on test (unseen) data remains poor. To prevent it dropout is used between two layers of the neural network. It works by setting some randomly chosen connections in the neural network to 0 ((37); (15)). 3 dropout layers are used in our approach -one between the embedding and the first GRU layers, one between 2 GRU layers and the last one between the second GRU and dense layers.
Activations: They are mathematical functions which are used in neural networks to transform inputs to a layer into its outputs. Two activations are used in this work -one is exponential linear units (ELU) ( (10)) and another is sigmoid (equation 2). ELU is used for both the GRU layers and has a special feature of being negative when the input is negative which allows mean activation (output) to get closer to 0 compared to other activation functions such as ReLU ( (23)) which is always positive. As mean activations get closer to 0, the approximated and actual gradients get closer to each other. Therefore, using ELU in our neural network as an activation can be useful to achieve faster training and an increased drop in loss and better accuracy. Sigmoid is used in the output layer which normalises any real number to lie between 0 and 1 and it is considered as a probability of each tool.
Usage frequencies of tools as weights: To ensure that the relevance of tools decays with time if they have not been used regularly in the recent past, their usage frequencies are used as their respective weights in the neural network training. The usage frequencies of tools over last 1 year (figure 2) have been collected from Galaxy. A curve is fit through the usage frequencies of each tool using support vector regression (SVR) to display a trend of the toolś usage over time. Using this trend, the usage of the tool for the next month is predicted and its logarithm is used as the weight for this tool. The logarithm of usage frequencies is computed to normalise them because only a few tools have significantly large magnitude of usage compared to that of the remaining tools which may lead the neural network to learn and predict only tools with very large magnitude of usage and ignore other tools. Learning a trend for each tool involves 5-fold cross-validation and optimising two hyperparameters of SVR, kernel and degree, using grid search. The values used for the kernel are -"rbf", "poly" and "linear" and the values of degree used are 2 and 3. By following the grid search, there are 3 (kernels) x 2 (degrees) = 6 different combinations of hyperparameters to be verified to find the best curve for each tool ( (27)).
Loss function: A neural network learns patterns from data by minimising a loss function. Cross-entropy is a popular choice for a loss function in classification problems ( (16)). In our approach cross-entropy function is used in the GRU neural network to compute the loss between the true and predicted label and is weighted by the label's weight. The loss is summed up over all labels of a tool sequence and then averaged (equation 3). The term T is the total number of labels (size of the label bit vector). The term w i is the weight of the i th label. The terms p a and p b refer to the true and predicted label vectors for a tool sequence, respectively. In general, the loss is large when p a and p b are far away from each other which means that the learning by the neural network is not good. If they are close the loss is low and the predictions are better. When an unweighted cross-entropy is used as the loss function for any classification problem ( (29)) then it is assumed that all the predictions have the same weight and it does not differentiate between the more and less dominant labels. In our approach when it is used as a loss function in the neural network, then even though the predicted labels are correct they may not necessarily have large weights and thereby maybe less relevant. Therefore, to reduce the possibility of less relevant labels appearing in recommendations, loss is weighted by the weights of labels. It ensures that if a label with a larger weight is misclassified, which means that the true and predicted values are different, then the overall loss is higher. In this way, the wrong classification of labels with a larger weight is penalised more than the wrong classification of labels with a smaller weight.
The loss in equation 3 is computed for all tool sequences in training data and is minimised using a root mean square propagation (RMSProp) optimiser. It follows an adaptive approach to estimate the learning rate by keeping knowledge of gradients in prior iterations. The learning rate is updated by dividing it with an average of the square of the prior gradients ( (28)).
Hyperparameter tuning: A neural network has multiple hyperparameters. In our approach they are the number of dimensions of embedding layer, learning and dropout rates, number of units for GRU layer and size of batches. They should be optimised to find the best configuration (a combination of hyperparameters) for training on tool sequences as a different configuration may give a different performance on the same training data. The grid and random searches are popular techniques to optimise hyperparameters. One limitation of these approaches is that they evaluate each configuration independently and have a high time-complexity to find the best configuration. Therefore, the hyperparameters in this work are optimised using a Bayesian (sequential modelbased) optimisation ( (4)). It learns from the previously evaluated configurations which ensures faster convergence. Reasonable ranges of all the hyperparameters to be optimised are given and the best configuration is found after 30 evaluations.
F.2. Learning. The neural network learns patterns in the tool sequences from the training data and creates a model. The ability of the model to recommend tools is evaluated on the test data which is unseen by the neural network during training. While learning, the complete training data is divided into batches of equal size and the weights (belonging to multiple layers of the neural network) are learned in iterations. All these iterations together make an epoch when all the tool sequences in the training data have been used for learning. The number of tool sequences extracted from workflows is approximately 200,000. The training data forms 80% of all tool sequences and it is iterated over 10 epochs of neural network training. The remaining 20% is used as the test data. The running time of the training is approximately 50 hours on Intel(R) Xeon(R) CPU provided by a high performance computing cluster with single core.

F.3. Predictions.
Learning on training data using a neural network creates a model to predict tools and each tool gets a probability score of being the recommended tool of a tool sequence. The predictions are sorted in the descending order of their probabilities and the top ones (with the highest probabilities) are shown as recommendations. Top-k precision (pre-cision@k) is a popular metric for evaluating a recommender system ((30); (19); (12)). Precision@k implies how many in the k predicted tools are correct. For example, k = 3 implies that the number of predicted tools are 3 with the highest predicted scores. If only 2 of them are correct, then the preci-sion@3 is 2 3 = 0.66. In this way, prediction@3 is computed for all the tool sequences in the test data and then averaged to get an overall precision@3. Precision@1 (top-1), preci-sion@2 (top-2) and precision@3 (top-3) metrics are used in this approach to evaluate the quality of the tool recommender system.

F.4. Multiple neural network architectures.
Multiple architectures, convolutional neural network (CNN) and dense neural network (DNN) with only dense layers, are used to compare their predictive strengths with GRU neural network (figures 5 and 6). In these architectures too, the embedding layer is used as the first (input) layer and a dense layer is used as an output layer having the same dimensions as the number of tools. Additionally, in CNN, convolutional and max-pooling layers are used to learn spatial patterns in tool sequences and downsample the dimensionality of input, respectively. Moreover, two dense layers are also used and the last one serves as an output layer. DNN uses two dense layers as hidden layers. The cross-entropy, with and without weights, is used as the loss function and RMSProp is used as an optimiser. Bayesian optimisation is used to optimise the parameters these architectures.
F.5. Library and model. The Keras deep learning library is used for producing the neural network architectures ( (8)). The trained model is saved as an H5 file to simplify its distribution to different Galaxy instances. The file is an HDF5 store containing the weights of different layers of the neural network and their configurations, a dictionary of tools and their indices and the weights of tools. The weights and configuration of the neural network are needed to recreate the trained model. The dictionary is used to replace IDs of the predicted tools by their indices in the tool sequence.

Results
The models obtained after training all the neural network architectures are used to predict tools for the tool sequences in the test data after every training iteration. The precision and usage frequencies of the predicted tools for top-1, top-2 and top-3 metrics are computed over 10 training iterations for each experiment run. They are averaged and their respective standard deviations are computed over 10 experiment runs. The mean precision is shown by line plots and shaded region spans the region between one standard deviation above and below the mean (figures 5 and 6). The GRU neural network with the weighted cross-entropy loss function shows a superior performance to DNN (figure 5a and 5b) by achieving 97% precision (figure 5f) which proves that the GRU layers in a neural network are better for learning on sequential data than the dense layers. Moreover, it shows lower divergence in the means of precision and usage frequencies (figures 5 and 6) establishing that its predictive strength is more stable than DNN over multiple experiment runs. Surprisingly, the weighted cross-entropy loss function does not have any beneficial effect on DNN as its precision deteriorates over training iterations (figure 5b) with a large standard deviation. Due to poor accuracy, DNN is not used in our approach. In contrast to DNN, CNN achieves a similar precision to GRU neural networks with smaller standard deviations (figure 5c and 5d). It also shows an increase in usage frequencies of predicted tools when weighted cross-entropy is used as a loss function ( figure 6c and 6d). Despite exhibiting promising results for learning on temporal data (figure 5c and 5d), it gathers lower magnitude of usage frequencies than the GRU neural network with cross-entropy loss function (figure 6d and 6f) which drives it to classify tools with higher usage frequencies more robustly. In other words, GRU neural network with cross-entropy loss function predicts tools with higher usage frequencies and precision than all other approaches. Therefore, it is used in our approach to learn on tool sequences and recommend tools. To illustrate the real-time usage of the recommender system in Galaxy, two examples have been provided -one shows recommended tools for a tool sequence with 3 tools, Trimmomatic → Bowtie2 → FreeBayes in the workflow editor of Galaxy (figure 7) and another displays recommended tools after the execution of RNA-star tool (figure 8).

Discussion
A recommender system to predict tools in Galaxy is built by analysing workflows using a variant of RNN (GRU) and a weighted cross-entropy loss function. The recommended tools are relevant for multiple scientific analyses with a high accuracy, are easily accessible through simple UI integrations and together, they improve user experience by helping re-  searchers to easily create correct workflows. Moreover, the approach does not need to store any metadata of tools and the recommendations are made by only learning the patterns of tool connections in workflows. The model created using this approach is integrated into Galaxy European server to show recommended tools to researchers. An API is provided, residing with other Galaxy APIs, to access a tool or a tool sequence specified by researchers to show its recommendations in real-time. The API is used at two different places in Galaxy -one shows recommendations in the workflow editor and another shows them after each tool execution. The list of recommended tools are sorted in decreasing order of their (predicted) scores. These scores are positive real numbers and are computed independently of one another by the GRU neural network. To make these scores more meaningful, they are normalised by dividing each toolś predicted score by the maximum predicted score. On a usual Galaxy server the workflows and tools are dynamic, as new tools and workflows are added regularly. Therefore, it is important to train the GRU neural network on the complete set of workflows periodically to keep the tool recommendation model updated with the latest tools and workflows. This model can be created using the set of work- Fig. 7. The image shows recommended tools in the workflow editor of Galaxy. The recommended tools can be seen in a modal popup after clicking on the right arrow button placed in top-right corner of each tool. Clicking on any of the recommended tool opens a new block for that tool which can be connected to the tool sequence.
flows and tools on a local Galaxy instance by following the steps mentioned in the repository. A script, "extract_data.sh", is provided for collecting raw data from a Galaxy instance. These are input datasets -one contains workflows and another contains usage frequencies of tools. The sample input datasets are also provided with the scripts. The values of multiple hyperparameters of the neural network, number of training iterations and sizes of training and test data can be altered using the "train.sh" training bash script. To execute the scripts on a GPU enabled machine, the "tensorflowgpu" package should be installed instead of "tensorflow" as mentioned in the conda package, "environment.yml", dependencies file. Alternatively, a Galaxy tool is also available to create this model which can be executed directly on Galaxy. This simplifies the creation of a model by providing a UI where the parameters pertaining to the datasets and GRU neural network can be changed. To see recommended tools an ipython script, "tool_recommendation.ipynb", is also provided which predicts tools for a tool or a tool sequence. A Galaxy admin can overwrite the recommended tools predicted using the trained model by a different set of tools using the Galaxy API which can be beneficial to highlight newly added tools.