ORCESTRA: a platform for orchestrating and sharing high-throughput pharmacogenomic analyses

Reproducibility is essential to Open Science, as there is limited relevance for finding that cannot be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, challenged, and built upon. However, due to recent advances in the biological and computational sciences, it has become difficult to process, analyze, and share data with the community in a manner that is transparent. This has made reproducing research findings more challenging, with some researchers going as far as suggesting that the biomedical sciences are experiencing a “reproducibility crisis”. To overcome these issues, we created a cloud-based platform called ORCESTRA (www.orcestra.ca), which provides a flexible framework for the reproducible processing of multimodal biomedical data. The platform enables processing of genomic and pharmacological profiles of cancer samples through the use of automated processing pipelines that are user-customizable, which are executed through Pachyderm, a data versioning and orchestration tool. ORCESTRA creates an integrated and fully documented data object known as a PharmacoSet (PSet), with a persistent identifier (DOI), that can be used and shared for future analyses using the Bioconductor PharmacoGx package.


Introduction
The demand for large volumes of multimodal biomedical data has grown drastically, partially due to active research in personalized medicine, and further understanding diseases [1][2][3] . This shift has made reproducing research findings much more challenging because of the need to ensure the use of adequate data handling methods, resulting in the validity and relevance of studies to be questioned 4 . Even though sharing of data immensely helps in reproducing study results 5 , current sharing practices are inadequate with respect to the size of data and corresponding infrastructure requirements for transfer and storage 2,6 . As computational processing required to process such data is becoming increasingly complex 3 , expertise is now needed for building the tools/workflows data principles, which includes information about dataset origin, how it was generated, if there were any modifications that were made to it from precedent versions, and what these modifications were 14,26,27 . However, datasets published online, including ones that reside in repositories and from journals are often not accompanied with sufficient metadata 28 . In the field of genomics, issues with metadata often include mislabelling/annotation of data (e.g., incorrect identification numbers), improper data characterization (e.g., mapping files to respective samples and protocols), and inconsistency in the way metadata is presented/communicated (non-uniform structure used across consortiums) 29 . Overall, data quality will suffer and a low level of confidence will arise with newly performing or repeating analyses with a dataset that contains poor or no metadata. Provenance also extends to the scientific workflows/pipelines that are developed to process datasets 2 , as sharing relevant source code is often not provided 30 along with relevant documentation about the workflow, such as in graphical user interface (GUI) based systems like Galaxy, affecting the ability to reproduce results 2 . In addition, pharmacogenomic data maintainers/consortiums, such as the Cancer Cell Line Encyclopedia (CCLE) 31 , the Genomics of Drug Sensitivity in Cancer (GDSC) and the Genentech Cell Line Screening Initiative (gCSI) 32 , often only process the dataset using one pipeline that they believe is the most suitable, without providing any reasoning in the form of documentation as to why the chosen processing pipeline was selected over other competing ones in the field 32,33 . Therefore, only a single-form of the dataset is released, which makes it difficult for other researchers to perform a diverse set of analyses that require the use of different processing pipelines on the dataset. A lack of provenance and utilization of a dataset in a single-form affects transparency, expressing a need for sharing pharmacogenomics data in a reproducible manner.

Current approaches to sharing pharmacogenomics data and pipelines
In the field of pharmacogenomics where genomics data are combined with phenotypic measurements of drug response, data portals have been created for accessing and sharing the molecular and pharmacological data components of these large datasets, but with limitations in regard to reproducibility 34 (Table 1). The Genomic Data Commons Data Portal (NIH/NCI GDC) hosts raw data for the Cancer Cell Lines Encyclopedia (CCLE) from the Broad Institute, including RNA-seq, WXS, and WGS, allowing users to select and download the data type(s) of interest.
Obtaining the data can be done through direct download or their GDC Data Transfer Tool by providing a manifest file that possesses the unique-identifiers (UUID) of each file, which also allow users to locate the files again through the portal, along with their corresponding run, analysis, and experimental metadata. This is advantageous, as all the raw data (public and controlled access), for both datasets are located within one portal and can be accessed in an efficient manner.
However, no release notes are provided for any data that is newly uploaded or modified to GDC, which makes it challenging for researchers to keep track of different versions of the dataset and ensure their analyses on it are reproducible. Moreover, the recent addition of new CCLE data (e.g. additional RNA-seq cell lines) 33 , is found on the European Nucleotide Archive (ENA), but not on GDC, resulting in data source inconsistency that becomes difficult to manage and follow for users. In addition, current and previous versions of other CCLE data (i.e. annotation, drugresponse) are hosted on a Broad Institute portal, but there are no release notes or documentation present with each version, forcing researchers to manually identify changes within each file after every release. gCSI, a dataset generated by Genentech, provides easy access to their drug response and processed molecular data via the compareDrugScreens R package (http://research-pub.gene.com/gCSI-cellline-data/) 32 , along with their raw molecular data through the European Genome-Phenome Archive (EGA) and ArrayExpress data portals. GRAY, a dataset generated by Dr. Joe Gray's lab at the Oregon Health and Science University, has had three updates with raw data hosted on NCBI, and drug-response and annotation data hosted on SYNAPSE, DRYAD, and/or the papers supplementary section [35][36][37] . Because each version of the dataset is associated with a different respective paper, the data is scattered among various hosts, which makes it challenging to keep track of each source, and for each source to ensure that the data remains readily available, as one failed link would make it difficult for a researcher to reproduce any results. However, for the GRAY dataset, NCBI provides detailed information about the methodology used for the experiments, SYNAPSE provides a wiki and contact source for the dataset and a provenance tracker for each file that is uploaded, and DRYAD stores each publications data as a package organized with subsequent descriptions to keep data organized.
Along with CCLE, the Genomics of Drug Sensitivity in Cancer (GDSC) from the Wellcome Trust Sanger Institute also represents good data sharing practices, as they also provide a data downloading tool, as well as a single web page including links and descriptions to both drugresponse and raw data, keeping everything in one desired space. Other than a short description of each data file, they do not provide any information about what was specifically updated within each file after an update, making it difficult to track provenance. Therefore, there is a need for a platform that acts as a common point of entry for both molecular and drug sensitivity data, as well as combining all of the positive data sharing practices used by the data portals available.

Key functionalities for an effective orchestration platform
The increasing utilization and demand for big data have resulted in the need for effective data orchestration 38 , which is a process that involves organizing, gathering, and coordinating the distribution of data from multiple locations across compute resources with specific processing requirements, in order to meet the needs of an end user. An ideal orchestration platform for handling large-scale heterogeneous data would consist of the following: (1) a defined workflow; (2) a programming model/framework 38 , and (3) broad availability of a compute infrastructure. At the workflow level, data from different sources/lineages, including data that is not static, must be effectively managed through the definition of workflow components (tasks) that interact and rely on one another 38 . Moreover, a programming model should be utilized for the workflow components responsible for handling the respective data (static/dynamic), such as a batch processing model (MapReduce) 38 . Lastly, the utilization of a scalable compute environment (i.e. Resource allocation. Pachyderm requires persistent RAM/CPU allocation for each pipeline within the Kubernetes cluster, even after a pipeline is successfully executed, which permits automatic pipeline triggering. Thus, an increased amount of compute resources (VM's scaled up/out) may be required for specific pipelines, which also impacts cost efficiency.

ORCESTRA platform
We have developed a cloud-based paradigm for data sharing and processing in pharmacogenomics based on automation, reproducibility, and transparency. The platform is deployed on Microsoft Azure and encompasses automated Pachyderm pipelines that allow a user to create a custom PharmacoSet (PSet), which is an R data object that stores molecular profile, drug-sensitivity, and experimental metadata for the respective cell lines and drugs, allowing for integrative analysis of the molecular and pharmacological data for biomarker and drug discovery 39 . The platform utilizes datasets from the largest pharmacogenomic consortia, including the  Table 3). PharmacoSets can accommodate all types of molecular profile data, however, ORCESTRA currently supports gene expression (RNAsequencing, microarray), CNV, mutation, and fusion molecular data. For RNA-seq data, a user is given the ability to select a reference genome of interest, a combination of quantification tools and their respective versions, along with reference transcriptomes from two genome databases (Ensembl, Gencode) to generate custom RNA-seq expression profiles for all of the cell lines in the dataset. Therefore, each PSet will be generated through a custom orchestrated Pachyderm pipeline path, where each piece of input data, pipeline, and output data option is tracked and given a unique-identifier to ensure the entire process is completely transparent and reproducible.
Unlike other platforms, to ensure PSet generation is fully transparent and that provenance is completely defined, each PSet is automatically uploaded to Zenodo and given a public DOI, where the DOI is shared via a personalized persistent web-page that possesses a detailed overview about the data that each DOI associated PSet contains and how it was generated. This includes publication sources, drug-sensitivity information and source, raw data source, exact pipelines parameters used for the processing tools of choice, and URLS to reference genomes and transcriptomes used by the tool(s). This page gets automatically sent to each user via email, providing users with one custom page that hosts all of the information required to understand how the PSet was generated. Therefore, all of the data used in the PSet is shared in a transparent fashion, where researchers can identify the true origins of all data used with confidence and effectively reproduce results.

ORCESTRA structure
In order for the platform to be as transparent as possible, it harnesses an architecture with three distinct layers that not only works independently to process and interpret precedent data, but also have the capability to scale (Figure 1). The first layer contains the web application which was developed using a Node.js API and React front-end with MongoDB as a database. The layer provides the user with an interaction point to the ORCESTRA platform, allowing users to search for existing PSets, request a new PSet by entering pipeline parameters, view PSet request status, and register a personal account to save existing PSets of choice.
The second data-processing layer encompasses a Kubernetes cluster on Microsoft Azure that hosts Pachyderm, which utilizes Docker images for PharmacoGx and respective R-packages.
All of the RNA-seq raw data for the CCLE, GDSC, gCSI, GRAY, and UHNBreast PSets have been pre-processed with Kallisto and Salmon Snakemake pipelines using an HPC environment, and subsequently pushed to assigned data repositories on Pachyderm, allowing for specified selection from the web-app (transcriptome and tool version). The Pachyderm PharmacoGx pipeline incorporates the RNA-seq and drug-sensitivity data specified via the web-app, along with the annotation data that is hosted on GitHub, and other molecular profile data (e.g. CNV, mutation) into a PSet (Figure 2). The GitHub hosted files encompass cell line, drug, and tissue annotation data that can be viewed at the file-level for changes and edited which automatically triggers the Pachyderm PharmacoGx pipeline with the new modifications to produce a new PSet. A unique feature of Pachyderm is the prevention of re-processing computed data, such as where an update of RNA-seq annotations will not trigger the re-processing of thousands of drug-sensitivity experiments. Each generated PSet enters the third data-sharing layer where the PSet gets automatically uploaded to an online data-sharing repository known as Zenodo, with a DOI so that the data object can be given a persistent location on the internet that can be uniquely identified.
The generated DOI is then associated with a custom meta-data web page that is generated based on the contents of the PSet. Below is a detailed overview of each ORCESTRA layer:

Web Application Layer: Features and Use Cases
ORCESTRA comes with a web application interface allowing users to interact with the dataprocessing and data-sharing layers. These features are outlined in the following use cases:

Search and obtain information about existing PSets ("Search/Request" and "Individual PSet" views)
Users can search existing PSets in the "Search/Request" view by filtering existing PSets with the "PSet Parameters" panel. Users can filter existing PSets by selecting datasets with associated drug sensitivity releases, genome references, RNA-seq transcriptomes, RNA-seq processing tools with respective versions, which associates with other respective DNA data types (mutation or CNV) and RNA data types (microarray or RNA-seq). Changes in the parameter keeps track of PSet requests submitted by users based on their email addresses even without registration. These PSets are automatically added to a user's favorite PSets and can be viewed in the "User Profile" view.

ORCESTRA Deployment Options and Costs
In order to cater to a specific set of computing needs and resource limitations for each user, ORCESTRA offers flexibility in platform configuration and deployment for each of its three layers.
The web application and data-processing layers can be customized and deployed onto different environments such as a remote server, a virtualized server on a cloud computing platform such as Microsoft Azure, Google Cloud and AWS, or an on-premise server. While the platform uses Zenodo as a data-sharing layer by default, it can be configured to use different data sharing repositories if a user wishes to do so. Such flexibility in the platform deployment and configuration enables users to deploy and utilize ORCESTRA within the limitations of their computing environment. For example, users may choose to deploy the web application and data-processing layers on a physical server located on-premise. This allows the users to enjoy greater control over the platform configuration and cost management, especially with regards to deploying a Kubernetes cluster, a core framework used for the data-processing layer. Running the cluster on a cloud computing platform such as Microsoft Azure could incur a large cost (in the order of thousands of US dollars per month). Therefore, deployment of the data-processing layer onpremise may be favorable if an adequate server is available. On the other hand, commercially available cloud computing platforms typically come with a number of useful features to assist users in deploying web applications, databases and computing clusters. Microsoft Azure's App Service plan and database services for example, allow users to easily deploy web applications and configure databases without installing additional tools or software. Pachyderm can also be easily installed and configured using Azure Kubernetes Service (AKS). These build-in services offer users ease and reliability in deploying and maintaining ORCESTRA. With this deployment flexibility, users can choose to deploy each platform layer in different computing environments to meet their needs and limitations in computing and monetary costs. Users may choose to take advantage of the ease of web application deployment and maintenance processes offered by a commercially available cloud computing platform while deploying the data-processing layer onpremise to reduce the monetary costs. Another way ORCESTRA allows users to reduce computing costs is by having online/offline request processing features. For example, users who choose to deploy the data-processing layer on a cloud computing platform may be restricted to activate the data-processing layer only when pipeline requests need to be processed, and deactivate it once the pipeline is complete, in order to reduce the monetary cost. ORCESTRA accommodates such use case by providing Pachyderm status checking, offline request processing, and manual push request features as described below: Pachyderm status check. Users can easily identify the status of Pachyderm (instances hosting it are either online or offline) with the status indicator on the right-hand corner of the top navigation bar on ORCESTRA. The web application checks the online/offline status each time the web application is reloaded onto a web browser.
Offline request submission. The web app interface allows users to submit pipeline requests even when the Pachyderm cluster is offline. These requests are saved in the database with "pending" status which can be viewed in the "Request Status" page.
Manual pipeline request push. The web app interface provides an administration feature that allows an administrator to manage pending pipeline requests that are submitted while Pachyderm is offline. When a user is logged in as an administrator, the user can "push" those pending requests to the data-processing layer in the "Request Status" view, upon activation of Pachyderm.

Summary and Future Directions
ORCESTRA is a platform that harnesses a robust framework that ensures the reproducible and transparent processing and sharing of pharmacogenomic data with the scientific community. Through the use of Pachyderm, the platform is able to track provenance of all pharmacological and molecular data across a plethora of cell lines and drug compounds that are to be incorporated into a single unified data object for cancer research. Not only are we hoping to add additional pharmacogenomic datasets to the platform, but we are also interested in expanding our platform to securely process and host private PSets that can be accessed through the existing user profiles. This will allow researchers to effectively collaborate and share data with other teams, along with standardizing the space in which their data is hosted. Lastly, we aim to allow users to select additional tools for the processing of RNAseq data, such as STAR and HISAT2, along with CNV and mutation data, which will increase the flexibility of the platform.   . ORCESTRA web-application connectivity with data processing layer through commit ID scanning for user selected pipeline requests, and subsequent PSet DOI tracking with MongoDB queries. Table 1. Common data portals for sharing genomic data with notable and expected functionalities for transparent and reproducible processing, analysis, and interpretation. Table 3. Datasets (PSets) processed and shared through ORCESTRA with respective drug sensitivity versions, number of cell lines and drugs, and number of drug sensitivity experiments.