Knowledge-based Biomedical Data Science 2019

Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.


INTRODUCTION What is Knowledge-based Biomedical Data Science?
Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine.1 There are many ways in which a system might act as if it knew something: for example, it might be able to use existing knowledge to generate, rank or evaluate hypotheses about a dataset, or able to answer a natural language question about a biomedical topic.
Knowledge-based systems have long been a theme in artificial intelligence research.
Knowledge-based systems specify a knowledge representation --how a computer system represents knowledge internally --and one or more methods of inference or reasoning -how computations over those representations (perhaps combined with other inputs) are used to produce outputs. Classical descriptions of knowledge representation and reasoning systems, e.g. (2) characterize them by the ontological commitments a knowledge representation makes (that is, what it can or cannot describe), which inferences are possible within it, and, sometimes, which of those inferences can be made efficiently. These issues remain useful in thinking about how knowledge representation and reasoning play a role in today's data science environment.

Ontologies
Knowledge representations are said to be grounded in a set of primitive terms that specify those ontological commitments: the entities and processes that can be referred to by that knowledge representation. Computational ontologies are collections of primitives relevant to a domain, often related to each other by explicit subsumption (subclass of) and meronomy (part of) statements. Biomedical ontologies (such as the Gene Ontology (3)) are community consensus views of the entities involved in biology, medicine and biomedical research, analogous to how nomenclature committees systematize naming conventions. knowledge-bases created using primitives from community-curated ontologies, rather than using idiosyncratic or single-use sets of primitives, provides significant advantages for reproducibility in scientific research, for interoperability, and in avoiding pitfalls in the modeling of knowledge. While lacking some useful aspects of a computational ontology, terminological resources such as UMLS (4), SNOMED CT (5) and the NCI thesaurus (6) have also been used to provide interoperable foundations for knowledge representations.
Primitives can be combined into assertions that express facts about the world. In the simplest case, an assertion links two represented elements with a specific relationship.
Consider, for example, that the Protein Ontology contains a representation of human TP53, the Gene Ontology contains a representation of the process of DNA strand renaturation, and the Relation Ontology contains a representation of participation that can link a physical entity to a process in which it participates. Those three ontological entities can be composed into an assertion that could be part of a knowledge-base: human TP53 participates in DNA strand renaturation. Collections of assertions, generally called knowledge-bases, can be created and shared, and then in turn used by other systems that apply various inference methods to fulfill particular application needs.

Standards
While ontologies provide the primitive elements from which a knowledge representation is constructed, they are agnostic about the mechanisms by which entities are assembled into assertions. In 2011, the World Wide Web Consortium promulgated a collection of international standards for linking entities with shared meaning into assertions and managing collections of assertions, together referred to as the Semantic Web (7). The Semantic Web builds on the standard Resource Description Framework (RDF) (8), which provides a way to link three uniform resource identifiers (URIs) (9) to specify a relationship between a pair of entities (forming an RDF "triple"). Collections of triples form a graph, where the entities are nodes and the relationships are edges connecting them. A computational mechanism for managing such collections is called a triple store (10). The Semantic Web standards also define RDF Schemas (RDFS) and a Web Ontology Language (OWL) which provide additional expressivity, SPARQL (SPARQL Protocol and RDF Query Language), which provides a query language for interrogating RDF graphs or triple stores, and the Simple Knowledge Organization System, which provides a basic ontology.
The web ontology language (OWL) (11) specifies two types of entities: instances and classes. Instances are particular entities or processes in the world (e.g., a particular molecule of TP53) and classes are groupings of instances that meet a defined set of individually necessary and collectively sufficient criteria (e.g. human TP53 proteins). As it lacks variables and quantification, OWL cannot express all logical statements about primitives; the subset of first order logic that OWL can express is inspired by description logics. For reasons rooted in description logic theory, in OWL ontologies are called A major use of KGs is simply to organize knowledge for information retrieval. Such systems are designed to make it possible to find facts or evidence regarding a wide variety of topics, ranging in this review from cataloging traditional Chinese medical practices to decision support for pharmacovigilance. KGs have also been used to improve other forms of information retrieval, such as finding relevant publications in the literature.
Computer science has produced many algorithms that operate on graphs, and therefore on KGs. One particular class of graph algorithm that is widely used in KBDS is edge (or link) prediction. Edge prediction methods generally use the structure of a graph to identify edges that are likely but missing in the graph. In KGs, these are predictions of unrepresented facts about the world. This is a form of hypothesis generation, and often includes an estimate of the confidence in the prediction. Many approaches to drug repurposing in this review use edge prediction algorithms over KGs of drugs and diseases to identify new indications. Another broad class of graph algorithms does community finding, or identification of groups of entities in a KG that are similar or highly related to each other. For example, some approaches to disease sub-phenotyping apply community-finding approaches to KGs encoding information about patients.
Machine learning, particularly in the form of artificial neural networks, is also widely used in the KG context. One frequent application of neural networks to KGs is to create vector embeddings of entities or assertions by training auto-encoder networks with inputs constructed from the KG. These embeddings can then in turn be used to compute knowledge-based similarities, e.g. between drugs, proteins, and diseases. Neural network methods have also been used to identify parts of a KG relevant to answering an input question.
The Semantic Web OWL standard was designed to facilitate two important classes of reasoning over KGs: satisfiability and subsumption inference. It is possible for a KG to define a class that has no members (e.g. TP53 homologs in bacteria); satisfiability inference checks to see if a class definition is logically satisfiable. Subsumption inference uses class definitions to identify all classes that are fully contained within some other class (e.g. all proteins are nitrogen-containing compounds). Specific reasoners, such as ELK (15) or Hermit (16) can be used to make these inferences with particular computational performance guarantees, which can be important in large KGs.
Subsumption inference in particular is useful in KBDS because it makes explicit many edges that are otherwise implicit in KGs, and therefore can improve the results of other algorithms that depend on the structure of the graph, such as link prediction or vector embeddings.

Known Challenges
Computational performance is a challenge in other areas as well. Biomedical knowledge is very extensive, and broad biomedical KGs can contain billions of assertions. A wide variety of schemes have been proposed to address the computational complexity of both querying and inference over KGs (17).
A few KGs (e.g. Gene Ontology annotations (18), , or Reactome (20)) are constructed through painstaking and expensive manual curation efforts. However, several algorithmic approaches have been proposed to either augment these efforts or fully automate them. Automated approaches to KG construction fall into two broad classes: natural language processing (NLP) and data-driven. Data-driven KG construction can involve the integration of previously disparate resources, or the direct analysis of large-scale datasets.
NLP methods propose to extract information from a set of documents to create a KG, (e.g. SemMedDB (21)). As NLP methods are all imperfect, these approaches are often focused on assessing the reliability of the information extracted, or on techniques to manage missing or erroneous assertions and other sources of noise.
Some data-driven approaches simply transform existing databases (e.g. DrugBank (22)) into KG form, which can facilitate adherence to FAIR (Findable, Accessible, Interoperable, and Reusable) research principles (23). More frequently, data-driven KG construction integrates of multiple sources of data into a single KG. If an integrated KG can ground the different sources to one set of primitives (ideally from a community-curated ontology), that facilitates inference over the combined information. As there are thousands of public, biomedically important databases (24), integration approaches that support semantic compatibility are important. Integration also can lead to improved data quality, as incompatibilities sometimes signal errors (25).

RESULTS
The search phrases above returned 52 papers from PubMed and 7,752 papers from Google Scholar. Manual review of these papers was performed to identify those that were focused on the use or construction of KGs within the biomedical domain resulted, which resulted in a reduced set of 174 papers. This set of papers was then further reduced to only include papers published or posted to public manuscript archives within last year (January 2018-September 2019) whose full-text version was publicly available at the time of review, resulting in a final set of 83 papers reviewed here. The final set of papers was further broken down by year of publication, the publication type, and the publication venue (i.e. the journal or archive name). Among the 83 papers, 44 were published in 2018. The majority of papers were published in conference proceedings (n=39) or in peer-reviewed journals (n=25), with the remaining papers published as online preprints (n=19). Among these, the majority of the 2018 papers were submitted to peer-reviewed journals (64.0%), whereas most of the 2019 papers were submitted to arxiv (73.7%). The number of conference submissions increased slightly between 2018 and 2019 (56.4%); 2018 papers were primarily submitted to the Institute of Electrical and Electronics Engineers (n=11) whereas 2019 papers were submitted to the Association for Computing Machinery (n=4) and the Association for the Advancement of Artificial Intelligence (n=2).
Information about each paper included in the final set is presented in Tables 1 and 2 (with more detailed tables in supplementary materials), and broad themes spanning multiple papers are described below.

Organization and Presentation of Findings
These publications fall into two broad categories, which we use to organize the review: application of KGs (n=53) versus production of KGs (n=30). Applications are noted in a wide variety of biomedical research domains, ranging from analysis of genomic data to clinical decision support. There is also a close relationship between KGs and biomedical NLP: KGs can be used to improve the quality of NLP, and NLP can be used to generate KGs from the literature. We conclude by considering some nascent projects likely to be important in the near future, characterizing current barriers to building and using biomedical KGs, and making some recommendations. Table 1 provides a high-level summary of the reviewed papers which applied KGs to help solve a biomedical data science problem.

Clinical Applications
There were three primary themes identified within this domain, including the use of KGs to improve the retrieval of information from the literature or from large sources of clinical data (26)(27)(28)(29)(30); the use of KGs to provide confidence either by adding evidence to support phenomena observed in data (31)(32)(33)(34)(35) or by completing missing information and deriving new hypotheses (36)(37)(38)(39)(40); and the use of KGs to improve the representation of complicated patient data or personal health information (41,42) or presentation of complex information or results (43)(44)(45).
KGs can be used to refine user queries and otherwise improve information retrieval from the literature or from an EHR system. One study demonstrated using KGs with traditional rule-based approaches for information retrieval performed better than using either approach alone (26). Liu et al. (27)  (45) introduced a novel approach to create custom systematic literature reviews by formulating the review as a biomedical KG that contains information relevant to specific hypotheses provided by a user.
Node and edge embeddings provide a powerful method to suggest relationships among entities via similarity functions, in ways that complement path traversal through the graph.
Semantic similarity inspired hypotheses are valuable include drug-drug (42, 49, 53), drugtarget (53), or protein-protein interactions (48, 50), many of which were in turn applied to drug repurposing. KG-based embeddings into low dimensional spaces were also used to visualize clusters in two-or three-dimensional projections (43) to better display entities of interest.
In a particularly innovative approach, Tripodi et al. (40) combined gene expression time series and KG embeddings from a Human-centric KG (61) to create specific and detailed hypotheses regarding mechanisms of toxicity. The KG subgraphs that made up the hypothesized mechanisms were far richer than black-box toxicity predictions and were also used to generate natural language narratives describing the mechanisms and the evidence for them.

Natural Language Processing Applications
KGs have been used to improve NLP performance in a wide variety of genres, including summarization or information extraction from EHRs and answering medical questions (17,28,29,33,42,62,63). KG-derived embeddings used alone, or in combination with text-derived features (48) improved performance of a variety of NLP tasks, including named-entity recognition (64), coreference resolution (65) and relation extraction (66).
Several applications demonstrate the utility of KGs in information extraction methods.
Ontologies can serve as formal dictionaries allowing for rapid indexing in named entity recognition and word-sense disambiguation tasks (67, 68). Compared to lexicons, KGs offer far richer semantic context, identifying not only similar concepts, but a rich collections of relationships that can be used to disambiguate or otherwise improve concept recognition in texts (67-70).

Constructing Knowledge Graphs
As the applications of KGs are many and varied, and their construction is demanding, substantial effort was reported in methods and new results in the construction of KGs, as well as extending, integrating and evaluating them. Table 2 provides a high-level summary of these publications.
Last year saw the announcement of a new approach to Gene Ontology (GO) annotation called Causal Activity Models, or GO-CAMs (19). Although GO annotations are perhaps the most widely used knowledge representations in biomedical research, until GO-CAMs were introduced, the annotations could not be assembled into a coherent knowledge graph. While individual annotations implicitly linked GO classes to gene products, contextual information was lost --for example, the annotation process could not capture that cytochrome C participated in apoptosis only when it was in the cytoplasm. GO-CAM models, and associated tooling is gradually replacing the traditional GO annotation process within the Alliance for Genomic Resources, meaning future GO annotation will produce an increasingly rich, manually curated KG.
Other efforts to produce domain-specific KGs were published in a variety of areas, including biodiversity (71-73), the microbiome (74), and for the purpose of enriching clinical data (75,76). The articles on biodiversity focused specifically on how a KG could be created and linked to identifiers in the literature (71,72) or other important biodiversity resources (73). In contrast, papers using KGs for clinical enrichment aimed to use KGs as way to link clinical data to sources of evidence to provide support to clinical observations (76-78) or to help make the data more interpretable with respect to underlying biological mechanism(s) (75) for improved diagnosis (79).
Historically, NLP information extraction efforts have often been used to construct KGs; two novel methods to do so were published last year. One proposed a minimum supervision-based approach which combined a traditional NLP-pipelines for information extraction with the use of biomedical context embeddings (80). The other focused on improving the extraction of biomedical facts from the literature by leveraging and refining specific seed patterns (81).
Although not a construction method per se, (82) presented an evaluation of one of the most widely used NLP-constructed KGs, SemMedDB. A large number of contradictory assertions were found in a variety of fundamental relationship categories, underscoring the need to be cautious regarding noise in NLP-derived KGs.
Finally, an ontology called BioKNO and a set of associated tools leveraging OWL was presented to assist scientists attempting to share data according FAIR principles (83).

Organizational efforts in knowledge-based biomedical data science
Both US and European scientific institutions support KG efforts. Perhaps the most ambitious of these is the National Institutes of Health's National Center for Advancing Translational Science's Biomedical Data Translator project (84). The goal of the Translator is a computational system that integrates sources of existing biomedical knowledge in order to translate clinical inquiries into relevant biomedical research results which synthesize elements of the integrated knowledge to directly answer the inquiry or generate testable hypotheses (85). A recent funding call2 targets $13.5 million per year for up to five years towards the construction of "Knowledge Providers" and "Autonomous Relay Agents." Knowledge providers are systems that seek out, integrate and provide high-value data sources within a specific scope of Translator-relevant knowledge, and presumably would primarily use KGs to do so. Relay agents are to take clinical queries in a standardized format, dispatch subtasks to appropriate knowledge providers, receive responses back from knowledge sources (presumably also as subgraphs of a KG), and process responses using scoring metrics in order to return the most relevant and highest quality potential responses.
Elixir Europe (86) is a large multinational (and European Commission) project with the goal of managing and safeguarding the data generated by publicly funded life science research and integrating bioinformatics resources. In pursuit of those goals, Elixir's interoperability platform promotes efforts in the European life science community to adopt standardized file formats, metadata, vocabulary and identifiers, including efforts in Semantic Web and adoption of community-curated ontologies. The Elixir Core Data Resources (87) are leaders in the production of interoperable knowledge resources, and are widely used components of biomedical KGs.

Knowledge Graph Embeddings
A very active area of research is using KGs to create knowledge-based vector embeddings. Good vector embeddings are important to the performance of machine learning systems, and therefore have wide applicability. KGs have been used to create embeddings for entities of many kinds (ranging from genes to patients), as well as for relations, assertions, and more complex representations. Applications of these embeddings include prediction of drug-drug interactions, drug-target interactions, target discovery, finding clinically relevant evidence and more. In addition to being able to reuse embeddings from the surveyed papers, we recommend considering the tools described in the BioKEEN paper that describes a Python-based library for training and tuning models to produce new knowledge-based embeddings.

Natural Language Processing-based Knowledge Graphs
A major theme in the literature is the use of text mining and natural language processing techniques to generate KGs. While this approach offers the potential for breadth missing from most manually curated KGs, it comes at the cost of a large proportion of errors. The Cong et al. (82), paper evaluated SemMedDB, a widely used KG produced by the US National Library of Medicine, and found nearly half a million inconsistent assertions, as well as a wide variety of apparently missing relationships . While they suggested methods that could be used to improve the quality of SemMedDB, our recommendation is to recognize that NLP-generated KGs are likely to be very noisy, and need to be used with caution.

Barriers and Future Work
Barriers Current barriers to constructing and using KGs includes navigating everything from KG and data availability, data licensing issues (sometimes there are different licenses for each data source), a lack of agreed upon standards for constructing KGs to dependency upon resources (i.e. software languages or applications) which may be obsolete, deprecated, or outside of the users skill set or area of expertise.
The construction of KGs is a difficult task, so re-use of existing resources is desirable whenever possible. However, there are challenges in applying each of the existing KGs (Table 3) to new tasks. While GO-CAMs have great promise, as of this writing, only a relatively small number of them have been curated. Reactome provides a very high quality and extensive KG grounded in community-curated ontologies, but it is limited in scope to biochemical reactions and pathways. Other manually annotated resources (such as GO annotations) cannot be straightforwardly assembled into KGs amenable to OWL reasoners. Data integration derived KGs such as KaBOB are both broad and grounded in community curated ontologies, but licensing restrictions mean that users have to download software and build the KG themselves, which requires expertise and computational resources. Hetio-nets are broad, but not grounded in community-curated ontologies, which makes integrating it with other resources difficult. Bio2RDF is a mashup of many different kinds of data without a consistent set of primitives, allowing inconsistencies in the represented knowledge. Finally, NLP derived systems such as SemMedDB are noisy, making trustworthiness an issue.
Automatic KG construction from literature sources is usually framed as a relation extraction problem, where semantic triples are inferred from text, and then assembled into a KG. The "correctness" of this approach to KG construction can be evaluated either before the KG is constructed, which involves evaluating the relation extraction process itself (a more traditional approach), or by evaluating the quality of the resulting KG itself.
Evaluating the quality of the constructed KG allows for the use of the reasoning methods described above.
The lack of standards for constructing KGs within the biomedical domain may be one of the reasons why they are challenging to evaluate. Of the reviewed articles on constructing KGs that evaluated their KGs (n=26), four provided qualitative evaluation (e.g., case studies or domain expert review of results, conceptual models, or prototypes, and focus groups), five provided quantitative evaluation (e.g., most often by applying a machine learning model to a specific set of held out data or to a new dataset or by performing a KG completion task like edge prediction), and 17 provided both types of evaluation. Of note, one of the articles that provided both types of evaluation utilized crowdsourcing as a means to validate triples from their KG (78).

Future Work
There trends we observed in last year's work are likely to continue. Applications of KGs will likely continue to involve generations of embeddings and other uses of KGs in machine learning. The close relationships between development of NLP methods and KGs is likely to persist. The expansion of KG to areas beyond molecular biology (e.g., Biodiversity and Chinese Traditional Medicine this year) is also likely to continue. Some previous areas of research, e.g. KG-based enrichment analysis for gene sets, that did not see new results this year may also continue to be fruitful.
New methods applying KGs to analyze different sorts of experimental data (e.g. images) seem ripe for development. Robust and biologically meaningful ways to incorporate or add experimental data to biomedical KGs would help to improve the precision of predictions when used to generate novel hypotheses or a means for helping to interpret experimental results. Similarly, for clinical KGs, it will be important to find clinically meaningful ways to incorporate quantitative measures (e.g. laboratory test results and biomarker measurements) and outcomes from EHR data.   (108) Propose a novel representation of SciKG, which has three layers. The first layer has concept nodes, attribute nodes, as well as the attaching links from attribute to concept.
The  (77) Present a prototype, Aero to (1) better organize and provide evidence based CBRF knowledge extracted from scientific literature (i.e., PubMed), and (2)