Cheminformatics Research at the Unilever Centre for Molecular Science Informatics Cambridge

The Centre for Molecular Informatics, formerly Unilever Centre for Molecular Science Informatics (UCMSI), at the University of Cambridge is a world-leading driving force in the field of cheminformatics. Since its opening in 2000 more than 300 scientific articles have fundamentally changed the field of molecular informatics. The Centre has been a key player in promoting open chemical data and semantic access. Though mainly focussing on basic research, close collaborations with industrial partners ensured real world feedback and access to high quality molecular data. A variety of tools and standard protocols have been developed and are ubiquitous in the daily practice of cheminformatics. Here, we present a retrospective of cheminformatics research performed at the UCMSI, thereby highlighting historical and recent trends in the field as well as indicating future directions.


1Introduction
In December 2000 the Unilever Centre for Molecular Science Informatics (UCMSI) was opened at the Department of Chemistry of University of Cambridge.B ased on an investment by the industrial partner Unilever,an ew world-leading research group in the emergingf ield of moleculari nformatics was established. The investment includedanew building,a ne stablishedc hair in Molecular ScienceI nformatics and three lectureships as well as set up costs (equipment, networking,a nd software). The Unilever research grants were renewed in year five and year ten. In addition, over the period of the UCMSI's existence, significanta dditional grants from av ariety of industrial, charitable and research council sources were obtained to support the objectives of the UCMSI.
The research centre is located in central Cambridge, thus profiting fromas timulating and exciting research environment. Daily interactions with several otherl ocal institutes at the University of Cambridge, the EMBL European Bioin-formaticsI nstitute (EBI), and the Cambridge Crystallographic Data Centre (CCDC) create aw orld class researchc luster. Furthermore,s everal major industrial partners arel ocated in close proximity in Cambridge's science parks and on corporate research sites.
Collaborations with more thant wenty industrial partners and especially Unilever allowed accesst oh igh quality data sets and formedthe basis for state-of-the-art computational modelling. [1] Further industrial partners have included Boehringer Ingelheim,A straZeneca,B ASF,P fizer, Johnson&Johnson, GSK, Aboca and Eli Lilly,t on ame but af ew.A dditional third party funding was attracted at the national level from the UK Engineering and Physical Sciences Research Council, The Medical Research Council, The Wellcome Trust, and the Biotechnology and BiologicalS ciences Research Council, as well as The National Institutes of Health in the USA. On the European level, major grants from the European Chemical Industry Council (CEFIC) and the European Research Council (ERC), and the Framework-7 program were successfully obtained, funding av ariety of internationali nformatics research projects.
The researcha tthe UCMSI covers broad areas of cheminformatics, which coupled with experiments and collaborations with industrial partners in bringing productst ot he market, ensures real life feedback in several inter-disciplinary research efforts. Am ain goal of the currently 40 scientists in the UCMSI is the integration of chemistry,b iology and materials sciencet hrough the development and application of moleculari nformatics.R obert Glen has directed the UCMSIs ince its opening. He heads an interdisciplinary researchg roup using ab road set of computational methodologies to tackle basic scientific questions in the general area of molecular biosystems. Foura dditional research groups are focusing on relevant areas on cheminformatics researcha tt he UCMSI. The group of Jonathan Goodman focuses on synthesis, computation and informatics and applies cheminformatics to tackle questions in chemical reactivity and catalysis. Peter Murray-Rust's group focuses on semanticw eb technologies and developsm ethodologies and software for intelligent storage and retrievalo fc hemical data. Andreas Bender's group drives the integration of new large scale data sources (genee xpression, biological networks, phylogenetics) in the field of cheminformatics. Recently,L ucy Colwell has established an ew group at the Centre, her researcha ims to identify structuralf eatures within large datasets using advanced statistical and data analytics methodologies. Twof ormer group leaders have recently departedf rom the UCMSI to set up significant research groups. Peter J. Bond recently moved on to ap rincipal investigator positiona tt he Bioinformatics (BII) Institute A*STAR in Singapore and continues his researche fforts on multiscale modelling and large scale simulations. Former group leaderJ ohn Mitchell has moved to the University of St. Andrews as aR eader and continues broad research in cheminformatics from quantum chemistry to molecular simulation technologies. Former groupl eaders include David Lary, GuyG rant, Dmitry Nerukh, Maxim Federov and Hamse Mussa.
Overt he years of the Centre's existence 67 PhDs have been trained and 36 postdoctoral research associates performed highq uality research. The UCMSI hosted 50 conferences/workshops and organized over 200 seminars.S cientists at the UCMSIh ave won several awards and prizes including the RSC Bader Award to Jonathan Goodman, the Hansch Award to Andreas Bender,and the Novartis Chemistry Lectureshipt oR obert Glen. In this article we describe the developments in cheminformatics research performed at the UCMSI aimingt oi dentify challenges and opportunities for the field in the future. We retrieved all entries relatedt ot he UCMSI in all Webo f Science databases. [2] Therefore, we used the Web of Science web interface and extractede ntries with at least one authors address containing all words "Cambridge", "University", "Molecular", and "Unilever". Furthermore, we discarded entries in the database earlier than the year 2001, to remove three false positive hits of earlier years. Unfortunately,s ome entries (e.g. [3,4] )a re discardeda sf alse negatives at this stage,a sdifferent abbreviations in author affiliations have been used.C iting articles and total citations were extracted for all articles and analyzed in terms of selfand non-self citations according to the tools provided in the Web of Science online mask. The Hirsch index (hindex) [5] was calculatedt oe stimate at otal impact of the science performed at the UCMSI.

Analysiso fP ublicationsUsing Word Clouds
Furthermore, we extractedi nformationo na uthor names and abstracts (where available) from all publications. We identified the leading scientists and their central research topics using word cloud representations( "wordles")g enerated using WordItOut. [6] We used author surnames only to ensure consistency between different data sources. Full abstracts were used for the generation of topic-related word clouds.T oa llow for identificationo ft rendso ver timew e split the complete data sets for authors and abstracts into sectionso fy ears. Thereby,w ea nalyzeda uthor and thematic contributions for

3R esults
We identified in total 325 published items of the UCMSI in the Web of Scienced atabase. After an initial growth phase, the article output has reached as table plateau of between twenty and 40 published items per year (see Figure 1A). The observed decrease for the year2014 arises from incomplete data for the current year. Linear extrapolationt ot he complete year increases the number of published articles listed in Web of Sciencet ot he average level of approximately 20. Due to lagging indexing in Webo fS cience linear extrapolation is expected to underestimate valuesf or year 2014. Whens ettinga side the incomplete year 2014, Pearson correlation between years and published items is 0.78, indicating as teadily growing output of research performed at the Centre.
In addition to the number of published items, citations were counted on articles published by the Centre. Soon after publication of the firstt wo articles in 2001, first citations are recorded (see Figure 1B). The amount of citations continues to grow steadily over the years and breaks the barrier of 100 citations in the year 2005. Resultsa re shown on ap er-year basis, not as ac umulative count. Nevertheless, an almost linear increase of citations with Pearson correlation coefficient of 0.97 betweeny ear and citations is observed when excluding the incomplete year2 014. With 988 citations in they ear 2013 only,r esearchf rom the UCMSI is referenced almostt hree times per day.L ineare xtrapolation for year 2014 indicates that the barrier of 1000 citations is likely to be broken for the first time.
In total 5508 citations on the 325 articlesw ere recorded via Webo fS cience.O nly nine percent of these represent self-citations( 506 citations), thus reflecting ab road audience and considerable impact within the cheminformatics community.4 109 unique indexed articles reference scientific reports from the UCMSI with less than five percent of self-citations( 177 articles).O na verage, articlesf rom the Centre are cited16.95 times.
By creating word clouds from authorl ists of all published items we identified key scientists at the UCMSIi nC ambridge (see Figure 2A). RobertC .G len has published 83 items over the years, closely followed by the group leaders Peter Murray-Rust, Andreas Bender,a nd JonathanG oodman, each contributingb etween6 0a nd 70 articles. Splitting authorc ontributions according to intervals of years allows the development of the UCMSIt ob ef ollowed. In earlier years, lecturer John Mitchell was ah ighly active re-searchera tt he UCMSI, as was Henry Rzepa in collaboration with PeterM urray-Rust( see Figure 2B). In later years the name "Andreas Bender" emergesm ore and more with as hort break during his time at Novartis and the University of Leiden (see Figure 2C and Figure 2D). In recent years, the groupo fP eter Bond focusing on simulations added additional output to the UCMSI (see Figure 2E).
Wordles created on the basis of published abstracts identify key topics and methods applied at the UCMSI (see Figure 3A). "Data" is the most abundants ingle phrase and appears mostlyi nc onnectionw ith the secondm ost occurring words "molecular" and "chemical". Therefore, chemical data form the foundation of all researchp erformed at the Centre. "Model" and "models" appear prominent in the list, giving hints how chemical data is utilizedi nt he generation of computational models for av ariety of mostly chemical and biological properties.  Splitting abstracts according to publication years does not reveal any consistency in the research efforts on-going at the UCMSI (see Figures 3B-3E). Contributionso f"molecular" and "data" are constantlyh igh. Somew ords linked to the observables predicted in established models appear transiently,e .g. "target" for recently established target prediction tools. Nevertheless, the low frequency of individual modelled molecular properties reflects the broadness of topics covered.F urthermore, structure-based modelling approaches [19] appear more prominent in recent years as indicated by the increasing occurrence of the keywords "protein", "water", and "dynamics".

4Discussion
Analysis of citations has allowed the identification of key topics and methods applied at the UCMSI. Science is often centred around method development whichi sr eflected by several publications in leading journals in the field of cheminformatics and modelling. Additionally,a pplicationo ft he newly developed methods ensures real life feedback and enabled publications in fieldsr angingf rom drug discovery to synthesis, materials science to electrochemistry.I nt he followingp aragraphs we will select particular fields of research where major advancesh ave beenm ade at the Centre.

Linking Cheminformatics and Biology
Starting from innovative ways to encode and compare chemicale nvironments in molecules, [7,8,10] several steps have beent aken to link chemical with biological properties of molecules [20] based on statistical modelling techniques. [21,22] Thereby,t he fields of cheminformatics and bioinformatics increasinglyo verlap and even fuse, leading to approaches like proteochemometric modelling( PCM). [23,24] With the establishment of target predictiont ools based on statistical modelling, [25,26] novel approaches to cluster molecules using predicted biological effect can be implemented. [27] Furthermore, these novel methodologies provedt o be helpful in multi target drug design. [28] Recent directions in the area compriset he inclusiono ff urther biological data sources, e.g. from biological networks and phylogenetics [29,30] as well as gene expression data. [31] Severals uccessful applications of target predictiona lgorithms [32][33][34] underline the increasing accuracy of such data-driven approaches and point towards ab right futureg iven the increase in available datas ources. [35,36] In ar ecent success story cheminformatic tools were applied to identify potentialt argets of as eries of syntheticb iscoumarins showing anti-cancer activity in vitro and in vivo. [37] Afteri nd epth computational characterization of predicted protein-ligand interactions (see Figure 4A), the in silico predicted protein target tumourn ecrosis factor a was verified by experimental techniques.A ne merging future directioni nt his field is the prediction of biological effects of compound combinations that is financed via an ERC Starting Grantt oA ndreas Bender.F urther stimuli in this area expected from statistical analyses of genomic sequenced ata. [38,39]

Data Semantics and Accessibility
Growth of the available ("Big") data brings new challenges to the field of cheminformatics. Millions of chemicals ubstances have been characterized, many hundreds of thousands of three-dimensional structures have been deposited in databases, and associated biological data is spreado ver millions of publications and patents. [40] Therefore, the development of standards for chemical identification, indexing and storage is crucial for future successful datar etrieval. Usage of the IUPAC standardI nternational Chemical Identifiers (InChIs)a llows the encoding of chemical information using au nique text descriptor,e .g. for database Special Issue United Kingdom searches. [41] Based on InChIss everal tools have been developed, e.g. to convert structures to chemical names in af ully automated way [42] or chemistry-aware text mining. [43] Recently,I nChIs have beene xtended to RinChIs to depict chemicalr eactions unambiguously. [44] Chemical Markup Language (CML) introduces and specifies particular data fields to efficiently store and retrieve chemicald ata. [45] Based on CML the World-Wide Molecular Matrix (WWMM) has been introduced to collect and connect chemical information of various sources. [46] Chem4Word has been created in ac ollaboration betweenM icrosoftR esearcha nd the UCMSI to facilitate the handling of chemical information within text processing software.T od ate, the plug-in has been downloaded more than 400 000 times.O ver the last decade, the UCMSI has been ak ey playeri ns etting standards for opend ata and data quality standards in chemistry. [47][48][49] In recent years, technologies for data semantics and natural languagep rocessing have been employed attempting to directly extract the scientific context of published chemicald ata. [50]

Modelling of Physicochemical Properties and ADMET Parameters
The increase in datas ources over recent years allowed the establishment of more accurate computational models, even for complex biological phenomena, e.g. absorption, distribution, metabolism, excretion and toxicity (ADMET) of xenobiotics. [51] Dozenso fc omputational methodsf or prediction of metabolic reactions on different levels of complexity have been published in the literature. [16] Thorough classification of annotated biotransformations [52] facilitated the development of novel predictivem odels for general metabolic reactivity (Metaprint2D), [53,54] P-glycoprotein transport, [55] solubility, [56] and cytochrome P450-catalyzed metabolic reactions. [57][58][59] Thereby, innovative computational methods making use of recent advances in graphics processing unit (GPU) based computing haveb een employed. Using these approaches, new levels of throughput and thus modelling accuracy are in range. Furthermore, the UCMSIr an as olubility competition (with over 100 entries) for the cheminformaticsc ommunity and provided high quality experimental data to facilitate further method developments in the area. [60][61][62]

Bioactive Compound Discovery
The UCMSI has beenadriving force of computer-aided drug design over the past decades. In addition to support of external drug designe fforts, local lead discovery projects have been guided by computational technologies. Novel ligands for the G-protein coupled receptor (GPCR) apelin have been successfully identified and biologically characterized. [63] Basedo nm odelled structures of the apelin receptor (see Figure 4B), optimization of the peptide-derived compoundsi so n-goinga nd showse normous potential for further development and recently the first human study of apelin biaseda gonists has beenc ompleted in Addenbrooke's hospital in Cambridge.  The predicted binding mode of the apelin-13 peptide and surrounding residues are highlighted in the included zoom on the binding cavity.C )I llustration of the simulation approach followed to investigate selectivity of Schiff base forming covalent IRE-1 inhibitors. IRE-1 is shown as cartoon with the reactants Lys-907 and inhibitor 4m8C highlighted as sticks. moleculea ntagonists of the 5-HT1B GPCR have been identified and optimized for development as ap otential treatment for Pulmonary Hypertension. [64] Several newly designed and synthesized compoundsa re currently in clinical studies in Addenbrooke's Hosptial, Cambridge. In another compound discovery effort large scale simulation approaches( see Figure 4C) have beenu sed to identify covalent binderso ft he endonuclease domain of IRE-1. [65,66] In af ollow-up study selectivity of the Schiff base forming compoundsh as been investigated by advanced computational techniques. [67] Research at the UCMSI also lead to joint patents with Unilever on the CB1/2 receptors and NCKX ion channels showing benefits in skin for use in home and personal care products. To assistd rug discovery several analyses of molecular diversity in the context of diversity-oriented chemical synthesis have been performed in collaborationw ith experimentalg roups. [68][69][70] In addition to these four areas the UCMSIh as been particularly active in development of innovative simulation and analysis methodologies [71,72] and cheminformatic support for organic synthesis. [73,74] Industrial partners emphasize the productivity of collaborative research efforts with the Centre for Molecular Informatics. Ola Engkvist, team leader of Computational Chemistry at AstraZeneca and involved in several joint research projects, highlights:" The collaboration with the UCMSI providesA straZeneca with the opportunityt ow ork closely with one of the world leading groups in cheminformatics. The combination of the UCMSI's outstanding scientific knowledgew ith AstraZeneca'si ndustrial experience provides ap latform for state-ofthe-art research in cheminformatics. The proximity to one of the AstraZeneca science hubs is an additionalp lus that facilitatess mooth collaborations." Jim Crilly,s enior vicepresident of the StrategicS cience Group at Unilever,a dds: "Unileveri sv ery proud to have instigated in partnership with the University of Cambridge the Centref or Molecular Informatics under the Leadershipo fP rofessorR obert Glen and pleasedt hati th as developed into ag lobal centre of excellence in ah ugely important field of research. Collaboration with the centre has brought an ew way of working into our own research endeavours and accelerated our discovery process." Followingaclear mission statement published in 2002 [75] the UCMSI is developing tools and standards in molecular informatics. With the increase in computer power and data accessibility,m olecularm odelling allows chemical and biological effects of increasing complexity and size to be captured. Available data sourcesr ange from chemistry and related bioactivity (PubChem, [76] ChEMBL, [35] DrugBank [77] )v ia protein sequence (UniProt [78] )a nd structure (Protein Data Bank, [79] Pfam [80] ), to cellular pathways (KEGG [81] )a nd responses (LINCS [82] ). Ac ollectiono fo nline molecular biology databases has recently been published with the database issue of Nucleic Acids Research. [83] With the advent of microsecond simulations of biological macromolecules [84] and sophisticatede nhanced sampling methods, [85] dynamic processes underlying macromolecular recognition processes can be modelled with an ew level of accuracy. [86] Protein-ligand binding processes may be studied at atomistic resolution, reports include full samplingo f bindinga nd unbinding of both fragments [87] and small molecules. [88] With an increasingly accurate description of protein dynamics, its role in biomolecular recognition processes can be studied, therebya llowing pharmaceutically relevant properties like binding specificity [89] or binding kinetics [90] including allosteric mechanisms [91] to be probed. On the other hand, integration of innovative data sources from emerging "omics"f ields such as proteomics, [92] metabonomics/metabolomics, [93] or lipidomics [94] will allow the capture of novel biological properties which have limited or no direct chemicall eads to their mechanism and function.

5C onclusions
The data presented underline the central role in the cheminformatics world that is occupiedb yt he UCMSI. The research institute sets world-wide standards and publishes technologies and software that is widely used and cited. The broad range of topics covered at the UCMSI ensures its broad impact ranging from basic cheminformatics,d ata mining and machine learning techniques to complexs imulations,t oc hemicalr eactivity and synthesis. The connection of all these areas in as ingle research centre offers unique opportunities for scientists involved, industrial partners, as well as the whole cheminformatics community.