Annotating Human P-Glycoprotein Bioassay Data

Abstract Huge amounts of small compound bioactivity data have been entering the public domain as a consequence of open innovation initiatives. It is now the time to carefully analyse existing bioassay data and give it a systematic structure. Our study aims to annotate prominent in vitro assays used for the determination of bioactivities of human P-glycoprotein inhibitors and substrates as they are represented in the ChEMBL and TP-search open source databases. Furthermore, the ability of data, determined in different assays, to be combined with each other is explored. As a result of this study, it is suggested that for inhibitors of human P-glycoprotein it is possible to combine data coming from the same assay type, if the cell lines used are also identical and the fluorescent or radiolabeled substrate have overlapping binding sites. In addition, it demonstrates that there is a need for larger chemical diverse datasets that have been measured in a panel of different assays. This would certainly alleviate the search for other inter-correlations between bioactivity data yielded by different assay setups.

It fulfils an important dual role in the process of drug discovery and development: It is recognised in its own right as a drug target but is also dreaded because of its antitarget character. Since the discovery of its role as a drug efflux pump in multidrug resistant tumour cells in 1976, [1] there has been an on-going search for potent and selective inhibitors of P-glycoprotein, which might be used for the treatment of multidrug resistance (MDR) -one of the major obstacles for a successful cancer chemotherapy. [2] Likewise the anti-target property of P-glycoprotein is significant, especially with respect to its importance for the pharmacokinetics of compounds being substrates of this xenobiotic efflux pump. This is of special importance at the bloodbrain barrier, the gastro-intestinal tract, and in the liver. [3] Inhibitors of P-glycoprotein can affect the disposition of co-administered substrates, thus Drug-Drug Interaction (DDI) studies need to be performed if the compound being clinically developed is known to interact with P-glycoprotein. [4] Therefore it is of key importance to be able to study and understand the interaction of compounds with P-glycoprotein both by in silico and in vitro methods at an early phase of the drug discovery process.
On a molecular basis, P-glycoprotein is well known for its promiscuous ligand recognition pattern, paving the way for exploring the chemical world in an enquiry for new scaf-folds as inhibitor leads of this protein. However, the more poly-specific P-glycoprotein was recognised to be, the more difficult it became to postulate a unique binding/ transport site for ligands. This was also recently underpinned by the resolution of a mouse P-glycoprotein crystal structure, which revealed topologically distinct binding sites for ligands. [5] Abstract: Huge amounts of small compound bioactivity data have been entering the public domain as a consequence of open innovation initiatives. It is now the time to carefully analyse existing bioassay data and give it a systematic structure. Our study aims to annotate prominent in vitro assays used for the determination of bioactivities of human P-glycoprotein inhibitors and substrates as they are represented in the ChEMBL and TP-search open source databases. Furthermore, the ability of data, determined in different assays, to be combined with each other is explored.
As a result of this study, it is suggested that for inhibitors of human P-glycoprotein it is possible to combine data coming from the same assay type, if the cell lines used are also identical and the fluorescent or radiolabeled substrate have overlapping binding sites. In addition, it demonstrates that there is a need for larger chemical diverse datasets that have been measured in a panel of different assays. This would certainly alleviate the search for other inter-correlations between bioactivity data yielded by different assay setups.
Given the central importance of this drug transporter it is therefore not surprising that there is a large panel of biological assays reported in the literature probing the transport function of P-glycoprotein. However, different biological assays designed to probe the response to a certain agent will also lead to variation in the bioactivity values for a unique compound. In case of competitive inhibition of transport, the probe substrate used for measuring inhibition of transport will play a major role. In addition, the type of assay used (e.g. transport, cellular accumulation, inhibition), the cell line, and the assay conditions used (e.g. probe substrate concentration, duration of experiment) will certainly influence the outcome.
It seems per se to be a challenging task to interpret bioassay data. Also, the huge amount of bioassay data which is accessible nowadays in the world wide web demands a systematic structuring and standardization of the data point entries. [6] With respect to small compound data the situation changed significantly when an unprecedented body of bioactivity data was made available to the public domain. This is a consequence of several factors, a change to an open innovation business model in the pharmaceutical industry, the development of public-private-partnerships such as the Innovative Medicines Initative (IMI), large screening initiatives such as the NIH Molecular Libraries and Imaging Program (MLP), [7] and finally, large scale manual and computational indexing of the primary literature. [8] The most prominent examples of such Open Access databases in the life sciences are the PubChem BioAssay database, [9] and the ChEMBL database. [10,11] PubChem hosts data from high throughput screening experiments (HTS), while ChEMBL is a manually curated collection of literature data, coming mainly from Structure Activity Relationships (SAR) studies. In the field of transport proteins we also would like to highlight the focussed transporter database TP-search. [12,13] Compound data presented herein was also retrieved from literature sources.
All three resources include compound bioactivity data with a brief tagline description of the underlying assays. However, the way data are retrieved and the respective output is different for each of these data sources. When the complexity of the information on the bioassays is considered, we obviously have to face some disparities while comparing data from different sources as well as comparing the individual entries internally. As databases compiled from literature sources are mainly built from Quantitative Structure Activity Relationships (QSAR) datasets. These are generally quite small and biased towards a set of structurally related compounds. Furthermore, as most of these studies are carried out in an academic set up and typically run sporadically and at low throughput, assays and assay conditions are different and often not directly comparable.
Thus, in order to establish predictive and robust in silico models covering a broad chemical space it is necessary to combine datasets from different studies. Nevertheless, the final size and quality of such benchmark datasets depends strongly on the potential of certain assays to be combined.
Being aware that expert knowledge represents a decisive skill for studying biological data, in this study we focus on human P-glycoprotein -definitely the best-characterised and bioassayed ABC-transporter so far. However, it is worth noting that the methodologies devised here for P-glycoprotein data can also be applied to data on other transporters which are, in general, less abundant.
The final aim of this study was the annotation/ classification of prominent in vitro assays used for the determination of bioactivities of human P-glycoprotein inhibitors and substrates as they are represented in the ChEMBL and TPsearch Open Access databases. In addition, we explored the ability of bioassay data coming from distinct literatures sources (in ChEMBL or TP-search) to be combined with each other.

Regression Analysis
For creating a set suitable for QSAR studies all the data retrieved from ChEMBL (version ChEMBL_13) and TP-search (last update June 26, 2007) were grouped according to their numerical endpoints IC 50 (concentration at half-maximum inhibition), EC 50 (concentration at half-maximum effect), and K i (inhibitor constant for the protein-inhibitor complex). In order to compare data sets from different assays we set a limit of at least ten compounds being measured in both assays. With this criterion, almost all the comparisons we were able to perform belong to the group 'IC 50 '. For K i readout we could only carry out one comparison, for EC 50 it was not possible to find any two datasets with at least ten compounds in common.
In order to draw correlation plots and calculate the correlation coefficient R, the squared correlation coefficient (coefficient of determination) R 2 , adjusted R 2 , and standard error we converted the IC 50 and K i values to the pIC 50 (ÀlogIC 50 [M]) and pK i (ÀlogK i [M]) scale, respectively, in which higher values indicate exponentially greater potency.
In the results section R 2 values are discussed. More detailed information on the regression statistics of the correlations drawn is given as Supporting Information (Table S1).

Creation of the Classification Dataset for Human P-Glycoprotein Inhibitors
The SDF file of chemical compounds was created using a combination of automated and manual protocols using a number of cheminformatics tools and searches of online databases. The approach included utilizing name to structure conversion tools to convert systematic names (IUPAC names generally) to chemical structures. Where necessary, stereochemistry (when defined in the systematic name) was introduced into the chemical structure manually. The software tools included ACD/Name (version 12) [14] and the OPSIN online service. [15] Trivial names and other synonyms were searched against public domain databases. These were generally ChemSpider, [16] ChEBI, [17] ChEMBL and Pub-Chem. Results from multiple data sources were manually curated for consistency and compared to other reference resources such as the Merck Index and, where appropriate, Wikipedia. In some cases the names were ambiguous and inherent experience of the compounds under study was used to identify and include the chemicals.

Bioassays in ChEMBL and TP-Search for Human P-Glycoprotein
In order to get a first impression of the composition of the two databases, we determined the overlap of ChEMBL and TP-search for human P-glycoprotein inhibitors/substrates with readout IC 50 , EC 50 , and K i : out of approximately 1200 data point entries (compounds with associated bioactivities; 846 in ChEMBL and 352 in TP-search) we found only 40 compounds being present in both databases.
Figures 1 and 2 depict the proportion of different assays for human P-glycoprotein with readout IC 50 values as they appear in ChEMBL and TP-search respectively. Figure 2 shows that the majority of assays with readout IC 50 in TPsearch are indirect transport assays, where the inhibition of a substrates' transepithelial transport is measured in order to determine the inhibitory activity of the compounds under investigation. Major substrates used for these assays are daunorubicin, digoxin, calcein-AM, LDS-751, and rhodamine 123. However, in ChEMBL almost half of the assays with readout IC 50 are Calcein-AM accumulation assays (where an increase in a substrates' intracellular accumulation is measured). Five percent of the assays measure an increase in rhodamine 123 accumulation. Transport assays are underrepresented in ChEMBL when compared to TPsearch (inhibition of Calcein-AM transport 11 %; inhibition of [ 3 H]vinblastine transport 9 %). Other assays with an incidence of less than ten percent of all the IC 50 assays in ChEMBL include the measurement of cytotoxic effects, reversal of multidrug resistance (MDR), antiproliferative effects, and radioligand-binding assays.
Inspecting the distribution of assays giving EC 50 as well as K i values in ChEMBL, we noticed that 61 % (198 compounds) of the data entries with EC 50 and 73 % (80 com-SPECIAL ISSUE  pounds) of the data entries with K i values were determined in a daunorubicin efflux (transport) assay. In TP-search there are very few data points with EC 50 or K i values so that no general tendencies of assay distributions could be deduced.
When the underlying literature sources of TP-search and ChEMBL are examined there is a clear difference in the highest contributing journal sources of data -TP-search has a higher fraction of journals concentrating on latestage preclinical/clinical development stage reporting of a compound (e.g. Journal of Pharmacology and Experimental Therapeutics, Drug Metabolism & Disposition) whereas ChEMBL has a higher fraction of data from medicinal chemistry optimisation assays (e.g. Journal of Medicinal Chemistry, Bioorganic & Medicinal Chemistry Letters). Therefore, part of the differences in the assay distribution probably reflects custom and practice in the relevant scientific communities.
Studying the prevalence of substrates and cell lines being used in the two databases, it was observed that 45 % of all the compounds (with bioactivities in IC 50 , EC 50 and K i ) in ChEMBL are measured in assays using daunorubicin as a substrate and 29 % using calcein-AM. In TP-search there is a slightly different tendency with 24 % using calcein-AM, 17 % digoxin, and 15 % daunorubicin.
With respect to the cell lines used in the respective assays, there were around 25 different ones in each of the databases -a lot of them were identical when comparing ChEMBL and TP-search. For a list of all the cell lines used see the Supporting Information (Table S1).
Throughout our study we noticed that assay nomenclature is not homogenous. For instance, the terms 'efflux' and 'transport', which in the context of P-glycoprotein substrates are equivalent, are both used interchangeably. The same is true for the equivalents 'uptake' and '(intracellular) accumulation'. Additionally, the precision of the assay description or the assays' name itself varies: e.g. 'Calcein-AM efflux assay', 'Calcein-AM accumulation assay', or just 'Calcein-AM assay', which is not even specifying if uptake or efflux is measured.
The assays were therefore first grouped according to a description of what is actually measured, which often required examining the original publications that were cited. The following major assay types were identified (X is the respective substrate used): 'Increase in X intracellular accumulation', 'Inhibition of X (transepithelial) transport'; 'Reversal of Multidrug Resistance (MDR)'; 'Cytotoxic effect', 'Antiproliferative effect', 'Inhibition of X-stimulated ATPase activity', and 'Inhibition of X binding'. Further sub-classification is possible by taking into account the different substrates (X) and the cell lines used for the experiment.
From this large-scale analysis it is possible to highlight alternate expressions of the same or related assays and then develop canonical or recommended ways of expressing the assay, and driving further curation of the underlying resources.

Combining Bioactivities from Identical Assays
For ligand-based drug design studies it is generally highly recommended to measure all compounds in an identical assay setup. In this manner a very clean and consistent dataset is obtained (within the experimental error) which is able to reflect the structural differences in terms of differences in the pharmacological activity values.
The assembly of such a dataset (only one assay type/substrate/cell line) from a combination of ChEMBL or TP-search datasets makes it evident that the largest compound dataset that can be retrieved from ChEMBL comprises 198 entities with EC 50 values measured in a daunorubicin efflux assay in MDR CCRF vcr1000 cells. The size of this dataset definitely satisfies QSAR-related studies, as does the activity range of six orders of magnitude. However, one weakness might be the lack of structural diversity of most of the compounds. [18][19][20] From a search of TP-search we could extract a dataset of 37 compounds with IC 50 values being the most comprehensive one. Compounds were measured in a daunorubicin transport assay/NIH-3T3-G185 and comprise a sufficient degree of structural diversity. Bioactivities in this dataset span three orders of magnitude.

Correlating Bioactivities from Different Assay Setups
In principle there are two general types of models that can be generated, those based on QSAR methods and those based on classifications. When combining data from different assays certain requirements need to be fulfilled. The combination of QSAR datasets with a significant overlap of compounds tested in both assays is highly recommended. Subsequent correlation analysis will then show if the two assays can be combined using regression analysis.
Interestingly, despite the huge number of data point entries (~1200) at our disposal when performing the study, there were only very few datasets with a sufficient number of compounds (at least ten) available which have been tested in more than one assay. Thus, the number of correlations we were able to establish was very limited.
In the case of classification models the definition of a reference compound which is routinely tested in all assays, and used as a threshold for defining active/inactive, serves as a valuable approach. It would be a great help to the community if publications reporting frequently used assays reported explicitly such internal standard data.
For this study, it also must be clearly stated, that the subsequent analysis does not take into account assays especially designed for measuring substrates of P-glycoprotein, such as in vitro (direct) transcellular transport assays. These assays directly measure transport of substrate but report transport ratios (basolateral-to-apical vs. apical-to-basolateral) that are not as easy to correlate with each other as compared to assays based on IC 50 , EC 50 , and K i values. Thus, our annotation scheme, and the conclusions we are drawing, are especially aimed for inhibitors of human P-glycoprotein, although this does not preclude that certain compounds discussed herein could also be classified as substrates in other assays.

Comparison: Same Assay Type, Different Substrates (X), Same Cell Line
Naturally, K i values relate more closely to competitive inhibitors and a true binding affinity for P-glycoprotein than IC 50 values. Ekins et al. [21] published a set of 17 compounds measured in both a [ 3 H]-vinblastine (radiolabeled) accumulation assay and a calcein-AM (fluorescent) accumulation assay. Both assays are of the same type ('Increase in X intracellular accumulation') and they were both measured in LLC-PK1 (pig kidney epithelial) cells. Such a comparison is best suited for predicting whether two different substrates have overlapping binding sites in this protein or not. In the case of vinblastine and calcein it has been proposed that their binding sites only partially overlap. [22] That might be the reason for the rather poor correlation of the bioactivities (pK i values; R 2 = 0.56; Figure 3 and Table 1).
Other correlations that were obtained from ChEMBL and TP-search are all based on comparisons of pIC 50 values.
Using a fluorescence indicator (displacement) assay (assay type: 'Inhibition of X (transepithelial) transport') with different fluorescence markers (= fluorescent substrates of P-glycoprotein) but always the same cell line, NIH-3T3-G185 (MDR1 transfected mouse embryo fibroblast cell line), Wang et al. [23] Table 2). Just one compound, quinidine, has a much more potent effect on LDS-751 (IC 50 = 1.0 mM) compared to the other substrates (DNR: IC 50 = 18.8 mM; Rho 123: IC 50 = 33.9 mM). Also, if the bioactivity of quinidine measured with LDS-751 as marker is correct, this gives some very specific information about the LDS binding site, since its stereoisomer quinine has a 75fold decreased affinity value. Of course, the difference in potency also affects the obtained correlation coefficients realized for the comparisons LDS-751/DNR and LDS-751/ Rho (see Figures 5 and 6).
Removing this outlier from the two correlations we achieved very good R 2 values of 0.82 (LDS-751/DNR) and 0.90 (LDS-751/Rho), respectively. It is well known that different substrate markers favour different substrate binding sites of P-glycoprotein [24][25][26] and that their responsiveness and wide applicability depends on their balanced affinity to more than one site. [23] Thus, the ability to combine assays SPECIAL ISSUE  with different underlying (fluorescent or radiolabeled) substrates strongly depends on the substrates' overlap of binding sites as well as on the series of compounds (inhibitors) under investigation. In the cases of daunorubicin, LDS-751, and rhodamine 123 it has been postulated that all three preferentially bind to the same site (the so called 'R site'). [25,27] This was perfectly reflected by the correlations we obtained. The question as to whether identical assay setups, but different cell lines, will lead to similar activity values becomes even more interesting if cell lines under investigation are expressing P-glycoprotein from different species. Schwab et al. [28] measured in an indirect indicator assay with Calcein-AM as fluorescent dye (assays type: 'Increase in X intracellular accumulation') the P-glycoprotein inhibitory activity of 28 compounds in both polarized pig kidney epithelial LLC-PK1 cells transfected with human MDR1 (L-MDR1 cells) and in porcine brain capillary endothelial cells (PBCEC cells). There was quite a good correlation between the pIC 50 values obtained using cells expressing human or porcine protein (R 2 = 0.73; Figure 7 and Table 3). Vinblastine was differently classified in both cell lines, being an inhibitor in PBCEC cells but not in L-MDR1 cells. On removal of this outlier from the correlation plot an R 2 of 0.78 could be achieved. As a general tendency of this correlation we observed that a few inhibitors had lower IC 50 values in PBCEC cells (itraconazole, ritonavir, saquinavir, verapamil). Particularly noticeable is itraconazole with a 70-fold higher inhibitory activity measured in porcine cells than in those transfected with human P-glycoprotein (R 2 [vinblastine and itraconazole removed] = 0.83).
By performing such comparisons (between P-glycoprotein from different species), outliers might provide some species specific information about binding to this protein.
In general, combining assay data from different cell lines is quite risky, as distinct types of cells might also express other ABC-transporters with compound binding profiles partly overlapping with those for P-glycoprotein, rendering data interpretation even more difficult.

Comparing Different Assays Types
Going further with the analysis, we also tried to combine data from different assay types, with different marker (substrate) and different cell lines. Because it is very probable that different labs performing the same assay might also achieve slightly different results, we established correlations by taking into account only data from one publication for each test series. Unfortunately, this led to data sets with an overlap of only 7-8 compounds, which is probably not very representative. Still, we found a good correlation (R 2 = 0.72;     Table 4) comparing the pIC 50 values of a calcein-AM accumulation assay in L-MDR1 cells to those of a daunorubicin transport assay in NIH-3T3-G185 cells. However, when comparing results from a calcein-AM/L-MDR1 accumulation assay to those of transport assays with rhodamine 123/NIH-3T3-G185 (R 2 = 0.4) or with LDS-751/NIH-3T3-G185 (R 2 = 0.36), we were not able to establish any correlations between these combinations of substrates and cell lines (data not shown). Data extracted from ChEMBL/TPsearch were originally reported in the publications of Schwab et al. [28] and Wang et al. [23] (as described earlier). Secondly, we also studied possible correlations for a series of tetrahydroisoquinoline and piperazine derivatives measured in a calcein-AM accumulation assay in MDCK-MDR1 cells versus a [ 3 H]-vinblastine transport inhibition assay with Caco-2 cells. An R 2 of 0.07 indicated no correlation at all between these assays (data not shown). The authors of this study (Colabufo et al.) [29] argued that this lack of any correlation could be either due to the different cell lines or to the different underlying assays (with different sensitivities). [26] The result is also in accordance with the study by Ekins et al. [21] which suggests that vinblastine and calcein might only have partially overlapping binding sites. We also observed that correlations were better for one compound class (tetrahydroisoquinolines) than for the other (piperazines). That points to the influence of the inhibitors' chemical structure on the usability of certain bioassays.

SPECIAL ISSUE
More broadly, this analysis implies that the prediction of drug-drug interactions will require both a better characterisation of existing drugs and development of sub-site specific models in order to have higher predictivity.

QSAR and Classification Datasets for Human P-Glycoprotein Inhibitors
Based on this study it was not possible to build up a large dataset of P-glycoprotein inhibitors based on the combination of different assays. The largest set contains 198 compounds and was obtained by merging three SAR datasets SPECIAL ISSUE  from ChEMBL, which were all measured in a daunorubicin transport assay in MDR CCRF vcr1000 cells (see Supporting Information, File S1). [18][19][20] Even though 24 % of the compounds with IC 50 values for human P-glycoprotein in TPsearch have also been measured in a daunorubicin transport assay (see Figure 2), none of these assays uses CCRF vcr1000 cells. Our studies suggest that bioactivities from daunorubicin-, LDS-751-, and rhodamine 123-transport assays can be combined in the case of NIH-3T3-G185 cells. It is not clear if the same is true for other cell lines. Thus, it was not possible to combine the 198 dataset with other datasets measured in different assay setups. Recently, other groups combined different literature sources (and thus data coming from different assays) for the generation of large classification databases by setting certain thresholds.
Broccatelli et al. [30] used a classification scheme derived from the observations of Rautio et al. [22] inhibitor (IC 50 ) < 15 mM; non-inhibitor (IC 50 ) > 100 mM. Rautio et al. tested twenty compounds with different protocols using five different probe substrates. However, not all combinations of assay types/probe substrates/cell lines could be taken into account.
Chen et al. [31] based the determination of whether a compound is an inhibitor or not on the experimental MDRR (multidrug resistance reversal) ratio. Comparing these two databases (of approximately 1200 compounds each) we found 33 collisions (differently classified compounds). From our point of view, one can achieve a cleaner and more robust dataset by not considering only one threshold, but by using a tailored threshold for every assay. This could be fulfilled by carefully inspecting the bioactivity measures of a compound that has been determined in many different assays. To our knowledge, verapamil is one of the broadest studied molecules interacting with P-glycoprotein. Table 5 lists all the IC 50 values for verapamil with the respective underlying assays that we could find in ChEMBL and TPsearch for human P-glycoprotein. Considering the respective (inhibitor) assays with determined verapamil-activity, we could classify all compounds having an equal or better affinity/potency than verapamil in a certain assay as inhibitor. Compounds with an activity value lower than verapamil were classified as non-inhibitors. In this way we created a benchmark dataset useful for P-glycoprotein classification studies, which comprises 77 inhibitors and 126 non-inhibitors (see Supporting Information, File S2).

SPECIAL ISSUE
We recommend measuring verapamil activity routinely when running a certain assay for a compound dataset. As a consequence, the pool of available information on verapamil thresholds will increase and then further serve for building up larger classification databases. In addition, verapamil bioactivities in certain assay setups already give an indication of the ability to combine them.

Conclusions
In conclusion, combining different bioactivity values from different assay setups should always be made with caution. Our study indicates that for inhibitors of human P-glycoprotein it is possible under certain conditions to combine SPECIAL ISSUE  data coming from the same assay type, if the cell lines used are also identical and the fluorescent or radiolabeled substrate have overlapping binding sites.
As a result of this study we have determined that there is only a very limited number of datasets with an appropriate number of compounds (at least ten) available, which have been measured in more than one assay and thus can serve for determining correlations. In order to prove our hypotheses, and to be able to find new inter-correlations between assays, there is an urgent need for a chemically diverse dataset measured in a panel of different assays, with different markers, and different cell lines.
It was observed throughout this study that expert knowledge is needed to organize and annotate existing bioactivity data, in this case specifically for human P-glycoprotein. There are several useful tools available which could help to increase the systematic structuring of bioassay data. One is MIABE (Minimum Information About a Bioactive Entity), which provides a formal list of information that should be provided when describing the synthesis and subsequent analysis of any potentially bioactive entity. [32] Another example is the STRENDA (standards for reporting enzymological data) [33] Commission. Again, this initiative aims at providing a check-list of information that should be included when reporting data, but focusing on enzyme data. In addition, STRENDA tends to give recommendations for uniform assay standards and standardization of enzyme data. Above all, the BioAssay Ontology project (BAO) should be mentioned, which has been initiated to facilitate the standardization of annotating the screening setup and the data generated. [34] In the BAO, there are already more than 350 assays from PubChem annotated with standardized BAO terms. [35] By defining the assay type and the compound action, functional viability and uptake assays can be separated from binding assays, while the activation or inhibition is annotated elsewhere. A standardized vocabulary enables extended searches regarding assay design, which can include both detection technology and instrument, dye specification, and other assay conditions. However, of the ABC superfamily of transport proteins there is only one member, the Multidrug resistance-associated protein 1 (UniProt ID: P33527; alternative protein names: ATP-binding cassette sub-family C member 1, Leukotriene C(4) transporter = LTC4 transporter; gene names: ABCC1, MRP, or MRP1) represented in the beta version of the BAO search tool. [36] This indicates that by far more work needs to be done to improve the coverage of this important antitarget family. Emerging standardization efforts as represented by BAO also underpin the importance of a bioassay annotation/classification at the time of assay data disposition.
Standardized vocabulary for assay design and technology would also allow making data interoperable and would remarkably increase the capabilities of data integration platforms, such as Open PHACTS. [37,38] This furthermore will promote enhanced access and use of data within both public sources and the pharmaceutical industry. Even more inter-esting would be the possibility to mark specific assays as interchangeable, which would allow to remarkably amplify the chemical space for certain targets. This will significantly increase the usefulness of those huge data repositories freely available. Table S1: Regression statistics of correlation plots (Figures 3-8). Table S2: Cell lines used for bioassays determining bioactivities of compounds interacting with human P-glycoprotein in ChEMBL and TP-search.

Supporting Information
File S1: sdf file of 198 chemical compounds with chemical structures and bioactivity values included measured in a daunorubicin efflux assay in MDR CCRF vcr1000 cells.
File S2: sdf file of 203 chemical compounds being either classified as inhibitors or non-inhibitors of human P-glycoprotein with chemical structures included.
In addition, sdf-files of the two datasets will also be available from our web-page pharminfo.univie.ac.at and from chemspider.com.