Single-cell entropy for accurate estimation of differentiation potency from a cell’s transcriptome

The ability to quantify differentiation potential of single cells is a task of critical importance for single-cell studies. So far however, there is no robust general molecular correlate of differentiation potential at the single cell level. Here we show that differentiation potency of a single cell can be approximated by computing the signaling promiscuity, or entropy, of a cell’s transcriptomic profile in the context of a cellular interaction network, without the need for model training or feature selection. We validate signaling entropy in over 7,000 single cell RNA-Seq profiles, representing all main differentiation stages, including time-course data. We develop a novel algorithm called Single Cell Entropy (SCENT), which correctly identifies known cell subpopulations of varying potency, enabling reconstruction of cell-lineage trajectories. By comparing bulk to single cell data, SCENT reveals that expression heterogeneity within single cell populations is regulated, pointing towards the importance of cell-cell interactions. In the context of cancer, SCENT can identify drug resistant cancer stem-cell phenotypes, including those obtained from circulating tumor cells. In summary, SCENT can directly estimate the differentiation potency and plasticity of single-cells, allowing unbiased quantification of intercellular heterogeneity, and providing a means to identify normal and cancer stem cell phenotypes. Software Availability SCENT is freely available as an R-package from github: https://github.com/aet21/SCENT

4 conditions in which the requirement for a cell's phenotypic plasticity is high (e.g. as in a 91 pluripotent state).

93
Validation of single-cell entropy as a measure of differentiation potency 94 To test that signaling entropy correlates with differentiation potency, we first estimated it for 95 1018 single-cell RNA-seq profiles generated by Chu et al [20], which included pluripotent 96 human embryonic stem cells (hESCs) and hESC-derived progenitor cells representing the 3 97 main germ-layers (endoderm, mesoderm and ectoderm) ("Chu et al set", SI table S1, Online highly statistically significant, with DEPs exhibiting significantly lower entropy values than 105 hESCs (Wilcoxon rank sum P<1e-50 ( Fig.2A). Likewise, TBs exhibited lower entropy than 106 hESCs (P<1e-50), but higher than HFFs (P<1e-7) ( Fig.2A). Importantly, signaling entropy 107 correlated very strongly with a pluripotency score obtained using a previously published cancer-associated fibroblasts (CAFs). For a given cell-type and individual, variation between 119 single cells was substantial and similar to the variation seen between individuals (SI fig.S1). 120 Mean entropy values however, were generally stable, showing little inter-individual variation, 121 except for T-cells from 4 out of 15 patients, which exhibited a distinctively different 122 distribution (SI fig.S1). In order to assess overall trends, we pooled the single-cell entropy 123 data from all patients together, which confirmed that all lymphocytes (T-cells, B-cells and 124 NK-cells) had similar average signaling entropy values (Fig.2E). Intra-tumor macrophages, 125 which are derived from monocytes, exhibited a marginally higher signaling entropy (Fig.2E). 126 The highest signaling entropy values were attained by endothelial cells and CAFs (Fig.2E), 127 consistent with their known high phenotypic plasticity [23][24][25][26]. Importantly, the entropy 128 values for all of these non-malignant differentiated cell-types were distinctively lower with the fact that hESCs and progenitors have much higher differentiation potency. To test 131 this formally, we compared hESCs, mesoderm progenitors, and terminally differentiated cells 132 within the mesoderm lineage (which included all endothelial cells and lymphocytes), which 133 revealed a consistent decrease in signaling entropy between all three potency states 134 (Wilcoxon rank test P<1e-50, Fig.2F). Of note, signaling entropy could discriminate 135 progenitor and differentiated cells better than the score derived from the pluripotency gene 136 expression signature [21], attesting to its increased robustness as a general measure of 137 differentiation potency (Fig.2G, SI fig.S2). 138 Next, we assessed signaling entropy in the context of a time-course differentiation 139 experiment, whereby hESCs were induced to differentiate into definite endoderm progenitors Signaling entropy is robust to choice of PPI network and NGS platform 147 We verified that signaling entropy is robust to the choice of PPI network (SI fig.S3). This 148 robustness to the network stems from the fact that signaling entropy depends mainly on the 149 relative connectivity of the proteins in the network (SI fig.S4A). Importantly, signaling 150 entropy lost its power to discriminate pluripotent from non-pluripotent cells if expression 151 values were randomly reshuffled over the network (SI fig.S4B-C), demonstrating that 152 features such as pluripotency are encoded in a subtle positive correlation between expression 153 levels and connectivity. In order to test the robustness of signaling entropy across 154 independent studies, we analyzed scRNA-Seq data for an independent set of single cell 155 hESCs derived from the primary outgrowth of the inner cell mass ("hESC set" [28], SI table 156 S1). Obtained signaling entropy values were most similar to those of single cells derived 157 from the H1 and H9 hESC lines, confirming the robustness of signaling entropy across 158 different studies and next-generation sequencing platforms (Fig.2D, SI table S1).

160
Non-linear association between single cell entropy and cell-cycle phase 161 A major source of variation in scRNA-Seq data is cell-cycle phase [22,29]. We explored the 162 relation between signaling entropy and cell-cycle phase in a large scRNA-Seq dataset 163 encompassing 3256 non-malignant and 1257 cancer cells derived from the microenvironment 164 of melanomas (Melanoma set, [22], SI table S1). A cycling score for both G1-S and G2-M 165 phases and for each cell was obtained using a validated procedure [22,29,30] and compared 166 to signaling entropy, which revealed a strong yet highly non-linear correlation (SI fig.S5). 167 Specifically, we observed that cells with a low signaling entropy were never found in either 168 6 the G1-S or G2-M phase (SI fig.S5). In contrast, cells with high signaling entropy could be 169 found in either a cycling or non-cycling phase. These results are consistent with the view that 170 cycling-cells must increase expression of promiscuous signaling proteins and hence exhibit 171 an increased signaling entropy.

174
Given that signaling entropy correlates with differentiation potency, we used it to develop the 175 SCENT algorithm (Fig.1C). Briefly, the SCENT algorithm uses the estimated signaling 176 entropies of single cells to derive the distribution of discrete potency states across the cell 177 population (Fig.1C, Online Methods). Thus, SCENT can be used to quantify expression 178 heterogeneity at the level of potency. In addition, SCENT can be used to directly order single 179 cells in pseudo-time [14] to facilitate reconstruction of lineage trajectories. A key feature of 180 SCENT is the assignment of each cell to a unique potency state and co-expression cluster, 181 which results in the identification of potency-clusters (which we call "landmarks"), through 182 which lineage trajectories are then inferred (Online Methods).

183
To test SCENT, we applied it to the scRNA-Seq data from Chu et al, a non-time course 184 single-cell experiment, which includes hESCs and progenitor cell populations (SI table S1).

185
SCENT correctly predicted a parsimonious two potency state model, with a high potency 186 pluripotent state and a lower potency non-pluripotent progenitor-like state (Fig.3A).

187
Interestingly, a small fraction (approximately 4%) of the single hESCs were deemed to be non-pluripotent cells (Fig.3B). Correspondingly, the Shannon index, which quantifies the 196 level of heterogeneity in potency, was highest for the NPC population (Fig.3C). In total, 197 SCENT predicted 6 co-expression clusters, which combined with the two potency states, 198 resulted in a total of 7 landmark clusters (Fig.3D). These landmarks correlated very strongly 199 with cell-type, with only NPCs being distributed across two landmarks of different potency 200 (Fig.3E). SCENT correctly inferred a lineage trajectory between the high potency NPC 201 subpopulation and its lower potency counterpart, as well as a trajectory between hESCs and 202 DEPs (Fig.3F). The other cell-types exhibited lower entropies ( Fig.2B & Fig.3F), and 203 correspondingly did not exhibit a direct trajectory to hESCs, suggesting several intermediate 204 states which were not sampled in this experiment.

205
To ascertain the biological significance of the two NPC subpopulations (Fig.3B,E,F), we first 206 verified that the NPCs deemed pluripotent did indeed have a higher pluripotency score (SI  relapsed sample, scRNA-Seq for 96 single AML cells was available (AML set, SI table S1). 276 We posited that comparing the signaling entropy values of these 96 cells would allow us to 277 identify a CSC-like subset responsible for relapse. Since in AML there are well accepted CSC 278 markers (CD34, CD96), we tested whether expression of these markers in high entropy AML 279 single cells is higher than in low entropy AML single cells (Fig.5D). Both CD34 and CD96

314
In order to obtain further evidence for regulated heterogeneity, we note that matched bulk 315 RNA-Seq data is not absolutely required since bulk samples can be approximated by  Fig.6D). This result was unchanged if the bulk samples at relapse were replaced with 338 bulk samples at diagnosis (Fig.6D). In summary, these data strongly support the view that the 339 differentiation potential or phenotypic plasticity of a cell population is higher than that of a 340 randomly picked single cell in the population, consistent with a model in which expression 341 heterogeneity between single cells is regulated.    context. The weight of an edge between protein g and protein h, denoted by w gh , is assumed 462 to be proportional to the normalized expression levels of the coding genes in the sample, i.e.

463
we assume that w gh ~ x g x h . We interpret these weights (if normalized) as interaction 464 probabilities. The above construction of the weights is based on the assumption that in a 465 sample with high expression of g and h, that the two proteins are more likely to interact than 466 in a sample with low expression of g and/or h. Viewing the edges generally as signaling 467 interactions, we can thus define a random walk on the network, assuming we normalize the 468 weights so that the sum of outgoing weights of a given node i is 1. This results in a stochastic

542
To identify single cells in either the G1-S or G2-M phases of the cell-cycle we followed the 543 procedure described in [22]. Briefly, genes whose expression is reflective of G1-S or G2-M 544 phase were obtained from [29,30]. A given normalized scRNA-Seq data matrix is then 545 z-score normalized for all genes present in these signatures. Finally, a cycling score for each 546 phase and each cell is obtained as the average z-scores over all genes present in each 547 signature.

548
To obtain an independent estimate of pluripotency we used the pluripotency gene expression