TrAp: a tree approach for fingerprinting subclonal tumor composition

Revealing the clonal composition of a single tumor is essential for identifying cell subpopulations with metastatic potential in primary tumors or with resistance to therapies in metastatic tumors. Sequencing technologies provide only an overview of the aggregate of numerous cells. Computational approaches to de-mix a collective signal composed of the aberrations of a mixed cell population of a tumor sample into its individual components are not available. We propose an evolutionary framework for deconvolving data from a single genome-wide experiment to infer the composition, abundance and evolutionary paths of the underlying cell subpopulations of a tumor. We have developed an algorithm (TrAp) for solving this mixture problem. In silico analyses show that TrAp correctly deconvolves mixed subpopulations when the number of subpopulations and the measurement errors are moderate. We demonstrate the applicability of the method using tumor karyotypes and somatic hypermutation data sets. We applied TrAp to Exome-Seq experiment of a renal cell carcinoma tumor sample and compared the mutational profile of the inferred subpopulations to the mutational profiles of single cells of the same tumor. Finally, we deconvolve sequencing data from eight acute myeloid leukemia patients and three distinct metastases of one melanoma patient to exhibit the evolutionary relationships of their subpopulations.


Introduction
The mechanisms of cancer evolution and metastatic onset are still largely unknown.The diversity, complexity and evasive nature of tumor biology are central reasons for the seemingly slow progress in the cure of most cancer types, particularly in controlling the ability of tumor populations to spread.Tumor populations are dynamic aggregates of constantly evolving subclones, each carrying a variety of genomic aberrations [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16].This genetic heterogeneity is often associated with differences in the biological behavior of different cell subpopulations.Some of these subclones are likely to be the primary instigators of invasion, metastases or relapse following treatment [11,[17][18][19][20].An effective characterization of the aggressive potential of tumors at early stages has an enormous potential to guide new clinical interventions and translational research [21].
Recently, several efforts have been made to provide a complete genealogical perspective of cancer evolution [22][23][24][25][26]. Using fluorescent labeling techniques, or more recently, single cell sequencing, it is technically possible to separate single cells from tumor samples in order to investigate their evolutionary patterns [22][23][24][25][26][27][28][29][30][31].However, these approaches are limited to either a small number of fluorescent markers [23,32], or to a relatively small number of single cells.On one hand, the identification and selection of uncharacterized subclones in high-throughput experiments is beyond the capabilities of current cell-sorting technologies; on the other hand, isolation and profiling of enough single cells in order to obtain a statistically representative sample of a tumor composed of millions of cells has, currently, prohibitive costs.For this reason, genomics profiling of tumors still relies on pooling in order to provide global averaged signals over the subclonal population within a tumor sample [33][34][35][36].Computational methods for identifying subclones, quantifying their relative abundance and monitoring their emergence and dynamics could prove extremely useful for the analysis of the heterogeneity of these pooled samples.This problem has been often overlooked due to its mathematical complexity.
We herein present a mathematical approach to de-mix signals from heterogeneous cell populations into their subclonal components and subsequently unveil the underlying dynamic tumor heterogeneity.Our proposed method is related to the problem of blind source separation [37][38][39][40][41][42][43][44], where both the underlying sources and their relative composition are unknown.In contrast to blind source separation methods we cannot assume that the underlying sources are statistically independent, we have no prior knowledge of the number of sources and we have at our disposal only one mixture of the unknown sources.This mathematical problem has a vast number of solutions and can be addressed only if additional constraints are imposed.Hence, we assume that tumor cell populations develop in a parsimonious evolutionary process.Furthermore, based on empirical observations, we introduce a sparsity constraint that limits the number of subpopulations.Distinctively from the standard problem of phylogeny [45][46][47][48][49][50][51][52][53][54][55][56], where each species is observed and measured separately, our methodology, which we term Tree Approach to Clonality (TrAp), is specifically designed to deconvolve a single aggregate signal into its different subclonal components.TrAp incorporates biologically motivated constraints which allow us to infer: 1) the composition of the subclones in a single aggregate sample 2) the abundance of each subclone and 3) the evolutionary path of the subclones.
The paper is organized as follows: we first define the subclonal deconvolution problem and present an efficient algorithm for finding all its solutions.Using in silico simulated data we verify that the algorithm is able to correctly deconvolve mixed subclonal populations and that the method is robust to realistic measurement errors.Further, the solution is often unique when the number of populated subclones is moderate.In addition, we also show that TrAp can correctly deconvolve random mixtures of karyotypes of several cells from the same tumor biopsy or from mixture of sequences generated in a study involving somatic hypermutations in B cells.We subsequently compare the mutation profiles of twenty Exome-Seq single cell experiments to those inferred using an aggregate signal generated by exome sequencing from the same renal cell carcinoma tumor.Finally, we apply TrAp to Exome-Seq data from three metastases from three distinct body compartments and compare their subclonal compositions and evolutionary histories.

Results
The results are divided in four parts.In the first part we describe our novel TrAp algorithm for subclonal deconvolution of aggregate genomic signals consisting of aberrations' frequencies and we show that the TrAp algorithm always identifies at least one solution; we also introduce an extension to our work that is suited for signals where each locus can be affected by distinct consecutive aberrations (e.g.several consecutive point mutations affecting the same nucleotide).In the second part, we simulate noisy aggregate signals constructed by random linear combinations of simulated subclonal aberration profiles.We used these simulated data to show that TrAp can correctly deconvolve mixtures of evolutionarily related subclones even in presence of levels of noise that are typically found in current genomics experiments.In the third part, we generated realistic aggregate signals by mixing subclonal genomic profiles obtained from cytogenetic analyses using random coefficients.We generate these data separately for each patient and show that, for nearly all aggregate samples, TrAp recovered the subclonal components.Similarly, we generated realistic aggregate signals from somatic hypermutated (SHM) regions from B cells.As we show, somatic hypermutation is a particularly suitable system for the framework of our TrAp algorithm, which successfully recovered all components from the aggregate signals.In the fourth part, we apply our approach to exome-sequencing experiments.We present an analysis of recent single-cell exome sequencing from a renal cell carcinoma study where, besides a collection of twenty single cells, an aggregate has also been measured.Despite the reported difference between the aggregate and mean aberration profile of the single cell experiments, TrAp could still identify subclones with co-occurring aberrations consistent with co-occurring aberrations found in direct single cell measurements.Finally, we analyze three metastases from separate body compartments of a melanoma patient and compare their inferred evolutionary patterns in a genomic region surrounding the DCC gene.

The TrAp algorithm
The subclonal deconvolution problem We consider a population of cells where each cell can be described by a binary vector, which we call genotype.Each element of the genotype vector has an aberration state modeled as a binary variable, e.g. the presence/absence of a mutated nucleotide in a specific genomic position or the presence/absence of a specific copy number variation event in a specific locus.For each cell, the i-th element of the genotype vector is 1 if the i-th aberration is present in the cell and 0 if the aberration is absent.A subclone is a collection of all cells that have identical genome-wide aberration profile.A subclone is populated in the sample if the fraction of cells sharing the subclone's genome is greater than zero and can be detected by the experiment.
We define the subclonal deconvolution problem as the task of de-mixing an aggregate measurement into a linear combination of (unknown) subclonal genotypes.The only information that is required as input is the aggregate frequency vector y, whose elements y i correspond to the frequency of the i-th aberration in the sample cell population.For efficiency, we remove from the genome all genotype entries k that are homogenous within the population (i.e.y k = 0 or y k = 1), as they do not need to be deconvolved.Next, to ensure that the aberration-free non cancerous cells (wildtype) are included in the solution of the problem, we add one dummy aberration to all the normal and cancerous cells in the sample.By construction, the aggregate frequency of this dummy aberration y 1 is equal to 1. Finally, without loss of generality, we sort the aggregate frequency vector y in descending order such that 1 = y 1 > y 2 ≥ . . .≥ y N > 0, where N is equal to the number of aberration events considered including the dummy aberration y 1 .The subclonal deconvolution problem can be written in matrix notation as where C is a matrix of size N × M whose elements c ij are 1 if aberration i is present in subclone C j and 0 otherwise; N is the size of the vector y; M is the number of subclones; and x is a vector of size M where each element x j corresponds to the frequency of subclone C j in the sample.We note that, without introducing the wildtype aberration, the wildtype subclone would correspond to a vector of zeros and we would not be able to infer the frequency of the wildtype component using the linear model of Equation (1).Furthermore, since the dummy aberration is present in the wildtype and all other subclones, it follows that (i)∀j, c 1j = 1 and (ii) j x j = 1.We note that since M , C and x are all unknown in this problem there is an intractable number of possible solutions.As previously discussed [57], for M > 2 the system is underdetermined and the aggregate signal cannot be uniquely deconvolved by solving the linear system and it is not even possible to find parsimonious unique solutions using sparse reconstruction methods.However, by introducing additional biologically motivated constraints to the model, we can dramatically reduce the number of possible solutions, such that the problem may have a tractable number of optimal solutions, ideally only one.We therefore seek the family of solutions (TrAp-solutions) that sequentially satisfy the following constraints: 1. Evolutionarity.The subclones are generated in an evolutionary process starting from a subclone with no aberrations.Every subclone is generated from an existing subclone by adding to it a single new aberration event.
2. Parsimony.The number of subclones M that are generated during the evolution process is minimal.
3. Sparsity.The number of populated subclones P (i.e. the number of subclones j for which x j > 0) is minimal.

4.
Shallowness.The depth of the evolutionary tree (i.e. the number of generations) A schema of a TrAp solution is shown in Figure 1.
A schema of deconvolution of the mixed signal of four subclones.In this example, the aggregate signal frequency vector y on the left hand side of the matrix-vector equation represents the frequency of five aberrations in the aggregate sample.To allow the heterogeneous mixture of subclones to include normal cells we introduce a dummy aberration that is present in any cell.The frequency of the dummy aberration y 1 is equal to one.The frequencies of the actual five aberrations A 2 , A 3 , A 4 , A 5 , and A 6 encoded in the remaining elements of the vector Y are given by y 2 = 0.6, y 3 = 0.4, y 4 = 0.35, y 5 = 0.3 and y 6 = 0.1, respectively.In this example, the optimal TrAp solution is unique and has four populated subclones: C 2 with aberrations {A 2 }, C 4 with aberrations {A 2 , A 4 }, C 5 with aberrations {A 3 , A 5 } and C 6 with aberrations {A 3 , A 6 }.The optimal solution is shown both as an evolutionary tree (top) and in matrix form according to Equation (1) (bottom), where the tree topology is encoded in the binary matrix and the relative composition of the subclones is represented in the column vector.
The evolutionarity constraint is used in many biological systems, in particular when studying development of cell populations [45][46][47][48][49][50][51][52][53][54][55].The parsimony constraint is typically satisfied because the expected probability of a specific aberration event in a nucleotide or a locus is low and it is therefore unlikely that an event occurs more than once and independently in distant subclones.This constraint is the main criterion used to determine the optimality of maximum parsimony methods commonly used in phylogeny [48,56,58,59].The parsimony constraint dramatically reduces the number of possible solutions of Equation ( 1) because it limits the number of subclones M .The sparsity constraint is justified by the fact that some subclones may die out or may be too rare to be detected.Also, it has been shown in several studies that few subclones acquire an evolutionary advantage and outgrow the other subclones [5,13,60,61], thus reducing the number of populated subclones.The shallowness constraint is justified as excessive genomic instability may not be viable, thus evolutionary trees tend to be shallow and wide rather than deep and narrow.

Properties of TrAp solutions
In this subsection we show that in the subclonal deconvolution problem the evolutionarity and parsimony constraint can always be satisfied by a naïve model in which each aberration event occurs exactly once during evolution (i.e.M = N ).We call any solution with M = N an N -solution and its evolutionary tree an N -tree.
The N -solution optimally satisfies the evolutionary and parsimony constraints.Since all detected aberrations need to occur at least once in the evolutionary tree, the number of subclones must be greater than or equal to the the number of aberration events considered, i.e.M ≥ N .We note that it is always possible to construct a valid solution for Equation (1) of exactly M = N subclones for every aggregate frequency vector y.Specifically, for a cascade-like evolutionary process with no branching, where the wildtype subclone C 1 is the root of the tree and every other subclone C i is a direct descendant of the subclone C i−1 , the solution of Equation ( 1) is given by: While this cascade-like solution satisfies both evolutionarity and parsimony constraints, this solution is not necessarily optimal with respect to the sparsity constraint and is the least optimal in terms of the shallowness constraint.Furthermore, the existence of this solution guarantees that there is always at least one solution to the subclonal deconvolution problem.Since we imposed that the four constraints to the subclonal deconvolution problem must be satisfied sequentially, any solution with M > N subclones will always be less optimal than the solution given in Equation ( 2), regardless of its sparseness and shallowness.We can thus limit the search space of the TrAp algorithm to N -solutions.For this subset of solutions, the vector x is of size N and C is a square matrix of size N ×N .Importantly, the index of each subclone C i indicates the subclone for which the i-th aberration occurs for the first time.Hence, for any TrAp solution there is a one-to-one correspondence between the i-th aberration and the subclone C i whose index indicates that it evolved from its parent subclone by acquiring the i-th aberration.Moreover, each aberration event of an N -tree occurs exactly once and cannot be reverted during evolution.Each aberration i is thus present only in subclone C i and its subclonal descendants: where α ij is the ancestor indicator variable that is equal to 1 if subclone C i is an ancestor of subclone C j and 0 otherwise.We note that ∀j > 1, α 1j = 1, which means that the wildtype clone C 1 is always the root of the evolutionary tree, as required by the evolutionarity constraint.
In Equation (3), we expressed the aggregate frequency y i as the sum of the frequencies of all subclones descending from C i .We now express the aggregate frequency y i as a function of the direct descendants of C i .We define the parent indicator variable φ ij , which is 1 if C i is the parent of C j (i.e. if subclone C j is the result of a single aberration event in subclone C i ) and 0 otherwise.Finally, using the parent indicator variables we express y i in terms of the aggregate frequencies of the direct descendants of C i Equation ( 4) can be rearranged to express the vector x in terms of y and the parent indicator matrix Φ: x = y − Φy, where Φ is the N × N matrix whose elements are given by the parent indicator variables φ i,j .Because of the evolutionary constraint, the matrix Φ has only N − 1 nonzero elements, reflecting the fact that each subclone except the wildtype has exactly one parent subclone, i.e.
N i=1 φ i1 = 0 and ∀j > 1, Furthermore, an important corollary of Equation ( 4) is that the subclone C i is not populated if and only if In other words, the clone C i is not populated when the aggregate frequency y i of aberration i is equal to the sum of the aggregate frequencies of all the direct descendants of the subclone C i .Therefore, the number of non-populated subclones of the N -tree encoded by Φ is given by the number of aberrations i that satisfy Equation (6).Finally, we summarize the relationships between C and the indicator variables α and φ.Using Equation (3), we can express the subclonal deconvolution problem as y = Cx = (I + A)x, where I is the N × N identity matrix and A is the N × N matrix of whose elements are given by the ancestor indicator variable α ij .Furthermore, Equation (5) allows us to write the subclonal deconvolution problem as x = (I − Φ)y.We can therefore express the relationships between the matrices C, A and Φ as: A brute-force algorithm for solving the subclonal deconvolution problem We now observe that the relationship between two subclones C i and C j such that i < j (which also implies y i ≥ y j as the vector y is sorted), must be one of the following: i) C i is an ancestor of C j , i.e. α ij = 1, α ji = 0 and y i ≥ y j ; or ii) C i and C j are on separate branches, i.e. α ij = 0, α ji = 0 and y i + y j ≤ 1.This property implies that all evolutionary trees can be generated iteratively by starting with the wildtype clone C 1 and adding the clone C i at step i to all trees generated at step i − 1.In detail, for any tree that can be generated using subclones C 1 , . . ., C i−1 we generate a new tree by adding the subclone C i as direct descendant of subclone C j for all j < i for which the resulting x j (calculated using Equation ( 5)) remains nonnegative after adding C i .For completeness, when y i = y j the subclones C i and C j can be either on separate branches or on the same branch.If they are on the same branch, then there is an ambiguity regarding the order in which the two aberrations occur (Figure 2).However, in the case that these two subclones are on the same branch, the aberration profile of the ancestor subclone (shown in green in Figure 2) is not informative because this subclone is not populated (x a = 0) and aberrations i and j are both present in the descendant subclone regardless of the order in which they occur.Since these two aberrations cannot be observed separately (i.e. the coefficient x a associated with the ancestor aberration is zero, whereas the x d associated with both aberrations could be nonzero), we only output the solution for which C min{i,j} is an ancestor of C max{i,j} (left solution in Figure 2).This choice ensures that for every pair of aberrations i < j, C j cannot be an ancestor of C i .A step-by-step example of the TrAp solution obtained using the brute-force algorithm is given in Figure 3.
. Deconvolution of a mixture where two aggregate signal frequencies are equal.In this example, five aberrations (A 2 , A 3 , A 4 , A 5 and A 6 ) were measured from an aggregate sample and their frequencies were y 2 = 0.8, y 3 = 0.5, y 4 = 0.5, y 5 = 0.4 and y 6 = 0.2, respectively.The dummy measurement y 1 = 1 was also added to generate the aggregate signal frequency vector y = [1, 0.8, 0.5, 0.5, 0.4, 0.2].In this example, there are two optimal TrAp solutions (left and right), each shown both as an evolutionary tree (top) and in matrix form according to Equation (1) (bottom).Both solutions have 4 common populated subclones, namely C 2 with aberration {A 2 }, C 5 with aberrations {A 2 , A 3 , A 4 , A 5 }, C 6 with aberration {A 6 } and a subclone with aberrations {A 2 , A 3 , A 4 }.In both cases, the ancestors of the clone with aberrations {A 2 , A 3 , A 4 } (C 3 of the left tree and C 4 of the right tree) are not populated.We remark that these two solutions are practically indistinguishable and that the TrAp algorithm outputs only the solution where the subclonal indices along each branch are arranged in an increasing order (as shown in the left tree solution).

S1 S2 S3
Figure 3. Brute-force search approach to deconvolve a mixture of four subclones.In this example, five aberrations (A 2 , A 3 , A 4 , A 5 and A 6 ) were measured from an aggregate sample and their frequencies were y 2 = 0.8, y 3 = 0.5, y 4 = 0.5, y 5 = 0.4 and y 6 = 0.2, respectively.The dummy measurement y 1 = 1 was also added to generate the aggregate signal frequency vector y = [1, 0.8, 0.5, 0.5, 0.4, 0.2].In the first step, the wild type clone C 1 (representing the wildtype subpopulation) is positioned at the root of the tree.In the second step, a tree reconstruction begins and C 2 is added to the only possible ancestor clone C 1 .In the third step, C 3 must be added as direct descendant of C 2 .C 3 cannot be added to the tree on a different branch as a direct descendant of C 1 because based on Equation ( 5) this would imply a negative frequency Likewise, the subclones C 4 and C 5 can only be added as direct descendants of C 3 and C 4 , respectively.Finally, C 6 can be added as a direct descendant of subclones C 1 , C 2 or C 5 generating solutions S 1 , S 2 or S 3 , respectively.However, S 1 is the only TrAp-solution as its number of populated subclones is minimal and corresponds the solution shown in the left side of Figure 2. Solution S 3 is the cascade-like solution described in Equation ( 2).
The upper bound on the number of possible evolutionary N -trees is thus (N − 1)!, as every subclone i can only be the direct descendant of i − 1 subclones.This bound is significantly smaller than the number of all possible trees with N labeled vertices, which is N N −2 according to Cailey's formula [62,63].Furthermore, we note that the parent indicator matrix Φ and C are upper triangular and that both C and I − Φ are of rank N and invertible, which guarantees that Equation (7) can always be used to switch between the representation with the parent indicator variable Φ (Equation ( 5)) and the representation with the subclone matrix C (Equation ( 1)).We also note that, given a matrix Φ (or C = (I − Φ) −1 ), the vector x is uniquely determined.In particular, if the C matrix is known, the vector x can be efficiently found by solving the linear system Cx = y using back-substitution (i.e. by solving Equation (3) first for x N , then using x N to solve for x N −1 and repeating through x 1 ).

TrAp: a fast algorithm for solving the subclonal deconvolution problem
As we have shown, the number of non-populated subclones of a given N -tree is equal to the number of aberrations i that satisfy Equation (6).Moreover, in order to satisfy the sparsity constraint of a solution, we do not need to know the topology of the whole evolutionary tree, but only the subset of rows of the matrix Φ that satisfy Equation (6).We now leverage on this property to efficiently generate sparse solutions to the subclonal deconvolution problem.
First, we group each subset of subclones that satisfy Equation ( 6) into a first generation tree T i , which we define as the subset of subclones C Ti p , C Ti 1 , . . ., C Ti Ni such that the subclone C Ti p is not populated (i.e.x Ti p = 0) and that N i subclones C Ti 1 , . . ., C Ti Ni are its direct descendants.Each first generation tree is represented by a row of the Φ matrix.For example, there are three first generation trees for the aggregate signal Y = {1, 0.6, 0.4, 0.35, 0.3, 0.1}, namely 4).We note that the optimal TrAp solution for this example contains the first generation trees T 1 and T 3 (Figure 1).Furthermore, a Φ matrix associated with a first generation tree must follow the evolutionary constraints previously described (∀j > 1, N i=1 φ ij = 1) and thus the first generation tree also gives complete information about the columns of Φ corresponding to the direct descendant subclones C Ti 1 , . . ., C Ti Ni .For example, the first generation tree T 1 = {C 1 , C 2 , C 3 } implies that φ i,2 = 0 and φ i,3 = 0 for every i = 1 (Figure 4).
Next, we define a partial tree as a collection of first generation trees {T 1 , . . ., T h } that can jointly be contained in a full evolutionary tree.Since each first generation tree can be represented by a row of the Φ matrix, a partial tree that is comprised of h first generation trees can be represented by h rows of the Φ matrix.In the example above, the partial tree that contains the first generation trees T 1 and T 3 is represented by rows 1 and 3 of the matrix Φ in the bottom panel of Figure 4. Similarly to first generation trees, the matrix Φ associated with a partial tree must follow the evolutionary constraint, which implies that not all combinations of first generation trees give rise to partial trees.In the example above, the partial trees T 1 and T 3 can be combined to generate the partial tree P T 1 = {T 1 , T 3 } (Figure 4 bottom), whereas the pairs {T 1 , T 2 } and {T 2 , T 3 } cannot be combined to generate partial trees.Therefore, in the example above the possible partial trees are P T 1 = {T 1 , T 3 }, the empty partial tree P T 2 = {} and the partial trees P T 3 = {T 1 }, P T 4 = {T 2 } and P T 5 = {T 3 }.Moreover, we note that all N -trees that contain a partial tree comprising of h first generation trees have N − h populated subclones.This implies that TrAp solutions, which must satisfy the sparsity constraint, must also contain one of the partial trees comprising the maximum number of first generation trees.In the example above, the optimal TrAp solution (Figure 1) contains the partial tree P T 1 , which is the only partial tree comprising two first generation trees.
Since all TrAp solutions contain the maximum number of first generation trees, the TrAp algorithm dramatically reduces the search space by identifying the optimal partial trees and later utilizing them as starting points for rapidly building all the sparsest solutions to the subclonal deconvolution problem.In details, the TrAp algorithm solves the subclonal deconvolution problem in the following steps (Figure 5): 1. Identify all first generation trees from the aggregate signal vector y.
3. Discard partial trees that do not have the minimum number of populated subclones.
4. Generate all evolutionary trees consistent with the partial trees comprising the maximum number of first generation trees.This step is done iteratively for each partial tree, similarly to the way described for the brute-force algorithm.The only difference is that, when the parent clone C Ti p of a first generation tree Ni is added to the tree, the subclones C Ti 1 , . . ., C Ti Ni are automatically added as direct descendants of C Ti p . 5. Optimize the shallowness constraint by sorting the generated solutions by the depth of the generated tree.
Step 1: First generation trees T 1 T 2 Step 2: Partial trees Step 3: Tree generation from PT 4 Add C 5 Put C 1 (and T 1 ) Add C 3 (and T 2 ) Figure 5. Illustration of the usage of first generation trees and partial trees for deriving the TrAp solution of a mixture of four subclones.In this example, five aberrations were measured from an aggregate sample and their frequencies were y 2 = 0.8, y 3 = 0.5, y 4 = 0.5, y 5 = 0.4 and y 6 = 0.2, respectively.The dummy measurement y 1 = 1 was also added to generate the aggregate signal frequency vector y = [1, 0.8, 0.5, 0.5, 0.4, 0.2].In the first step, TrAp identifies all first generation trees, namely In the second step, TrAp generates the possible partial trees, namely P T 1 = {}, P T 2 = {T 1 }, P T 3 = {T 2 } and P T 4 = {T 1 , T 2 }, and consequently selects only P T 4 = {T 1 , T 2 } as it is the only partial tree which contains a maximum number of first generation trees.In the third step, TrAp generates evolutionary trees starting from the partial tree P T 4 = {T 1 , T 2 }.To complete the evolutionary tree starting from P T 4 , the subclone C 1 is positioned as the root of the tree.Since C 1 is part of the first generation tree T 1 , the subclones C 2 and C 6 are automatically added as a direct descendant of C 1 .Next, C 3 is added as a direct descendant of C 2 .Since C 3 is part of the first generation tree T 2 , C 4 is automatically added as direct descendant of C 3 .Finally, C 5 is added as a direct descendant of C 4 , generating the optimal TrAp solution to the subclonal deconvolution problem.We remark that the optimal solution generated by the TrAp algorithm is equal to S1 in Figure 3 and to the left solution of Figure 2.
The performance of the TrAp algorithm is equivalent to the brute-force approach when there are no first generation trees (i.e. when all subclones are populated) and becomes superior to a brute-force approach when P < N .Furthermore, the algorithm can be modified by imposing additional constraints that need to be satisfied at each step of the iterative tree reconstruction procedure (an example of such modification is presented in the subsection on extension of TrAp to non-binary aberrations).The TrAp algorithm by default returns only the solution(s) that optimize all the constraints.In addition, the user can specify parameters to relax the sparsity and shallowness constraints.

Generalization of TrAp to non-binary aberrations
In the previous sections we represented the genome of each subclone by a vector of binary values whose entries represent if a genomic position is in a normal state (0) or in an aberrated state (1).In general the number of states in a given genomic position could be larger than two and hence subclones cannot be represented by vectors of binary values without loss of information.For example, a nucleotide found in the reference genome or in the germline at a specific position may undergo multiple distinct point mutation events into more than one specific nucleotide.In this subsection, we describe an extension of the TrAp algorithm to deal with such cases.
We consider the generalized subclonal deconvolution problem in which the genome consists of N positions each of which can have S different states.We also assume that the genome of the wildtype subclone is known.The only information that we utilize as input is the aggregate frequency matrix Z whose elements z k,s correspond to the observed frequency of the aberrated state s at position k.We note that, by construction, 0 ≤ z k,s ≤ 1 and that S s=1 z ks = 1.To utilize our framework for solving the subclonal deconvolution problem, we convert the information encoded in Z as a vector y whose elements represent frequencies of binary events.We perform this transformation in several steps.First, we vectorize the matrix Z by concatenating its rows to construct the vector z, which has KS elements.Then, we remove from the vector z the entries for which z ks = 0 as they are not informative.As a result, for each position k there are S k entries, where S k ranges from 1 (only the unmutated state is observed) to S (all aberrated states are observed).For illustration of these first steps, we consider a toy example of a genome of length three whose wildtype sequence is "CAT" and we analyze an aggregate sample made of three subclones with sequences "TCT", "TAC" and "CGT" mixed with frequencies 0.1, 0.3 and 0.6, respectively (Figure 6).In this example the z vector consists of the elements z 1C = 0.6, z 1T = 0.4, z 2A = 0.3, z 2C = 0.1, z 2G = 0.6, z 3C = 0.3 and z 3T = 0.7 (Figure 6).
Next, we wish to design a binarization matrix B to encode the information contained in z as a vector y = Bz whose elements represent the frequency of binary aberrations and can thus be used as input to the subclonal deconvolution problem for the whole genome (Equation ( 1)).For every position k, we assume that each state s (1 ≤ s ≤ S k ) is reached by a sequence of aberration events.We denote by A ks an aberration event to state s at position k.In the example above, there are two states at position 1.The unmutated state C is reached by a dummy aberration A 1C (in analogy to the dummy aberration of the wildtype clone in the subclonal deconvolution problem) and the aberrated state T is reached by the sequence of the dummy aberration A 1C followed by the aberration A 1T .Since the dummy aberration A 1C is present when we observe states C or T at position 1, the frequency of the unmutated aberration is y 1C = z 1C + z 1T = 1.However, the aberration A 1T is present only when we observe the state T , therefore the frequency of the aberration A 1T is y 1T = z 1T = 0.4 (Figure 6).Since measurements at different genomic positions do not affect one another, we construct B as a block-diagonal matrix: 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 Application of the TrAp algorithm to deconvolve a mixture of three sequences in presence of poly-allelic mutations.We analyzed an aggregate sample composed of three subclones with sequences "TCT", "TAC" and "CGT" mixed with frequencies 0.1, 0.3 and 0.6, respectively.In this particular example, the wildtype sequence "CAT" is absent in the mixture.Nonzero elements of the aggregate frequency matrix Z (top right) are then concatenated in the z vector, which consists of the elements z 1C = 0.6, z 1T = 0.4, z 2A = 0.3, z 2C = 0.1, z 2G = 0.6, z 3C = 0.3 and z 3T = 0.7 (middle left).In the center of the middle panel we show the binarization matrix B (1,1,1) and the matrix-vector product B (1,1,1) z associated with it, which are consistent with Equation ( 8) and lead to the optimal TrAp solution.Next, rows and columns corresponding to unmutated states (i.e.1C, 2A and 3T , shown in green) are substituted with a dummy aberrated state corresponding to the wildtype (shown in blue).In the bottom row, the TrAp solution is shown both as an evolutionary tree (left) and in matrix form according to Equation (9) (right).
where y k = [y k1 , . . ., y kS k ] and zk = [z k1 , . . ., z kS k ].Combining Equation (1) and Equation (8) gives It is important to note that for any pair of different states s 1 and s 2 at position k where 1 ≤ s 1 ≤ S k and 1 ≤ s 2 ≤ S k , the value of b ks1,ks2 defines the ancestral relationship between the aberration events A ks1 and A ks2 .Using the ancestor indicator variable α we can express this relationship as α ks1,ks2 = b ks1,ks2 .
To preserve these ancestral relationships in both sides of Equation ( 9) (we recall that C = I + A), the element c ks1,ks2 of C must be equal to the element b ks1,ks2 of B for every pair of states s 1 and s 2 at position k, where 1 In order to find the solutions to Equation ( 8), we solve independently each subproblem y k = B k z k and we require that the B k matrix encodes a tree which satisfies the evolutionary and parsimony constraints and whose root is the dummy unmutated aberration at position k.The number of solutions for each subproblem is equal to the number of trees with S k labeled vertices and, as discussed earlier, this number is equal to S S k −2 k .We then denote by ) and by y (t k ) k the aggregate frequency vector associated to it.The solutions to the problem for S k ≤ 3 are: There is only one solution.Since the input consists only of the unmutated state, it follows that k = [1] and the corresponding aggregate frequency vector is y There is only one solution.Assuming the input is z k = [z k1 , z k2 ] and the unmutated state is k1, and the corresponding aggregate frequency vector is y S k = 3.There are 3 solutions.Assuming the input is z k = [z k1 , z k2 , z k3 ] and unmutated state is k1, the solutions are B


 and their corresponding y k vectors are given by y = [1, z k2 + z k3 , z k3 ] and y The values b k2,k3 and b k3,k2 define the ancestral relationship between aberrations A k2 and A k3 .In the first solution A k2 and A k3 must be on separate branches, in the second solution A k2 must be an ancestor of A k3 , and in the third solution A k3 must be an ancestor of A k2 .
The number of solutions grows dramatically with S k (16 solutions for S k = 4, 125 solutions for S k = 5, 1296 solutions for S k = 6).Therefore, herein we explicitly show the solutions for S k < 4 and implement the method to adress practical scenarios, such as nucleotide point mutations (S k ≤ 4).
The set of all possible solutions to Equation ( 8) is given by the Cartesian product of the solution sets of each subproblem.This implies that the number of solutions of Equation ( 8) is . We denote by B (t1,...,tK) the block matrix solution associated with the blocks B t1 1 , . . ., B tK K representing the solutions of each subproblem.Similarly, we denote by y (t1,...,tK) the aggregate signal vector given by y (t1,...,tK) = B (t1,...,tK) z.It is possible to reduce the number of rows and columns in Equation ( 9) by substituting all the K rows corresponding to the unmutated states (shown in green in Figure 6) with a single dummy wildtype aberration (shown in blue in Figure 6).In our toy example, one aberrated state is observed at positions 1 and 3 (S 1 = S 3 = 2), while two aberrated states (C and G) are observed at position 2 (S 2 = 3).Therefore, there are three solutions to Equation (8): B (1,1,1) , B (1,2,1) and B (1,3,1) .Figure 7 illustrates the best solution for each of these binarization matrices in a graphical and matrix form.The solution associated with B (1,1,1) is the only TrAp-solution of the generalized subclonal deconvolution problem since it has the minimum number of populated subclones (sparsity constraint).
In summary the generalized subclonal deconvolution problem can be solved as follows: 1. Vectorize the aggregate frequency matrix Z and identify all binarization matrices B (t1,...,tK) (Equation ( 8)) compatible with the vector z.
3. Discard partial trees that do not have the minimum number of populated subclones.
1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 . Three solutions of the generalized subclonal deconvolution problem for a mixture of three sequences in presence of poly-allelic mutations.We analyzed an aggregate sample composed of three subclones with sequences "TCT", "TAC" and "CGT" mixed with frequencies 0.1, 0.3 and 0.6, respectively.In this example, the wildtype sequence "CAT" is absent in the mixture.Nonzero elements of the aggregate frequency matrix Z are concatenated in the z vector, which consists of the elements z 1C = 0.6, z 1T = 0.4, z 2A = 0.3, z 2C = 0.1, z 2G = 0.6, z 3C = 0.3 and z 3T = 0.7.There are three binarization matrices (B (1,1,1) , B (1,2,1) and B (1,3,1) ) to Equation ( 8) and one solution for each binarization matrix is shown.Mutations are shown using the notation "position : reference→ mutated", e.g. the notation 2 : A → G indicates that nucleotide at position 2 was mutated from Adenine to Guanine.The notation 2 : A → G → C indicates that nucleotide at position 2 was first mutated from Adenine to Guanine and then from Guanine to Cytosine.Top: solution based on the binarization matrix B (1,1,1) , in which the subclones C 3 and C 4 associated with the aberration events A 2C and A 2G are on separate branches; Middle: solution based on the binarization matrix B (1,2,1) , in which the aberration event A 2G (subclone C 4 ) happens before of A 2C (subclone C 3 ; bottom: solution based on the binarization matrix B (1,3,1) , in which the aberration event A 2C (subclone C 3 ) happens before A 2G (clone C 4 ).The solutions are shown both as evolutionary trees (left) and in matrix form according to Equation (9).
4. Generate all evolutionary trees consistent with the partial trees comprising a maximum number of first generation trees.This step is performed as described above, but with the additional constraint that c ks1,ks2 = b ks1,ks2 for any pair of states s 1 and s 2 at position k, where 1 5. Optimize the shallowness constraint by sorting the generated solutions by the depth of the generated tree.

Benchmarks using simulated data
The models presented above show that TrAp is an efficient algorithm for inferring subclonal components from the aggregate measure.In particular we have shown that in the absence of noise TrAp returns the exact solution when the underlying subclonal population satisfies reasonable constraints and that the algorithm is always able to find at least one solution.However, experimental measurements are often noisy and can only have finite precision.In this section, we first discuss two approaches to treat noisy input and then we apply our algorithm to simulated aggregate signals from random in silico trees showing that TrAp is robust to typical noise levels found in genomic experiments.

Handling Measurement Errors
In order to accommodate different types of noise that may arise in genomic data, we incorporated two error models in the TrAp algorithm.In both error models, the input to the TrAp algorithm requires an additional vector ε of size N whose elements ε i are related to the precision of the measure y i .The error related to the dummy variable is denoted by ε 1 and is set to 0 as y 1 = 1 is a constraint of the model and thus ε 1 must vanish.First, we examine the bound error model in which we assign a threshold to the error ε i ≥ 0 of every underlying aggregate signal ỹi such that each measured signal y i will be in the range [ ỹi − ε i , ỹi + ε i ].Equation ( 6) is then modified accordingly and we can state that the subclone C i defined by aberration i is not populated if and only if: Next, we implement a normal error model assuming normally distributed measurements errors using a confidence interval approach to determine whether a subclone is populated or not.Specifically, we assume that the underlying aggregate signal is normally distributed around the observed signal, i.e. y i ∼ N ỹi , ε 2 i .We substitute each term of the left-hand side of Equation ( 6) by its normal distribution in order to derive the distribution r ∼ N µ r , σ 2 r with mean µ r = ỹi − N j=1 φ ij ỹj and variance Using the distribution of r and a desired confidence level α > 0 (default 0.05), we can define that clone C i is not populated if 0 falls within the confidence interval [q α 2 , q 1−α 2 ], where q α is the α quantile of the distribution of r.
Once the error model is chosen, the algorithm generates all optimal TrAp-trees in a similar fashion to the noise-free case.The main difference is that in the first step of the TrAp algorithm, Equation (10) (or a confidence interval on the distribution r ∼ N µ r , σ 2 r ) is used instead of Equation ( 6) to identify first generation trees.Moreover, instead of using back-substitution for finding the vector x, we solve the nonnegative linear least square problem Cx = y with the constraint x k = 0 for all non-populated subclones k associated with the parents of the first generation trees.This fitting allows us to obtain a value of exactly zero for all non-populated subclones and to distribute measurement error more evenly in the vector x.

Deconvolution of simulated noisy aggregates
To confirm that TrAp can correctly infer the subclonal composition from aggregate noisy signals with typical noise levels found in genomic experiments, we performed simulations starting with random in silico evolutionary trees (see Methods) with different numbers of aberrations N and different numbers of populated subclones P .For each tree, we also studied the effect of different magnitudes of measurement errors ε and we investigated the operating conditions for which TrAp would assign the best score to the true solution.
For each aggregate signal from a random tree we examined: (i) whether the true solution (i.e. the solution associated with the simulated tree), which is by construction one of the possible solutions to the subclonal deconvolution problem, had the minimum number of populated subclones among all solutions (sparsity constraint), (ii) whether the true solution had the minimum number of populated subclones and minimum number of levels of the evolutionary tree among all solutions generated by TrAp (sparsity and shallowness constraints), and (iii) whether the true solution was the only TrAp solution (sparsity and shallowness constraints and uniqueness of the optimal solution).
The results of the simulations show that aggregate signals from sparse trees are deconvolved correctly even in presence of typical noise levels of sequencing experiments (Figure 8).We note that for simulations of non-sparse trees, TrAp generates a large number of possible solutions of which only one is the true solution.Furthermore, in the presence of high levels of noise, the TrAp algorithm identifies a large number of first generation trees that satisfy Equation ( 10) and generates solution trees whose number of populated subclones is smaller than the number of populated subclones of the true solution.We generated 1000 simulations (see Methods) for each combination of number of populated subclones (columns), number of mutations (rows).We performed this analysis using different level of noise (error) drawn from a uniform distribution U (−ε, ε).
The heatmaps (tables) show the percentage of trees in each cell in which the TrAp solution has the minimum number of subclones (left panel), is also the shallowest solution with the best score (middle panel) and in addition to these two conditions is also unique (right panel).If the best solution is unique.

Analysis of simulated mixtures of biological data
Deconvolution of a mathematical mixtures of karyotyping data from single tumor biopsies After showing that our approach can correctly deconvolve aggregate signals of subclones with a tree-like genealogy, we sought to investigate whether actual subclonal populations can be charted on evolutionary trees.For this purpose we analyzed the Mitelman database, consisting of cytogenetic analyses of more than 60,000 biopsies (see Methods).For each tumor type we counted how frequently the relationships between cancer clones from the same biopsy could be explained by an evolutionary tree that follows the evolutionarity and parsimony constraints (but not necessarily the sparsity and shallowness constraints).We found that almost all biopsies in the Mitelman database can be represented by evolutionary trees (Table 1), with the notable exception of astrocytoma of grades III and IV (Figure 9).
We mathematically mix all karyotypes of each single patient from the Mitelman database and apply the TrAp algorithm for each of these mixtures.The ability of the TrAp algorithm to extract the correct underlying clonal or subclonal components depends on the number of actual components (columns) and the multiplicity of aberrations studied in each mixture (rows).The frequency in which TrAp is able to recover the correct underlying components is shown in percentages.The number of mixtures for a given size of aberration multiplex (row) and given number of actual underlying components (column) is shown in parentheses.Note that when the column index is greater than the row index (entries shown in bold) the parsimony constraint cannot be satisfied.
Next, we investigated whether the TrAp algorithm could uniquely deconvolve mixtures of the cancer subclones within a biopsy.As these aggregate signals are obtained by mixing actual subclonal profiles, we consider these signals to be more realistic than our previous in silico simulations.For each biopsy, we generated in silico mixtures by combining the cytogenetic profiles of each subclone using random nonnegative coefficients (see Methods).We then applied our TrAp method to deconvolve in silico mixtures of these cancer clones and found that TrAp could correctly deconvolve 99.7% of these realistic aggregate signals.However, we note that this task was trivial for a significant fraction of the biopsies whose number of aberrations and/or subclones is small.Moreover, the TrAp algorithm inferred also intermediate nodes in the evolutionary tree that did not correspond to any of the cytogenetic profiles for the biopsy, providing a plausible picture of the evolutionary order in which the aberrations occurred.Figure 10 shows the result of two deconvolution simulations, one from a melanoma sample with 2 subclonal populations [64] and one from an adenocarcinoma sample with 3 subclonal populations [65].An interesting, yet rare (only 5 examples, shown in bold in Table 1), case of biopsies showed more clones than aberrations.Albeit a treelike genealogical relationship can be constructed, these biopsies do not satisfy the parsimony constraint since the number of subclones M is greater than the number of mutations N .For this reasons, their genealogy cannot be reconstructed by the TrAp algorithm or by any other method that makes use of a similar parsimony constraint [48,56,58,59].

Melanoma Adenocarcinoma
Figure 10.Deconvolution of random mixtures of three subclones.The boxes represent different subclones, each denoted by the list of its aberrations.The aberration profiles of two subclones identified by cytogenetics in a melanoma biopsy (left) and the aberration profiles of three subclones identified in an adenocarcinoma biopsy (right) have been mixed in silico using random coefficients.In both case, the mixtures were successfully deconvolved.Aberrations are grouped within the boxes according to the order of occurrence.The reconstructed evolutionary trees suggest intermediate (white boxes), probably rare, subclones that were not reported in the cytogenetic data.
Deconvolution of a mathematical mixture of somatic hypermutation data with poly-allelic mutations in a single nucleotide Somatic hypermutation (SHM) introduces mutations that target the variable regions associated with immune adaptivity, i.e.Ig loci.In particular, SHM involves a programmed process of mutations that affects the variable regions of immunoglobulin genes and starts from an initial dividing single cell (a naïve B cell in this case).All descendants of the founder cell accumulate mutations and, at the same time, are subjected to a strong selective pressure.For this reason, SHM is a particularly good biological model system to test our deconvolution method, which imposes tree-like evolutionary constraints.We considered a dataset where a total of 20 mutated nucleotides in the V(D)J region of the Ig locus were measured in 8 sequences extracted from the same germinal center (see Methods) [66].This dataset was particularly interesting because of the high number of mutations found and because of the presence of poly-allelic mutations.
We mathematically mixed the multi-subclonal data and applied our TrAp algorithm taking into account that the SHM scenario consists of non-binary aberrations.In all simulations, TrAp was able to recover the original sequences and the solution was unique in 65% of the simulations.The TrAp-solution of one simulation is shown in Figure 11.However, even if the solution was not always unique, in more than 97% of the simulations there were only five or less candidate solutions satisfying the evolutionary, parsimony and sparsity constraints, all of which correctly identified six out of seven subclones.
In addition to the identification of the underlying subclones, the TrAp algorithm generates evolutionary trees, which represent the B cell lineage during the somatic hypermutation process.The reconstruction of B cell lineage can provides important insights into the mechanisms that regulate adaptive immunity.B cell lineage reconstruction is generally performed using maximum parsimony constraints [56] using the sequences of several Ig loci as input.However, in contrast to these approaches, the TrAp algorithm is able to generate maximum parsimony trees when only the relative frequency of mutations at each nucleotide is available.Therefore, the TrAp algorithm can be used to generate parsimonious evolutionary trees when only partial sequence information is available, e.g. when only short read sequences from a single aggregate sample are available, or when the loci analyzed span a region that is too large to be fully sequenced.

Comparison between subclonal aberration profiles inferred from heterogeneous cell populations and singl-cell aberration profiles
We analyzed data from a recent study on renal cell carcinoma where two aggregate samples and twenty single cells were isolated from a clear cell renal cell carcinoma (ccRCC) and subjected to exome sequencing.Interestingly, the original study only showed partial similarity between the single cells and the aggregate [24].However, since the single cells and the aggregate used in the experiments are from the same tumor, we sought to investigate whether any subclones inferred by TrAp would share a similar combination of mutations found in any of the single cells.
We applied our TrAp algorithm to the aggregate sample and obtained an evolutionary tree consisting of three main subclones.Due to the lack of extensive validations, we limited ourselves to investigate whether mutations that co-occur in the TrAp-solution also co-occur in single cell samples.We considered the mutations that were validated by bioinformatics analysis (Table S3A from Xu et al. [24]) and by PCR validation (TableS3B).The fraction of correctly estimated co-occurrences was 0.76 for mutations validated by bioinformatics analysis and 0.74 for mutations validated by PCR.

Melanoma data
Finally, we applied our algorithm to investigate evolutionary mechanisms in tumor metastases using exome sequencing data from three tumor metastases (TM1: left lateral chest wall, TM4: pleural cavity, and TM3: right axilla) a matched normal sample (N: left lateral chest wall) of one melanoma patient [67].TrAp can efficiently handle aggregate signal vectors of about 20 unique frequencies and therefore we perform deconvolution analysis only on one chromosome.We selected chromosome 18 as it contains the tumor suppressor Deleted in Colorectal Cancer (DCC ) gene, which is known to exhibit a high load Figure 11.Deconvolution of a random mixture of eight sequences from somatic hypermutation data.Eight sequences from the Ig locus of eight cells extracted from the same germinal center were mixed with the random coefficients given by x = [8.8%,22.6%, 20.2%, 7.9%, 5.7%, 8.5%, 10.9%, 15.5%].Since sequences 5 and 8 are identical, they are grouped in a single clone whose relative frequency is 5.7% + 15.5% = 21.2%.In total, twenty mutated nucleotides were found in the data and two different mutations were identified at position 170.Mutations are shown using the notation "position : reference→ mutated", e.g. the notation 170 : A → G indicates that the nucleotide at position 170 was mutated from Adenine to Guanine.The notation 170 : A → G → C indicates that the nucleotide at position 170 was mutated twice, first from Adenine to Guanine and then from Guanine to Cytosine.In this example, all seven subclones were correctly deconvolved by the TrAp algorithm, the frequency of the subclones was correctly estimated and the solution was unique. of mutations only in melanoma [68], in contrast to low expression, loss of heterozygosity or epigenetic silencing in other tumors.
To apply the TrAp algorithm, we first preprocessed the data and selected 19 mutations on chromosome 18 (see Methods).We labeled each mutation according to the gene affected and the amino acid change caused by the mutation (e.g. the label DCC.L1099H indicates a mutation in the DCC gene that causes a mutation from a Leucine to a Histidine at position 1099 in the DCC protein).There are six mutations with > 99% frequency in all samples (including the normal): ADNP2.G54G, ALPK2.I2157V, CD226.S307G, DCC.F23L, NETO1.S481N, and SLC39A6.E119D.The only other mutation found in the normal sample was TCEB3CL.S301C, which occurs with frequency > 90% in all samples.Moreover, the mutations ALPK2.R136S, CHST9.S122N, FAM38B.V2463, LAMA1.S1577A, LAMA1.K2002E, MYOM1.T215M, SERPINB10.R246C and SLC14A2.A880T were found in all three tumor samples and shared a similar frequency profile.The mutations DSC3.A28D, DSG1.M11V and IMPACT.D125E were found only in the metastases samples TM3 and TM4 and shared a similar frequency profile.Finally, the mutation DCC.L1099H was found only in the sample TM3.
Independent runs on the three metastatic samples gave 33 optimal solutions for TM1, 222 for TM3 and only 1 TrAp-solution for TM4.These high number of solutions is due to the substantial noise of the experiment (in the range 0.005 − 0.025) and the fact that in samples TM1 and TM3, two of the subclones have similar frequencies and are thus difficult to separate from one another.However, TrAp identified an unique solution in the sample TM4, where the three populated subclones are distributed with significantly different frequencies (Figure 12 middle).Next, we reasoned that the metastatic TM1, TM3 and TM4 samples may share common ancestors and that their evolutionary profiles may be related.We then applied our TrAp algorithm while also imposing that all evolutionary trees must be a subset of one global evolutionary tree.This approach returned an unique solution for each sample, all of which were among the solution sets identified in the previous analyses.We observe that this approach can be very powerful since, in principle, it allows the reconstruction of very large trees by combining several snapshots of the related subclonal populations.
The results of the deconvolution are shown in Figure 12.We observe the presence of two main branches.Mutations in the left branch of TM4 (77%) are more abundant than in TM1 and TM3 (48%).We note that LAMA1, a protein that is involved in cell adhesion, is present in the right branch.44% (19%) of the subclones of TM3 (TM4) have mutations in DSC3, DSG1 and IMPACT.The TM3 subclone also acquires a second mutation in the DCC gene (DCC.L1099H-L) in addition to the mutation DCC.F23L, which was hereditary.The novel mutation in the DCC gene occurs close to the boundary between the extracellular domain and the transmembrane domain of the protein product.The resulting Histidine amino-acid is positively charged, opposed to the Leucine amino-acid of wildtype, which is neutrally charged.Because this change is next to the cell membrane, it may have repercussions on the functionality of the DCC protein product, perhaps causing inactivation, similar to the inactivation caused by loss of heterozygosity and transcript suppression observed in other cancer types.

Discussion
In the present study we presented the TrAp algorithm, a tool for inferring subclonal composition and abundance from a single aggregate measurement experiment.As we have shown, TrAp is robust to noise and it can deconvolve mixtures where multiple mutations occurs at the same locus.TrAp efficiently enumerates all possible solutions that are compatible with the model constraints, thus always identifying the sparsest and most parsimonious solution(s).However, TrAp will also generate trees (cf.Equation 2) in cases where no tree structure can be inferred.As we have shown, such structures, while deviating from the true underlying population structure, can still capture relevant co-occurrences of mutations that are specific to certain subclones.Further, in contrast to parsimonious neighbor-joining approaches, which rely on sampling single subclones from the population (e.g.single cell experiments), TrAp utilizes one single aggregate experiment as input, thus overcoming the issue of small sampling size, which may be insufficient to cover the whole spectrum of subclones in a sample.We successfully deconvolved systems of up to 25 aberrations.Although this number may not always be large enough to consider all somatic mutations found in a tumor sample, this problem can be circumvented by clustering aberrations with similar frequencies before running the TrAp algorithm.
The level of characterization achieved by subclonal deconvolution holds high potential for person-  alized therapies.Possible applications include the classification of subclones in primary tumors, the identification of the seeds of metastases, tracing of resistant subclones especially after drug treatments and developing treatment strategies to eliminate resistant subclones.Furthermore, our proposed model can be applied to other medical problems, such as tracing bacterial or viral paths of adaptation within the infected host, detailed genome-wide reconstruction of the epigenetic differentiation program, or class specification in the hematopoietic system or in other systems.

Methods
In this section, we describe the procedures used to preprocess input data for the TrAp algorithm.

Data processing and generation
Random in silico evolutionary trees We performed simulations by sampling genotypes whose size N ranged from 1 to 12 and with underlying number of populated subclones P ranging from 1 to N −1.The simulations were repeated for measurement error values ε equal to 10 −2 , 10 −3 and 10 −4 .For each combination of these quantities, we performed 1000 runs using in silico generated data as follows: during each run, a random evolutionary tree with N aberrations was generated.The set of P populated subclones was then selected by first including all leaves of the generated tree and then adding the remaining subclones randomly.The frequency of the populated subclones was randomly assigned and the frequency of the non-populated clones was set to zero.Next, the aggregate frequency vector ỹ was calculated from the generated tree.Finally, we perturbed each element of ỹi by adding an error ε i drawn from a uniform distribution U (−ε, ε).The elements y i = ỹi + ε i and the error ε i are used as input for the bound error model option of the TrAp algorithm.

Cytogenetic data
The cytogenetic data was obtained from the Mitelman database, which contains 61,846 biopsies as of August 15, 2012.We accessed the database through the CGAP Website [69] and we filtered out 29,842 biopsies with uncertain calls (indicated by a "?" in the karyotype data).We focused our search only on aberrations that are binary by nature, namely chromosome deletions and translocations.For each biopsy, we performed 100 in silico simulations in which we mixed the subclones using random nonnegative coefficients.

Somatic hypermutation data
SHM sequencing data was derived from B cells undergoing somatic hypermutation (SHM), a process that leads to high-affinity antibody molecules [70].In details, we analyzed sequences of the V(D)J region extracted from the same germinal center from the sample 11930d16 4print.2,which was sequenced by Anderson et al. [66,71].The sequences were aligned using the IMGT alignment tool [72,73] in order to properly align the V, D and J regions.We selected 8 sequences that were aligned to the same V(D)J sequence (V 1 (D 1 )J 1 ).Since these sequences are from the same germinal center and align to the same V(D)J sequence, they are expected to stem from a single naïve B cell and have evolved through the somatic hypermutation process.From the sequencing experiment, 20 mutated nucleotides were identified.Furthermore, a polyallelic mutations were found at position 170, where both A → C and A → G mutations were observed.Next, we considered the 7 unique sequences (sequences 5 and 8 were identical) as representatives of the genome of 7 different subclones.Similarly to the previous application, we mixed these subclones using random nonnegative coefficients and performed 1000 simulations using random nonnegative coefficients.

Exome capture sequencing data
Exome-capture data [74] was obtained from a recent clear cell renal cell carcinoma (ccRCC) study [24] and from the melanoma patient YUHUY of the Yale cohort, for which DNA from normal circulating lymphocytes and three metastases (TM1, TM3 and TM4) were subjected to exome-capture Illumina Hi-Seq sequencing [67].Exome-Seq reads from the aggregate samples of the ccRCC patient were combined and aligned to the human reference genome (assembly hg19) using Bowtie [75] with parameters "-n3 -k1 -m20 -l32 -chunkmbs 1024 --best --strata".The frequency and coverage of each point mutation was computed using VarScan [76].We further selected the mutations that were validated by Xu et al. [24] by PCR validation (TableS3B) and by bioinformatics analysis (Table S3A ), whose genomic coordinated were realigned to the assembly hg19 using the Lift-Over tool of Galaxy [77].
For the melanoma patient YUHUY [67], we selected 19 mutations that were homozygous in the normal sample (i.e.y ≈ 0 or y ≈ 1), had high sequence coverage (i.e. more than 200 reads covering the specific nucleotide) and were localized on chromosome 18.
Since none of the genomic positions analyzed contained poly-allelic mutations, we assign a binary state (normal/mutated) to each selected genomic position and we estimated the aggregate signal and measurement error for each mutation event using a normal approximation as where n i is total the number of reads covering position i and m i is the number of reads covering position i where the nucleotide i is mutated.Finally, the y and ε vectors were used as input for the TrAp algorithm, which was set to use the normal error model.

Implementation of the TrAp algorithm
TrAp was programmed in Java.TrAp makes use of the Java Matrix package [78] for linear regression and code by Josh Vermaas to solve the nonnegative least square problem using JAMA.The Java Universal Network/Graph Framework (JUNG) [79] is used for creating pictorial representations of evolutionary trees.
TrAp is released under the GNU Lesser General Public License 3.0 and can be downloaded on the Source-Forge repository of Kluger's lab at the URL http://sourceforge.net/projects/klugerlab/files/TrAp/.

Figure 4 .
Figure 4. Identification of first generation trees.In this example, the aggregate signal frequency vector y = [1, 0.6, 0.4, 0.35, 0.3, 0.1] is consistent with three first generation treesT 1 = {C 1 , C 2 , C 3 }, T 2 = {C 1 , C 2 , C 5 , C 6 } and T 3 = {C 3 , C 5 , C 6 }.Each first generation tree is visualized as a matrix equation X = Y − ΦY according to Equation (5) (left) and as a partial evolutionary tree (right).In the bottom row, the partial tree P T 1 given by the union of the partial trees T 1 and T 3 is shown.Question marks indicate values that are unknown as they are not specified by the first generation tree or by the partial tree.

1 Figure 8 .
Figure 8. Deconvolution of simulated data.We generated 1000 simulations (see Methods) for each combination of number of populated subclones (columns), number of mutations (rows).We performed this analysis using different level of noise (error) drawn from a uniform distribution U (−ε, ε).The heatmaps (tables) show the percentage of trees in each cell in which the TrAp solution has the minimum number of subclones (left panel), is also the shallowest solution with the best score (middle panel) and in addition to these two conditions is also unique (right panel).If the best solution is unique.

Figure 12 .
Figure 12.Evolutionary trees inferred from three metastases of a melanoma patient.Each subclone in these trees is represented by a box with a list of mutations that includes only its new mutations (ancestral mutations can be read off by tracing back the mutation lists of all of its ancestors).Mutations are labeled according to the gene affected and the amino acid change caused by the mutation (e.g. the label DCC.L1099H indicates a mutation in the DCC gene that causes a mutation from a Leucine to a Histidine at position 1099 in the DCC protein).Highly expressed genes from this patient are indicated in bold.Mutations in the left branch of TM4 are more abundant than in TM1 and TM3.44% (19%) of the subclones of TM3 (TM4) have mutations in DSC3, DSG1 and IMPACT.The TM3 subclone has an additional mutation in DCC.

Table 1 .
Applicability of the TrAP algorithm for different number of aberrations events and underlying subclones.