Cocos: Constructing multi-domain protein phylogenies

Phylogenies of multi-domain proteins have to incorporate macro-evolutionary events, which dramatically increases the complexity of their construction


Introduction
Domains characterize proteins structurally, evolutionarily and functionally [1] . More than half of the proteins in prokaryotes and about 80 percent of the proteins in eukaryotes are composed of multiple domains [2] . About 200 domains in eukaryotes occur in diverse architectures [3] and provide a challenge for phylogenetic inference, as proteins can be composed of non-homologous elements. Evolutionary events such as the fusion of proteins or the loss of domains need to be considered in phylogenetic analyses of multi-domain proteins (MDPs).
Behzadi and Vingron put forward an iterative procedure (BV) to reconstruct ancestral domain compositions using the phylogenetic relationship within domain families [4] . See Figure 1 for an overview. Their algorithm minimizes the number of the macro-evolutionary events protein fusion and domain loss using a set-theoretic formulation and is independent of the order of the domains in the proteins.

Introduction
Domains characterize proteins structurally, evolutionarily and functionally [1] . More than half of the proteins in prokaryotes and about 80 percent of the proteins in eukaryotes are composed of multiple domains [2] . About 200 domains in eukaryotes occur in diverse architectures [3] and provide a challenge for phylogenetic inference, as proteins can be composed of non-homologous elements. Evolutionary events such as the fusion of proteins or the loss of domains need to be considered in phylogenetic analyses of multi-domain proteins (MDPs).
Behzadi and Vingron put forward an iterative procedure (BV) to reconstruct ancestral domain compositions using the phylogenetic relationship within domain families [4] . See Figure 1 for an overview. Their algorithm minimizes the number of the macro-evolutionary events protein fusion and domain loss using a set-theoretic formulation and is independent of the order of the domains in the proteins. 3. Recursively walk through the species tree nodes (bottom-up). In all child species, the domains are already partitioned. Establish correspondence between domain nodes in the child species and those in the current species (relabeling of domain nodes), then find an optimal partition in the current species which is closest to all child partitions in terms of the weighted number of fusions and deletions.

PLOS Currents Tree of Life
Previous analyses of MDPs have decided against the use of phylogenetic trees for domains [5] or relied on establishing phylogenies only for domain trees with high internal bootstrap support [6] . Recently, an alternative approach for the reconstruction of MDPs including domain trees was proposed [7] .
We implemented BV, tested it and and identified critical issues that need to be addressed for successful reconstruction of phylogenies of MDPs using BV. Due to the large number of possible domain combinations a good set of partitions cannot be found by brute-force enumeration. We implemented a heuristic called weak edge erosion, which yields close to optimal solutions faster than simulated annealing suggested by [4] . In the practical application it showed that most domain trees are incongruent to each other and the species tree. We implemented a simple procedure to detect and rectify problematic cases.
In the following, we present our findings in detail, provide an implementation and show how to use it in practice.
First, BV and the individual improvements are introduced formally. 3. Recursively walk through the species tree nodes (bottom-up). In all child species, the domains are already partitioned. Establish correspondence between domain nodes in the child species and those in the current species (relabeling of domain nodes), then find an optimal partition in the current species which is closest to all child partitions in terms of the weighted number of fusions and deletions.

Methods
Previous analyses of MDPs have decided against the use of phylogenetic trees for domains [5] or relied on establishing phylogenies only for domain trees with high internal bootstrap support [6] . Recently, an alternative approach for the reconstruction of MDPs including domain trees was proposed [7] .
We implemented BV, tested it and and identified critical issues that need to be addressed for successful reconstruction of phylogenies of MDPs using BV. Due to the large number of possible domain combinations a good set of partitions cannot be found by brute-force enumeration. We implemented a heuristic called weak edge erosion, which yields close to optimal solutions faster than simulated annealing suggested by [4] . In the practical application it showed that most domain trees are incongruent to each other and the species tree. We implemented a simple procedure to detect and rectify problematic cases.
In the following, we present our findings in detail, provide an implementation and show how to use it in practice.
First, BV and the individual improvements are introduced formally. 3. Recursively walk through the species tree nodes (bottom-up). In all child species, the domains are already partitioned. Establish correspondence between domain nodes in the child species and those in the current species (relabeling of domain nodes), then find an optimal partition in the current species which is closest to all child partitions in terms of the weighted number of fusions and deletions.

Methods
Previous analyses of MDPs have decided against the use of phylogenetic trees for domains [5] or relied on establishing phylogenies only for domain trees with high internal bootstrap support [6] . Recently, an alternative approach for the reconstruction of MDPs including domain trees was proposed [7] .
We implemented BV, tested it and and identified critical issues that need to be addressed for successful reconstruction of phylogenies of MDPs using BV. Due to the large number of possible domain combinations a good set of partitions cannot be found by brute-force enumeration. We implemented a heuristic called weak edge erosion, which yields close to optimal solutions faster than simulated annealing suggested by [4] . In the practical application it showed that most domain trees are incongruent to each other and the species tree. We implemented a simple procedure to detect and rectify problematic cases.
In the following, we present our findings in detail, provide an implementation and show how to use it in practice.
First, BV and the individual improvements are introduced formally.

The algorithm by Behzadi and Vingron
The algorithm BV uses the information of domain trees and the composition of extant proteins to infer the domain composition of ancestral proteins [4] The entire tree is reconstructed in a bottom-up pass including the root.

Heuristics for partitioning
As the number of possible partitions for domains into genes grows rapidly, complete enumeration is only

The algorithm by Behzadi and Vingron
The algorithm BV uses the information of domain trees and the composition of extant proteins to infer the domain composition of ancestral proteins [4]. Assigning costs for union and deletion yields the partitioning score The entire tree is reconstructed in a bottom-up pass including the root.

Heuristics for partitioning
As the number of possible partitions for domains into genes grows rapidly, complete enumeration is only

The algorithm by Behzadi and Vingron
The algorithm BV uses the information of domain trees and the composition of extant proteins to infer the domain composition of ancestral proteins [4]. The original publication contains a worked example recommended for further study. Let be a species tree in which is the parent species of and . Each species, e.,g. , is assigned a set of domains that belong to a domain family . Let be a partition of called a domain composition family of domain compositions . As these are sets, the ordering of domains within a gene is ignored.
Phylogenetic The entire tree is reconstructed in a bottom-up pass including the root.

Heuristics for partitioning
As the number of possible partitions for domains into genes grows rapidly, complete enumeration is only 3 PLOS Currents Tree of Life possible for the simplest cases and not suitable for real applications. For BV, it was suggested to use simulated annealing to solve the partitioning problem [4] . We explored a deterministic algorithm we call weak edge erosion (see below) to find a suitable partition

Weak Edge Erosion
Weak edge erosion is a hierarchical graph clustering method based on the idea of attacking a network of affinities between elements at its weakest points and recursively create clusters by separating network components. The affinity graph is defined as follows: Consider an undirected loop-free graph with a vertex set . An edge between vertices is assigned a cost according to the number of sets in the child partitions that contain as a subset after relabeling. Note that each set induces a clique in this affinity graph. The edge weight measures the affinity of two vertices to occur together in a set. Thus removing edges corresponds to breaking the affinity between elements. To find a good approximation for the partitioning problem, we let store in a tree node , cut the graph and store the resulting components in child nodes of . These nodes are processed recursively until the vertex set in a node scores 0 with the score . Then the nodes are checked in a bottom-up pass as to whether their subset or the subsets of their children score better, and this solution is passed to the parent.
Cluster boundaries are induced by regions sparse in edges. But dividing a set to create partitions translates to cutting through a large number of edges in a clique, so minimal cuts and related concepts of connectivity are of limited use. Particularly, min-cut tends to separate single vertices from cliques, thus creating a suboptimal partition.
To obtain meaningful cuts for our purpose, we introduce the concept of weak edges. Let each vertex be labeled by the sum of weights of all its incident edges. From the perspective of an edge, high vertex weights mean that the vertices have a strong connection to a set of other vertices, so edges are regarded to become weaker as their vertex weights grow. We thus define a total order of weakness: let be two edges, and be two vertices such that . is weaker than , denoted by , if the corresponding edge has a lower weight: This first condition accounts for that we prefer weaker affinities to be violated. If the edge weights are equal, we want to exploit the weakening effect of high vertex weights, thus the weakness order is determined by the heavier vertices: If these are equal as well, the relation between the lighter vertices decides: The concept of edge weakness is illustrated in Figure 2. As the cost function is additive, we can find an approximate solution to the partitioning problem by splitting into nested recursively and combine the local results. We use Algorithm 1 to find a good partition of with a cost . The weakest edges are removed until the graph is decomposed into two or more components. A cut tree of is built such that a node contains and its children contain the connected components. See Figure 3 for an example.
possible for the simplest cases and not suitable for real applications. For BV, it was suggested to use simulated annealing to solve the partitioning problem [4] . We explored a deterministic algorithm we call weak edge erosion (see below) to find a suitable partition

Weak Edge Erosion
Weak edge erosion is a hierarchical graph clustering method based on the idea of attacking a network of affinities between elements at its weakest points and recursively create clusters by separating network components. The affinity graph is defined as follows: Consider an undirected loop-free graph with a vertex set . An edge between vertices is assigned a cost according to the number of sets in the child partitions that contain as a subset after relabeling. Note that each set induces a clique in this affinity graph. The edge weight measures the affinity of two vertices to occur together in a set. Thus removing edges corresponds to breaking the affinity between elements. To find a good approximation for the partitioning problem, we let store in a tree node , cut the graph and store the resulting components in child nodes of . These nodes are processed recursively until the vertex set in a node scores 0 with the score . Then the nodes are checked in a bottom-up pass as to whether their subset or the subsets of their children score better, and this solution is passed to the parent. To obtain meaningful cuts for our purpose, we introduce the concept of weak edges. Let each vertex be labeled by the sum of weights of all its incident edges. From the perspective of an edge, high vertex weights mean that the vertices have a strong connection to a set of other vertices, so edges are regarded to become weaker as their vertex weights grow. We thus define a total order of weakness: let be two edges, and be two vertices such that . is weaker than , denoted by , if the corresponding edge has a lower weight: This first condition accounts for that we prefer weaker affinities to be violated. If the edge weights are equal, we want to exploit the weakening effect of high vertex weights, thus the weakness order is determined by the heavier vertices: If these are equal as well, the relation between the lighter vertices decides: The concept of edge weakness is illustrated in Figure 2. As the cost function is additive, we can find an approximate solution to the partitioning problem by splitting into nested recursively and combine the local results. We use Algorithm 1 to find a good partition of with a cost . The weakest edges are removed until the graph is decomposed into two or more components. A cut tree of is built such that a node contains and its children contain the connected components. See Figure 3 for an example.
possible for the simplest cases and not suitable for real applications. For BV, it was suggested to use simulated annealing to solve the partitioning problem [4] . We explored a deterministic algorithm we call weak edge erosion (see below) to find a suitable partition

Weak Edge Erosion
Weak edge erosion is a hierarchical graph clustering method based on the idea of attacking a network of affinities between elements at its weakest points and recursively create clusters by separating network components. The affinity graph is defined as follows: Consider an undirected loop-free graph with a vertex set . An edge between vertices is assigned a cost according to the number of sets in the child partitions that contain as a subset after relabeling. Note that each set induces a clique in this affinity graph. The edge weight measures the affinity of two vertices to occur together in a set. Thus removing edges corresponds to breaking the affinity between elements. To find a good approximation for the partitioning problem, we let store in a tree node To obtain meaningful cuts for our purpose, we introduce the concept of weak edges. Let each vertex be labeled by the sum of weights of all its incident edges. From the perspective of an edge, high vertex weights mean that the vertices have a strong connection to a set of other vertices, so edges are regarded to become weaker as their vertex weights grow. We thus define a total order of weakness: let be two edges, and be two vertices such that . is weaker than , denoted by , if the corresponding edge has a lower weight: This first condition accounts for that we prefer weaker affinities to be violated. If the edge weights are equal, we want to exploit the weakening effect of high vertex weights, thus the weakness order is determined by the heavier vertices: If these are equal as well, the relation between the lighter vertices decides: The concept of edge weakness is illustrated in Figure 2. As the cost function is additive, we can find an approximate solution to the partitioning problem by splitting into nested recursively and combine the local results. We use Algorithm 1 to find a good partition of with a cost . The weakest edges are removed until the graph is decomposed into two or more components. A cut tree of is built such that a node contains and its children contain the connected components. See Figure 3 for an example. If edges tie, such as , and , the heavier nodes decide:

PLOS Currents Tree of Life
If the upper nodes tie as well, as for , the weights of the lighter nodes decide: Walking through the cut-tree in a bottom-up procedure, we decide whether the block in the parent node or the two set blocks in its children yield a better score , and pass this local solution upwards until the root contains an approximate solution for the partitioning problem. If edges tie, such as , and , the heavier nodes decide: If the upper nodes tie as well, as for , the weights of the lighter nodes decide: Walking through the cut-tree in a bottom-up procedure, we decide whether the block in the parent node or the two set blocks in its children yield a better score , and pass this local solution upwards until the root contains an approximate solution for the partitioning problem. If edges tie, such as , and , the heavier nodes decide: If the upper nodes tie as well, as for , the weights of the lighter nodes decide: Walking through the cut-tree in a bottom-up procedure, we decide whether the block in the parent node or the two set blocks in its children yield a better score , and pass this local solution upwards until the root contains an approximate solution for the partitioning problem. The complete procedure is summarized in Algorithm 1: The complete procedure is summarized in Algorithm 1: The complete procedure is summarized in Algorithm 1:

Simulated annealing
The original authors of BV propose simulated annealing to solve the underlying partitioning problem [8]. We implemented a dynamic cooling schedule, which sets most parameters automatically [9]. Starting from a valid domain configuration, the fusion and fission of groups of domains and the swap of individual domains are used to generate related domain compositions. The simulated annealing procedure is then used to minimize the score of the domain composition.

Identification and curation of implausible domain trees
Due to their short length, the inferred domain phylogenies often disagree with each other and the species tree.
The BV algorithm was proposed for ideal data and does not consider errors in the underlying domain topologies.
A practical consequence of such errors are that the order of speciation and duplication events between adjacent domains do not agree. These conflicts can lead to duplicate nodes in the reconstructed composition, for example nodes that have successors within the same protein. The BV algorithm will then produce MDPs with a high partition score due to additional copies of the conflicting domains. Our algorithm aims to produce a partition with an improved score by applying nearest neighbor interchanges on the conflicting domain trees. For each modification the ancestral composition is reconstructed and the modification with the lowest score is used as a correction.

Simulation of MDP phylogenies
Species trees were generated following the Yule-Harding model. An initial domain composition of three families in a single protein was placed in the root and passed to its children, whose protein domain compositions underwent evolutionary events of genes (fusion, duplication) and domains (gain, duplication, loss). Subtree pruning and regrafting (SPR) operations were applied on the domain trees to create perturbed input data.

Construction of domain phylogenies
For an empirical evaluation we ran global models of the PFAM database [10] with HMMER's hmmpfam and hmmalign [11] against the UniProt/Swiss-Prot database [12] to identify domains and construct alignments.
Maximum Likelihood trees were inferred from the alignments using PhyML [13]. The trees were rooted using

Simulated annealing
The original authors of BV propose simulated annealing to solve the underlying partitioning problem [8]. We implemented a dynamic cooling schedule, which sets most parameters automatically [9]. Starting from a valid domain configuration, the fusion and fission of groups of domains and the swap of individual domains are used to generate related domain compositions. The simulated annealing procedure is then used to minimize the score of the domain composition.

Identification and curation of implausible domain trees
Due to their short length, the inferred domain phylogenies often disagree with each other and the species tree.
The BV algorithm was proposed for ideal data and does not consider errors in the underlying domain topologies.
A practical consequence of such errors are that the order of speciation and duplication events between adjacent domains do not agree. These conflicts can lead to duplicate nodes in the reconstructed composition, for example nodes that have successors within the same protein. The BV algorithm will then produce MDPs with a high partition score due to additional copies of the conflicting domains. Our algorithm aims to produce a partition with an improved score by applying nearest neighbor interchanges on the conflicting domain trees. For each modification the ancestral composition is reconstructed and the modification with the lowest score is used as a correction.

Simulation of MDP phylogenies
Species trees were generated following the Yule-Harding model. An initial domain composition of three families in a single protein was placed in the root and passed to its children, whose protein domain compositions underwent evolutionary events of genes (fusion, duplication) and domains (gain, duplication, loss). Subtree pruning and regrafting (SPR) operations were applied on the domain trees to create perturbed input data.

Construction of domain phylogenies
For an empirical evaluation we ran global models of the PFAM database [10] with HMMER's hmmpfam and hmmalign [11] against the UniProt/Swiss-Prot database [12] to identify domains and construct alignments.
Maximum Likelihood trees were inferred from the alignments using PhyML [13]. The trees were rooted using

Simulated annealing
The original authors of BV propose simulated annealing to solve the underlying partitioning problem [8]. We implemented a dynamic cooling schedule, which sets most parameters automatically [9]. Starting from a valid domain configuration, the fusion and fission of groups of domains and the swap of individual domains are used to generate related domain compositions. The simulated annealing procedure is then used to minimize the score of the domain composition.

Identification and curation of implausible domain trees
Due to their short length, the inferred domain phylogenies often disagree with each other and the species tree.
The BV algorithm was proposed for ideal data and does not consider errors in the underlying domain topologies.
A practical consequence of such errors are that the order of speciation and duplication events between adjacent domains do not agree. These conflicts can lead to duplicate nodes in the reconstructed composition, for example nodes that have successors within the same protein. The BV algorithm will then produce MDPs with a high partition score due to additional copies of the conflicting domains. Our algorithm aims to produce a partition with an improved score by applying nearest neighbor interchanges on the conflicting domain trees. For each modification the ancestral composition is reconstructed and the modification with the lowest score is used as a correction.

Simulation of MDP phylogenies
Species trees were generated following the Yule-Harding model. An initial domain composition of three families in a single protein was placed in the root and passed to its children, whose protein domain compositions underwent evolutionary events of genes (fusion, duplication) and domains (gain, duplication, loss). Subtree pruning and regrafting (SPR) operations were applied on the domain trees to create perturbed input data.

Construction of domain phylogenies
For an empirical evaluation we ran global models of the PFAM database [10] with HMMER's hmmpfam and hmmalign [11] against the UniProt/Swiss-Prot database [12] to identify domains and construct alignments.
Maximum Likelihood trees were inferred from the alignments using PhyML [13]. The trees were rooted using 7 PLOS Currents Tree of Life Notung [14].

Results and Discussion
As we cannot obtain true ancestral multi-domain protein compositions, we relied on simulations to test the validity of the algorithm and to inspect its performance when errors are introduced. The practical performance was evaluated on known instances of MDPs, available on the accompanying website. A brief, non-trivial example is presented in Fig. 1.

Simulations
To assess the performance of the algorithm on controlled input, species and domain phylogenies were simulated. The reconstructed domain compositions were compared to the original compositions using the partition distance measure, which ranges between zero and the size of the compositions [15]. Most ancestral partitions can be reconstructed perfectly but not all events can be mapped correctly. For example, the gain of a new domain family and the subsequent loss in one of its immediate children cannot be reconstructed accurately. The high standard deviation suggests that there are a few compositions which differ very much from their simulated counterpart. SPR operations lower the quality of the reconstruction.

Results and Discussion
As we cannot obtain true ancestral multi-domain protein compositions, we relied on simulations to test the validity of the algorithm and to inspect its performance when errors are introduced. The practical performance was evaluated on known instances of MDPs, available on the accompanying website. A brief, non-trivial example is presented in Fig. 1.

Simulations
To assess the performance of the algorithm on controlled input, species and domain phylogenies were simulated. The reconstructed domain compositions were compared to the original compositions using the partition distance measure, which ranges between zero and the size of the compositions [15]. Most ancestral partitions can be reconstructed perfectly but not all events can be mapped correctly. For example, the gain of a new domain family and the subsequent loss in one of its immediate children cannot be reconstructed accurately. The high standard deviation suggests that there are a few compositions which differ very much from their simulated counterpart. SPR operations lower the quality of the reconstruction.

Results and Discussion
As we cannot obtain true ancestral multi-domain protein compositions, we relied on simulations to test the validity of the algorithm and to inspect its performance when errors are introduced. The practical performance was evaluated on known instances of MDPs, available on the accompanying website. A brief, non-trivial example is presented in Fig. 1.

Simulations
To assess the performance of the algorithm on controlled input, species and domain phylogenies were simulated. The reconstructed domain compositions were compared to the original compositions using the partition distance measure, which ranges between zero and the size of the compositions [15]. Most ancestral partitions can be reconstructed perfectly but not all events can be mapped correctly. For example, the gain of a new domain family and the subsequent loss in one of its immediate children cannot be reconstructed accurately. The high standard deviation suggests that there are a few compositions which differ very much from their simulated counterpart. SPR operations lower the quality of the reconstruction.

Performance
The run times of our implementation are short. We simulated 50 replicates of input data containing 50 taxa and no regrafting operations and ran the implementation under Linux on an AMD 64 X2 3200+ processor. The erosion heuristic inferred a MDP reconstruction in 2.06s±0.25s, while the simulated annealing procedure took 153s±22s to achieve comparable solutions. In practice, the calculation of reliable domain families bounds the performance.

Application
JMJ domains are found in proteins involved in chromatin remodeling complexes often together with domain families such as ARID, PHD and PLUN-1 [16] . They provide a concise example, parts of which are presented in

Conclusion
Our solution for the inference of phylogenies of multi-domain proteins provides a simple and easy-to-use interface. The weak edge erosion heuristic provides considerable speed-up over simulated annealing while maintaining comparable solution quality. Beyond the application on MDPs, such methods could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes. Future work will be directed at improving the quality of the tree reconciliation.

Availability and requirements
The program was successfully tested under Python 2.6 and 2.7 on Windows, Mac OS X and Linux. It receives input for species and domain trees as well as parameters in Nexus format. The output can be visualized using GraphViz. The source code with additional figures and examples can be found at http://virulence.molgen.mpg.de/cocos/ and is freely available under a BSD license. Table 1. Average partition distance and standard deviation of reconstructed domain compositions. The simulated trees had 10, 30 and 50 taxa and up to ten SPR operations were applied. The distance was calculated between simulated and reconstructed domain compositions of all ancestral species and normalized by the number of taxa. Each experiment was repeated 50 times.

Author contributions
An automated correction of incongruent domain phylogenies is effective for cases with minor errors. Manual curation is advised if domain phylogenies cannot be inferred reliably. To this end, internal nodes with high scores are tagged for assessment.

Performance
The run times of our implementation are short. We simulated 50 replicates of input data containing 50 taxa and no regrafting operations and ran the implementation under Linux on an AMD 64 X2 3200+ processor. The erosion heuristic inferred a MDP reconstruction in 2.06s±0.25s, while the simulated annealing procedure took 153s±22s to achieve comparable solutions. In practice, the calculation of reliable domain families bounds the performance.

Application
JMJ domains are found in proteins involved in chromatin remodeling complexes often together with domain families such as ARID, PHD and PLUN-1 [16] . They provide a concise example, parts of which are presented in Fig. 1. Only three species were selected for display; using a larger number of species is advisable for the inference of individual domain phylogenies.

Conclusion
Our solution for the inference of phylogenies of multi-domain proteins provides a simple and easy-to-use interface. The weak edge erosion heuristic provides considerable speed-up over simulated annealing while maintaining comparable solution quality. Beyond the application on MDPs, such methods could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes. Future work will be directed at improving the quality of the tree reconciliation.

Availability and requirements
The program was successfully tested under Python 2.6 and 2.7 on Windows, Mac OS X and Linux. It receives input for species and domain trees as well as parameters in Nexus format. The output can be visualized using GraphViz. The source code with additional figures and examples can be found at http://virulence.molgen.mpg.de/cocos/ and is freely available under a BSD license. Table 1. Average partition distance and standard deviation of reconstructed domain compositions. The simulated trees had 10, 30 and 50 taxa and up to ten SPR operations were applied. The distance was calculated between simulated and reconstructed domain compositions of all ancestral species and normalized by the number of taxa. Each experiment was repeated 50 times.

Author contributions
An automated correction of incongruent domain phylogenies is effective for cases with minor errors. Manual curation is advised if domain phylogenies cannot be inferred reliably. To this end, internal nodes with high scores are tagged for assessment.

Performance
The run times of our implementation are short. We simulated 50 replicates of input data containing 50 taxa and no regrafting operations and ran the implementation under Linux on an AMD 64 X2 3200+ processor. The erosion heuristic inferred a MDP reconstruction in 2.06s±0.25s, while the simulated annealing procedure took 153s±22s to achieve comparable solutions. In practice, the calculation of reliable domain families bounds the performance.

Application
JMJ domains are found in proteins involved in chromatin remodeling complexes often together with domain families such as ARID, PHD and PLUN-1 [16] . They provide a concise example, parts of which are presented in Fig. 1. Only three species were selected for display; using a larger number of species is advisable for the inference of individual domain phylogenies.

Conclusion
Our solution for the inference of phylogenies of multi-domain proteins provides a simple and easy-to-use interface. The weak edge erosion heuristic provides considerable speed-up over simulated annealing while maintaining comparable solution quality. Beyond the application on MDPs, such methods could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes. Future work will be directed at improving the quality of the tree reconciliation.

Availability and requirements
The program was successfully tested under Python 2.6 and 2.7 on Windows, Mac OS X and Linux. It receives input for species and domain trees as well as parameters in Nexus format. The output can be visualized using GraphViz. The source code with additional figures and examples can be found at http://virulence.molgen.mpg.de/cocos/ and is freely available under a BSD license. 9 PLOS Currents Tree of Life