Edinburgh Research Explorer Phylogenetic Analysis of Guinea 2014 EBOV Ebolavirus Outbreak

Members of the genus Ebolavirus have caused outbreaks of haemorrhagic fever in humans in Africa. The most recent outbreak in Guinea, which began in February of 2014, is still ongoing. Recently published analyses of sequences from this outbreak suggest that the outbreak in Guinea is caused by a divergent lineage of Zaire ebolavirus. We report evidence that points to the same Zaire ebolavirus lineage that has previously caused outbreaks in the Democratic Republic of Congo, the Republic of Congo and Gabon as the culprit behind the outbreak in Guinea. correlations between time, geographic distance and genetic distance of Ebola haemorrhagic fever outbreaks 2 and the recent ancestry of related EBOV lineages in fruit bats 3 Abstract Members of the genus Ebolavirus have caused outbreaks of haemorrhagic fever in humans in Africa. The most recent outbreak in Guinea, which began in February of 2014, is still ongoing. Recently published analyses of sequences from this outbreak suggest that the outbreak in Guinea is caused by a divergent lineage of Zaire ebolavirus. We report evidence that points to the same Zaire ebolavirus lineage that has previously caused outbreaks in the Democratic Republic of Congo, the Republic of Congo and Gabon as the culprit behind the outbreak in Guinea. result of spread of the EBOV lineage from the Central African countries that have had previous human outbreaks. Previously, a dynamic re-interpretation of EBOV emergence in Central Africa has been suggested, citing correlations between time, geographic distance and genetic distance of Ebola haemorrhagic fever outbreaks 2 and the recent ancestry of related EBOV lineages in fruit bats 3 . Abstract Members of the genus Ebolavirus have caused outbreaks of haemorrhagic fever in humans in Africa. The most recent outbreak in Guinea, which began in February of 2014, is still ongoing. Recently published analyses of sequences from this outbreak suggest that the outbreak in Guinea is caused by a divergent lineage of Zaire ebolavirus. We report evidence that points to the same Zaire ebolavirus lineage that has previously caused outbreaks in the Democratic Republic of Congo, the Republic of Congo and Gabon as the culprit behind the outbreak in Guinea.


Introduction
A recent article 1 suggests that the currently ongoing outbreak in Guinea is caused by a divergent variant of the Zaire ebola (EBOV) lineage. The EBOV strain has previously caused ebola outbreaks in the Democratic Republic of Congo (DRC), the Republic of Congo (RC) and Gabon. The authors publish three complete genome sequences from the Guinea outbreak and perform a phylogenetic analysis using 24 sequences of the Zaire and other representative lineages. One finding is that the 2014 sequences fall as a divergent lineage outside the Zaire lineage suggesting that this may be a pre-existing endemic virus in West Africa rather than the result of spread of the EBOV lineage from the Central African countries that have had previous human outbreaks.
Previously, a dynamic re-interpretation of EBOV emergence in Central Africa has been suggested, citing correlations between time, geographic distance and genetic distance of Ebola haemorrhagic fever outbreaks 2 and the recent ancestry of related EBOV lineages in fruit bats 3 .

May 2, 2014 · Research
Gytis Dudas 1 Introduction A recent article 1 suggests that the currently ongoing outbreak in Guinea is caused by a divergent variant of the Zaire ebola (EBOV) lineage. The EBOV strain has previously caused ebola outbreaks in the Democratic Republic of Congo (DRC), the Republic of Congo (RC) and Gabon. The authors publish three complete genome sequences from the Guinea outbreak and perform a phylogenetic analysis using 24 sequences of the Zaire and other representative lineages. One finding is that the 2014 sequences fall as a divergent lineage outside the Zaire lineage suggesting that this may be a pre-existing endemic virus in West Africa rather than the result of spread of the EBOV lineage from the Central African countries that have had previous human outbreaks.
Previously, a dynamic re-interpretation of EBOV emergence in Central Africa has been suggested, citing correlations between time, geographic distance and genetic distance of Ebola haemorrhagic fever outbreaks 2 and the recent ancestry of related EBOV lineages in fruit bats 3 .

May 2, 2014 · Research
Gytis Dudas 1 Introduction A recent article 1 suggests that the currently ongoing outbreak in Guinea is caused by a divergent variant of the Zaire ebola (EBOV) lineage. The EBOV strain has previously caused ebola outbreaks in the Democratic Republic of Congo (DRC), the Republic of Congo (RC) and Gabon. The authors publish three complete genome sequences from the Guinea outbreak and perform a phylogenetic analysis using 24 sequences of the Zaire and other representative lineages. One finding is that the 2014 sequences fall as a divergent lineage outside the Zaire lineage suggesting that this may be a pre-existing endemic virus in West Africa rather than the result of spread of the EBOV lineage from the Central African countries that have had previous human outbreaks.
Previously, a dynamic re-interpretation of EBOV emergence in Central Africa has been suggested, citing correlations between time, geographic distance and genetic distance of Ebola haemorrhagic fever outbreaks 2 and the recent ancestry of related EBOV lineages in fruit bats 3 .

Materials and Methods
sequences from the Guinea outbreak. Genbank accessions and sources for the sequences can be found at http://epidemic.bio.ed.ac.uk/ebolavirus_sequences. The Ebolavirus genome consists of a single strand of negative sense RNA and contains 7 protein coding genes (in order 3′-NP-VP35-VP40-GP-VP30-VP24-L, separated by various intergenic regions) 4 . We collated the protein coding regions of each gene (alignment length 14647 nucleotides) and, in a separate alignment, the non-coding intergenic regions. Phylogenetic trees were inferred in PhyML 5 or MrBayes 6 using the GTR 7 +Γ substitution model. We were able to replicate the analysis presented in Baize et al. 1 only when omitting the accommodation of rate heterogeneity modelled as a discretized Γ distribution. We suspect the difficulty in replicating the analysis is due to a combination of using different sequences, a different alignment and the inherently unreliable rooting of the EBOV clade using highly divergent sequences from other ebolavirus clades. We have uploaded the alignments we used (whole genome, coding and non-coding) to a GitHub repository at https://github.com/evogytis/ebolaGuinea2014.
We also compiled a dataset containing only the glycoprotein (GP) sequences, for which more sequences are available. Many of the extra sequences come from wild ape carcasses 8 in Gabon and RC.
These sequences were analyzed in BEAST 9 to establish a time frame for the split of the Guinea viruses from other EBOV lineages. The data were analyzed using the GTR+Γ nucleotide substitution model, an uncorrelated relaxed molecular clock (following a lognormal distribution) 10 and under different demographic models (constant population size, exponential growth or the non-parametric Bayesian skyride 11 ).
GP sequence results were recovered from a relaxed molecular clock analysis, under an exponential growth tree prior (as it can accommodate a constant population size scenario when the growth rate is 0) but the analysis was found to be quite robust to different demographic models.

Analysis
An alignment of complete genomes and a maximum likelihood tree (PhyML) appears to confirm the phylogenetic position shown in the recent paper 1 (Figure 1), albeit the position of the Guinea outbreak sequences is not very well supported.
All complete genome sequences from the genus Ebolavirus (which includes Bundibugyo BDBV, Reston RESTV, Sudan SUDV, Tai Forest TAFV and Zaire ebolavirus EBOV species) were collated from genbank including the sequences from the Guinea outbreak. Genbank accessions and sources for the sequences can be found at http://epidemic.bio.ed.ac.uk/ebolavirus_sequences. The Ebolavirus genome consists of a single strand of negative sense RNA and contains 7 protein coding genes (in order 3′-NP-VP35-VP40-GP-VP30-VP24-L, separated by various intergenic regions) 4 . We collated the protein coding regions of each gene (alignment length 14647 nucleotides) and, in a separate alignment, the non-coding intergenic regions. Phylogenetic trees were inferred in PhyML 5 or MrBayes 6 using the GTR 7 +Γ substitution model. We were able to replicate the analysis presented in Baize et al. 1 only when omitting the accommodation of rate heterogeneity modelled as a discretized Γ distribution. We suspect the difficulty in replicating the analysis is due to a combination of using different sequences, a different alignment and the inherently unreliable rooting of the EBOV clade using highly divergent sequences from other ebolavirus clades. We have uploaded the alignments we used (whole genome, coding and non-coding) to a GitHub repository at https://github.com/evogytis/ebolaGuinea2014.
We also compiled a dataset containing only the glycoprotein (GP) sequences, for which more sequences are available. Many of the extra sequences come from wild ape carcasses 8 in Gabon and RC.
These sequences were analyzed in BEAST 9 to establish a time frame for the split of the Guinea viruses from other EBOV lineages. The data were analyzed using the GTR+Γ nucleotide substitution model, an uncorrelated relaxed molecular clock (following a lognormal distribution) 10 and under different demographic models (constant population size, exponential growth or the non-parametric Bayesian skyride 11 ).
GP sequence results were recovered from a relaxed molecular clock analysis, under an exponential growth tree prior (as it can accommodate a constant population size scenario when the growth rate is 0) but the analysis was found to be quite robust to different demographic models.

Analysis
An alignment of complete genomes and a maximum likelihood tree (PhyML) appears to confirm the phylogenetic position shown in the recent paper 1 (Figure 1), albeit the position of the Guinea outbreak sequences is not very well supported.
All complete genome sequences from the genus Ebolavirus (which includes Bundibugyo BDBV, Reston RESTV, Sudan SUDV, Tai Forest TAFV and Zaire ebolavirus EBOV species) were collated from genbank including the sequences from the Guinea outbreak. Genbank accessions and sources for the sequences can be found at http://epidemic.bio.ed.ac.uk/ebolavirus_sequences. The Ebolavirus genome consists of a single strand of negative sense RNA and contains 7 protein coding genes (in order 3′-NP-VP35-VP40-GP-VP30-VP24-L, separated by various intergenic regions) 4 . We collated the protein coding regions of each gene (alignment length 14647 nucleotides) and, in a separate alignment, the non-coding intergenic regions. Phylogenetic trees were inferred in PhyML 5 or MrBayes 6 using the GTR 7 +Γ substitution model. We were able to replicate the analysis presented in Baize et al. 1 only when omitting the accommodation of rate heterogeneity modelled as a discretized Γ distribution. We suspect the difficulty in replicating the analysis is due to a combination of using different sequences, a different alignment and the inherently unreliable rooting of the EBOV clade using highly divergent sequences from other ebolavirus clades. We have uploaded the alignments we used (whole genome, coding and non-coding) to a GitHub repository at https://github.com/evogytis/ebolaGuinea2014.
We also compiled a dataset containing only the glycoprotein (GP) sequences, for which more sequences are available. Many of the extra sequences come from wild ape carcasses 8 in Gabon and RC.
These sequences were analyzed in BEAST 9 to establish a time frame for the split of the Guinea viruses from other EBOV lineages. The data were analyzed using the GTR+Γ nucleotide substitution model, an uncorrelated relaxed molecular clock (following a lognormal distribution) 10 and under different demographic models (constant population size, exponential growth or the non-parametric Bayesian skyride 11 ).
GP sequence results were recovered from a relaxed molecular clock analysis, under an exponential growth tree prior (as it can accommodate a constant population size scenario when the growth rate is 0) but the analysis was found to be quite robust to different demographic models.

Analysis
An alignment of complete genomes and a maximum likelihood tree (PhyML) appears to confirm the phylogenetic position shown in the recent paper 1 (Figure 1), albeit the position of the Guinea outbreak sequences is not very well supported.    When only the coding sequences are used, the Guinea outbreak sequences appear to be derived from within the diversity of Gabon/DRC EBOV lineages. When only the coding sequences are used, the Guinea outbreak sequences appear to be derived from within the diversity of Gabon/DRC EBOV lineages. When only the coding sequences are used, the Guinea outbreak sequences appear to be derived from within the diversity of Gabon/DRC EBOV lineages. Expanding the EBOV region of the tree (same tree as Figure 2, but with the divergent ebolavirus species cropped out) we see that the Guinea outbreak sequences are nested within the EBOV clade. Expanding the EBOV region of the tree (same tree as Figure 2, but with the divergent ebolavirus species cropped out) we see that the Guinea outbreak sequences are nested within the EBOV clade. Expanding the EBOV region of the tree (same tree as Figure 2, but with the divergent ebolavirus species cropped out) we see that the Guinea outbreak sequences are nested within the EBOV clade.  Figure 3 and on the 1995 Kikwit outbreak for the intergenic regions in Figure   4). This shows that the rooting of this clade using the highly divergent other ebolavirus species is very problematic.
However, EBOV is estimated to evolve at about 7×10 -4 substitutions per site per year 12 which means that the virus will accumulate significant amounts of substitutions over the nearly 40 years since the first recorded outbreak in 1976. We can use this to root the EBOV tree and look at where the Guinea outbreak lies. Path-O-Gen (available at http://tree.bio.ed.ac.uk/software/pathogen/) was used to find the root that gave the best association between genetic divergence and time.
The relationship between genetic divergence and time after rooting the tree using least squares regression is shown in Figure 5.  Figure 3 and on the 1995 Kikwit outbreak for the intergenic regions in Figure   4). This shows that the rooting of this clade using the highly divergent other ebolavirus species is very problematic.
However, EBOV is estimated to evolve at about 7×10 -4 substitutions per site per year 12 which means that the virus will accumulate significant amounts of substitutions over the nearly 40 years since the first recorded outbreak in 1976. We can use this to root the EBOV tree and look at where the Guinea outbreak lies. Path-O-Gen (available at http://tree.bio.ed.ac.uk/software/pathogen/) was used to find the root that gave the best association between genetic divergence and time.
The relationship between genetic divergence and time after rooting the tree using least squares regression is shown in Figure 5.  Figure 3 and on the 1995 Kikwit outbreak for the intergenic regions in Figure   4). This shows that the rooting of this clade using the highly divergent other ebolavirus species is very problematic.
However, EBOV is estimated to evolve at about 7×10 -4 substitutions per site per year 12 which means that the virus will accumulate significant amounts of substitutions over the nearly 40 years since the first recorded outbreak in 1976. We can use this to root the EBOV tree and look at where the Guinea outbreak lies. Path-O-Gen (available at http://tree.bio.ed.ac.uk/software/pathogen/) was used to find the root that gave the best association between genetic divergence and time.
The relationship between genetic divergence and time after rooting the tree using least squares regression is shown in Figure 5.

Estimating the date of introduction of EBOV into Guinea
The analysis of GP sequences in BEAST revealed rooting consistent with that found in Figure 6 as well as a nucleotide substitution rate (mean of lognormal distribution from which the rates were drawn is 1.

Estimating the date of introduction of EBOV into Guinea
The analysis of GP sequences in BEAST revealed rooting consistent with that found in Figure 6 as well as a nucleotide substitution rate (mean of lognormal distribution from which the rates were drawn is 1.

Estimating the date of introduction of EBOV into Guinea
The analysis of GP sequences in BEAST revealed rooting consistent with that found in Figure 6 as well as a nucleotide substitution rate (mean of lognormal distribution from which the rates were drawn is 1.07×10 ) on a scale expected, given previously published rates 12 and the fact that GP codes for a surface glycoprotein.
In Figure 7 the estimate of the split between the lineage now causing an outbreak in Guinea and the Central African lineage that had caused outbreaks in DRC and Gabon is late 2002 (95% HPD interval 2000 -2006). This gives us a lower boundary on the introduction of Central African lineage of EBOV into Guinea, although these estimates should be interpreted with caution. We also find very good support for the common ancestry of Guinea and DRC/Gabon lineages (posterior probability = 1.0). Figure 7 also highlights the importance of environmental sampling -many sequences in the tree come from ape carcasses and are more diverse (not shown) than sequences from human outbreaks, giving this dataset much better resolution. Figure 7 also highlights the importance of environmental sampling -many sequences in the tree come from ape carcasses and are more diverse (not shown) than sequences from human outbreaks, giving this dataset much better resolution. Figure 7 also highlights the importance of environmental sampling -many sequences in the tree come from ape carcasses and are more diverse (not shown) than sequences from human outbreaks, giving this dataset much better resolution.

Conclusion
The phylogenetic analysis of the five ebolavirus species here does not substantially improve on that presented by Baize et al. 1 in that even when partitioning the alignment into coding and non-coding regions we get inconsistent rooting positions for the EBOV clade. We believe that at present no suitable outgroup sequences to root the EBOV phylogeny exist and that a temporal rooting gives the most consistent results.
This approach indicates that the outbreak in Guinea is likely caused by a Zaire ebolavirus lineage that has spread from Central Africa into Guinea and West Africa in recent decades, and does not represent the emergence of a divergent and endemic virus.
As the GP sequences show, without more diverse sequences, especially those from the animal reservoir, it is difficult to narrow down the estimates of when and through what means the Central African EBOV lineage has been introduced into West Africa.

Competing Interests
The authors have declared that no competing interests exist.

Conclusion
The phylogenetic analysis of the five ebolavirus species here does not substantially improve on that presented by Baize et al. 1 in that even when partitioning the alignment into coding and non-coding regions we get inconsistent rooting positions for the EBOV clade. We believe that at present no suitable outgroup sequences to root the EBOV phylogeny exist and that a temporal rooting gives the most consistent results.
This approach indicates that the outbreak in Guinea is likely caused by a Zaire ebolavirus lineage that has spread from Central Africa into Guinea and West Africa in recent decades, and does not represent the emergence of a divergent and endemic virus.
As the GP sequences show, without more diverse sequences, especially those from the animal reservoir, it is difficult to narrow down the estimates of when and through what means the Central African EBOV lineage has been introduced into West Africa.

Competing Interests
The authors have declared that no competing interests exist.