Sampling bias and model choice in continuous phylogeography: Getting lost on a random walk

Phylogeographic inference allows reconstruction of past geographical spread of pathogens or living organisms by integrating genetic and geographic data. A popular model in continuous phylogeography—with location data provided in the form of latitude and longitude coordinates—describes spread as a Brownian motion (Brownian Motion Phylogeography, BMP) in continuous space and time, akin to similar models of continuous trait evolution. Here, we show that reconstructions using this model can be strongly affected by sampling biases, such as the lack of sampling from certain areas. As an attempt to reduce the effects of sampling bias on BMP, we consider the addition of sequence-free samples from under-sampled areas. While this approach alleviates the effects of sampling bias, in most scenarios this will not be a viable option due to the need for prior knowledge of an outbreak’s spatial distribution. We therefore consider an alternative model, the spatial Λ-Fleming-Viot process (ΛFV), which has recently gained popularity in population genetics. Despite the ΛFV’s robustness to sampling biases, we find that the different assumptions of the ΛFV and BMP models result in different applicabilities, with the ΛFV being more appropriate for scenarios of endemic spread, and BMP being more appropriate for recent outbreaks or colonizations.


Introduction
Genetic data can be very informative of migration histories and spatial patterns of living organisms, and of geographic spread of outbreaks, in particular when combined with information regarding present and past geographic ranges. Phylogeography combines genetic and geographic data to study geographical spread; in the context of geographic spread of outbreaks, which we will focus on in this manuscript, phylogeography often interprets observed genetic sequences as the result of sequence evolution along an evolutionary phylogenetic tree (see (1)), while modeling spatial spread as a separate evolutionary process along the same phylogeny (see e.g. (2)(3)(4)(5)(6)(7)(8)). In recent years, Bayesian phylogeographic inference has gained remarkable popularity, in large part due to convenient implementations such as in the Bayesian phylogenetic inference software package BEAST (9,10). Bayesian phylogeography in BEAST allows users to investigate past geographical spread using genetic sequences possibly collected at different times. Genetic data is integrated with geographical and temporal sampling information, and all data is interpreted jointly in terms of evolution along a phylogenetic tree with heterochronous leaves (6,7,(11)(12)(13)(14)(15). BEAST uses Markov chain Monte Carlo (MCMC) to efficiently integrate over the joint parameter space -which can also include parameter related to demographic reconstruction and phenotypic trait evolution -and in doing so, accurately accounts for uncertainty in phylogeny and model parameters, and possibly uncertainty in sampling time and location. Bayesian phylogeographic approaches in BEAST can be divided into two categories depending on the type of geographical data: discrete space phylogeography and continuous space phylogeography. Discrete space phylogeography is typically used when samples are clustered based on their geographic location; this is appropriate when spread within a geographical unit is more or less free, while spread between units is hindered by geographical or political barriers (such as bodies of water, mountain chains, national borders, etc). In this case, the geographical data for a collected sample consists of a discrete geographical unit (e.g. a country). Oftentimes, the use of discrete phylogeography is one of necessity, e.g. when only the country of origin of the collected samples is known. Evolution of this location over time (e.g. spread between countries) along the phylogeny is usually modeled using a continuous-time Markov chain (see (6,11)), similarly to popular phylogenetic models of sequence evolution (see (1)). On the other hand, when the longitudinal and latitudinal coordinates of the samples are known, and when spread is assumed to happen more or less in a geographically homogeneous way over some area (such as on one island, or within one continent), continuous space phylogeography is often employed. This approach typically models geographical spread along the branches of the tree as a Brownian motion process, which can be thought of as consisting of many small movements in random directions over many short time intervals (see (7,14)). The results of continuous phylogeography can subsequently be used to determine factors causing non-homogeneous spatial spread through space (16,17). A problem of discrete space phylogeography is that sampling bias (samples not being collected across locations proportionally to their prevalence) can strongly affect statistical inference (13). Unbiased sampling can be very hard to achieve, as it requires knowledge of the full geographic range of an outbreak, access to the whole of this range, and extensive sampling and sequencing efforts. An alternative is to use models that are not affected, or less affected, by sampling biases, such as the structured coalescent and its approximations (see (12,13,15)). The structured coalescent model, however, is far more computationally demanding than classical discrete space phylogeography and can differ from it on several aspects other than sampling assumptions. For example, the structured coalescent assumes that the migratory process and the distribution of cases across locations are at equilibrium, but these assumptions are rarely met in practice and do not match outbreaks that recently expanded into new areas.
Here, we investigate the effect of sampling biases in continuous space phylogeography. We show that sampling only certain areas of an outbreak can result in strongly inaccurate inference of dispersal history and the related model parameters. A possible alternative to the Brownian motion phylogeography ("BMP") model used in continuous space phylogeography is the spatial Λ-Fleming-Viot process ("ΛFV") recently introduced in population genetics (see (18)(19)(20)(21)(22)(23)). The ΛFV addresses, among other things, the undesirable equilibrium properties of classical models of geographic spread (24). The ΛFV represents an alternative to the BMP, robust to sampling bias. We here show that the BMP and the ΛFV are non-interchangeable models, which are suitable for very different evolutionary scenarios. We also investigate the use of "sequence-free" samples (samples without genetic information) as a means to correct or help diagnose the effects of sampling biases on BMP.

Materials and methods
We assume that N samples s 1 , . . . , s N have been collected, and each sample s i is associated with a genetic sequence S i , a collection time t i , and a location of collection l i ∈ R 2 . Location l i is made up of longitude l (1) i and latitude l (2) i , and represents the location of the sample at the time t i of collection. Sequence S i represents the genome (or part of the genome) of the sample, and usually provides most of the phylogenetic information. We assume that the phylogenetic tree τ is a time-stamped phylogeny, where the dates of the tips are known (corresponding to the collection times t i ) and can differ from each other; branch lengths are represented in units of time. Our main focus is to infer the history of geographical spread, represented in particular here by the reconstruction of the location of the root node of τ , and to infer the parameters of the migration process itself. We use two models to simulate and infer the migration process in continuous space: Brownian motion phylogeography (BMP) and the spatial Λ-Fleming-Viot process (ΛFV). Below we describe both models in detail.
Brownian Motion Phylogeography (BMP). BMP assumes that changes in location happen along branches of τ according to a time-homogeneous Brownian (Wiener) diffusion pro-cess (25,26). Given any branch b of length t in τ , and assuming that we know the location l = (l (1) , l (2) ) of the parent node of this branch, then, under the BMP, the distribution of potential locations of the child node of b is centered on l and is multivariate normally distributed with variance tP −1 , where P is the precision matrix of the BMP. In other words, conditional on the top node of b being in position l, the location of the bottom node of the branch has distribution N 2 (l, tP −1 ). We assume that the precision matrix P is the same for all branches, and has three free parameters: two marginal precisions, and the correlation coefficient between dimensions. These parameters describe respectively how fast spatial movement happens in each dimension and how correlated the movements are in the two dimensions. For simplicity, we assume no changes in diffusion rates across branches, although we recognize that variation in diffusion rates is important in many real-life scenarios (7). Under the BMP, the posterior probability of a set of parameters τ (the phylogeny), Θ (the parameters describing sequence evolution along τ ), and P (the precision matrix of the BMP) conditional on the data t 1 , . . . , t N , S 1 , . . . , S N , l 1 , . . . , l N is: This means that, given τ and P, the migratory history (and therefore the observed locations) is independent of genetic data and evolution. Similarly, given τ and Θ, sequence evolution (and therefore observed sequences) is independent of geographic data and migratory process. It is usually not feasible to calculate the probability of the data (known as the marginal likelihood, or the normalizing constant), P (t 1 , . . . , t N , S 1 , . . . , S N , l 1 , . . . , l N ), which appears in the denominator above. Instead, BEAST employs MCMC to obtain samples from the posterior density of model parameters without the need to calculate this probability. The terms in the nominator are: • the prior P (P) on the precision matrix P (usually a Wishart distribution (7)), and the prior P (Θ) on the substitution model and parameters Θ. For P (Θ), many choices are possible, depending on prior information available regarding the mutational process, and the models considered (27).
• the tree prior P (τ, t 1 , . . . , t N ) which represents the prior probability of observing a given tree and sampling times. Possible priors can be based on birthdeath models (28) or coalescent models (29) (note however that for coalescent priors a different notation from Equation 1 is required, conditioning all probabilities on the sampling times t 1 , . . . , t N ).
• the geographic likelihood P (l 1 , . . . , l N |τ, t 1 , . . . , t N , P) is the probability of the geographic locations given the precision matrix P and tree. This can be efficiently calculated by integrating out the location of internal tree nodes, similarly to Felsenstein's pruning algorithm but for a continuous trait (14,31). Some approaches opt for Gibbs sampling the ancestral node locations, for example in the work of (7); in such cases, the notation of Equation 1 needs to be slightly modified.
There are a number of features that distinguish the BMP from the ΛFV presented in the next section, which are important to keep in mind. In Figure 1A we give a graphical representation of the BMP, and we here provide a short summary of the features of the model: • BMP assumes that the prior probability of the tree τ is not affected by the migration process P. Note however that the posterior probability of the tree might instead be very much affected by the geographic migration model and parameters.
• BMP normally does not assume boundaries on possible geographic locations, so sample and ancestral node coordinates can be anywhere in the considered space (including in bodies of water, for example). Prior ancestral root locations can also be specified, see (32), and a normal prior distribution over root location is typically assumed, see (7)).
• BMP does not assume that the density of the overall population of cases over space and over time is uniform or at equilibrium, and does not aim to describe, at least explicitly, the migratory and reproductive dynamics of the whole population, but only of the ancestral lineages of the considered samples. It assumes instead that there is no interaction among cases (for example, limited resources or susceptible individuals within one area), so that different lineages evolve and spread independently of each other no matter how close they are in geographic space.
• in BMP, sampling locations are considered a result of pathogen spread, and not an arbitrary choice of the investigator. As such, sampling locations, even in the absence of genetic sequences, can be very informative about the process of geographic spread, as it is assumed that sampling locations are representative of the geographic range of the pathogen. This also means that absence of samples from certain areas will be interpreted by the model as evidence of absence of cases from such areas. In practice, if the sampling process is dependent on geography, for example when cases from some areas are more likely to be sampled than cases from other areas, then the inference under BMP can be affected, as we show below. This should not necessarily be considered a negative aspect of the model: if there is no sampling bias, then considering sampling locations as informative of the process of geographic spread can increase the inference power of the model.
• Currently, no backward-in-time descriptions of the BMP exist; such a description of a dual process of the BMP could be useful for performing BMP inference while avoiding assumptions about (and therefore biases from) the sampling process. Spatial Λ-Fleming-Viot Process (ΛFV). The ΛFV can be used to model migration and evolution of individuals within a population distributed across an area. The geographical area A under consideration is usually a torus (as in the simulator discsim (22)), or a rectangle (as in the phylogeographic inference software PhyREX (23)). Migration is only allowed from and into A, potentially representing, for example, the case of an island or a continental mass. Individuals of the population are assumed to be spread over A with uniform density ρ. Migration and reproduction of individuals are modeled through reproduction-extinction events (from now on, just "events") which happen at rate λ over time. Each event e i happening at time t i is centered at a location c i taken at random uniformly from A. Individuals in the population are affected by the event according to their distance from c i . For example, in discsim all individuals within a radius r around c i are affected, while in PhyREX individuals are affected with a probability that decreases with their distance from c i (specifically, according to a normal kernel with variance θ 2 ). Individuals affected by e i then die with probability µ, and new individuals are born around c i . In the case of disc events (as in discsim), new individuals are born uniformly within the event disc with density ρµ. In the case of normal kernel events, as in PhyREX, new individuals are similarly placed so to leave the population distribution uniform. Lastly, one (or more in case of recombination (22,33)) parents for all the individuals born at e i are chosen, again with a probability that decreases as a function of the distance from c i (again, either uniformly on a disc as in discsim or with a normal kernel as in PhyREX, for example). While the ΛFV is very different from the BMP, some aspects of the two models can be compared. For example, for narrow event kernels (i.e. small θ), per-dimension displacement of individuals after time t is approximately normally distributed with variance tσ 2 , where σ 2 = 4πθ 4 λµ/|A| and |A| is the area of A (23), and at the limit of very small θ and very large λ, the movements of individuals approach a Brownian motion with diffusion rate σ 2 . Similarly, in the case of disc events of radius r, the mean per-dimension diffusion rate approaches σ 2 = πr 4 λµ 2|A| (see Supplementary Section ). Despite the fact that for small and frequent events individuals might move almost in a Brownian motion, there are still significant differences between the ΛFV and the BMP. The posterior probability of ΛFV model parameters (which we collectively represent as Λ), of a certain history E of events E = {e 1 , . . . , e |E| }, of tree τ , and of substitution model parameters Θ is: . (2) Similarly to the BMP, samples from the joint posterior density of model parameters can be obtained using MCMC, as is done by PhyREX. The terms in the numerator are: • the prior P (Λ) on the ΛFV model parameters, and the prior P (Θ) on the substitution model parameters Θ.
• the likelihood of the history of events, and the ancestry and ancestral locations of the samples P (τ, E|Λ, t 1 , . . . , t N , l 1 , . . . , l N ), which can be computed following (23).
In Figure 1B we give a graphical representation of the ΛFV. From Equation 2 and the description of the model, a number of differences with the BMP can be noted, of which we again provide a summary here: • in the ΛFV, the probability of a tree τ can be affected by the spatial dynamics of the model.
• the ΛFV is defined over a finite space, and is hence more appropriate at describing migration within a limited area (such as an island or continent).
• the ΛFV assumes that the spatial density of the population is homogeneous and at equilibrium. This means that the model describes the case where resources are homogeneously spread across the environment, and the pathogen or species is endemic within an area (this excludes recent colonizations or recent outbreaks where the pathogen has not yet spread across the whole area).
• calculating the likelihood of the ΛFV, at least in implementations proposed so far (23,34), requires the explicit parameterization of individual events. This means that inference under this model is typically going to be more computationally demanding than inference under the BMP, except for scenarios with very few events.
• the ΛFV always conditions on sampling times and sampling locations (see Equation 2). This is because, while the population is assumed homogeneously distributed through time and space, the sampling process is assumed to be arbitrary and not reflective or related to the density of the population or the migratory history. As such, the ΛFV should not be affected by any sampling bias.
• the ΛFV model has a backward-in-time dual process (22,23). This process describes the distribution of past events given data collected later on (see term P (τ, E|l 1 , . . . , l N , t 1 , . . . , t N ) above), thereby naturally accommodating possible spatial sampling biases.

Sampling biases in BMP.
To investigate the effect of sampling bias on BMP, we simulated evolution and migration under the same BMP model used for inference, and tested different sampling scenarios. We simulated a Yule phylogenetic tree with birth rate 1.0, and stopped the simulations when 1000 tips were generated. Genetic sequences were assumed 10kb long, and we simulated their evolution using an HKY model (κ = 3 and uniform nucleotide frequencies) and a substitution rate of 0.01 per unit time, ensuring reasonable levels of genetic diversity to allow reliable phylogenetic inference. Trees and sequences were simulated using DendroPy (35). Using a custom python script, we simulated migration along the tree under the BMP model with two independent dimensions each with diffusion rate equal to 1 unit of square distance per time unit, and we always placed the root in (0.0, 0.0). Of the 1000 tips in the total tree (representing all the cases in the considered outbreak), we sampled 50 tips (representing the samples collected and sequenced) under four different strategies to simulate different types of sampling bias: • in the first scenario ("Random Sampling"), samples were collected independently of their location, and as such no bias is expected and there is no model misspecification; • in the second scenario ("Central Sampling"), the closest samples to the source of the outbreak (0.0, 0.0) were collected; • in the third scenario ("Diagonal Sampling"), the samples closest to the x = y diagonal were collected; • in the fourth and last scenario ("One-Sided Sampling"), the samples with the highest X coordinate (the most eastern samples) were collected.
We used BEAST v1.10.4 to perform inference under the classical BMP model (7), assuming the default priors in BEAUti. During inference we did not restrict the two diffusion processes in the two dimensions to be independent or of equal rate, and inferred the correlation in the two diffusion processes and their rates. During both inference and simulations we assumed a constant rate migration process (see (7)). We ran the MCMC for 10 7 steps and sampled the posterior every 1000 steps, which was sufficient to reach convergence (ESS much higher than 200, checked using Tracer (36)). We ran 100 simulated replicates, and we analysed each replicate four times according to the four sampling scenarios above. Under these four sampling scenarios, we find at least moderate correlation between samples' geographic distances and genetic distances: averages over 100 simulations of 0.192 for random sampling, 0.042 for central sampling, 0.176 for diagonal sampling, and 0.260 for one-sided sampling. This suggests that, in all scenarios, at least a moderate amount of signal to estimate geographic spread is present in the generated data. We found that the sampling strategy affects root location inference using BMP (Figure 2A,D and Supplementary Figure S1). In the absence of sampling bias, inference appears accurate (unbiased and calibrated, Figure 2A). With central or diagonal sampling bias, the uncertainty and error of root location is further reduced (Supplementary Figure S1C-F), but this probably reflects the fact that samples were collected close to the true origin of the simulated outbreak. When collecting samples at one extreme end of the outbreak, instead, we found that root location inference is strongly biased, with posterior distributions usually not containing the true simulated origin locations ( Figure 2D). The effects of sampling bias on the inference of BMP migration parameters are even more noticeable ( Figure 2B,C,E,F and Supplementary Figure S2). While inference of diffusion rate with no sampling bias is correct and calibrated ( Figure  2B), in every biased sampling scenario it is underestimated.
In particular, with central and diagonal sampling the posterior distributions usually do not contain the true value ( Figure 2E and Supplementary Figure S2). The reason for this is probably that the small sampled range (compared to the actual range of the outbreak) is interpreted as evidence of a small outbreak range, and therefore as low diffusion rate (absence of samples in an area interpreted as absence of cases). In the case of diagonal sampling, BMP also infers a strong correlation between the migration processes in the two dimensions (the true value of 0 covariance is never covered by the posterior distributions, Figure 2F).
To test the effects of tree uncertainty and sequence data, we also ran inference under the scenario that the simulated tree is perfectly known, representing the case in which sufficient genetic information is available so that there is negligible uncertainty in tree inference. We provided no input alignment and specified no phylogenetic likelihood or substitution model, but instead fixed the tree to the simulated one and removed all transition kernels in BEAST that affect the tree. In this case, our analyses required substantially fewer MCMC steps (10 4 ), with parameters sampled every 10 steps. We found virtually identical results as those presented in Figure  Compensating the effects of sampling biases using sequence-free samples. The biases shown above originate from the fact that the BMP assumes that samples are collected independently of location, and so the absence of samples from an area is evidence -for the BMP -of absence of cases in that area. Here, we explore the possibility of for compensating the effects of sampling bias in BMP by adding "sequence-free" samples to the analyses. This is representative of the case, for example, that we know that an outbreak has spread into a location, and we know the time and place of some of the cases in that location, but we cannot collect or sequence samples from those cases; so, some of the samples will be "proper", that is, will encompass genetic sequences, while the other "sequence-free" samples will have sampling location and time, but no genetic sequence (see also (37,38)).
To recreate this scenario, we used the 100 Yule trees simulated before. As before, from each simulation, we considered 50 tips sampled according to the four sampling scenarios, representing "proper" samples with genetic sequence. Then, we selected another 50 sequence-free tips randomly (and independently of location) from the remaining 950 tips. These other 50 sequences were added to the BEAST analyses (for a total of 100 samples per replicate) without sequence data (or, more precisely, with uninformative sequences made only of gap characters "-") but with correct sampling location and date.
Adding these extra sequence-free samples greatly reduces the effects of sampling biases, but does not eliminate them (Figure 2G-I and Supplementary Figures S6, S7). To completely eliminate these biases one would probably have to include enough sequence-free samples to make the overall sampling strategy unbiased, which would however come at considerable additional computational demand and be impractical in most real-life scenarios. Plots A-C are from simulations with non-biased sampling. Plots D-F are respectively with "One-sided" sampling bias, "Central" sampling bias, and "Diagonal" sampling bias. Plots G-I are like D-F but with the addition of 50 sequence-free samples (see Section ) collected independently of their geographic location. When plotting root locations, since in many cases the MRCA (root) of the collected samples is not the root of the whole simulated phylogeny (which is always located at (0.0, 0.0)), in each replicate all posterior locations are translated (in mathematical sense) so that the true MRCA location is always at (0.0, 0.0).
Can the ΛFV correct the sampling bias in BMP?.
As mentioned before, the ΛFV has a number of differences from the BMP. One of these differences is that the model does not assume that the sampled locations are representative of the range of the outbreak, but instead the model assumes uniform density of cases over a considered, limited space. For this reason, the ΛFV should not be affected by sampling bias (see "Models" Section). We performed inference using the software PhyREX within the package PhyML v3.3.20190909 (23) downloaded on 4th of January 2020 from https://github.com/ stephaneguindon/phyml.git. PhyREX implements the ΛFV model on a rectangular space. We fixed the tree to the simulated true one to greatly reduce the parameter space to be explored and to consider the case in which tree uncertainty is negligible (for example due to abundant genetic data). We used PhyREX to infer the diffusion rate (σ 2 ) of the migration process (see Supplement Section ) and the migration histories, together with the other parameters of the ΛFV model. We ran each PhyREX replicate analysis for 1 week or a maximum of 2 × 10 8 MCMC steps, sampling every 2000 steps. This seemed generally sufficient to reach convergence in all scenarios: in the vast majority of cases effective sample size (ESS) was larger than 100, and in most cases larger than 200 for all parameters considered.
First, we considered the same exact 1000-tips simulated Yule trees as described above, with BMP migration, four different sampling bias scenarios and 50 collected tips. The ΛFV model used for inference might now be very different from the BMP model used for simulations, and so model misspecification could have a considerable impact. One important difference is that the BMP has no spatial boundaries by default, while the ΛFV is defined over a finite space. In PhyREX, we define the geographical space (where the migration process takes place) to be a square with dimension double the maximum coordinate of any simulated outbreak case, and centered in (0.0, 0.0), so that all simulated samples are contained within the considered square. We find results from PhyREX to be very different from those in BEAST. Credible intervals of the root location are now much broader, and always contain the truth ( Figure 3A and Supplementary Figure  S8). On the other hand, the diffusion rate is highly overesti-mated, up to hundreds of times, and the corresponding posterior distributions usually do not contain the truth ( Figure  3B and Supplementary Figure S9). The large uncertainty in the root location is probably caused by the fact that the ΛFV model uses less information than the BMP (by not assuming that sampling locations are representative of prevalence) and is less affected by sampling bias; however, the high inferred diffusion rates suggest that model mis-specification also plays a strong role in these analyses. We found that setting a prior on the radius parameter so as to mimic BMP (i.e., migration events preferentially taking place over short distances) can often reduce this bias, allowing PhyREX to obtain realistic estimates of the dispersal parameter.  To further investigate the differences between the BMP and ΛFV models, we simulated trees and migration under the ΛFV model implemented in discsim (22). The ΛFV models in discsim and PhyREX differ in some aspects. One difference is that discsim assumes that death and recolonization events happen over discs, while PhyREX uses normal distribution kernels. Another difference is that discsim assumes that migration happens on a torus, while PhyREX uses a rectangle (no migration outside the rectangle allowed, representing, for example, the edges of a continent or island).
In discsim, we always assume a torus of length and width L = 100, and in PhyREX we run inference assuming a square space with the same dimensions. We simulated discs of radius r = 0.1, impact u = 0.1, and event rate λ = 2L 2 ur 4 π ; these parameters were chosen so that migration histories are composed of many small migration events, therefore approximating a Brownian motion, with diffusion rate per dimension approximately σ 2 = 1.0 (see Supplementary Section ). We consider two sampling strategies: • "wide sampling", where 100 samples are collected uniformly at random from the central square of dimensions 50 × 50.
• "narrow sampling", where 100 samples are collected uniformly at random from a central square of dimensions 10 × 10.
In both sampling strategies, differently from BMP simulations, the diffusion rate was not overestimated by PhyREX ( Figure 3F and Supplementary Figure S10). Root location inference in PhyREX is accurate, but posterior intervals usually span most of the simulated geographical range ( Figure  3E and Supplementary Figure S11). When we run BEAST inference on the discsim simulations, BMP seems to consistently underestimate the diffusion rate σ 2 ( Figure 3D and Supplementary Figure S12). While usually containing the true values, posterior distributions of root locations are even broader than those inferred by PhyREX, and, in particular, broader than the allowed geographical range ( Figure 3C and Supplementary Figure S13). These results suggest that the large discrepancies between the simulations under BMP and inference in PhyREX are due to model mis-specification and the inherent differences between the BMP and ΛFV models. In BMP simulations, the very high diffusion rate inferred by PhyREX is likely because the ΛFV model would usually assume that ancestral lineages traverse the considered geographical space several times, backward in time, before finally coalescing, at least in the limit of small and frequent events. The BMP, instead, not assuming endemicity but a rapid spread from an original location, expects shorter distances traveled before lineages find a common ancestor backward in time. This seems, conversely, also the most plausible reason why the BMP infers low diffusion rate in ΛFV simulations. It seems harder instead to explain why root location posterior distributions inferred by the BMP are broader than those inferred with the ΛFV in ΛFV simulations, while the opposite is true for BMP simulations. A possible reason is that, because the ΛFV assumes a finite space, inferred root locations have to be contained within this space, even if, as typical, lineages are inferred to travel, backward in time, long distances before reaching the root. Under the BMP, in contrast, geographical space is unlimited, and in ΛFV simulations the simulated tree is very long, suggesting long traveled distances from the root to the tips, and therefore high uncertainty in root locations, which more than offsets the effect of sample locations being concentrated inside the ΛFV finite space of interest.

Analysis of a West Nile Virus Outbreak.
To showcase the importance of these observations with respect to practical epidemiological and phylodynamic investigations, we consider a dataset from a recent West Nile Virus outbreak in North America (14). We choose this particular dataset due to availability of the data and of clear instructions on how to repeat the published analyses in BEAST https://beast.community/workshop_ continuous_diffusion_wnv (accessed on August 2019), reducing the chances of errors on our part. As described in the tutorial, we include sampling time, sampling location, and genetic sequence data for each sample. We use a separate HKY model for each of the three codon positions, but assume no variation in substitution rates across codons, and we assumed an uncorrelated relaxed molecular clock model (39) with an underlying lognormal distribution.
As the tree prior, we employ an exponential growth coalescent model. We assume homogeneous Brownian motion along tree branches.
To investigate the possible effects of sampling bias, we consider two datasets: the first including all samples, and the second including only the western-most half of the samples. This second scenario artificially recreates sampling bias, such as the case where only cases from one half of the country are accessible or considered. We consider the inference of the location of the root (MRCA) of the western half of the samples. The posterior densities of this same ancestor in the two analyses is very different: when using only western samples, this phylogenetic node is confidently placed in western USA, but when using the whole dataset this same node is confidently placed in the eastern USA instead (Figure 4). Another difference between the two analyses is that when restricting to just the western samples diffusion was inferred to be slower (95% HPD interval [166, 284] km/yr versus [339,498] in the full analysis). Next, we wanted to see whether, in this scenario, including some sequence-free samples from the eastern side of the country could help in the scenario of biased sampling. To do so, we ran an analysis of the 52 western samples with additionally the 52 eastern samples added as sequence-free samples. These sequence-free eastern samples were included with correct location and sampling time data but without sequence data. In this analysis, the inferred location of the considered node (the MRCA of the western samples) is now shifted eastward, but it is still very different from the inferred location of the same node from the full analysis ( Figure 4). It is remarkable that in this dataset, unlike in our simulations, the addition of sequence-free samples does not seem to alleviate the effects of sampling bias very much. One possible explanation for this observation is that, unlike in our BMP simulations, in this case the outbreak seems to migrate westward as time progresses (14), a feature that sequence-free samples are insufficient, in this case, to capture, and that a more specific model might be able to address (40). This is also hinted at by the fact that performing the same analyses as above but removing the western samples from the full dataset instead of the eastern ones shows almost no effects of the artificially introduced sampling bias (Supplementary Figure S14). Analysing the same datasets with PhyREX also shows different estimates after removing the eastern samples, although this time there is considerable overlap between the different ancestral location estimates and different diffusion rate estimates (Supplementary Figures S15, S16, S17, S18). In principle we would not expect to see considerable differences for different subsampling schemes since the ΛFV model should be robust to sampling biases, as shown in our simulations. This further supports the hypothesis that the progressive westward shift of the outbreak plays a major role in the apparent strong effects of sampling bias in this case. A noticeable difference between BEAST and PhyREX results, also observed in simulations, is that the inferred uncertainty in ancestral location is much larger in PhyREX than in BEAST.

Analysis of a Yellow Fever Virus Outbreak.
As a second example of real world epidemic analysis, we considered a recent dataset of Yellow Fever Virus (YFV) from Brazil (41). 65 YFV genomes were collected between January and April 2017, mostly from the Brazilian state of Minas Gerais. Again, we chose this dataset due to availability of data and instructions for repeating the analysis https://beast.community/ workshop_continuous_diffusion_yfv (accessed on August 2019). Following the tutorial, we used the same substitution model as for the West Nile Virus dataset, a skygrid coalescent (42) tree prior with 36 grid points, and a Cauchy relaxed random walk model (7). When recreating sampling bias along a north-south gradient, we find that directional sampling bias seems to have considerable effect in BEAST analyses when removing northern samples (Supplementary Figure S20), greatly reduced by introducing sequence-free samples. We observed instead little impact from removing southern samples (Supplementary Figure S19). PhyREX inference seems, expectedly, mostly unaffected by sampling bias, and shows much broader posterior distributions for ancestral locations (Supplementary Figures S21 and S22). We also observed that many samples of this dataset were collected from few locations: six from Ladainha, five from Novo Cruzeiro, seven from Teófilo Otoni and five from Itambacuri. So, in a second alternative sub-sampling strategy, we reduced the maximum number of samples from any of these locations to two. As before, we aim to artificially recreate different sampling scenarios. We find that, after downsampling, the origin of the outbreak is not inferred anymore to be solely nearby Teófilo Otoni, but also possibly south, close to another cluster of samples near Caratinga ( Figure 5). A third possible, but low-probability area remains near Belo Horizonte, close to the phylogenetic outgroup location.

Fig. 4. Recreating the effects of biased sampling over a West Nile Virus outbreak investigation.
We re-analysed the West Nile Virus North America dataset of Pybus and colleagues (14). At top, we show the maximum clade credibility tree. Branch lengths are in years. Green circles represent eastern samples while red squares represent western samples. The red triangle in the tree represents the node whose location is considered here: the most recent common ancestor (MRCA) of all western samples. Below, the sample locations are shown on a map of the USA. Sample numbers are only used to link samples on the map onto the phylogeny. All three kernel density estimate areas (red, orange and blue) on the map represent the posterior densities of the location of the MRCA of all western samples (red triangle in the phylogeny). The red area represents the posterior from the analysis of only western samples; the blue area is the posterior from the analysis of all samples; the orange area is the posterior from the analysis of the western samples and of sequence-free eastern samples (eastern samples included but without sequence data).
These results further suggest that the decision of where to collect samples and which samples to include or exclude from a BMP analysis can significantly impact its conclusions, and that great care should be taken to make sure that the range of samples collected and their density reflect real geographic distributions.

Conclusions
We have shown that continuous space phylogeographic inference can be negatively affected by sampling biases, such as sampling efforts being focused in certain areas over others. These biases can lead to strongly mis-inferred ancestral node locations, up to completely excluding the true origin of outbreaks with complete confidence. These biases also usually lead to underestimating the dispersal velocity of pathogens, and can in some cases lead to inference of artificial patterns of correlated spread across space dimensions.
We explored possible ways to tackle these issues. A possible approach is to include sequence-free samples, which correspond to known cases (for which we know date and location) which have no corresponding genetic information. We find that sequence-free samples can considerably improve inference and compensate sampling biases, but that in most scenarios it would be computationally unfeasible or unrealistic to completely eliminate the effects of sampling biases, if possible at all. We confirm these results on real datasets from West Nile Virus and Yellow Fever Virus outbreaks by artificially recreating scenarios of sampling bias.
As an alternative, we investigated the use of an inference model that is not affected by sampling biases: the ΛFV implemented in PhyREX. Indeed, we found that this model is seemingly unaffected by different sampling strategies, but, more importantly, the model is also very different from the BMP, resulting in very different estimates. The assumptions and applicability of these two models being so different, we would expect few scenarios of common applicability. The BMP, in fact, well-describes the spread of outbreak within a new, unlimited environment, or at least within an area that is large compared to the current range of the outbreak. For example, in BMP simulations, lineages generally spread out from the original source and move in all directions, on average spreading further away from the origin as time progresses. The ΛFV, instead, fits better a scenario where an outbreak (or any population) has become endemic within an area, or at least where lineages are expected to have migrated across the area since their introduction. For example, in ΛFV simulations lineages usually tend to cross the considered geographic space several times before they all find a common ancestor. It is possible, however, that, in some scenarios or with some modifications, these two models would more substantially overlap in applicability. An example could be for example when restricting the allowed geographic range within the BMP to a limited space, that is, not allowing BMP migration outside of a confined area. In fact, we suspect that simulating migration under such a version of the BMP, and simulating a long phylogeny (in terms of distance traveled from the root before samples are collected) would lead to patterns very close to those simulated under the ΛFV.
In the future, it would be of great interest to make the BMP robust to sampling biases by conditioning the BMP geographical likelihood on the location of collected samples; however, so far, a simple solution remains elusive. While in this manuscript we have only considered a simple model of migration, that is with no directional bias and no variation in diffusion rate over time, location or lineages, it will be interesting in the future to investigate how the relaxation of these assumptions (7,40) would impact the results presented here. In conclusion, we report that often the choice of model and of sampling strategy has dramatic effects on the results of a continuous phylogeographic analysis. We therefore recommend attention be paid when deciding a sampling strategy for BMP so that the range and distribution of collected samples would reflect the geographical distribution of the outbreak as much as possible. We also recommend an appropriate phylogeographic model to be used, depending on the history and range of the considered outbreak.

Code availability
The code used for this project is is available from: https: //github.com/NicolaDM/Phylogeography

Supplementary Information
Estimates of diffusion rate σ 2 .
Defining σ 2 for the ΛFV in discsim. Here we want to relate the dispersion rate of lineages under the ΛFV simulation model in discsim with the diffusion rate inferred by PhyREX and the BMP in BEAST. To do this, we consider what is the dispersion rate of lineages in discsim given certain model parameter values, and in particular we focus on the limit of frequent but small events, so that lineages migrate approximately in a Brownian motion over short times (see (21)). Given λ the rate of events in the disc-based ΛFV, µ the death probability, r the radius of event discs, and assuming the considered space A is a torus with both dimensions of size L (see main text for a description of the ΛFV), then the rate of events overlapping with the location of a certain individual is: When an individual is covered by the disc D of an event, the probability that it dies is µ. When a new individual is born at location (x 2 , y 2 ), the mean square distance of one-event displacement from its parent at (x 1 , y 1 ) is the mean square distance between two points chosen uniformly at random in the disc; this is r 2 , as can be seen from the following (see also (43)): since, without loss of generality, we can assume that D is centred at 0 and observing that terms such as 2x 1 x 2 integrate to 0. So, over a very short time t, the mean square displacement of a lineage is t λµπr 2 L 2 r 2 = t λµπr 4 L 2 and the diffusion rate per dimension is To enforce σ 2 = 1 we therefore simulate under the condition λ = 2L 2 µπr 4 , and in particular with L = 100, r = 0.1, and µ = 0.1 .
Estimating σ 2 from the ΛFV in PhyREX. A theoretical estimate of σ 2 from the PhyREX inference is obtained similarly as following (23), where θ, similarly to r, measures the spatial size of events. Throughout the manuscript, we use this classical measure of σ 2 . However, as this is an approximation assuming a limit of a Brownian motion, we also test alternative statistics below, which however seem overall less reliable estimates. In one of the statistics, "dispersion from the root", we consider the average squared Euclidean distance from the current root location (at the current MCMC step) and the tip locations, and divide this by twice the time distance between the root and the tips (the tips are all assumed collected at the same time). Because lineages are inferred to travel several times across the considered space before coalescing, the dispersion from the root statistic would usually not represent the instantaneous dispersion rate of lineages well; in fact, we see in Figure S10 that this measure severely underestimates the diffusion rate from discsim simulations.
As another alternative we also consider the "dispersion across short distance from the tips", which is the sum of the squared Euclidean distances between each tip and its location after (backward in time) its first event affecting its location, divided by the sum of the times for each tip to each such event. This "dispersion near the tips statistic" better summarizes the short-term dispersion of lineages in the PhyREX model; however, this statistic seems to both underestimate the diffusion rate in discsim simulations ( Figure S10) and to overestimate it in BMP simulations ( Figure S9). Finally, we considered as a statistic the "dispersion across long distance from the tips", which is half the average square Euclidean distance of the tips from their ancestral position one time unit in the past. This statistic seems to overestimate the diffusion rate in BMP simulations ( Figure S9) while giving overall unreliable estimates in discsim simulations ( Figure S10). All four estimates of σ 2 mentioned above (the theoretical one, the dispersion from the root, dispersion across short distance from the tips, and the dispersion across long distance from the tips) have been included in PhyREX and are now part of its output.
σ 2 for the BMP. For the BMP in our simulations we used identity diffusion and precision matrices, which leads, over a short time t, to a mean square displacement of 2t and so to σ 2 = 1. Fig. S1. Effects of sampling bias on BMP root location inference. Here a BMP model was used both for simulation and inference.

Supplementary Figures.
Plots show inferred posterior distributions for the X dimension position of the tree root (plots A,C,E,G), and its Y dimension position (plots B,D,F,H). In each plot, the 100 distributions represent 100 independent replicates, and are vertically sorted based on the posterior median. Vertical black lines show the true, simulated values (in this case always 0). Plots A,B are from simulations with non-biased samples, plots C,D with "Central" biased samples, plots E,F with "Diagonal" biased samples, and plots G,H with "One-sided" sampling bias. Since in many cases the MRCA of the collected samples is not the root of the whole simulated phylogeny (which was simulated at location (0, 0)), in each simulation all locations are translated (in mathematical sense) so that the true simulated sample MRCA is always at (0, 0). simulated values (in this case 1 for rates and 0 for the correlation). Plots A-C are from simulations with non-biased samples, plots D-F with "Central" biased samples, plots G-I with "Diagonal" biased samples, and plots J-L with "One-sided" sampling bias.  (plots A,D), the diffusion rate along the X dimension (plots B,E), and the correlation between the diffusion in the two dimensions (plots C,F). In each plot, the 100 distributions represent 100 independent replicates, and are vertically sorted based on the posterior median. Plots A-C are from simulations with non-biased samples. Plots D,E,F are respectively with "One-sided" sampling bias, "Central" sampling bias, and "Diagonal" sampling bias.  Figure S1, here we show BMP inference of root locations under BMP simulations, but this time the phylogenetic tree is assumed to be known without uncertainty. Figure  S2, here we show BMP inference of diffusion parameters under BMP simulations, but this time the phylogenetic tree is assumed to be known without uncertainty. Figure S1, here we show BMP inference of root locations under BMP simulations, but this time we include 50 extra sequence-free samples (without genetic sequence but with correct date and sampling location). Figure S2, here we show BMP inference of diffusion parameters under BMP simulations, but this time we include 50 extra sequence-free samples (without genetic sequence but with correct date and sampling location).              Figure S19, we consider the effects or restricting an analysis to a southern geographical subsample of the YFV dataset. We compare the results of BMP analysis using all the data from (41) versus using only the southern samples (latitude below −19.0, red squares in the phylogeny and on the map). On top is the maximum clade credibility phylogeny inferred from analysing the whole dataset. On the map (bottom) we show the location of the samples and the inferred location of the most recent common ancestor (red triangle in the phylogeny). The three colored areas on the map show the inferred posterior distribution (kernel density estimate) of the root location from three analyses: using only the southern samples (red area), using all samples (blue area) or using only the southern samples but adding the northern ones as sequence-free samples (orange areas). Due to difficulties in convergence, and following results from the analysis with all samples, in the analysis with sequence-free samples we added a normal distribution prior over root height with mean 0.7 and standard deviation 0.25.