The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

We report the Simons Genome Diversity Project (SGDP) dataset: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioral modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that in other non-Africans.

ignore smaller populations that are also important for understanding human diversity. In addition, many of these studies have sequenced genomes to only 4-6-fold coverage. Here, we report the Simons Genome Diversity Project (SGDP): deep genome sequences of 300 individuals from 142 populations chosen to span much of human genetic, linguistic, and cultural variation (Supplementary Data Table 1).

Data set and catalog of novel variants
We sequenced the samples to an average coverage of 43-fold (range 34-83 fold) at Illumina Ltd.; almost all samples (278) were prepared using the same PCR-free library preparation 2 . We aligned reads to the human reference genome hs37d5/hg19 using BWA-MEM (BWA-0.7.12) 3 ( Supplementary Information section 1). We genotyped each sample separately using the Genome Analysis Toolkit (GATK) 4 , with a modification to eliminate bias toward genotypes matching the reference ( Supplementary Information section 1). We developed a filtering procedure that generates a sample-specific mask. At "filter level 1" which we recommend for most analyses, we retain an average of 2. 13 Gb of sequence per sample and identify 34.4 million single nucleotide polymorphisms (SNPs) and 2.1 million insertion/deletion polymorphisms (indels) (Supplementary Information section 2). We have made the GATK-processed data available in a file small enough to download by FTP, along with software to analyze these data ( Supplementary Information section 3). The SGDP dataset highlights the incompleteness of current catalogs of human variation, with the fraction of heterozygous positions not discovered by the 1000 Genomes Project being 11% in the KhoeSan and 5% in New Guineans and Australians (Extended Data Fig. 1; Supplementary Data Table 1). We used FermiKit 5 to map short reads against each other, store the assemblies in a compressed form that retains all the information required for polymorphism discovery and analysis, and identified SNPs by comparing against the human reference. We find that FermiKit has comparable sensitivity and specificity to GATK for SNP discovery and genotyping, and is more accurate for indels (Supplementary Information section 4). FermiKit also identified 5. 8 Mb of contigs that are present in the SGDP but absent in the human reference genome presumably because they are deleted there; these contigs which we have made publicly available can be used as "decoys" to improve read mapping (Supplementary Information section 5). Finally, we called copy number variants 6 and used lobSTR 7,8 to genotype 1.6 million short tandem repeats (STRs) (Supplementary Information section 6). The high quality of the STR genotypes (r 2 =0.92 to capillary sequencing calls) is evident from their accurate reconstruction of population relationships, even for difficult-to-genotype mononucleotide repeats (Extended Data Fig. 2).

The structure of human genetic diversity
To obtain an overview of population relationships, we carried out ADMIXTURE 9 (Extended Data Fig. 3) and principal component analysis 10 (Extended Data Fig. 4a). We also built neighbor-joining trees based on pairwise divergence per nucleotide (Fig. 1a) and F ST (Extended Data Fig. 4b) whose topologies are consistent with previous findings that the deepest splits among human populations are among Africans. We computed heterozygosity -the proportion of diallelic genotypes per base pair -and recapitulate previous findings that the highest genetic diversity is found in sub-Saharan Africa and that there is a much lower ratio of X-to-autosome diversity in non-Africans than in Africans (Fig. 1b) 11 . A surprise is that African "Pygmy" hunter-gatherers have reduced X-to-autosome diversity ratios relative to all other sub-Saharan Africans. This pattern remains even after we remove the third of chromosome X known to be subject to the strongest natural selection, suggesting that the finding is driven by demographic history rather than by natural selection ( Supplementary  Information section 7). It has been suggested that the reduced X-to-autosome heterozygosity ratio in non-Africans is due to ongoing male-driven admixture 11,12 . Male non-Pygmy admixture into Pygmies is well-documented 13,14 , so this process could explain these findings.
Comparisons of ancient to present-day human genomes have shown that all non-Africans today possess Neanderthal ancestry 15 with more in eastern non-Africans 16,17 , and that Australo-Melanesians and to a lesser extent other eastern non-Africans possess Denisovan ancestry [18][19][20] . However, these studies only analyzed genomes from a handful of populations.
We computed statistics informative about Neanderthal and Denisovan ancestry and provide a fine-scale view of these ancestry distributions worldwide (Fig. 1c,d; Supp. Data Table 1; Supplementary Information section 8). We do not detect any population with a higher proportion of Neanderthal ancestry than is present in East Asians. However, we do find suggestive evidence of an excess of Denisovan ancestry in some South Asians compared to other Eurasians. This signal may not have been detected before because earlier surveys of archaic introgression largely excluded South Asians ( Fig. 1d; Supp. Data Table 1).

The time course of human population separation
We studied demographic history by leveraging the fact that variation across the genome in divergent sites per base pair can be used to reconstruct population size changes and separations. We used the Pairwise Sequential Markovian Coalescent (PSMC) 21 to reconstruct population size changes ,and the multiple sequentially Markovian coalescent 22 (MSMC) to study the time course of population separations. We infer that the population ancestral to all present day humans began to develop substructure at least two hundred thousand years ago (kya), which is most apparent when comparing the ancestors of some present-day African hunter-gatherers (southern African KhoeSan and central African Mbuti Pygmies) and other populations (Fig. 2a). However, it is also clear that this substructure developed slowly, as all pairs of present-day populations including African hunter-gatherer share a substantial subset of their ancestors as recently as a hundred thousand years ago [23][24][25][26] . Quoting the time at which MSMC infers that more than 50% (25-75%) of lineages for a pair of populations are descended from the same ancestral population, we estimate that non-Africans separated substantially from KhoeSan 131 (82-173) kya and almost as anciently from the Mbuti around 112 (67-171) kya. Within Africa ( Fig. 2a-b), we infer that the Yoruba separated substantially from the KhoeSan 87 (58-120) kya; from the Mbuti 56 (32-85) kya; and from the Dinka 19 (9-25) kya. We estimate a relatively rapid 21 (21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33)(34)(35)(36) kya separation of northern and southern KhoeSan 24,27 potentially reflecting isolation since the last glacial maximum; and 38 (27)(28)(29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44) kya separation between western (Biaka) and eastern (Mbuti) Pygmies, confirming very old substructure between these two central African hunter gatherer groups 28 . Outside Africa, the most ancient structure dates to around 50 kya (Fig. 2c) during or shortly after the deepest part of the shared non-African bottleneck 40-60 kya, consistent with the archaeological evidence of the dispersal of modern humans into Eurasia during this period. We are not confident about the estimates of the date of separation of Australians, New Guineans and Andamanese from other populations because we find that these inferences change depending on the computational method we use for phasing, likely due to these populations not being represented in the 1000 Genomes haploid genome reference panel ( Supplementary Information section 9). We caution that the date estimates also do not take into account uncertainty about the true value of the human mutation rate, which could plausibly be 30% higher or lower than the point estimate we use 29 .

Early modern human dispersals contributed little to non-African populations
There is intense debate about whether present-day Australians, New Guineans and Asian "Negrito" populations are descended from the same source population as mainland Eurasians, or whether they also derive some ancestry from an early, independent dispersal of modern humans into Asia [30][31][32] . To explore this scenario rigorously, we fit an admixture graph 33 -a phylogenetic tree incorporating mixture events-to the allele frequency correlations among Neanderthals, Denisovans, Upper Paleolithic Europeans, East Asians, New Guineans, Australians, and Andamanese. We obtain a good fit to the data if we include known Neanderthal and Denisovan introgression and model all modern human ancestry in New Guineans, Australians and Andamanese as part of an eastern clade together with mainland East Asians (Supplementary Information section 11; Fig. 3). Furthermore, when we manually introduce a deeply diverging modern human lineage contributing ancestry to Australians, New Guineans, and Andamanese (or when we repeat the analysis in a model without Andamanese), no position or proportion of the deep lineage improves the fit. If this putative source population branched off the main lineage leading to non-Africans more than about 10-20 ky prior to the separation of European and East Asian ancestors, we obtain an upper bound of a few percent for the possible contribution to Australians and New Guineans (Fig. 3 inset; Supplementary Information section 11). These results are at odds with an inference of substantial early dispersal ancestry in a previous analysis of an Australian genome 32 ; however, that study used a less complete model that, notably, did not include the known Denisovan admixture into Australo-Melanesians 18 . The findings for Australians are also unlikely to be due to some unusual feature of the individuals we sequenced, as when we compared three different Australian samples for which there is published genome-wide data, they are all consistent with descending from a common homogeneous population since separation from New Guineans ( Supplementary Information section 10). These results are not in conflict with skeletal and archaeological evidence of an early modern human presence outside of Africa 30,34 , as early migrations could have occurred but not contributed substantially to present-day populations. The possibility of populations that once flourished but did not contribute substantially to living groups is especially plausible now that ancient DNA from the ~45 kya Ust'-Ishim 29 and the ~40 kya Oase 1 individuals 35 has documented directly their existence.

More mutation accumulation in non-Africans than in Africans
The SGDP data provide an opportunity to compare the rates at which mutations have accumulated across populations. We restricted our analyses to samples for which our genotypes are likely to be most reliable (this included restricting to samples which were all processed in the same way), and we used the highest level of filtering ("level 9") ( Supplementary Information section 7). We pooled samples by region to increase power, and for all pairs of regions, computed the expected number of positions where, if we picked a random chromosome from both, region A would mismatch chimpanzee and region B would be identical to chimpanzee (or vice versa). If the rate of accumulation of mutation has been the same since the two populations diverged, these numbers are expected to be equal 36 . However, when we compute the ratio of mutations on one lineage or the other since separation, we find a subtle (average of 0.5%) but significant excess of mutations in non-Africans relative to sub-Saharan Africans (3.3<|Z|<9.4 standard errors from zero; Extended Data Table 1). Because any difference must reflect events since non-African / African population divergence which is a less than a tenth of average genetic divergence (Fig. 2a), this implies a greater difference in mutation accumulation rates since population divergence (~5%). We were concerned that these results might be biased by the fact that the human genome reference sequence is more closely related to non-Africans than to Africans, or by higher levels of heterozygosity in Africans, as both these issues could make detection of divergent sites in Africans more difficult. However, we replicated the findings after remapping to chimpanzee, which is equally distant to all present populations, and after restricting analyses to the X chromosome in males (males only have a single X chromosome, and so this procedure avoids bias due to different error rates in detecting heterozygous genotypes in populations with different rates of heterozygosity) (Extended Data Fig. 5). These observations are most likely to be explained by acceleration in the rate of mutation accumulation in non-Africans, since the same signal appears in comparisons to sub-Saharan Africans related in different ways to non-Africans (Extended Data Table 1). It is known that the rate of CCT>CTT mutations differs across human populations. However, this particular mutation class was found to be enriched relative to Africans in Europeans but not in East Asians, and thus cannot explain our signal 37 . One of several possible explanations for these findings is a decrease in the generation interval in non-Africans compared to Africans since separation 38 .

No evidence for species-wide sweeps since the origin of anatomically modern humans
We finally used the SGDP dataset to address the hypothesis that the widespread appearance of modern human behavior in the archaeological record after ~50 kya was driven by one or a few changes in neurological genes that swept through the population shortly before this time 39 . We first applied the 3P-CLR method 40 to search for locations in the genome with low allele frequency differentiation between KhoeSan and other modern humans, combined with high differentiation between modern and archaic (Neanderthal and Denisovan) humans, as might be expected from a selective sweep in the ancestors of all modern humans (Supplementary Information section 12) (Extended Data Figure 6). We found no strong outlier signals, although a caveat is that our scan has imperfect power and we could not apply it to filtered sections of the genome. We also applied the PSMC method 21 to estimate the average time since the most recent common ancestor (TMRCA) of individuals' two chromosomes in the genomic regions within the largest 3P-CLR peaks (38 peaks corresponding the top 0.1%). In none of the regions did we find that the great majority of all pairs of modern humans are inferred to share a common ancestor <100 kya, as would be expected for a sweep just prior to ~50 kya years ago (Supplementary Data Table 2).
As a second approach to scanning for species-wide selective sweeps, we applied the PSMC to infer TMRCA for SGDP samples across the entire genome. This analysis found no regions where the great majority of pairs of human genomes are inferred to share a common ancestor <100 kya (the largest fraction seen anywhere in the genome is 68%; Extended Data Fig. 7).
Taken together, these results do not rule out the possibility that genetic changes played a meaningful role in the changes in human behaviors after 50 kya; for example, changing selection can produce shifts in the frequencies of pre-existing mutations to bring a population to a new and advantageous set-point for a phenotype as occurred in the case of height differences between northern and southern Europeans 41 . For polygenic selection, however, genetics is not the creative force, but instead is responding to selection pressures imposed by new environmental conditions or lifestyles. Thus, our results provide evidence against a model in which one or a few mutations were responsible for the rapid developments in human behavior in the last 50 kya. Instead, changes in lifestyles due to cultural innovation or exposure to new environments are likely to have been the ultimate driving forces behind the rapid transformations in human behavior that became evident after 50 ky 42,43 .

Extended Data
Extended Data Figure 1

. Heatmap of fraction of heterozygous sites missed in the 1000 Genomes Project
For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes Project.

Extended Data Figure 3. ADMIXTURE analysis
We carried out unsupervised ADMIXTURE 1.23 9,44 analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K=2 and K=12 and using default 5-fold cross-validation (--cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering sites with at least 99% completeness and performing linkage-disequilibrium based pruning in PLINK 1.9 45,46 with parameters (-indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K=5 minimizes cross-validation error (not shown). The solution at K=5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data. date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. C: Percentile points of the cumulative distribution function of B.
Extended Data Table 1 Fewer accumulated mutations in Africans than in non-Africans.

Figure 2. Cross-coalescence rates and effective population sizes for selected population pairs
A-C: Cross-coalescence rates as a function of time in thousands of years ago (kya) estimated using MSMC, with four haplotypes per pair. In each subfigure legend, we give the point estimate of the date at which 25%, 50% and 75% of lineages in the pair of populations have coalesced into a common ancestral population. We generated these plots using data phased with the 1000 Genomes reference panel (method PS1 described in supplementary information section 9), but only show pairs of populations for which the cross-coalescence rates are relatively insensitive to the phasing approach. A: Selected African crosscoalescence rates. B: Central African rainforest hunter-gatherer cross-coalescence rates. C: Ancient non-African cross coalescence rates. D-F: Effective population sizes inferred using PSMC, using one diploid genome per population, for the same populations that we used in A-C.  model likelihood is maximized with zero early dispersal ancestry, and no more than a few percent is consistent with the data.