Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives

Relatedness inference is an essential component of many genetic analyses and popular in consumer genetic testing. Ramstetter et al. evaluate twelve.....

Values that map to degree: (1) k (2)  Supplemental Table 1: Degrees of relatedness with the ranges of kinship coefficients (φ) and genome proportions inferred to be IBD0 (k (0) ) that map to that degree. Includes example relationships for each degree and the corresponding number of meioses that separate a pair of individuals with that relationship. Number of meioses with text (×2) and (×4) correspond, respectively, to samples that are related along two (e.g., full siblings) and four lines of descent, with the indicated meiotic distance on all lines. Also lists the proportions of the genome that are expected to be IBD0, IBD1, and IBD2 between samples that have each relationship, denoted k (0) , k (1) , and k (2) , respectively; and the corresponding expected kinship coefficient φ. Table does not include all possible relationship types for these degrees of relatedness.

SAMAFS data and quality control procedures
Our analysis focuses on SNP array data from the San Antonio Mexican American Family Studies 1-3 (SAMAFS). The 2,490 samples were genotyped via one of the following Illumina arrays: the Human660W, Human1M, Human1M-Duo, or both the HumanHap500 and the HumanExon510S array which together provide roughly the same content as the Human1M array. We began by using data that had quality control measures carried out in a prior study 4 . In brief, sites with non-Mendelian errors were set to missing and we used BWA 5 v0.7.5a-r405 to map the SNP array probe sequences to human reference sequence GRCh37.
Only SNPs with probe sequences that aligned with no mismatches at exactly one genomic position and that do not align to any other location with either zero or one mismatches were kept. We omitted SNPs for which any of the following was true: (1) multiple probes aligned to the same location (we retained only one SNP for any location), (2) dbSNP reported either more than two variants or had non-simple alleles (i.e., not A/C/G/T), (3) the raw genotype data had differing alleles as compared to those reported in the manifest files, (4) the manifest file listed SNP alleles that differed from those in dbSNP at the aligned location, (5) dbSNP listed the SNP as having multiple locations or as 'suspected', (6) the SNP location is outside the 'accessible genome' as reported by the 1000 Genomes Project 6 , (7) the site occurs in a region that is segmentally duplicated with a Jukes-Cantor K-value of <2%, or (8) the site occurs within a total of 17 Mb of the genome that received excess reads aligned using 1000 Genome Project data 7 .
Following these procedures, we filtered the dataset to include SNPs with less than 2% missingness and individuals with less than 10% missingness. This resulted in data for a total of 2,485 individuals at 521,184 SNPs that overlap between the two types of arrays and are of high quality. To run methods that require independent SNPs, we used the PLINK command --indep-pairwise 1000 25 0.25 which uses a sliding window method that considers blocks of 1,000 SNPs and removes SNPs with r 2 > 0.25, afterward shifting the window by 25 SNPs. This process yielded a total of 140,314 SNPs after linkage disequilibrium (LD) was filtered. We explore the effect of different filtering thresholds below in the "Varying linkage disequilibrium filters" section.
For the ADMIXTURE 8 analyses described below, we merged the above LD-pruned SAMAFS dataset with HapMap phase 3 samples 9 and again filtered to include SNPs with less than 2% missingness from the combined dataset. This resulted in a sample with 102,527 SNPs.
After an initial analysis, we discovered several pairs of individuals reported as unrelated but inferred to be first or second degree relatives based on the output from Refined IBD. When these individuals are in the same SAMAFS pedigree, relatedness degrees between their descendants can be inflated relative to that implied by the reported relationships. In order to reduce this effect, we excluded from consideration reported pairs of relatives in which both samples descend from one of the unexpected relatives. Specifically, for a given pair of inferred but unreported relatives from the same pedigree, (ind 1 , ind 2 ), we found the sets D 1 and D 2 of all descendants of ind 1 and ind 2 , respectively. Then for each d 1 ∈ D 1 and d 2 ∈ D 2 , we removed from the set of relatives to be analyzed each pair (d 1 , d 2 ), as well as all pairs of the forms (ind 1 , d 2 ) and (d 1 ,ind 2 ). This resulted in the removal of 2,618 pairs of individuals: 3, 80, 322, 625, 763, 554, 120, and 151 pairs reported as second through eighth degree relatives and unrelated, respectively.
Some reported relationships in the dataset have an expected kinship coefficient that does not map to a specific degree of relatedness. For example, three-quarter-siblings share one parent in common and have unshared parents that are first degree relatives (such as full siblings); their expected kinship coefficient is intermediate between first and second degree relatives, and as a result their degree of relatedness is unclear.
We omit all such relatives from the analysis, with a total of 91 pairs removed.
Although monozygotic (MZ) twins have a degree of relatedness of 0, for the analyses in this paper, we combine the 10 reported pairs of MZ twins with the set of first degree relatives and call all individuals with estimated kinship φ > 1 2 5/2 as first degree relatives.

Effects of the SAMAFS relatedness structure on inference quality
As the SAMAFS data consist of numerous large families, allele and haplotype frequencies estimated from the sample may be biased, potentially affecting the inference results in a way that is not representative of the methods' accuracy in other datasets. To assess how severely this may impact the results of the allele frequency-based methods on all of SAMAFS, we tested PLINK 10 on datasets composed primarily of individuals with fairly low level relatedness. To generate these sample sets, we first determined a set of distantly related individuals using FastIndep 11 , a program that uses estimated kinship coefficients and a maximum allowed relatedness threshold to identify a set of individuals in which no pairwise relatedness exceeds the given threshold. For pairs reported as unrelated, we use the kinship coefficients from PLINK, and for pairs reported as related, we use the expected kinship coefficient (Supplemental Table 1) value for that pair. We input these kinship coefficients to FastIndep with the relatedness threshold set to 0.015, which is roughly the expected kinship coefficient for fifth degree relatives. This produced a set of 529 individuals,  Table 2: Number of pairs of individuals that passed quality filters, are reported to have first through fifth degree relatedness, and which we included in at least one of the 1,000 datasets composed primarily of individuals with PLINK-estimated kinship less than 0.015. (See the "Effects of reducing relatedness in input data" section.) Pairs counted as unrelated for this analysis are from distinct pedigrees. We include reported MZ twins with the set of first degree relatives.
denoted as L, in which all sample pairs are expected or estimated to share no more than 3% of their genome IBD. We note that PLINK is somewhat biased in inferring relatedness and identifies a non-trivial proportion of samples that are reported to be unrelated as fifth degree or closer relatives ( Figure 1). Therefore, using PLINK kinship estimates should provide an aggressive filter against potential relatedness in these sample sets. Next, we created 1,000 datasets containing the base set of samples L merged with no more than one randomly selected pair of related individuals from each SAMAFS pedigree, resulting in a total of 26,026 pairs of fifth degree or closer relatives, and nearly two million pairs from distinct pedigrees so reported as unrelated (Supplemental Table 2). When adding a related pair of individuals to the dataset, we checked if either of the individuals is reported to be a fifth degree or closer relative of a sample in L, and in that case, removed the sample in L from the dataset. Finally, we ran PLINK on each of the 1,000 datasets and show performance accuracy results in comparison to running PLINK on the full SAMAFS dataset in Supplemental Figure 2. While some differences exist between the two analyses, the accuracy results differ by less than 3% for all considered relatedness classes, indicating that the relatedness structure within SAMAFS has little affect on allele frequency estimates and resulting relatedness measures.
In the presence of multiple copies of various haplotype segments that stem from the numerous relatives in the full SAMAFS dataset, phasing and therefore IBD accuracy may be inflated compared to non-pedigree samples. This may increase the inference quality of IBD segment-based programs (which utilize either internal phasing models or pre-phased data) compared to the other programs. To assess the performance of the IBD segment-based methods in a setting with relatively outbred data, we again used datasets comprised mostly of individuals with low relatedness. Specifically, starting with the 1,000 datasets with the base set of samples contained in L as outlined above, we merged genotypes from 580  Figure 3. The accuracy of the IBD segment-based methods does drop for higher degrees of relatedness in the reduced datasets compared to all of SAMAFS, in some cases by as much as 9.6%. In this case the performance of IBD segment methods and allele frequency methods are more similar, suggesting that for smaller datasets, phasing errors can reduce the efficacy of IBD segment methods for inferring relatedness. Still, the IBD segment-based methods are comparable to or more accurate than the allele frequency methods even in this setting. Moreover, for larger datasets where it is possible to achieve phase accuracy at the megabase-scale 12 , the results from the full dataset indicate that IBD segment-based methods provide greater accuracy than allele frequency-based methods for inferring relatives with third degree or more distant relatedness.
Supplemental Figure 2: Accuracy results from PLINK run on the entire SAMAFS dataset (labeled "Full"), and from PLINK run on 1,000 reduced datasets composed mostly of individuals with PLINK-derived kinship estimates of less than 0.015 (labeled "Reduced"). When a pair of relatives is present in more than one reduced dataset, we selected results from an arbitrary dataset to determine accuracy. We combined results for pairs inferred to have sixth degree or more distant relatedness in one class for this analysis.
Addressing population structure while including data from diverse samples Samples that have admixed ancestry can confound relatedness estimation methods due to some individuals having more or less similar population-level ancestry, thus inducing an increased or decreased correlation in genotypes among admixed samples that may or may not be recently related 13,14 . While methods such as REAP 13 and RelateAdmix 15 adjust for admixture, they rely on the output of model-based ancestry inference methods such as ADMIXTURE 8 which have difficulty distinguishing between ancestral populations and more recent relatedness among samples 14 .
We sought to determine whether adding genotypic variance corresponding to the populations relevant to the ancestry of the SAMAFS samples would enable ADMIXTURE to reliably estimate ancestral population proportions despite the relatedness in the full dataset. To that end, we generated a dataset containing the entire (LD-pruned) SAMAFS sample together with 372 unrelated HapMap individuals. These HapMap individuals are a subset of the 580 individuals described above (including samples with African, European, and Native American ancestry pertinent to SAMAFS), but with samples filtered out by FastIndep using previously estimated kinship coefficients 16 as input and a filtering threshold of 0.015. We then ran ADMIXTURE on this dataset with K = 3. Next, we ran ADMIXTURE with K = 3 on another dataset containing the 372 Supplemental Figure 3: Accuracy results from the full dataset for all IBD segment-based methods, PC-Relate, and PREST-plus along with results from running ERSA, GERMLINE, and IBDseq on 1,000 reduced datasets composed mostly of individuals with inferred kinship values of less than 0.015. Results from programs run on both types of data are indicated with a label "(F)" and red text for the full dataset and "(R)" and blue text for the reduced datasets. The accuracies in all cases are for pairs of samples that were included in at least one reduced dataset so that the results are directly comparable between data types. We report sample pairs inferred as sixth degree or more distant together here.
unrelated HapMap samples and the set L (above) containing 529 SAMAFS samples inferred by PLINK to have kinship less than 0.015. As this latter dataset has little relatedness structure, ADMIXTURE should readily infer the samples' ancestral proportions from African, European, and Native American populations. Consistent with this, we located the inferred ancestry components likely to correspond to these three groups using individuals from the HapMap YRI, CEU, and MXL populations in the two different ADMIXTURE runs. We then computed correlations separately for each component between the set L of 529 SAMAFS samples contained in both datasets. These correlations are extremely high at >0.97 for all three populations, indicating that the output from ADMIXTURE run on all of SAMAFS together with the 372 unrelated HapMap individuals reliably infers population-level ancestry proportions.
Building on this analysis, we used the ADMIXTURE results from the dataset containing the full SAMAFS and unrelated HapMap samples, extracting only the ancestry estimates for SAMAFS, in order to examine whether the accuracy of REAP and RelateAdmix improve using these ancestry estimates. Additionally, we ran KING and PC-Relate (the other methods that address population structure) on this same combined Supplemental Figure 4: Accuracies from running KING, PC-Relate, REAP, and RelateAdmix on: a dataset containing all of SAMAFS and unrelated HapMap individuals, with a label "(F+H)" in blue text; and on the full SAMAFS sample, with an added label "(F)" in black text. Red horizontal bars under a bar plot indicate that the corresponding inferences agree with the reported relationships.
dataset to compare their performance using this more diverse sample to that from the SAMAFS-only data. As shown in Supplemental Figure 4, the accuracy of REAP and RelateAdmix increase, indicating that the ADMIXTURE results improve when using the augmented dataset. In contrast, the performance of KING and PC-Relate are generally consistent in both analyses. As PC-Relate identifies a set of individuals with little relatedness and uses these to infer principal components (PCs), it is less susceptible to confounding due to relatedness in the input data, and appears not to require aids to distinguish shared variation due to population ancestry from recent relatedness. Notably, in the case of seventh degree relatives PC-Relate's accuracy increases to 41.1% compared to 27.4% in the original dataset. While this increase is striking, our analysis includes fewer than 100 seventh degree relatives, and it is unclear to what extent this result may be due to stochastic factors. Overall, PC-Relate performs well in both settings, and indeed outperforms REAP and RelateAdmix regardless of whether we include HapMap samples with SAMAFS.

Additional analyses of PC-Relate
We further examined the potential for improving accuracy in PC-Relate by (1) using different methods to produce the kinship coefficients given to PC-Relate as input, testing PLINK and Refined IBD as well as the recommended method KING 17 (as used in the main analysis); (2) changing the relatedness threshold that determines the set of samples used to infer PCs, varying this between the default of excluding fourth degree relatives or closer to excluding seventh degree relatives or closer; and (3) iterating PC-Relate by using its own inferred kinship coefficients as input to a second round inference, using either KING or Refined IBD as the initial input, an approach the authors suggested previously 17 . As shown in Supplemental Figure 5, all tested scenarios produce similar accuracy, with sizeable variation only observed in seventh degree relative inference, potentially driven by noisy results for this smaller number of distant relatives. We conclude that the results in Figure 1 are representative of PC-Relate's optimal performance in SAMAFS.

Varying linkage disequilibrium filters
For methods that are designed to handle only independent SNPs, LD filtering of the genotype data is necessary, but the parameters used in filtering have the potential to affect inference quality. We consider the accuracy of inferring relatedness using genotypes with r 2 up to 0.05, 0.1, 0.2, and our original threshold of 0.25. Using the original SAMAFS dataset, we filtered using the --indep-pairwise 1000 x option in PLINK, where x is the upper bound for r 2 . This resulted in datasets with 23,761, 57,198, and 114,541 SNPs for r 2 set to 0.05, 0.1, and 0.2, respectively. Supplemental Figure 6 shows the rates for inferring sample pairs to have the same degree of relatedness as reported in SAMAFS as well as one degree higher and lower. In general, changes in accuracy for inferring individuals to have the reported relatedness degree are minimal. Some larger differences exist for seventh degree relative inference, but as previously noted, this is a smaller set of relative pairs and differences may be due to random fluctuations. No clear trend exists among the methods, as some have improved performance for lower r 2 thresholds, while some perform best with the highest value.

Samples with inferred relatedness different from that reported in SAMAFS
We sought to identify relationships that are reported as either first degree, second degree, or unrelated, but are inferred to have a substantially different degree of relatedness. Using unanimous agreement from ERSA, GERMLINE, and Refined IBD to locate confidently inferred relationships, we located several pairs of samples with such differences. For inferred relationships that differ by more than one degree from the reported relationship (e.g., reported as second degree but inferred as sixth degree), we assumed that the inference is valid as this is unlikely to occur due to data errors or statistical fluctuations. For relationships that are inferred to differ by only one degree from the reported relationship, we further required that either: (1) the discrepant relatedness call be supported by a consistent call involving at least one other sample (example follows); (2) in cases of reported siblings inferred to be second degree relatives, that their IBD2 proportion be less than 1 2 5/2 ; or (3) in cases of reported half-sibling pairs inferred to be first degree relatives, that their IBD2 proportion be greater than 1 2 5/2 . As an example of an inference supported by another sample, given a set of three or more reported siblings, if the methods infer a pair of siblings as likely second degree relatives (presumably half-siblings), we checked that one of the other siblings also support a second degree relatedness inference involving one of the two original samples to ensure consistency. We used results from Refined IBD to quantify IBD2 levels. Note that the expected proportion of IBD2 between full siblings is 1 4 , and we used 1 2 5/2 as the cutoff for confirming full vs. half-sibling calls.
Supplemental Figure 5: Inference results from running PC-Relate with different input and parameter settings. Labels indicate the method used to supply input kinship coefficients-KING, Refined IBD, or PLINK-that PC-Relate uses to identify samples with which it infers PCs. Labels in green with "+ iter" text correspond to results from iterating by rerunning PC-Relate using its own kinship estimates as input for a second round of inference. Labels in blue with "(7)" are from running PC-Relate with a threshold set to exclude seventh degree or closer relatives rather than the default of excluding fourth degree or closer relatives. Red horizontal bars under a bar plot indicates that the corresponding inferences agree with the reported relationships.
The IBD2 levels of two reported half-siblings from two pedigrees were greater than that seen for most halfsiblings but less than typical for full siblings, and appear to be best explained as being a less commonly described class of relatives known as three-quarter-siblings defined earlier (see "SAMAFS data and quality control procedures"). Individuals with this class of relatedness share non-trivial proportions of IBD2 but at Supplemental Figure 6: Accuracies of KING, PC-Relate, PLINK, REAP, and RelateAdmix using four different LD filtering thresholds: 0.05, 0.1, 0.2, and 0.25. Depicts the proportion of sample pairs inferred to have the same degree of relatedness as reported as well as inferred to have one higher and one lower relatedness degree.
a lower level than for full siblings. For the potential three-quarter-siblings we identified, we did not have genotype data for one of the fathers in both cases and therefore could not validate whether the fathers were siblings. As the degree of relatedness for such pairs is ambiguous, and because other factors may explain these findings, we do not count these pairs as inconsistent. As shown in Supplemental Table 3, the majority of pairs in the dataset have inferred relatedness consistent with the reported relationships. However, we find a few discrepancies, including unrelated individuals inferred to be closely related (Supplemental Figure 7), and even some second degree relatives inferred to have sixth degree or more distant relatedness. This latter group could be explained by sample swaps or possibly by the individuals being adopted. Supplemental Table 3: Pairs of relationships that are confidently inferred using unanimous agreement from three programs and further checks described in the text (for some discrepant relationships) in SAMAFS.
(HS) are half-sibling pairs, (A) are avuncular pairs, (GP) are grandparent-grandchild pairs, and (DC) are double cousin pairs. Bold numbers indicate agreement between reported and inferred relationships. We only count pairs whose relationships are unanimously agreed upon by the methods and, when discrepant, could be verified as probable misreports using the described checks.
Supplemental Figure 7: Relationships discovered between individuals from different SAMAFS pedigrees. Bands on the perimeter of the elliptical plot indicate distinct pedigrees within SAMAFS with band size proportional to the number of individuals in the pedigree. Curves between two bands correspond to discovered relative pairs with curve color indicating the degree of relatedness: red for first degree, green for second degree, and blue for third degree. Points where the curves end correspond to specific individuals, and a single point may have multiple curves running to it, indicating several relationships between that individual and others in the dataset.

Method testing details
Details on how we ran each of the methods appears below. Note that these descriptions apply to the inference results presented in Figure 1. We performed several additional analyses using altered input or parameters to the methods that are described above. In all cases, we provided methods with genotype data only and removed all reported family relationships.
fastIBD: Following recommended practices 18 , we ran fastIBD using the default settings 10 times using a different seed each time and took the union of the called IBD segments from the multiple runs as our final set of IBD regions.
complete the analysis due to apparent bugs, and/or required the purchase of a license to run, and we did not include these in our study.