qpdstat: Estimate f4 statistics
In uqrmaie1/admixtools: Inferring demographic history from genetic data

qpdstat

R Documentation

Estimate f4 statistics

Description

Computes f4-statistics of the form f4(A, B; C, D). For allele frequencies a, b, c, d, ⁠f4(A, B; C, D)⁠ is computed as the average of (a-b)*(c-d) over all SNPs in each SNP block. This is equivalent to (f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2 (assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by f2_from_precomp or f2_from_geno, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).

Usage

qpdstat(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  comb = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  f4mode = TRUE,
  afprod = TRUE,
  cpp = TRUE,
  verbose = is.character(data),
  ...
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pop1`	One of the following four: `NULL`: all possible population combinations will be returned A vector of population labels. All combinations with the other `pop` arguments will be returned A matrix with population combinations to be tested, with one population per column and one combination per row. Other `pop` arguments will be ignored. the location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. Other `pop` arguments will be ignored.
`pop2`	A vector of population labels
`pop3`	A vector of population labels
`pop4`	A vector of population labels
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`sure`	The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.
`unique_only`	If `TRUE` (the default), redundant combinations will be excluded
`comb`	Generate all combinations of `pop1`, `pop2`, `pop3`, `pop4`. If `FALSE`, `pop1`, `pop2`, `pop3`, `pop4` should all be vectors of the same length.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). Only used when `data` is the prefix of genotype files
`block_lengths`	Vector with lengths of each jackknife block. `sum(block_lengths)` has to match the number of SNPs. only used when `data` is the prefix of genotype files
`f4mode`	Set this to `FALSE` to compute D-statistics instead of f4. This only has an effect if the first argument is a genotype prefix. D-statistics are computed as `(a-b)(c-d) / ((a + b - 2ab) (c + d - 2cd))`, which is the same as `(P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA))`
`afprod`	Compute f4 from allele frequency products instead of f2 (default `TRUE`). Only used if `data` is a directory with precomputed data. For populations with lots of missing data, this option reduces bias that can result from setting `maxmiss` to values greater than 0. In all other cases it should not make a difference.
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.
`verbose`	Print progress updates
`...`	Additional arguments passed to `f4blockdat_from_geno` if `data` is a genotype file prefix or `f2_from_precomp` if `data` is a directory with f2-statistics

Details

f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.

Use the same SNPs for every f4-statistic
This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (maxmiss = 0 in extract_f2 or f2_from_geno).
Use different SNPs for each f4-statistic
This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set allsnps = FALSE. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the allsnps option, whereas the default mode in qpAdm and qpGraph is equivalent to maxmiss = 0)
Use different SNPs for each f2-statistic
This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the maxmiss parameter to a value greater than 0 (and not larger than 1) in extract_f2 or f2_from_geno. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option afprod = TRUE in f2_from_precomp. In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting maxmiss = 1 in extract_f2) to get an overview, and then spot-check individual results using more conservative options.

Value

qpdstat returns a data frame with f4 statistics, with columns for populations 1 to 4, f4-estimate (est), standard error of the estimate (se), z-score (z, est/se), p-value (2*(1-pnorm(z))), and the number of SNPs used (n; only if first argument is genotype prefix)

Alias

f4

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
## Not run: 
qpdstat(f2_dir, pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)

## End(Not run)

uqrmaie1/admixtools documentation built on July 16, 2025, 4:01 p.m.