qpdstat | R Documentation |
Computes f4-statistics of the form f4(A, B; C, D)
. For allele frequencies a
, b
, c
, d
, f4(A, B; C, D)
is computed as the average of (a-b)*(c-d)
over all SNPs in each SNP block. This is equivalent to (f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2
(assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by f2_from_precomp
or f2_from_geno
, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).
qpdstat(
data,
pop1 = NULL,
pop2 = NULL,
pop3 = NULL,
pop4 = NULL,
boot = FALSE,
sure = FALSE,
unique_only = TRUE,
comb = TRUE,
blgsize = NULL,
block_lengths = NULL,
f4mode = TRUE,
afprod = TRUE,
cpp = TRUE,
verbose = is.character(data),
...
)
data |
Input data in one of three forms:
|
pop1 |
One of the following four:
|
pop2 |
A vector of population labels |
pop3 |
A vector of population labels |
pop4 |
A vector of population labels |
boot |
If |
sure |
The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large. |
unique_only |
If |
comb |
Generate all combinations of |
blgsize |
SNP block size in Morgan. Default is 0.05 (5 cM). Only used when |
block_lengths |
Vector with lengths of each jackknife block. |
f4mode |
Set this to |
afprod |
Compute f4 from allele frequency products instead of f2 (default |
cpp |
Use C++ functions. Setting this to |
verbose |
Print progress updates |
... |
Additional arguments passed to |
f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.
Use the same SNPs for every f4-statistic
This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (maxmiss = 0
in extract_f2
or f2_from_geno
).
Use different SNPs for each f4-statistic
This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set allsnps = FALSE
. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the allsnps
option, whereas the default mode in qpAdm and qpGraph is equivalent to maxmiss = 0
)
Use different SNPs for each f2-statistic
This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the maxmiss
parameter to a value greater than 0 (and not larger than 1) in extract_f2
or f2_from_geno
. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option afprod = TRUE
in f2_from_precomp
.
In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting maxmiss = 1
in extract_f2
) to get an overview, and then spot-check individual results using more conservative options.
qpdstat
returns a data frame with f4 statistics, with columns for populations 1 to 4,
f4-estimate (est
), standard error of the estimate (se
), z-score (z
, est
/se
), p-value (2*(1-pnorm(z))
),
and the number of SNPs used (n
; only if first argument is genotype prefix)
f4
Patterson, N. et al. (2012) Ancient admixture in human history Genetics
Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
## Not run:
qpdstat(f2_dir, pop1, pop2, pop3, pop4)
## End(Not run)
## Not run:
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)
## End(Not run)
## Not run:
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.