fst_NEI87 | R Documentation |
The function calculates for diploid genomes the classic Nei's Gst (1987),
Nei's G'st (prime), that comes with a correction for the bias that stems
from sampling a limited number of populations. Also calculated is Jost's D,
an index of population differentiation that is independent of the amount of
within-population diversity (Jost, 2008).
Both overall and pairwise Fst can be estimated with
confidence intervals based on bootstrap of markers (resampling with replacement).
The function should give identical results at the 4th decimal when tested
against genet.dist
in hierfstat
and with
the Fst computed in Calculate Distances
or
GenoDive.
The fastest computation is still
GenoDive,
but here, the R solution computes confidence intervals and it's very fast.
The computations takes advantage of tidyverse packages and parallel.
The impact of unbalanced design on estimates can be tested by using the
subsample argument.
Special concerns for genome-wide estimate and filtering bias
During computation, the function first starts by keeping only the polymorphic markers in common between the populations. Keep this in mind when filtering your markers to use this function characteristic strategically to get better genome-wide estimate. This is even more important when your project involves more than 2 populations that evolved more by neutral processes (e.g. genetic drift) than by natural selection (see the vignette for more details).
fst_NEI87(
data,
pop.levels = NULL,
pop.labels = NULL,
strata = NULL,
holdout.samples = NULL,
pairwise = FALSE,
ci = FALSE,
iteration.ci = 100,
quantiles.ci = c(0.025, 0.975),
subsample = NULL,
iteration.subsample = 1,
digits = 9,
parallel.core = parallel::detectCores() - 1,
verbose = FALSE,
...
)
data |
A file in the working directory or object in the global environment
in wide or long (tidy) formats. To import, the function uses internally
radiator
How to get a tidy data frame ?
radiator
|
pop.levels |
(optional, string) This refers to the levels in a factor. In this
case, the id of the pop.
Use this argument to have the pop ordered your way instead of the default
alphabetical or numerical order. e.g. |
pop.labels |
(optional, string) Use this argument to rename/relabel
your pop or combine your pop. e.g. To combine |
strata |
(optional) A tab delimited file with 2 columns with header:
|
holdout.samples |
(optional) Samples that don't participate in the Fst
computation (supplementary). Data frame with one column |
pairwise |
(logical, optional) With |
ci |
(logical, optional) Compute bootstrapped confidence intervals.
Default: |
iteration.ci |
(integer, optional) The number of iterations for
the boostraps (resampling with replacement of markers).
Default: |
quantiles.ci |
(double, optional)
The quantiles for the bootstrapped confidence intervals.
Default: |
subsample |
(Integer or character)
With |
iteration.subsample |
(Integer) The number of iterations to repeat
subsampling.
With |
digits |
(optional, integer) The number of decimal places to be used in
results.
Default: |
parallel.core |
(optional) The number of core for parallel computation
of pairwise Fst.
If not selected |
verbose |
(logical, optional) |
... |
other parameters passed to the function. |
Input data:
To discriminate the long from the wide format,
the function radiator tidy_wide
searches
for MARKERS or LOCUS
in column names (TRUE = long format).
The data frame is tab delimitted.
Wide format:
The wide format cannot store metadata info.
The wide format starts with these 2 id columns:
INDIVIDUALS
, STRATA
(that refers to any grouping of individuals),
the remaining columns are the markers in separate columns storing genotypes.
Long/Tidy format:
The long format is considered to be a tidy data frame and can store metadata info.
(e.g. from a VCF see radiator tidy_genomic_data
). A minimum of 4 columns
are required in the long format: INDIVIDUALS
, STRATA
,
MARKERS or LOCUS
and GENOTYPE or GT
. The rest are considered metata info.
2 genotypes formats are available:
6 characters no separator: e.g. 001002 of 111333
(for heterozygote individual).
6 characters WITH separator: e.g. 001/002 of 111/333
(for heterozygote individual).
The separator can be any of these: "/", ":", "_", "-", "."
.
How to get a tidy data frame ?
radiator tidy_genomic_data
can transform 6 genomic data formats
in a tidy data frame.
The function returns a list with several objects.
When sumsample is selected the objects end with .subsample
.
$subsampling.individuals
: the combinations of individuals and subsamples,
$fst.markers
: Nei's Gst, Nei's G'st and Jost's D by markers,
$fst.ranked
: Nei's Gst, Nei's G'st and Jost's D by markers ranked by Nei's Gst,
$fst.overall
: the overall Nei's Gst, Nei's G'st and Jost's D by markers with confidence intervals.
$fis.markers
: the fis by markers,
$fis.overall
: the mean fis overall markers with confidence intervals and the number of markers,
$fst.plot
: the histogram of the overall G'st per markers,
$pairwise.fst
: pairwise Nei's Gst, Nei's G'st and Jost's D in long/tidy data frame and the number of markers ,
$pairwise.fst.upper.matrix
: the pairwise fst prime in a upper triangle matrix,
$pairwise.fst.full.matrix
: the pairwise fst prime matrix (duplicated upper and lower triangle),
$pairwise.fst.ci.matrix
: matrix with pairwise fst prime in the upper triangle
and the confidence intervals in the lower triangle.
when subsample is selected $pairwise.fst.subsample.mean
is a summary
of all pairwise comparisons subsample. The mean is calculated accross summary
statistics.
Negative Fst are technical artifact of the computation (see Roesti el al. 2012) and are automatically replaced with zero inside this function.
Why no p-values ?
There is no null hypothesis testing with P-values. Confidence intervals provided with the F-statistics enables more reliable conclusions about the biological trends in the data.
Thierry Gosselin thierrygosselin@icloud.com
Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press
Meirmans PG, Van Tienderen PH (2004) genotype and genodive: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes, 4, 792-794.
Roesti M, Salzburger W, Berner D. (2012) Uninformative polymorphisms bias genome scans for signatures of selection. BMC Evol Biol., 12:94. doi:10.1111/j.1365-294X.2012.05509.x
Jost L. (2008) G(ST) and its relatives do not measure differentiation. Molecular Ecology. 17: 4015-4026.
From GenoDive manual: 'In general, rather than to test differentiation between all pairs of populations, it is adviseable to perform an overall test of population differentiation, possibly using a hierarchical population structure, (see AMOVA)'
To compute an AMOVA, use GenoDive
Link for GenoDive
For Fisher's exact test and p-values per markers
see mmod
diff_test
.
tidy_genomic_data
to transform numerous genomic data
format in tidy data frames.
## Not run:
wombat.fst.pairwise <- fst_NEI87(
data = "wombat.filtered.tidy.tsv",
sep = "/",
pop.levels = c("ATL", "MLE", "BIS", "PMO", "SOL", "TAS", "ECU"),
holdout.samples = NULL,
pairwise = TRUE,
ci = TRUE,
iteration.ci = 10000,
quantiles.ci = c(0.025,0.975),
parallel.core = 8,
verbose = TRUE
)
To get the overall Fst estimate:
wombat.fst.pairwise$fst.overall
To get the Fst plot:
wombat.fst.pairwise$fst.plots
To get the pairwise Fst values with confidence intervals in a data frame:
wombat.fst.pairwise$pairwise.fst
To get the pairwise fst and ci matrix in a data frame:
# rename, data frame, put rownames in column
pairwise.fst.ci.df <- data.frame(pairwise.fst.ci.matrix) %>% add_rownames("POP")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.