Description Usage Arguments Details Value Warning Author(s) References Examples
The degree of genetic differentiation between populations is often measured by the fixation index Gst (Nei, 1973). However, differentiation at polymorphic loci with more than 2 alleles is much better reflected by the D value (Jost, 2008; Gerlach et al., 2010). The functions of this package allow to estimate locus by locus (and averaged over loci) pairwise Gst and D values for codominant markers between populations and their averages over all populations. P-values (indicating the strength of evidence against the null hypothesis of no genetic differentiation) and 95% confidence limits are obtained from bootstrap methods. Depending on whether or not all populations are in Hardy Weinberg Equilibrium for a given locus, either alleles or genotypes are randomized over populations, respectively (see Goudet, 1996).
1 2 3 4 |
filename |
Its syntax depends on the setting of the argument |
bias |
An argument providing two options ( |
object |
This argument can be set as |
format.table |
A logical argument either set as |
pm |
A two-level argument providing the opportunity to compare
populations pairwise ( |
statistics |
A four-level argument to select whether no statistics
( |
bt |
A numeric argument (default= |
The input format
The input data can be of two different formats. Both of them should be tab-delimited. The information that has to be provided are names or numbers for each individual, the according population they were sampled from and the alleles (length in base pairs, rounded) at each locus. Two alleles have to be defined for each diploid individual. Haplotype data can not be evaluated with this package. Missing alleles have to be set to zero (possible: 0, 00, 000).
The data table that has to be transformed by choosing
format.table=TRUE
, can be
provided in the following format:
individual | population | locus1.allele.a | locus1.allele.b | locus2.allele.a | locus2.allele.b |
P1.1 | P1 | 175 | 183 | 110 | 110 |
P1.2 | P1 | 183 | 183 | 123 | 126 |
P2.1 | P2 | 230 | 225 | 110 | 110 |
. | . | . | . | . | . |
. | . | . | . | . | . |
. | . | . | . | . | . |
The number of populations and loci are not restricted.
The column names individual
and population
must be included. The
other columns listing the fragment lengths in base pairs can be named
arbitrarily. It is recommended name the two columns that refer to the
same locus, equally (e.g. locus1.allele.a
and locus1.allele.b
should
both be named Locus1
). Mathematical signs, like +
or -
should be
avoided and spaces are not allowed in column names.
Alternatively, when the input data are given in the following format,
they do not have to be transformed (format.table=FALSE
):
individual | population | fragment.length | locus |
P1.1 | P1 | 175 | L1 |
P1.1 | P1 | 183 | L1 |
P1.2 | P1 | 183 | L1 |
P1.2 | P1 | 183 | L1 |
P2.1 | P2 | 230 | L1 |
P2.1 | P2 | 225 | L1 |
. | . | . | . |
. | . | . | . |
. | . | . | . |
P1.1 | P1 | 110 | L2 |
P1.1 | P1 | 110 | L2 |
P1.2 | P1 | 123 | L2 |
P1.2 | P1 | 126 | L2 |
P2.1 | P2 | 110 | L2 |
P2.1 | P2 | 110 | L2 |
. | . | . | . |
. | . | . | . |
. | . | . | . |
The data in the column fragment.length
represent numbers of base pairs.
Details on confidence interval calculation
95% confidence intervals of the D or Gst values are based on the range
of these values from reallocated data sets that are obtained by
bootstrapping alleles (or genotypes) of one locus within populations.
Hardy Weinberg Equilibrium (HWE) is tested for each locus and each
population. If all of the tested populations are in HWE, the alleles of
a single locus, are randomized within populations. Otherwise, alleles are not
inherited independently from each other and genotypes are randomized
within populations (Goudet, 1996). The upper and lower 95% confidence limits are evaluated as the lower
(0.025) and upper (0.975) bounds of the quantiles of D or Gst values
from the resampled data using the function quantile
:
Empirical D or Gst +(-) upper(lower) quantile bound
Details on p-value calculation
To be able to test the null hypothesis of absence of genetic differentiation between populations, a bootstrap method is performed. Thereby, alleles (or genotypes) of one locus are randomized over all compared populations. Hardy Weinberg Equilibrium HWE is tested for each locus and each population. If all of the tested populations are in HWE, the alleles of a single locus, are randomized over all populations. Otherwise, alleles are not inherited independently from each other and genotypes are randomized over all populations (Goudet, 1996). Reallocating alleles or genotypes simulates populations that share a common gene pool and are not differentiated. Since the empirical value of genetic differentiation is expected to be larger than a value obtained from within a panmictic population when the tested populations are significantly differentiated, a one tailed test is carried out. The null hypothesis (panmictic populations) can be rejected at a 95% significance level (p<0.05) when the empirical value is larger than 95% of the bootstrapped test statistics. The p-value is calculated according to Manly (1997, p. 62).
When more than two populations are compared with one another, using the
option pm="pairwise"
, the p-values are adjusted in order to
account for the multiple comparison from one data set, using the
function p.adjust
of the package stats
. They represent the
smallest overall significance levels, at which the hypothesis would be
rejected (Wright, 1992). Those p-values giving the significance levels
for different loci, are adjusted independently from each other. Those
p-values giving the significance levels for the averaged differentiation
over all loci, are adjusted to one another. The adjustment is performed
by Bonferroni correction, by Holm's method, by Hommel's method and by a
method provided by Benjamini and Hochberg. See the help file of the
function p.adjust
for further information on these methods.
Test for Hardy Weinberg Equilibrium HWE
Before bootstrapping, populations are automatically tested for being in HWE by comparing the
empirical numbers of genotypes and those expected under HWE using the
function chisq.test
with the arguments: simulate.p.value=TRUE
,
b=10000
. This means, that the p-value is obtained from a Monte Carlo
method with 10000-fold resampling. The null hypothesis of HWE is
rejected when p is smaller than 0.05.
Results are saved as .txt files (space-delimited) in the actual working
directory, which is normally the one your input data were loaded
from. The path of the working directory can be requested by typing
getwd()
and changed by using the function setwd()
. During
the calculation, the output is printed in the R console where the kind
of data is also shortly described and how the respective .txt files are
named. The filenames include the argument filename
and the actual date.
In case that you are comparing more than two populations pairwise and are calculating p-values and/or confidence intervals, you will be informed about the estimated end of the analysis after completion of the first pairwise comparison.
If the same analysis is carried out more than once at the same day on a single dataset, the results will all be found, one written below the other, separated by a row of column names, in the same file (if the working directory was not changed).
The output files are described in the following paragraphs:
allelefrequencies |
A data table comprising the following columns:
|
sample sizes |
A data table comprising the following columns:
|
heterozygosities |
A data table that lists heterozygosites which are calculated according to the formulas given in Jost (2008).
|
Depending on whether populations are compared pairwise
pm="pairwise"
or differentiation / fixation is estimated over all
populations pm="overall"
, the result tables comprising the
D/Gst values differ slightly.
When overall D or Gst values are
evaluated, the output comprises the following two data tables (X
stands for D, Dest, Gst or Gst.est values):
X.loci.over.all.populations |
|
X.mean.over.all.populations |
|
When populations are compared pairwise, INTERMEDIATE RESULTS are printed and saved after each comparison. automatically. The next INTERMEDIATE RESULT is printed to the same file, separated from the preceding result by a row of column names. When the whole analysis is completed, the END RESULT containing the information of all the INTERMEDIATE RESULTs in a single data frame is printed and saved to the same file, separated from the preceding INTERMEDIATE RESULTs by a row of column names. Appending the results one below the other avoids loss of data. But you have to be careful. If you want to work with the INTERMEDIATE RESULTs that have already been saved, it is recommended to copy the respective file and work with the copy. Otherwise, problems can arise, when you work with the original file and R tries to write new results to it. This could cause interruption of the analysis.
If an analysis is carried out more than once at the same day, the results will all be found, one written below the other, separated by a row of column names in the same file (if the working directory was not changed).
If an analysis runs more than one day, the INTERMEDIATE RESULTs will be saved in different files, according to the date, they had been analysed on. But all the INTERMEDIATE RESULTs will be included in the END RESULT in which all INTERMEDIATE RESULTs are finally saved together.
The output comprises data tables with the following information (X stands for D, Dest, Gst or Gst.est values):
X.loci.pairwise.comparison |
|
X.mean.pairwise.comparison |
|
When you choose the option format.table=TRUE
, a data file called
“Output-Inputformat.txt” is created that is needed by the
functions of this package to analyze the data.
Depending on the size of your data set and the performance of your computer, the bootstrapping process for calculating p-values and confidence intervals, can take very long so that you might want to run the analysis over night.
When you carry out pairwise population comparisons, you will be informed after evaluation of the data for the first population pair, when the whole analysis is estimated to finish.
Alexander Jueterbock, Alexander-Jueterbock@web.de
Philipp Kraemer, philipp.kraemer@mail.uni-oldenburg.de
Gerlach G., Jueterbock A., Kraemer P., Deppermann J. and Harmand P. 2010
Calculations of population differentiation based on Gst and D:
forget Gst but not all of statistics!
Molecular Ecology 19, p. 3845–3852.
Goudet J., Raymond M., deMeeues T. and Rousset F. 1996
Testing differentiation in diploid populations.
Genetics 144, 4, p. 1933–1940.
Jost, L. 2008
Gst and its relatives do not measure differentiation.
Molecular Ecology 17, 18, p. 4015–4026.
Manly, B.F.J. 1997
Randomization, bootstrap and Monte Carlo methods in biology
Chapman & Hall.
Nei, M. 1973
Analysis of gene diversity in subdivided populations.
Proceedings of the National Academy of Sciences of the United
States of America
70, 12, p. 3321–3323.
Nei M., Chesser R. 1983
Estimation of fixation indices and gene diversities.
Annals of Human Genetics 47, 253–259.
Wright, S.P. 1992
Adjusted p-values for simultaneous inference.
Biometrics 48, 1005–1013.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # loading data from the example files of this package
data(Example.transformed)
Example.t <- Example.transformed
data(Example.untransformed)
Example.u <- Example.untransformed
# Calculating mean Dest values (averaged over all populations) with
# p-values and confidence intervals using only 10 bootstrap resamplings
D.Jost("Example.t", bias="correct", object=TRUE, format.table=FALSE,
pm="overall", statistics="all", bt=10)
# Calculating pairwise Gst values without any statistics
Gst.Nei("Example.u", bias="uncorrected", object=TRUE, format.table=TRUE,
pm="pairwise", statistics="none")
# If you do not know where the results of these example tables have been
# saved, type getwd()
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.