dapc_infer | R Documentation |
Takes a long-format data table of genotypes and assist in a preliminary inference of k, the effective number of populations. Inference of k is facilitated through examination of the PCA screeplot and through testing K-means testing.
dapc_infer(
dat,
scaling = "covar",
sampCol = "SAMPLE",
locusCol = "LOCUS",
genoCol = "GT",
kTest = 1:10L,
pTest,
screeMax = 20L,
plotLook = "ggplot"
)
dat |
Data table: A long data table, e.g. like that imported from
|
scaling |
Character: How should the data (loci) be scaled?
Set to |
sampCol |
Character: The column name with the sampled individual information.
Default is |
locusCol |
Character: The column name with the locus information.
Default is |
genoCol |
Character: The column name with the genotype information.
Default is |
kTest |
Integer: A vector of the number of (k) values to test.
Default is |
pTest |
Integer: A vector of the number of (p) PC axes to fit K-means with. |
screeMax |
Integer: The maximum number of PC axes to plot in the screeplot. |
plotLook |
Character: The look of the plot. Default = |
DAPC was made popular in the population genetics/molecular ecology community following Jombart et al.'s (2010) paper. The method uses a DA to model the genetic differences among populations using PC axes of genotypes as predictors.
The choice of the number of PC axes to use as predictors of genetic
differences among populations should be determined using the k-1 criterion
described in Thia (2022). This criterion is based on the findings of
Patterson et al. (2006) that only the leading k-1 PC axes of a genotype
dataset capture biologically meaningful structure. Users can use the function
genomalicious::dapc_infer
to examine eigenvalue screeplots and
perform K-means clustering with different parameters to infer the number of
biologically informative PC axes.
Users should use examine both the screeplot of eigenvalues and the different
K-means plots produced. The screeplot typically exhibits a break in the scree
around the putative k. Additionally, different parameterisations of
K-means clustering should also converge on a similar conclusion. Users may
also find it useful to visualise scatterplots, e.g., using pca_genos
.
This function can also be used to determine populations de novo if the user
has not a priori expectation of the number of populations and the designation
of individuals. See Miller et al. (2020) and Thia (2022) for distinction and
importance of a priori vs. de novo population designations. The function
returns all K-means solutions for all parameter combinations. After insepcting
the screeplot and the K-means solutions, the desired K-means fit can be
extracted from the returned object to obtain de novo population designations
for downstream analysis, e.g., with genomalicious::dapc_fit
.
Returns a list:
$tab
is a datatable of k and p
values examined and
associated BIC value. $fit
contains the individual outputs from
kmeans
for each combination of parameters fitted. $plot
is
a ggplot object.
Jombart et al. (2010) BMC Genetics. DOI: 10.1186/1471-2156-11-94 Miller et al. (2020) Heredity. DOI: 10.1038/s41437-020-0348-2 Patterson et al. (2006) PLoS Genetics. DOI: 10.1371/journal.pgen.0020190 Thia (2022) Mol. Ecol. DOI: 10.1111/1755-0998.13706
library(genomalicious)
data(data_Genos)
# Test 1 to 10 with 3, 10, 20, and 40 PC axes, plotting just the first 10
# eigenvalues from the PCA, with a ggplot flavour.
inferK <- dapc_infer(
data_Genos,
kTest=1:10L,
pTest=c(3,10,20,40),
screeMax=10L,
plotLook='ggplot'
)
# Tabulated statistics
inferK$tab
# The K-means clustering results for k=3 fitted with p=3 PC axes
inferK$fit$`k=3,p=3`
# The plot
inferK$plot
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.