dapc_fit | R Documentation |
Takes a long-format data table of genotypes and conducts a PCA using R's
prcomp
function, then fits a DA using R's lda
function.
Can also be used to assess DA model fit using a leave-one-out cross-validation
or training-testing partitioning.
dapc_fit(
dat,
sampCol = "SAMPLE",
locusCol = "LOCUS",
genoCol = "GT",
popCol = "POP",
scaling = "covar",
pcPreds,
method = "fit",
numCores = 1,
trainProp = 0.7
)
dat |
Data table: A long data table, e.g. like that imported from
|
sampCol |
Character: The column name with the sampled individual information.
Default is |
locusCol |
Character: The column name with the locus information.
Default is |
genoCol |
Character: The column name with the genotype information.
Default is |
popCol |
Character: An optional argument. The column name with the
population information. Default is |
scaling |
Character: How should the data (loci) be scaled for PCA?
Default is |
pcPreds |
Integer: The number of leading PC axes to use as predictors of among-population genetic differences in DA. See details. |
method |
Character: The analysis to perform. Default is |
numCores |
Integer: The number of cores to run leave-one-out cross-validation,
only required when |
trainProp |
Numeric: The proportion of the data to reserved as the training set, with the remaining proportion used as the testing set. Default is 0.7. See details. |
DAPC was made popular in the population genetics/molecular ecology community following Jombart et al.'s (2010) paper. The method uses a DA to model the genetic differences among populations using PC axes of genotypes as predictors.
The choice of the number of PC axes to use as predictors of genetic
differences among populations should be determined using the k-1 criterion
described in Thia (2022). This criterion is based on the findings of
Patterson et al. (2006) that only the leading k-1 PC axes of a genotype
dataset capture biologically meaningful structure. Users can use the function
genomalicious::dapc_infer
to examine eigenvalue screeplots and
perform K-means clustering with different parameters to infer the number of
biologically informative PC axes.
Assessing model fit of DAPC requires partitioning data into sets
for training and testing. When method=='loo_cv'
, leave-one-out cross-validation
is performed: each ith sample is withheld as a testing sample, the model is
fit without the ith sample, and then the model is used to predict the ith sample's
population. This method is preferable when sample sizes are small.
When method=='train_test'
, a proportion of trainProp
individuals
from each populations are used to train the DAPC model which is then used to
predict the populations in the remaining testing individuals.
Returns a list, whose contents depend on the method
specified.
If method=='fit'
, the list contains:
$da.fit
: an lda
object, a DA of genotypic PC axes.
If also contains the index $exp.var
, which is the percent of among
population variance captured by each LD axis, and $among.var
, the
percent of among population variance.
$da.tab
: a data table of LD axis scores, with columns $POP
,
the designated population, $SAMPLE
, the sample ID, $[axis]
,
where each [axes]
is a column for a different DA axis.
$da.prob
: a data table of posterior probabilities of group
assignment for each sample, with columns $POP
, the designated
population, $SAMPLE
, the sample ID, $POP.PRED
, the predicted
population, $PROB
, the posterior probability for the predicted
populations (sums to 1 across predicted populations per sample).
pca.fit
: a prcomp
object, a PCA of genotypes.
pca.tab
: a data table of PC axis scores, with columns $POP
,
the designated populations, $SAMPLE
, the sample ID, $[axis]
,
where each [axes]
is a column for a different PC axis.
snp.contrib
: a data table of SNP contributions to LD axes,
with columns, $LOCUS
, the SNP locus, $LD[x]
, the individual
LD axes, with [x]
denoting the axis number.
If method=='loo_cv'
or method=='train_test'
, the list contains:
$tab
: a data table of predictions for tested samples, with
columns, $POP
, the designated population, SAMPLE
, the
sample ID, and $POP.PRED
, the predicted population. Note, that
for method=='loo_cv'
, as samples with be present, but for
method=='train_test'
, only the samples retained for the testing
set will be present.
$global
: a single numeric, the global correct assignment rate.
$pairs.long
: a long-format data table, of pairwise population correct
assignment rates, with columns, $POP
, the designated population,
$POP.PRED
, the predicted population, and $ASSIGN
, the
assignment rate. Note, the correct assignment rate are those
instances where values in POP==POP.PRED
.
$pairs.wide
: a wide-format data table of pairwise population
assignment rates, with columns, $POP
, the designated population,
and $[pop]
, the predicted populations, where each possible predicted
population is a [pop]
column. The cell contents are the assignment
rate, with correct assignment rates on the diagonal.
Jombart et al. (2010) BMC Genetics. DOI: 10.1186/1471-2156-11-94 Patterson et al. (2006) PLoS Genetics. DOI: 10.1371/journal.pgen.0020190 Thia (2022) Mol. Ecol. DOI: 10.1111/1755-0998.13706
library(genomalicious)
data(data_Genos)
### Fit the DAPC with the first 3 PC axes as predictors
DAPC.fit <- dapc_fit(dat=data_Genos, pcPreds=3, method='fit')
# Table of LD and PC axis scores
DAPC.fit$da.tab
DAPC.fit$pca.tab
# The lda and prcomp objects
DAPC.fit$da.fit
DAPC.fit$pca.fit
# The contributions of SNP to the LD axes
DAPC.fit$snp.contrib
# The posterior probabilities
DAPC.fit$da.prob
### Leave-one out cross-validation with 2 cores
DAPC.loo <- dapc_fit(data_Genos, method='loo_cv', pcPreds=3, numCores=2)
# Predictions
DAPC.loo$tab
# Global correct assignment rate
DAPC.loo$global
# Pairwise assignment rates in long-format data table
DAPC.loo$pairs.long
# Pairwise correct assignment rates from long-format data table
DAPC.loo$pairs.long[POP==POP.PRED]
# Pairwise assignment rates in wide-format data table
DAPC.loo$pairs.wide
#### Training-testing partitioning with 80% used as trianing
DAPC.tt <- dapc_fit(data_Genos, method='train_test', pcPreds=3, trainProp=0.8)
# Pairwise assignment rates in wide-format data table
DAPC.tt$pairs.wide
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.