dapc_fit: Perform a DAPC of individual genotype data in a data table

View source: R/dapc_fit.R

dapc_fitR Documentation

Perform a DAPC of individual genotype data in a data table

Description

Takes a long-format data table of genotypes and conducts a PCA using R's prcomp function, then fits a DA using R's lda function. Can also be used to assess DA model fit using a leave-one-out cross-validation or training-testing partitioning.

Usage

dapc_fit(
  dat,
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  genoCol = "GT",
  popCol = "POP",
  scaling = "covar",
  pcPreds,
  method = "fit",
  numCores = 1,
  trainProp = 0.7
)

Arguments

dat

Data table: A long data table, e.g. like that imported from genomalicious::vcf2DT. Genotypes can be coded as '/' separated characters (e.g. '0/0', '0/1', '1/1'), or integers as Alt allele counts (e.g. 0, 1, 2). Must contain the following columns,

  1. The population designation (see param popCol)

  2. The sampled individuals (see param sampCol).

  3. The locus ID (see param locusCol).

  4. The genotype column (see param genoCol).

sampCol

Character: The column name with the sampled individual information. Default is 'SAMPLE'.

locusCol

Character: The column name with the locus information. Default is 'LOCUS'.

genoCol

Character: The column name with the genotype information. Default is 'GT'.

popCol

Character: An optional argument. The column name with the population information. Default is NULL. If specified, population membership is stored in the returned object.

scaling

Character: How should the data (loci) be scaled for PCA? Default is 'covar' to scale to mean = 0, but variance is not adjusted, i.e. PCA on a covariance matrix. Set to 'corr' to scale to mean = 0 and variance = 1, i.e. PCA on a correlation matrix. Set to 'patterson' to use the Patteron et al. (2006) normalisation. Set to 'none' to if you do not want to do any scaling before PCA.

pcPreds

Integer: The number of leading PC axes to use as predictors of among-population genetic differences in DA. See details.

method

Character: The analysis to perform. Default is 'fit', which is a DAPC fitted to all the samples. 'loo_cv' performs leave-one-out cross-validation, and 'train_test' performs training-testing partitioning, for assessing model fit. See details

numCores

Integer: The number of cores to run leave-one-out cross-validation, only required when method=='loo_cv'. Default is 1.

trainProp

Numeric: The proportion of the data to reserved as the training set, with the remaining proportion used as the testing set. Default is 0.7. See details.

Details

DAPC was made popular in the population genetics/molecular ecology community following Jombart et al.'s (2010) paper. The method uses a DA to model the genetic differences among populations using PC axes of genotypes as predictors.

The choice of the number of PC axes to use as predictors of genetic differences among populations should be determined using the k-1 criterion described in Thia (2022). This criterion is based on the findings of Patterson et al. (2006) that only the leading k-1 PC axes of a genotype dataset capture biologically meaningful structure. Users can use the function genomalicious::dapc_infer to examine eigenvalue screeplots and perform K-means clustering with different parameters to infer the number of biologically informative PC axes.

Assessing model fit of DAPC requires partitioning data into sets for training and testing. When method=='loo_cv', leave-one-out cross-validation is performed: each ith sample is withheld as a testing sample, the model is fit without the ith sample, and then the model is used to predict the ith sample's population. This method is preferable when sample sizes are small. When method=='train_test', a proportion of trainProp individuals from each populations are used to train the DAPC model which is then used to predict the populations in the remaining testing individuals.

Value

Returns a list, whose contents depend on the method specified.

If method=='fit', the list contains:

  1. $da.fit: an lda object, a DA of genotypic PC axes. If also contains the index $exp.var, which is the percent of among population variance captured by each LD axis, and $among.var, the percent of among population variance.

  2. $da.tab: a data table of LD axis scores, with columns $POP, the designated population, $SAMPLE, the sample ID, $[axis], where each [axes] is a column for a different DA axis.

  3. $da.prob: a data table of posterior probabilities of group assignment for each sample, with columns $POP, the designated population, $SAMPLE, the sample ID, $POP.PRED, the predicted population, $PROB, the posterior probability for the predicted populations (sums to 1 across predicted populations per sample).

  4. pca.fit: a prcomp object, a PCA of genotypes.

  5. pca.tab: a data table of PC axis scores, with columns $POP, the designated populations, $SAMPLE, the sample ID, $[axis], where each [axes] is a column for a different PC axis.

  6. snp.contrib: a data table of SNP contributions to LD axes, with columns, $LOCUS, the SNP locus, $LD[x], the individual LD axes, with [x] denoting the axis number.

If method=='loo_cv' or method=='train_test', the list contains:

  1. $tab: a data table of predictions for tested samples, with columns, $POP, the designated population, SAMPLE, the sample ID, and $POP.PRED, the predicted population. Note, that for method=='loo_cv', as samples with be present, but for method=='train_test', only the samples retained for the testing set will be present.

  2. $global: a single numeric, the global correct assignment rate.

  3. $pairs.long: a long-format data table, of pairwise population correct assignment rates, with columns, $POP, the designated population, $POP.PRED, the predicted population, and $ASSIGN, the assignment rate. Note, the correct assignment rate are those instances where values in POP==POP.PRED.

  4. $pairs.wide: a wide-format data table of pairwise population assignment rates, with columns, $POP, the designated population, and $[pop], the predicted populations, where each possible predicted population is a [pop] column. The cell contents are the assignment rate, with correct assignment rates on the diagonal.

References

Jombart et al. (2010) BMC Genetics. DOI: 10.1186/1471-2156-11-94 Patterson et al. (2006) PLoS Genetics. DOI: 10.1371/journal.pgen.0020190 Thia (2022) Mol. Ecol. DOI: 10.1111/1755-0998.13706

Examples

library(genomalicious)

data(data_Genos)

### Fit the DAPC with the first 3 PC axes as predictors
DAPC.fit <- dapc_fit(dat=data_Genos, pcPreds=3, method='fit')

# Table of LD and PC axis scores
DAPC.fit$da.tab
DAPC.fit$pca.tab

# The lda and prcomp objects
DAPC.fit$da.fit
DAPC.fit$pca.fit

# The contributions of SNP to the LD axes
DAPC.fit$snp.contrib

# The posterior probabilities
DAPC.fit$da.prob

### Leave-one out cross-validation with 2 cores
DAPC.loo <- dapc_fit(data_Genos, method='loo_cv', pcPreds=3, numCores=2)

# Predictions
DAPC.loo$tab

# Global correct assignment rate
DAPC.loo$global

# Pairwise assignment rates in long-format data table
DAPC.loo$pairs.long

# Pairwise correct assignment rates from long-format data table
DAPC.loo$pairs.long[POP==POP.PRED]

# Pairwise assignment rates in wide-format data table
DAPC.loo$pairs.wide

#### Training-testing partitioning with 80% used as trianing
DAPC.tt <- dapc_fit(data_Genos, method='train_test', pcPreds=3, trainProp=0.8)

# Pairwise assignment rates in wide-format data table
DAPC.tt$pairs.wide


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.