dapc_infer: Conduct an inference of _k_ prior to DAPC.

View source: R/dapc_infer.R

dapc_inferR Documentation

Conduct an inference of k prior to DAPC.

Description

Takes a long-format data table of genotypes and assist in a preliminary inference of k, the effective number of populations. Inference of k is facilitated through examination of the PCA screeplot and through testing K-means testing.

Usage

dapc_infer(
  dat,
  scaling = "covar",
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  genoCol = "GT",
  kTest = 1:10L,
  pTest,
  screeMax = 20L,
  plotLook = "ggplot"
)

Arguments

dat

Data table: A long data table, e.g. like that imported from vcf2DT. Genotypes can be coded as '/' separated characters (e.g. '0/0', '0/1', '1/1'), or integers as Alt allele counts (e.g. 0, 1, 2). Must contain the following columns,

  1. The sampled individuals (see param sampCol).

  2. The locus ID (see param locusCol).

  3. The genotype column (see param genoCol).

scaling

Character: How should the data (loci) be scaled? Set to 'covar' to scale to mean = 0, but variance is not adjusted, i.e. PCA on a covariance matrix. Set to 'corr' to scale to mean = 0 and variance = 1, i.e. PCA on a correlation matrix. Set to 'patterson' to use the Patteron et al. (2006) normalisation. Set to 'none' to if you do not want to do any scaling before PCA.

sampCol

Character: The column name with the sampled individual information. Default is 'SAMPLE'.

locusCol

Character: The column name with the locus information. Default is 'LOCUS'.

genoCol

Character: The column name with the genotype information. Default is 'GT'.

kTest

Integer: A vector of the number of (k) values to test. Default is 1:10.

pTest

Integer: A vector of the number of (p) PC axes to fit K-means with.

screeMax

Integer: The maximum number of PC axes to plot in the screeplot.

plotLook

Character: The look of the plot. Default = 'ggplot', the typical gray background with gridlines produced by ggplot2. Alternatively, when set to 'classic', produces a base R style plot.

Details

DAPC was made popular in the population genetics/molecular ecology community following Jombart et al.'s (2010) paper. The method uses a DA to model the genetic differences among populations using PC axes of genotypes as predictors.

The choice of the number of PC axes to use as predictors of genetic differences among populations should be determined using the k-1 criterion described in Thia (2022). This criterion is based on the findings of Patterson et al. (2006) that only the leading k-1 PC axes of a genotype dataset capture biologically meaningful structure. Users can use the function genomalicious::dapc_infer to examine eigenvalue screeplots and perform K-means clustering with different parameters to infer the number of biologically informative PC axes.

Users should use examine both the screeplot of eigenvalues and the different K-means plots produced. The screeplot typically exhibits a break in the scree around the putative k. Additionally, different parameterisations of K-means clustering should also converge on a similar conclusion. Users may also find it useful to visualise scatterplots, e.g., using pca_genos.

This function can also be used to determine populations de novo if the user has not a priori expectation of the number of populations and the designation of individuals. See Miller et al. (2020) and Thia (2022) for distinction and importance of a priori vs. de novo population designations. The function returns all K-means solutions for all parameter combinations. After insepcting the screeplot and the K-means solutions, the desired K-means fit can be extracted from the returned object to obtain de novo population designations for downstream analysis, e.g., with genomalicious::dapc_fit.

Value

Returns a list: $tab is a datatable of k and p values examined and associated BIC value. $fit contains the individual outputs from kmeans for each combination of parameters fitted. $plot is a ggplot object.

References

Jombart et al. (2010) BMC Genetics. DOI: 10.1186/1471-2156-11-94 Miller et al. (2020) Heredity. DOI: 10.1038/s41437-020-0348-2 Patterson et al. (2006) PLoS Genetics. DOI: 10.1371/journal.pgen.0020190 Thia (2022) Mol. Ecol. DOI: 10.1111/1755-0998.13706

Examples

library(genomalicious)

data(data_Genos)

# Test 1 to 10 with 3, 10, 20, and 40 PC axes, plotting just the first 10
# eigenvalues from the PCA, with a ggplot flavour.
inferK <- dapc_infer(
   data_Genos,
   kTest=1:10L,
   pTest=c(3,10,20,40),
   screeMax=10L,
   plotLook='ggplot'
)

# Tabulated statistics
inferK$tab

# The K-means clustering results for k=3 fitted with p=3 PC axes
inferK$fit$`k=3,p=3`

# The plot
inferK$plot


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.