fstat_calc: Calculate F-statistics from genotypes or allele frequencies...

View source: R/fstat_calc.R

fstat_calcR Documentation

Calculate F-statistics from genotypes or allele frequencies (counts)

Description

Function takes a genotypes or allele frequencies in a long-format data table and calculates Weir & Cockerham's F-statistics (Weir & Cockerham, 1984). Permutations can be used to test statistical significance of F-statistics in genotype data sets. Can deal with multiallelic data. See Details for more information.

Usage

fstat_calc(
  dat,
  type,
  method,
  fstatVec = NULL,
  popCol = "POP",
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  genoCol = "GT",
  countCol = "COUNTS",
  indsCol = "INDS",
  permute = FALSE,
  keepLocus = TRUE,
  numPerms = 100,
  numCores = 1
)

Arguments

dat

Data table: For genotype data, a long-format data table of genotypes, coded as '/' separated alleles ('0/0', '0/1', '1/1'). For allele frequency data, a long-format data table of allele counts.

Columns required for both genotypes and allele frequencies:

  1. The population ID (see param popCol).

  2. The locus ID (see param locusCol).

Columns required only for genotypes:

  1. The sample ID (see param sampCol).

  2. The genotypes (see param genoCol).

Columns required only for allele frequencies:

  1. The allelic count column (see param countCol).

  2. The number of individuals used to obtain the allele frequency estimate (see param indsCol).

type

Character: One of 'genos' or 'freqs', to calculate F-statistics from genotype or allele frequency data, respectively.

method

Character: One of 'global' or 'pairwise' for global or pairwise F-statistics, respectively.

fstatVec

Character: A vector of F-statistics to calculate. This is only applicable for genotype data, type=='genos'. Must include one of "FST", "FIS", or "FIT".

popCol

Character: The column name with the population information. Default is 'POP'.

sampCol

Character: The column name with the sampled individual information. Default is 'SAMPLE'.

locusCol

Character: The column name with the locus information. Default is 'LOCUS'.

genoCol

Character: The column name with the genotype information. Default is 'GT'.

countCol

Character: The column name with the allele count information. Default is 'FREQ'. Counts for each allele need to be separated with a comma, starting with the Ref allele, followed by each subsequent Alt allele. E.g., '0,25', or '5,7,10', for a locus with 2 alleles and 3 alleles, respectively. You must code alleles within a locus at same positions in the character string across all populations.

indsCol

Character: The column name with the number of individuals contributing to the allele freuqency estimate. Default is indsCol.

permute

Logical: Should permutations be performed to test the statistical significance of F-statistics? Default is FALSE. Can only be performed on genotype data, type=='genos'.

keepLocus

Logical: Should locus-specific estimates of F-statistics be kept? Default is TRUE. Dropping locus-specific estimates will dramatically save memory and the size of the returned list.

numPerms

Integer: The number of permutations to perform. Default is 100.

numCores

Integer: The number of cores to use when running permutations. Default is 1.

Details

With genotype data, the F-statistics FST, FIS, and FIT can be calculated. Only FST can be calculated from allele frequency data.

F-statistics from genotype data are calculated from the variance components 'a', 'b', and 'c', which have been standardised for observed heterozygosity. FST from allele frequency data uses an estimate of the expected heterozygosity.

Permutation tests for genotype data involve random shuffling of individuals among populations, recalculating F-statistics, and testing the hypothesis that the permuted F-statistic > observed F-statistic. The p-value represents the proportion of permutation that were TRUE to this expression. That is, if no permuted values are greater than the observed, p=0. Likewise, if all the permuted values are greater than the observed, p=1.

Value

A list is returned with three indexes.

The first index is $genome, the genome-wide F-statistics. If global estimates were requested, global==TRUE, then this is just a single row; the estimates across all populations. If pairwise esimates were requested, pairwise==TRUE, then there are $POP1 and $POP2, which represent two populations tested.

The second index is $locus, the locus-specific F-statistics. This is a data table with a $LOCUS column for global estimates at each locus. when global==TRUE. If pairwise population estimates have been requested, pairwise==TRUE, then there are $POP1 and $POP2, which represent the two populations tested.

The third index is $permute, the permutation results. This index will be NULL when frequencies are used, i.e., type=='freqs', and will only contain data if type=='genos' and permute==TRUE. $permute is itself a list, with two subindexes:

  1. $fstat: The permuted F-statistics. If global==TRUE, then this will simply be a single row of global estimates. If pairwise==TRUE, then this will be a data table with columns $POP1, $POP2, and a column for each F-statistic.

  2. $pval: The permuted p-values. This is a long-format data table. If global==TRUE, then there are two column: $STAT, which contains the F-statistic; and $PVAL, which contains the global permuted p-value. If pairwise==TRUE, then there will two additional columns, $POP1 and $POP2.

References

Weir & Cockerham (1984) Evolution. DOI: 10.1111/j.1558-5646.1984.tb05657.x Weir et al. (2002) Annals of Human Genetics. DOI: 10.1146/annurev.genet.36

Examples

library(genomalicious)

data(data_Genos)
data(data_PoolFreqs)
data(data_PoolInfo)

# Set genotypes as characters
data_Genos$GT %>% head
data_Genos[, GT:=genoscore_converter(GT)]
data_Genos$GT %>% head

# Set allele counts and individuals in pool-seq data
data_PoolFreqs %>% head
data_PoolInfo %>% head

data_PoolFreqs[, COUNTS:=paste(RO,AO,sep=',')]

data_PoolFreqs$INDS <- data_PoolInfo$INDS[
match(data_PoolFreqs$POOL, data_PoolInfo$POOL)
]

head(data_PoolFreqs)

# Genotypes and global F-statistics
geno_global_f <- fstat_calc(
dat=data_Genos,
type='genos', method='global', fstatVec=c('FST','FIS','FIT'),
popCol='POP', sampCol='SAMPLE',
locusCol='LOCUS', genoCol='GT',
permute=FALSE
)

# Genotypes and pairwise F-statistics
geno_pair_f <- fstat_calc(
dat=data_Genos,
type='genos', method='pairwise', fstatVec=c('FST','FIS','FIT'),
popCol='POP', sampCol='SAMPLE',
locusCol='LOCUS', genoCol='GT',
permute=FALSE
)

# Allele frequencies (from counts) and global FST
freqs_global_f <- fstat_calc(
dat=data_PoolFreqs,
type='freqs', method='global', fstatVec=NULL,
popCol='POP', locusCol='LOCUS',
countCol='COUNTS', indsCol='INDS',
permute=FALSE
)

# Allele frequencies (from counts) and pairwise FST
freqs_pair_f <- fstat_calc(
dat=data_PoolFreqs,
type='freqs', method='pairwise', fstatVec=NULL,
popCol='POP', locusCol='LOCUS',
countCol='COUNTS', indsCol='INDS',
permute=FALSE
)


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.