allele_freqs_DT: Generate an allele frequency data table

View source: R/allele_freqs_DT.R

allele_freqs_DTR Documentation

Generate an allele frequency data table

Description

Takes a data.table of genotypes or allele counts and calculates the allele frequency for each allele. Can be used for multiallelic datasets.

Usage

allele_freqs_DT(
  dat,
  type,
  sampCol = "SAMPLE",
  popCol = "POP",
  locusCol = "LOCUS",
  genoCol = "GT",
  countCol = "COUNTS",
  indsCol = "INDS"
)

Arguments

dat

Data.table: Long-format data table of variants, e.g., as read in with genomalicious::vcf2DT.

type

Character: Two modes, one of "genos" for individual genotype data, or "counts" of allele in populations.

sampCol

Character: The column with sample ID information. Default is "SAMPLE". Only needed if type=="genos".

popCol

Character: The column with population ID information. Default is "POP".

locusCol

Character: The column with locus ID information. Default is "LOCUS".

genoCol

Character: The column with genotype information. Default is "GT". Only needed if type=="genos". Genotypes must be in character format where alleles are separated by the delimiter, "/". For example, "0/1" is one Ref and one Alt allele 1; "2/2" is two Alt allele 2.

countCol

Character: The column with allele count information for all alleles. For example, in pool-seq of populations, the number of read counts for each allele. Default is "COUNTS". Only needed if type=="counts". Counts should be separated by commas, with the Ref allele first. E.g., "20,60,4" would indicate 20, 60, and 4 counts of the Ref allele, Alt allele 1, and Alt allele 2, respectively.

indsCol

Character: The column with the number of sampled individuals per population. Default is "INDS".

Details

This function assumes no missing values. For type=="genos", all sampled individuals must have a genotype value for each locus. For type=="counts", all sampled populations must have count data for each locus. You could impute for individuals, or drop loci with missing data for for individual or population datasets.

Note, when type=="counts", the allele frequencies are based on the proportion of counts per allele relative to the total number of observed counts at a locus. However, this function will align the total sample number of sequenced individuals against the counts.

Value

Returns a long format data table with the following columns:

  1. $POP, the population ID column.

  2. $LOCUS, the locus ID column.

  3. $ALLELE, the allele ID column (0 is Ref, and each subsequent Alt allele is 1 -> n alleles).

  4. $COUNTS, the number of observations of the allele: the number of individuals for genotype data, or the number of counts (e.g., reads) for population count data.

  5. $INDS, the number of individuals sampled per population.

  6. $FREQ, the estimated allele frequency.

  7. $HET, the proportion of heterozygotes, calculated directly from genotype data, or estimated as the expected heteroygosity for population allele frequencies. Assumes diploid organisms.

Examples

library(genomalicious)

# Import biallelic SNPs as genotypes or population counts
data(data_Genos)
data(data_PoolFreqs)

# On genotypes, convert the $GT values to characters.
dat_gt <- data_Genos %>%
  copy %>%
  .[, GT:=as.character(GT)] %>%
  .[GT==0, GT:='0/0'] %>%
  .[GT==1, GT:='0/1'] %>%
  .[GT==2, GT:='1/1']

print(dat_gt)

allele_freqs_DT(dat=dat_gt, type='genos')

# On counts, need to make a $COUNTS column, and add in 30 individuals
# per locus per population in a new $INDS column.
dat_counts <- data_PoolFreqs %>%
  copy %>%
  .[, COUNTS:=paste(RO,AO,sep=',')] %>%
  .[, INDS:=30]

print(dat_counts)

allele_freqs_DT(dat=dat_counts, type='counts')

j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.