View source: R/allele_freqs_DT.R
allele_freqs_DT | R Documentation |
Takes a data.table of genotypes or allele counts and calculates the allele frequency for each allele. Can be used for multiallelic datasets.
allele_freqs_DT(
dat,
type,
sampCol = "SAMPLE",
popCol = "POP",
locusCol = "LOCUS",
genoCol = "GT",
countCol = "COUNTS",
indsCol = "INDS"
)
dat |
Data.table: Long-format data table of variants, e.g., as read in
with |
type |
Character: Two modes, one of "genos" for individual genotype data, or "counts" of allele in populations. |
sampCol |
Character: The column with sample ID information. Default is "SAMPLE".
Only needed if |
popCol |
Character: The column with population ID information. Default is "POP". |
locusCol |
Character: The column with locus ID information. Default is "LOCUS". |
genoCol |
Character: The column with genotype information. Default is "GT".
Only needed if |
countCol |
Character: The column with allele count information for all alleles.
For example, in pool-seq of populations, the number of read counts for each allele.
Default is "COUNTS". Only needed if |
indsCol |
Character: The column with the number of sampled individuals per population. Default is "INDS". |
This function assumes no missing values. For type=="genos"
, all
sampled individuals must have a genotype value for each locus.
For type=="counts"
, all sampled populations must have count data for each
locus. You could impute for individuals, or drop loci with missing data for
for individual or population datasets.
Note, when type=="counts"
, the allele frequencies are based on the
proportion of counts per allele relative to the total number of observed counts
at a locus. However, this function will align the total sample number of
sequenced individuals against the counts.
Returns a long format data table with the following columns:
$POP
, the population ID column.
$LOCUS
, the locus ID column.
$ALLELE
, the allele ID column (0 is Ref, and each subsequent
Alt allele is 1 -> n alleles).
$COUNTS
, the number of observations of the allele: the number of
individuals for genotype data, or the number of counts (e.g., reads) for
population count data.
$INDS
, the number of individuals sampled per population.
$FREQ
, the estimated allele frequency.
$HET
, the proportion of heterozygotes, calculated directly from
genotype data, or estimated as the expected heteroygosity for population
allele frequencies. Assumes diploid organisms.
library(genomalicious)
# Import biallelic SNPs as genotypes or population counts
data(data_Genos)
data(data_PoolFreqs)
# On genotypes, convert the $GT values to characters.
dat_gt <- data_Genos %>%
copy %>%
.[, GT:=as.character(GT)] %>%
.[GT==0, GT:='0/0'] %>%
.[GT==1, GT:='0/1'] %>%
.[GT==2, GT:='1/1']
print(dat_gt)
allele_freqs_DT(dat=dat_gt, type='genos')
# On counts, need to make a $COUNTS column, and add in 30 individuals
# per locus per population in a new $INDS column.
dat_counts <- data_PoolFreqs %>%
copy %>%
.[, COUNTS:=paste(RO,AO,sep=',')] %>%
.[, INDS:=30]
print(dat_counts)
allele_freqs_DT(dat=dat_counts, type='counts')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.