dadi_inputs: Genertate dadi input from genotype or allele frequency data

View source: R/dadi_inputs.R

dadi_inputsR Documentation

Genertate dadi input from genotype or allele frequency data

Description

Creates an input file for the program dadi, described in Gutenkunst et al. (2009). The input is biallelic genotypes or allele frequencies at SNP loci in a long-format data table.

Usage

dadi_inputs(
  dat,
  type,
  sampCol = "SAMPLE",
  popCol = "POP",
  locusCol = "LOCUS",
  refCol = "REF",
  altCol = "ALT",
  genoCol = "GT",
  freqCol = "FREQ",
  indsCol = "INDS",
  freqMethod = "probs",
  popSub = NULL,
  popLevels = NULL
)

Arguments

dat

Data table: A long-format data table of biallelic genotypes, coded as '/' separated alleles ('0/0', '0/1', '1/1'), or counts of the Alt alleles (0, 1, 2, respectively). Alternatively, a long-format data table of allele frequencies. Columns required for both genotypes and allele frequencies:

  1. The population ID (see param popCol).

  2. The locus ID (see param locusCol).

  3. The reference allele (see param refCol).

  4. The alternate allele (see param altCol).

Columns required only for genotypes:

  1. The sample ID (see param sampCol).

  2. The genotypes (see param genoCol).

Columns required only for allele frequencies:

  1. The allele frequencies (see param freqCol).

  2. The number of individuals used to obtain the allele frequency estimate (see param indsCol).

type

Character: One of 'genos' or 'freqs', to calculate F-statistics from genotype or allele frequency data, respectively.

sampCol

Character: Sample ID. Default = 'SAMPLE'.

popCol

Character: Population ID. Default = 'POP'.

locusCol

Character: Locus ID. Default = 'LOCUS'.

refCol

Character: Reference allele. Default = 'REF'.

altCol

Character: Alternate allele. Default = 'ALT'.

genoCol

Character: The genotype. Default = 'GT'.

freqCol

Character: The reference allele frequency. Default = 'FREQ'.

indsCol

Character: The number of individuals per population pool. Default = 'INDS'.

freqMethod

Character: The method to estimate the SFS from allele frequency data. Either 'probs' or 'counts'. Default = 'probs'. Only applicable when type=='freqs'. See Details for parameterisation.

popSub

Character: The populations to subset out of popCol. Default = NULL.

popLevels

Character: An optional vector of the population IDs used to manually specify the first and second population order. Default = NULL.

Details

Because pool-seq provides estimates of allele frequencies, not direct observations of allele counts, we have to infer the SFS from the allele frequencies. This is determined by the argument freqMethod.

When freqMethod=='counts', the default, the allele counts are simply rounded to the nearest integer (e.g. 1.5 = 2, and 1.4 = 1), relative to the number of chromosomes. The Ref allele counts are made first, then the Alt allele counts are made. For instance, if 20 diploid individuals were pooled and the Ref allele frequency was 0.82, from the 40 haploid chromosomes, 33 (32.8 rounded up) would be expected to contain the Ref allele, whilst 7 (40 - 33) would be expected to carry the Alt allele. NOTE: if the estimated number of individuals for the Ref allele is < 1 but > 0, this will always be rounded to 1. This method will produce a consistent SFS, but note that extremely low Ref allele frequencies will have a tendency to produce counts of 1.

When freqMethod=='probs', the allele counts are derived from a binomial draw using R's rbinom() function. Again, if the Ref allele frequency from pooled diploids was 0.82, then the SFS would be generated from the command call: rbinom(n=1, size=40, prob=0.82), which would produce a probable number of Ref allele counts, and the Alt allele counts would be 40 minus this number. This method will not produce consistently reproducible SFSs due to the nature of the probabilistic draws. However, it does avoid potentially biasing the SFS from rounding errors when allele frequencies are low.

Value

Returns a data table in the dadi input format.

References

Gutenkunst et al. (2009) Inferring the joint demographic history of multiply populations from multidimensional SNP frequency data. PLoS Genetics: 10, e1000695.

Examples

library(genomalicious)

data(data_Genos)
data(data_PoolFreqs)
data(data_PoolInfo)

### Make the dadi input from genotype data
dadi_inputs(dat=data_Genos, type='genos', popSub=c('Pop1', 'Pop2'))

### Make the dadi input from allele frequency data
colnames(data_PoolFreqs)

# We need to add in the $INDS column to the data, data_PoolFreqs
newFreqData <- left_join(data_PoolFreqs, data_PoolInfo)
colnames(newFreqData)

# Three
dadi_inputs(dat=newFreqData, type='freqs', freqMethod='probs', )


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.