filter_missing_loci: Filter missing data by loci

View source: R/filter_missing_loci.R

filter_missing_lociR Documentation

Filter missing data by loci

Description

Parses a data table of genotypes/allele frequencies and returns a list of loci that conform to a desired missing data threshold.

Usage

filter_missing_loci(
  dat,
  missing,
  type = "genos",
  method = "samples",
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  popCol = "POP",
  genoCol = "GT",
  freqCol = "FREQ"
)

Arguments

dat

Data table: The must contain the columns:

  1. The sample ID (see param sampCol), for genotype datasets only.

  2. The locus ID (see param locusCol).

  3. The population ID (see param popCol).

  4. The genotypes (see param genoCol), or the allele frequencies (see param freqCol)

missing

Numeric: The proportion of missing data a locus, a value between 0 and 1.

type

Character: Is dat a data table of genotypes ('genos') or a data table of allele frequencies ('freqs')? Default = 'genos'.

method

Character: The method by which missingness filtering is performed. Only valid when filtering is performed on genotypes (type=='genos'). One of 'samples', or 'pops'. Default = 'samples'. For 'samples', missingness is calculated across all sampled individuals (irrespective of their populations) for genotypes at each locus. If the missingness summed across samples is greater than the threshold, the locus will be discarded. For 'pops', missingness at a locus is calculated per population. If any populations has missingness above the threshold at a locus, then that locus will be removed.

sampCol

Character: The column name with the sampled individual information. Default = 'SAMPLE'. Only needed when type=='genos'.

locusCol

Character: The column name with the locus information. Default = 'LOCUS'.

popCol

Character: The column name with population information. Default = 'POP'.

genoCol

Character: The column name with the genotype information. Missing genotypes are encoded with an NA. Default = 'GT'. Only needed when type=='genos'.

freqCol

Character: The column name with the allele frequency information. Missing frequencies are encoded with an NA. Default = 'freqCol'. Only needed when type=='freqs'.

Details

Note, it is assumed that missing data values have already been put is as an NA. If this is not done in advance, this function will not produce the expected results.

If type=='genos', then your output will depend on how you specify the method argument. If type=='freqs', then there is just one output, those loci with missing data less than the missing threshold.

Value

Returns a character vector of locus names in dat$LOCUS that conform to the missingness threshold (<= to the value of missing).

Examples

library(genomalicious)

simMiss <- data_Genos %>% copy()
simMiss$GT[sample(1:nrow(simMiss), 0.1*nrow(simMiss), replace=FALSE)] <- NA

filter_missing_loci(simMiss, 0.10)


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.