filter_missing_units: Filter missing data by units (sampled individuals or...

View source: R/filter_missing_units.R

filter_missing_unitsR Documentation

Filter missing data by units (sampled individuals or populations)

Description

Parses a data table of genotypes/allele frequencies and returns a list of units (sampled individuals or populations) that conform to a desired missing data threshold.

Usage

filter_missing_units(
  dat,
  missing,
  type = "genos",
  method = "samples",
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  popCol = "POP",
  genoCol = "GT",
  freqCol = "FREQ"
)

Arguments

dat

Data table: The must contain the columns:

  1. The individual sample ID (see param sampCol), for genotype datasets only.

  2. The locus ID (see param locusCol).

  3. The population ID (see param popCol).

  4. The genotypes (see param genoCol), or the allele frequencies (see param freqCol)

missing

Numeric: The proportion of missing data a locus, a value between 0 and 1.

type

Character: Is dat a data table of genotypes ('genos') or a data table of allele frequencies ('freqs')? Default = 'genos'.

method

Character: The method by which missingness filtering is performed. Only valid when filtering is performed on genotypes (type=='genos'). One of 'samples', or 'pops'. Default = 'samples'. For 'samples', missingness is calculated across all loci for each sampled individual. Any individual with missingness greater than the threshold will be discarded. For 'pops', a mean missingness is calculated for each populations, across all sampled individuals within those populations and across all loci. Populations with missingness greater than the threshold will be discarded.

sampCol

Character: The column name with the sampled individual information. Default = 'SAMPLE'. Only needed when type=='genos'.

locusCol

Character: The column name with the locus information. Default = 'LOCUS'.

popCol

Character: The column name with population information. Default = 'POP'.

genoCol

Character: The column name with the genotype information. Missing genotypes are encoded with an NA. Default = 'GT'. Only needed when type=='genos'.

freqCol

Character: The column name with the allele frequency information. Missing frequencies are encoded with an NA. Default = 'freqCol'. Only needed when type=='freqs'.

Details

Note, it is assumed that missing data values have already been put is as an NA. If this is not done in advance, this function will not produce the expected results.

If type=='genos', then your output will depend on how you specify the method argument. If type=='freqs', then there is just one output, those populations with missing data less than the missing threshold.

Value

Returns a character vector of samples names in dat[[sampCol]] or populations in dat[[popCol]] that conforms to missingness threshold (<= to the value of missing).

Examples

library(genomalicious)

simMiss <- data_Genos %>% copy()
simMiss$GT[sample(1:nrow(simMiss), 0.1*nrow(simMiss), replace=FALSE)] <- NA

filter_missing_units(simMiss, missing=0.10)
filter_missing_units(simMiss, missing=0.10, method='pops')


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.