Parses a data table of genotypes/allele frequencies and returns a list of loci that conform to a desired missing data threshold.


  type = "genos",
  method = "samples",
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  popCol = "POP",
  genoCol = "GT",
  freqCol = "FREQ"



Data table: The must contain the columns:

  1. The sample ID (see param sampCol), for genotype datasets only.

  2. The locus ID (see param locusCol).

  3. The population ID (see param popCol).

  4. The genotypes (see param genoCol), or the allele frequencies (see param freqCol)


Numeric: The proportion of missing data a locus, a value between 0 and 1.


Character: Is dat a data table of genotypes ('genos') or a data table of allele frequencies ('freqs')? Default = 'genos'.


Character: The method by which missingness filtering is performed. Only valid when filtering is performed on genotypes (type=='genos'). One of 'samples', or 'pops'. Default = 'samples'. For 'samples', missingness is calculated across all sampled individuals (irrespective of their populations) for genotypes at each locus. If the missingness summed across samples is greater than the threshold, the locus will be discarded. For 'pops', missingness at a locus is calculated per population. If any populations has missingness above the threshold at a locus, then that locus will be removed.


Character: The column name with the sampled individual information. Default = 'SAMPLE'. Only needed when type=='genos'.


Character: The column name with the locus information. Default = 'LOCUS'.


Character: The column name with population information. Default = 'POP'.


Character: The column name with the genotype information. Missing genotypes are encoded with an NA. Default = 'GT'. Only needed when type=='genos'.


Character: The column name with the allele frequency information. Missing frequencies are encoded with an NA. Default = 'freqCol'. Only needed when type=='freqs'.


Note, it is assumed that missing data values have already been put is as an NA. If this is not done in advance, this function will not produce the expected results.

If type=='genos', then your output will depend on how you specify the method argument. If type=='freqs', then there is just one output, those loci with missing data less than the missing threshold.


Returns a character vector of locus names in dat$LOCUS that conform to the missingness threshold (<= to the value of missing).



simMiss <- data_Genos %>% copy()
simMiss$GT[sample(1:nrow(simMiss), 0.1*nrow(simMiss), replace=FALSE)] <- NA

filter_missing_loci(simMiss, 0.10)

