filter_snps: Filter SNPs in snpRdata objects.
In hemstrow/snpR: Whole-Genome Analysis Tools for Use with Single Nucleotide Polymorphism Data

filter_snps

R Documentation

Filter SNPs in snpRdata objects.

Description

filter_snps filters snpRdata objects to remove SNPs or individuals which fail to pass user defined thresholds for several statistics. Since this function removes all calculated statistics, etc. from the snpRdata object, this should usually be the first step in an analysis. See details for filters.

Usage

filter_snps(
  x,
  maf = FALSE,
  mac = 0,
  mgc = 0,
  hf_hets = FALSE,
  hwe = FALSE,
  fwe_method = "none",
  hwe_excess_side = "both",
  singletons = FALSE,
  min_ind = FALSE,
  min_loci = FALSE,
  inds_first = FALSE,
  remove_garbage = FALSE,
  re_run = "partial",
  maf_facets = NULL,
  hwe_facets = NULL,
  non_poly = TRUE,
  bi_al = TRUE,
  LD_prune_sigma = FALSE,
  LD_prune_step = LD_prune_sigma,
  LD_prune_r = FALSE,
  LD_prune_method = "CLD",
  LD_prune_facet = NULL,
  LD_prune_par = FALSE,
  LD_prune_use_ME = FALSE,
  LD_prune_ME_sigma = 1e-04,
  verbose = TRUE
)

Arguments

`x`	snpRdata object.
`maf`	numeric between 0 and 1 or FALSE, default FALSE. Minimum acceptable minor allele frequency. Either maf or mac filtering are allowed, not both.
`mac`	integer, between 0 and 1 minus the total number of individuals. Loci with less or equal to this many minor alleles will be removed. Either maf or mac filtering are allowed, not both. `mac = 1` is therefore a singleton filter.
`mgc`	integer, between 0 and the total number of individuals/2. Loci where the minor allele is present in less or equal to this many individuals will be removed. Either mac or mgc filtering are allowed, not both. `mgc = 1` is analagous to a singleton filter, but will also remove loci with one homozygous minor individual.
`hf_hets`	numeric between 0 and 1 or FALSE, default FALSE. Maximum acceptable heterozygote frequency.
`hwe`	numeric between 0 and 1 or FALSE, default FALSE. Minimum acceptable HWE p-value.
`fwe_method`	character, default "none". Option to use for Family-Wise Error rate correction for HWE filtering. If requested, only SNPs with p-values below the alpha provided to the `hwe` argument after FWE correction will be removed. See `p.adjust` for information on method options.
`hwe_excess_side`	character, default "both". Options: heterozygote: only loci with a heterozygote excess are removed. homozygote: only loci with a homozygote excess are removed. both: loci with either a heterozygote or homozygote excess are removed.
`singletons`	logical, default FALSE. Depricated, use `mac = 1` to remove singletons. If TRUE, removes singletons (loci where there is only a single minor allele). If population sizes are reasonably high, this is more or less redundant if `maf` is also set.
`min_ind`	numeric between 0 and 1 or FALSE, default FALSE. Minimum proportion of individuals in which a loci must be sequenced.
`min_loci`	numeric between 0 and 1 or FALSE, default FALSE. Minimum proportion of SNPs at which an individual must be genotyped.
`inds_first`	logical, default FALSE. If TRUE, individuals will be filtered out for missing data if that option is selected prior to loci being filtered. Otherwise loci are filtered first.
`remove_garbage`	numeric between 0 and 1 or FALSE, default FALSE. Optionally do a filter to remove very poorly sequenced individuals and loci jointly before applying other filters. This can be used to reduce biases caused by very bad loci/individuals prior to full filtering. This number should be lower than either the `min_ind` or `min_loci` parameters if those arguments are used and should generally be quite permissive to remove only truly bad loci or individuals.
`re_run`	character or FALSE, default "partial". When individuals are removed via min_ind, it is possible that some SNPs that initially passed filtering steps will now violate some filters. SNP filters can be re-run automatically via several methods: partial: Re-filters for non-polymorphic loci (non_poly) only, if that filter was requested initially. full: Re-runs the full filtering scheme (save for min_loci). Note: if `inds_first = TRUE`, all re-run options other than FALSE will re-run the individual filter for missing data again after loci filtering.
`maf_facets`	character or NULL, default NULL. Defines a sample facet over which the minor allele frequency can be checked. SNPs will only fail the maf filter if they fail in every level of every provided facet.
`hwe_facets`	character or NULL, default NULL. Defines a sample facet over which the hwe filter can be checked. SNPs will fail the hwe filter if they fail in any level of any provided facet.
`non_poly`	logical, default TRUE. If TRUE, non-polymorphic loci will be removed.
`bi_al`	logical, default TRUE. If TRUE, loci with more than two alleles will be removed. Note that this is mostly an internal argument and should rarely be used directly, since import.snpR.data and other snpRdata object creation functions all pass SNPs through this filter because many snpR functions will fail to work if there are more than two alleles at a locus.
`LD_prune_sigma`	numeric or FALSE, default FALSE. If LD pruning, the window size in kb across which to consider LD loci pairs. Windows will be equal to two times `LD_prune_sigma`.
`LD_prune_step`	numeric, default equal to `LD_prune_sigma`. Step size between LD windows, in kb. The default ensures that each SNP is in exactly two windows, allowing for smoother filtering.
`LD_prune_r`	numeric, default FALSE. The LD filter cutt-off value. Pairs above this value will be greedily pruned. Must be between 0 and 1 if LD prunning is conducted.
`LD_prune_method`	character, default "CLD". The method to use for LD calculations and prunning. The options are: CLD Dprime rsq See `calc_pairwise_ld` for details.
`LD_prune_facet`	character, default NULL. Defines a facet over which the minor allele frequency can be checked. Windows will be calculated within any SNP parts of the facet, LD values will be calculated only with levels of any sample parts. For example 'chr.pop' would calculate LD values across windows, then filter SNPs above `LD_prune_r` in any populations.
`LD_prune_par`	numeric, default FALSE. If numeric, LD calculation will be conducted in parallel using `LD_prune_par` threads.
`LD_prune_use_ME`	logical, default FALSE. If TRUE, uses Minimization-Expectation to do LD calculations for all `LD_prune_method` options other than "CLD". This can be very slow. See `calc_pairwise_ld` for details.
`LD_prune_ME_sigma`	numeric, default 0.0001. The cutt-off value for difference in haplotype frequencies to use when conducting ME. See `calc_pairwise_ld` for details.
`verbose`	Logical, default TRUE. If TRUE, some progress updates and filtering notes will be printed to the console.

Details

Possible filters:

maf, minor allele frequency: removes SNPs where the minor allele frequency is too low. Can look for mafs below #'provided either globally or search each population individually.
hf_hets, high observed heterozygosity: removes SNPs where the observed heterozygosity is too high.
min_ind, minimum individuals: removes SNPs that were genotyped in too few individuals.
min_loci, minimum loci: removes individuals sequenced at too few loci.
non_poly, non-polymorphic SNPs: removes SNPs that are not polymorphic (not true SNPs).
bi_al, non-biallelic SNPs: removes SNPs that have more than two observed alleles. This is mostly an internal argument, since the various snpRdata import options use it automatically to prevent downstream errors in other snpR functions.
remove_garbage: Quickly removes any loci or samples that are jointly poorly genotyped prior to other filtering. This can be useful if some individuals or loci are of poor enough quality that they will bias other filters.
LD_prune_sigma: Sliding-window based Linkage Disequilibrium filtering to remove non-independant loci.

Value

A data.frame in the same format as the input, with SNPs and individuals not passing the filters removed.

Filter order

Note that filtering out poorly sequenced individuals creates a possible conflict with the loci filters, since after individuals are removed, some loci may no longer pass filters. For example, if a portion of individuals in one population all carry the only instances of a rare minor allele that still passes the maf threshold, removing those individuals may cause the loci to no longer be polymorphic in the sample.

To counter this, the re_run argument can be used to pass the data through a second filtering step after individuals are removed. By default, the "partial" re-run option is used, which re-runs only the non-polymorphic filter (if it was originally set). The "full" option re-runs all set filters. Note that re-running any of these filters may cause individuals to fail the individual filter after loci removal, and so subsequent tertiary re-running of the individual filters, followed by the loci filters, and so on, could be justified. This is not done automatically here, however, if the inds_first option is selected to filter poorly sequenced individuals prior to applying loci filters, any re-run option will re-run individuals filters after the loci filters have been applied.

Filter order (snps vs samples first) can be controlled with the inds_fist argument. If individuals are filtered first, poorly sequenced individuals will be re-filtered following locus removal if re-running is requested.

LD pruning

LD pruning can be conducted using any of the methods described in calc_pairwise_ld using the LD_prune_ argument family. Loci are removed greedily: starting from the first locus pair with elevated LD, the more poorly sequenced loci is removed. If each locus is equally well sequenced, the second loci (by position) is removed. If any elevated locus pairs remain, the next pair is addressed the same way until no pairs remain.

Facet-based filters (within-group filtering)

Via the maf_facets argument, this function can filter by minor allele frequencies in either all samples or each level of a supplied sample specific facet and the entire dataset. In the latter case, any SNPs that pass the maf filter in any facet level are considered to pass the filter. The latter should be used in instances where population sizes are very different, there are many populations, and/or allele frequencies are very different between populations and thus common alleles of interest in one population might be otherwise filtered out.

The hwe_facets argument is the inverse of this: loci will be removed if they fail the provided hwe filter in any facet level. In both cases, Facets should be provided as described in Facets_in_snpR.

LD filtering can likewise be conducted within facets. SNP-level facets dictate the levels within which LD values are calculated (chromosome, etc.), and loci will be removed if they display elevated LD within sample-level facets.

Author(s)

William Hemstrom

Examples

# Filter with a minor allele frequency of 0.05, maximum heterozygote 
# frequency of 0.55, 50% minimum individuals, and at least 75% of loci 
# sequenced per individual.
filter_snps(stickSNPs, maf = 0.05, hf_hets = 0.55, 
            min_ind = 0.5, min_loci = 0.75)

# The same filters, but with minor allele frequency considered per-population
# and a full re-run of loci filters after individual removal.
filter_snps(stickSNPs, maf = 0.05, hf_hets = 0.55, min_ind = 0.5, 
            min_loci = 0.75, re_run = "full", maf_facets = "pop")

hemstrow/snpR documentation built on July 5, 2025, 4:38 a.m.