View source: R/utility_functions.R
filter_snps | R Documentation |
filter_snps
filters snpRdata objects to remove SNPs or individuals
which fail to pass user defined thresholds for several statistics. Since this
function removes all calculated statistics, etc. from the snpRdata object,
this should usually be the first step in an analysis. See details for filters.
filter_snps(
x,
maf = FALSE,
mac = 0,
mgc = 0,
hf_hets = FALSE,
hwe = FALSE,
fwe_method = "none",
hwe_excess_side = "both",
singletons = FALSE,
min_ind = FALSE,
min_loci = FALSE,
inds_first = FALSE,
remove_garbage = FALSE,
re_run = "partial",
maf_facets = NULL,
hwe_facets = NULL,
non_poly = TRUE,
bi_al = TRUE,
LD_prune_sigma = FALSE,
LD_prune_step = LD_prune_sigma,
LD_prune_r = FALSE,
LD_prune_method = "CLD",
LD_prune_facet = NULL,
LD_prune_par = FALSE,
LD_prune_use_ME = FALSE,
LD_prune_ME_sigma = 1e-04,
verbose = TRUE
)
x |
snpRdata object. |
maf |
numeric between 0 and 1 or FALSE, default FALSE. Minimum acceptable minor allele frequency. Either maf or mac filtering are allowed, not both. |
mac |
integer, between 0 and 1 minus the total number of individuals.
Loci with less or equal to this many minor alleles will be removed.
Either maf or mac filtering are allowed, not both. |
mgc |
integer, between 0 and the total number of individuals/2. Loci
where the minor allele is present in less or equal to this many
individuals will be removed. Either mac or mgc filtering are allowed, not
both. |
hf_hets |
numeric between 0 and 1 or FALSE, default FALSE. Maximum acceptable heterozygote frequency. |
hwe |
numeric between 0 and 1 or FALSE, default FALSE. Minimum acceptable HWE p-value. |
fwe_method |
character, default "none". Option to use for Family-Wise
Error rate correction for HWE filtering. If requested, only SNPs with
p-values below the alpha provided to the |
hwe_excess_side |
character, default "both". Options:
|
singletons |
logical, default FALSE. Depricated, use |
min_ind |
numeric between 0 and 1 or FALSE, default FALSE. Minimum proportion of individuals in which a loci must be sequenced. |
min_loci |
numeric between 0 and 1 or FALSE, default FALSE. Minimum proportion of SNPs at which an individual must be genotyped. |
inds_first |
logical, default FALSE. If TRUE, individuals will be filtered out for missing data if that option is selected prior to loci being filtered. Otherwise loci are filtered first. |
remove_garbage |
numeric between 0 and 1 or FALSE, default FALSE. Optionally
do a filter to remove very poorly sequenced individuals and loci
jointly before applying other filters. This can be used to reduce biases
caused by very bad loci/individuals prior to full filtering. This number
should be lower than either the |
re_run |
character or FALSE, default "partial". When individuals are removed via min_ind, it is possible that some SNPs that initially passed filtering steps will now violate some filters. SNP filters can be re-run automatically via several methods:
Note: if |
maf_facets |
character or NULL, default NULL. Defines a sample facet over which the minor allele frequency can be checked. SNPs will only fail the maf filter if they fail in every level of every provided facet. |
hwe_facets |
character or NULL, default NULL. Defines a sample facet over which the hwe filter can be checked. SNPs will fail the hwe filter if they fail in any level of any provided facet. |
non_poly |
logical, default TRUE. If TRUE, non-polymorphic loci will be removed. |
bi_al |
logical, default TRUE. If TRUE, loci with more than two alleles will be removed. Note that this is mostly an internal argument and should rarely be used directly, since import.snpR.data and other snpRdata object creation functions all pass SNPs through this filter because many snpR functions will fail to work if there are more than two alleles at a locus. |
LD_prune_sigma |
numeric or FALSE, default FALSE. If LD pruning, the
window size in kb across which to consider LD loci pairs. Windows will be
equal to two times |
LD_prune_step |
numeric, default equal to |
LD_prune_r |
numeric, default FALSE. The LD filter cutt-off value. Pairs above this value will be greedily pruned. Must be between 0 and 1 if LD prunning is conducted. |
LD_prune_method |
character, default "CLD". The method to use for LD calculations and prunning. The options are:
See
|
LD_prune_facet |
character, default NULL. Defines a facet over which the
minor allele frequency can be checked. Windows will be calculated within any
SNP parts of the facet, LD values will be calculated only with levels of any
sample parts. For example 'chr.pop' would calculate LD values across
windows, then filter SNPs above |
LD_prune_par |
numeric, default FALSE. If numeric, LD calculation will be
conducted in parallel using |
LD_prune_use_ME |
logical, default FALSE. If TRUE, uses
Minimization-Expectation to do LD calculations for all
|
LD_prune_ME_sigma |
numeric, default 0.0001. The cutt-off value for difference
in haplotype frequencies to use when conducting ME. See
|
verbose |
Logical, default TRUE. If TRUE, some progress updates and filtering notes will be printed to the console. |
Possible filters:
maf, minor allele frequency: removes SNPs where the minor allele frequency is too low. Can look for mafs below #'provided either globally or search each population individually.
hf_hets, high observed heterozygosity: removes SNPs where the observed heterozygosity is too high.
min_ind, minimum individuals: removes SNPs that were genotyped in too few individuals.
min_loci, minimum loci: removes individuals sequenced at too few loci.
non_poly, non-polymorphic SNPs: removes SNPs that are not polymorphic (not true SNPs).
bi_al, non-biallelic SNPs: removes SNPs that have more than two observed alleles. This is mostly an internal argument, since the various snpRdata import options use it automatically to prevent downstream errors in other snpR functions.
remove_garbage: Quickly removes any loci or samples that are jointly poorly genotyped prior to other filtering. This can be useful if some individuals or loci are of poor enough quality that they will bias other filters.
LD_prune_sigma: Sliding-window based Linkage Disequilibrium filtering to remove non-independant loci.
A data.frame in the same format as the input, with SNPs and individuals not passing the filters removed.
Note that filtering out poorly sequenced individuals creates a possible conflict with the loci filters, since after individuals are removed, some loci may no longer pass filters. For example, if a portion of individuals in one population all carry the only instances of a rare minor allele that still passes the maf threshold, removing those individuals may cause the loci to no longer be polymorphic in the sample.
To counter this, the re_run
argument can be used to pass the data
through a second filtering step after individuals are removed. By default, the
"partial" re-run option is used, which re-runs only the non-polymorphic filter
(if it was originally set). The "full" option re-runs all set filters. Note
that re-running any of these filters may cause individuals to fail the
individual filter after loci removal, and so subsequent tertiary re-running of
the individual filters, followed by the loci filters, and so on, could be
justified. This is not done automatically here, however, if the
inds_first
option is selected to filter poorly sequenced individuals
prior to applying loci filters, any re-run option will re-run individuals
filters after the loci filters have been applied.
Filter order (snps vs samples first) can be controlled with the inds_fist
argument. If individuals are filtered first, poorly sequenced individuals
will be re-filtered following locus removal if re-running is requested.
LD pruning can be conducted using any of the methods described in
calc_pairwise_ld
using the LD_prune_
argument family.
Loci are removed greedily: starting from the first locus pair with elevated
LD, the more poorly sequenced loci is removed. If each locus is equally well
sequenced, the second loci (by position) is removed. If any elevated locus
pairs remain, the next pair is addressed the same way until no pairs remain.
Via the maf_facets
argument, this function can filter by minor allele
frequencies in either all samples or each level of a supplied
sample specific facet and the entire dataset. In the latter case, any SNPs
that pass the maf filter in any facet level are considered to pass the
filter. The latter should be used in instances where population sizes are very
different, there are many populations, and/or allele frequencies are
very different between populations and thus common alleles of interest in one
population might be otherwise filtered out.
The hwe_facets
argument is the inverse of this: loci will be removed if they
fail the provided hwe filter in any facet level. In both cases, Facets should
be provided as described in Facets_in_snpR
.
LD filtering can likewise be conducted within facets. SNP-level facets dictate the levels within which LD values are calculated (chromosome, etc.), and loci will be removed if they display elevated LD within sample-level facets.
William Hemstrom
# Filter with a minor allele frequency of 0.05, maximum heterozygote
# frequency of 0.55, 50% minimum individuals, and at least 75% of loci
# sequenced per individual.
filter_snps(stickSNPs, maf = 0.05, hf_hets = 0.55,
min_ind = 0.5, min_loci = 0.75)
# The same filters, but with minor allele frequency considered per-population
# and a full re-run of loci filters after individual removal.
filter_snps(stickSNPs, maf = 0.05, hf_hets = 0.55, min_ind = 0.5,
min_loci = 0.75, re_run = "full", maf_facets = "pop")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.