filter_ma | R Documentation |
The Allele Frequency is the relative frequency of an allele for a markers/SNP. Calculations usually involve classifying the alleles into Major/Minor, Reference (REF)/Alternate (ALT) and/or Ancestral or dirived allele. radiator provides a filter based of Minor Allele Count, frequency and depth function. Remove/blacklist markers based on Minor/Alternate Allele Count (MAC), Frequency (MAF) or Depth of coverage (MAD). Use it to remove noise, sequencing errors or low marker polymorphism. Some analysis performs better with the full spectrum of allele frequency, consequently it's important to understand the limite of using the thresholds to avoid generating biases.
Filter targets: Marker's alternate allele(s)
Statistics: count, frequency or depth of the allele per markers. This filter as NO concept of strata/populations. It is computed globally.
filter_ma(
data,
interactive.filter = TRUE,
ma.stats = "mac",
filter.ma = NULL,
calibrate.alleles = "depth",
filename = NULL,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
data |
(4 options) A file or object generated by radiator:
How to get GDS and tidy data ?
Look into |
interactive.filter |
(optional, logical) Do you want the filtering session to
be interactive. Figures of distribution are shown before asking for filtering
thresholds.
Default: |
ma.stats |
(optional, character) The statistic of the alternate (minor)
allele to keep a SNP (see details). Options are: |
filter.ma |
(optional, integer or double) The threshold to
blacklist Minor Allele (see details). e.g. with |
calibrate.alleles |
(character) When ancestral allele
information is not available, REF / ALT alleles can be calibrated using
alleles count or depth (sequencing coverage for REF and ALT alleles).
Removing individuals will impact REF/ALT alleles. Ideally,
it should be based on observed read depth, when this information is available
for both alleles. By selecting |
filename |
(optional, character) Write to folder the MAF filtered
tidy dataset. The name will be appended |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
verbose |
(optional, logical) When |
... |
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section) |
With interactive.filter = FALSE
, a list in the global environment,
with 7 objects:
$tidy.filtered.mac
$whitelist.markers
$blacklist.markers
$mac.data
$filters.parameters
With interactive.filter = TRUE
, a list with 4 additionnal objects are generated.
$distribution.mac.global
$distribution.mac.local
$mac.global.summary
$mac.helper.table
mac.helper.table:
First and second variables (in columns) represents POP_ID and sample size (n).
The last 2 rows are the local MAF (suggested based on the lowest pop value) and the TOTAL/GLOBAL observations.
Columns starting with ALT are the variable corresponding to the number of alternative (ALT) allele (ranging from 1 to 20). The observations in the ALT allele variable columns are the local (for the pop) and global (last row) MAF of your dataset. e.g. ALT_3 can potentially represent 3 heterozygote individuals with the ALT allele or 1 homozygote individuals for the ALT allele and 1 heterozygote individual. And so on...
Using count or frequency to remove a SNPs ? The preferred choice in radiator as changed from frequency to count, because we think the filtering should not alter the spectrum and this is only achieved if the same criteria is applied for each SNP.
Even small differences in missing data between RADseq markers generates differences in MAF frequency thresholds applied.
Example with a datset consisting of N = 36 individuals and 3 SNPs with varying level of missing genotypes:
SNP number : number samples genotypes : REF/ALT counts
SNP1 : 36 : 69/3
SNP2 : 30 : 65/3
SNP3 : 24 : 45/3
Each SNPs have the same alternate allele count, corresponding to
2 individuals with the polymorphism: 1 homozygote + 1 heterozygote.
Applying a MAF threshold of
0.05 would mean that SNP3 would be blacklisted
(24 * 2 * 0.05 = 2.4 alt alleles required to pass
).
Using count instead of frequency allows each RADseq markers, with varying missing data, to be treated equally.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
filter.common.markers
(optional, logical).
Default: filter.common.markers = FALSE
,
Documented in filter_common_markers
.
filter.monomorphic
(logical, optional) Should the monomorphic
markers present in the dataset be filtered out ?
Default: filter.monomorphic = TRUE
.
Documented in filter_monomorphic
.
path.folder
: to write ouput in a specific path
(used internally in radiator).
Default: path.folder = getwd()
.
If the supplied directory doesn't exist, it's created.
To help choose a threshold for the local and global MAF use the interactive version.
2 steps in the interactive version:
Step 1. Visualization and helper table.
Step 2. Filtering markers based on MAC, MAF or MAD
Thanks to Charles Perrier and Jeremy Gaudin for very useful comments on previous version of this function.
Thierry Gosselin thierrygosselin@icloud.com
Good reading: Linck, E., Battey, C. (2019). Minor allele frequency thresholds strongly affect population structure inference with genomic data sets. Molecular Ecology Resources 19(3), 639-647. https://dx.doi.org/10.1111/1755-0998.12995
## Not run:
# The minumum
mac <- radiator::filter_ma(data = turtle.tidy.data)
# This will use the default: interactive version,
# a list is created and to view the filtered tidy data:
mac.tidy.data <- mac$tidy.filtered.mac
# No user interaction
# Using filter.ma = 4, is a good practice to remove mostly sequencing errors
# and assembly artifacts, because it requires the markers to be genotyped in
# 4 heterozygote individuals or 2 homozygote individuals for the alternate
# allele or 2 heterozygote individuals and 1 homozygote individual for the
# alternate allele. Overall, never less than 2 indiduals are required.
mac <- radiator::filter_ma(
data = turtle.gds, # using gds object
ma.stats = "mac",
filter.ma = 4,
filename = "turtle.mac")
# This will remove monomorphic markers (by default filter.monomorphic = TRUE)
# The filtered data will be written in the directory under the name:
# turtle.maf.rad
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.