filter_het | R Documentation |
Observed Heterozygosity based filtering.
The filter arguments of filter_het
allows you to test rapidly
if departure from realistic expectations of heterozygosity statistics
are a problem in downstream analysis.
Highlight outliers individual's observed heterozygosity for a quick diagnostic of mixed samples or poor polymorphism discovery. The statistic is also contrasted with missing data to help differentiate problems. (also computed in detect_mixed_genomes)
The observed heterozygosity in the dataset: an assembly artefact, a genotyping problem, a problem of population groupings or a reliable signal of biological polymorphism? Detect assembly artifact or genotyping problem (e.g. under/over-splitting loci) by looking at marker's observed heterozygosity statistics by population or overall.
Use haplotype or snp level statistics. When the haplotype approach is selected, consistencies of heterozygosity statistics along the read are highlighted.
Interactive approach help by visualizing the data before making a decision on thresholds.
filter_het(
interactive.filter = TRUE,
data,
strata = NULL,
ind.heterozygosity.threshold = NULL,
het.approach = c("SNP", "overall"),
het.threshold = 1,
het.dif.threshold = 1,
outlier.pop.threshold = 1,
helper.tables = FALSE,
coverage.info = FALSE,
filename = NULL,
parallel.core = parallel::detectCores() - 1,
...
)
interactive.filter |
(optional, logical) Do you want the filtering session to
be interactive. With the default, the user is asked to see figures of
distribution before making decisions for filtering with heterozygosity statistics.
Default: |
data |
14 options for input (diploid data only): VCFs (SNPs or Haplotypes,
to make the vcf population ready),
plink (tped, bed), stacks haplotype file, genind (library(adegenet)),
genlight (library(adegenet)), gtypes (library(strataG)), genepop, DArT,
and a data frame in long/tidy or wide format. To verify that radiator detect
your file format use DArT and VCF data: radiator was not meant to generate alleles and genotypes if you are using a VCF file with no genotype (only genotype likelihood: GL or PL). Neither is radiator able to magically generate a genind object from a SilicoDArT dataset. Please look at the first few lines of your dataset to understand it's limit before asking raditor to convert or filter your dataset. |
strata |
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
|
ind.heterozygosity.threshold |
(string, double, optional) Blacklist individuals based on observed heterozygosity (averaged across markers). The string contains 2 thresholds values (min and max). The values are proportions (0 to 1), where 0 turns off the min threshold and 1 turns off the max threshold. Individuals with mean observed heterozygosity higher (>) or lower (<) than the thresholds will be blacklisted. Default: |
het.approach |
(Character string). First value, |
het.threshold |
Number Biallelic markers usually max 0.5. But departure
from this value is common. The higher the proportion threshold, the more relaxed
the filter is. With default there is no filtering.
Default: |
het.dif.threshold |
Number (0 - 1). For |
outlier.pop.threshold |
(integer, optional) Useful to incorporate problematic
populations dragging down polymorphism discovery, but still wanted for analysis.
Use this threshold to allow variance in the number of populations passing
the thresholds described above.
e.g. with |
helper.tables |
(logical) Output tables that show
the number of markers blacklisted or whitelisted based on a series of
automatic thresholds to guide decisions. When |
coverage.info |
(optional, logical) Use
|
filename |
(optional) The function uses |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
... |
(optional) To pass further arguments for fine-tuning the function. |
Interactive version
There are 4 steps in the interactive version to visualize and filter the data based heterozygosity statistics:
Step 1. Individual's observed heterozygosity: outliers that might represent mixed samples Step 2. Blacklist outliers based on a proportion threshold of mean observed heterozygosity Step 3. Observed heterozygosity statistics per populations and overall Step 4: Blacklist markers based on observed heterozygosity
outlier.pop.threshold
If your a regular radiator user, you've seen the pop.num.threshold
.
outlier.pop.threshold
, is different and requires more thinking,
because the number of populations genotyped potentially vary across markers,
which makes the use of pop.num.threshold
less optimal.
e.g. If only 4 populations out of 10 are genotyped, for a marker you want
to keep, using pop.num.threshold = 0.6
will lead to unwanted results
and inconsistensis.
It's easier to tolerate outliers with this new approach:
outlier.pop.threshold = 2
.
Individual observed heterozygosity (averaged across markers): To help discard an individual based on his observed heterozygosity (averaged across markers), use the manhanttan plot to:
contrast the individual with population and overall samples.
visualize the impact of missigness information (based on population or overall number of markers) and the individual observed heterozygosity. The larger the point, the more missing genotypes.
Outlier above average:
potentially represent two samples mixed together (action: blacklist), or...
a sample with more sequecing effort (point size small): did you merge your replicates fq files ? (action : keep and monitor)
a sample with poor sequencing effort (point size large) where the genotyped markers are all heterozygotes, verify this with missingness (action: discard)
In all cases, if there is no bias in the population sequencing effort, the size of the point will usually be "average" based on the population or overall number of markers.
You can visualize individual observed heterozygosity, choose thresholds and
then visualize, choose thresholds and filter markers based on observed
heterozygosity in one run with: radiator filter_het
.
Outlier below average:
A point with a size larger than the population or overall average (= lots of missing): the poor polymorphism discovery of the sample is probably the result of bad DNA quality, a bias in sequencing effort, etc. (action: blacklist)
A point with a size that looks average (not much missing): this sample requires more attention (action: blacklist) and conduct more tests. e.g. for biallelic data, look for coverage imbalance between ALT/REF allele. At this point you need to distinguish between an artifact of poor polymorphism discovery or a biological reason (highly inbred individual, etc.).
The function returns inside the global environment a list with
15 objects, the objects names are found by using names(your.list.name)
:
filtered tidy data frame: $tidy.filtered.het
whitelist of markers:$whitelist.markers
the strata:$strata
the filters parameters used:$filters.parameters
the individual's heterozigosity:$individual.heterozigosity
the heterozygosity statistics per populations and overall:$heterozygosity.statistics
the blacklisted individuals based on the individual's heterozigosity:$blacklist.ind.het
a list containing the helper tables:$helper.table.het
the boxplot of individual observed heterozygosity:$individual.heterozygosity.boxplot
the manhattan plot of individual heterozygosity:$individual.heterozygosity.manhattan.plot
the boxplot of observed heterozygosity averaged across markers and pop:$markers.pop.heterozygosity.boxplot
the density plot of observed heterozygosity averaged across markers and pop:$markers.pop.heterozygosity.density.plot
the manhattan plot of observed heterozygosity averaged across markers and pop:$markers.pop.heterozygosity.manhattan.plot
the relationship between the number of SNPs on a locus and locus observed heterozygosity statistics:$snp.per.locus.het.plot
the relationship between locus coverage (read depth) and locus observed heterozygosity statistics:$het.cov.plot
whitelist of markers:$whitelist.markers
whitelist of markers:$whitelist.markers
In the working directory, output is conditional to interactive.filter
argument
plot_density_distribution_het
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.