View source: R/detect_duplicate_genomes.R
detect_duplicate_genomes | R Documentation |
The function can compute two methods to highligh potential duplicate individuals.
distance between individuals and/or
pairwise genome similarity
detect_duplicate_genomes(
data,
interactive.filter = TRUE,
detect.duplicate.genomes = TRUE,
dup.threshold = 0,
distance.method = "manhattan",
genome = FALSE,
threshold.common.markers = NULL,
blacklist.duplicates = FALSE,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
data |
(4 options) A file or object generated by radiator:
How to get GDS and tidy data ?
Look into |
interactive.filter |
(optional, logical) Do you want the filtering session to
be interactive. Figures of distribution are shown before asking for filtering
thresholds.
Default: |
detect.duplicate.genomes |
(optional, logical) For use inside radiator pipelines.
Default: |
dup.threshold |
Default: |
distance.method |
(character) Depending on input data, 2 different methods are used (give similar results):
Using |
genome |
(logical) Computes pairwise genome similarity in parallel.
The proportion of the shared genotypes is averaged across shared markers between
each pairwise comparison. This method makes filtering easier because the
threshold is more intuitive with the plots produced, but it's much longer
to run, even in parallel, so better to run overnight.
Default: |
threshold.common.markers |
(double, optional) When using the
pairwise genome similarity approach ( |
blacklist.duplicates |
(optional, logical)
With |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
verbose |
(optional, logical) When |
... |
(optional) Advance mode that allows to pass further arguments for fine-tuning the function. Also used for legacy arguments (see details or special section) |
Strategically, run the default first (distance.method
,
no genome
)
distance.method
argument is fast, but...
you don't know if the observed comparison (close or distant) is influenced by missing values/the number of markers in common between the pair compared. This is something that needs to be considered. Be suspicious of a distant outlier from the same pop pairwise comparison, and similarly, be suspicious of a close outlier from a different pop pairwise comparisons.
If there is no outlier, don't bother running the function again with
(genome = TRUE
).
Relative distance
Is the normalized distance for your dataset (not calculated by strata). For each individual, it's the distance divided by the maximum distance observed. The range is limited between 0 and 1. Closer to 0 = the more similar and closer to 1, the more distant.
genome = TRUE
The function will run slower, but...
If you see outliers with the first run, take the time to run the function
with genome = TRUE
. Because this option is much better at detecting
duplicated individuals and it also shows the impact of missingness
or the number of shared markers between comparisons.
Your outlier duo could well be the result of one of the individual having an extremely low number of genotypes...
A list with potentially 8 objects:
$distance
: results of the distance method.
$distance.stats
: Summary statistics of the distance method.
$pairwise.genome.similarity
: results of the genome method.
$genome.stats
: Summary statistics of the genome method.
$violin.plot.distance
: violin plot showing the distribution of pairwise distances.
$manhattan.plot.distance
: same info different visual with manhattan plot.
$violin.plot.genome
: violin plot showing the distribution of pairwise genome similarities.
$manhattan.plot.genome
: same info different visual with manhattan plot.
$blacklist.id.similar
: blacklisted duplicates.
Saved in the working directory: individuals.pairwise.dist.tsv, individuals.pairwise.distance.stats.tsv, individuals.pairwise.genome.similarity.tsv, individuals.pairwise.genome.stats.tsv, blackliste.id.similar.tsv, blacklist.pairs.threshold.tsv
Thierry Gosselin thierrygosselin@icloud.com
## Not run:
# First run and simplest way (if you have the tidy tibble):
dup <- radiator::detect_duplicate_genomes(data = "wombat_tidy.rad")
# This will use by default:
# distance.method = "manhattan"
# genome = FALSE
# parallel.core = all my CPUs - 1
# If you need a tidy tibble: use one of radiator \code{tidy_} function or
# \code{radiator::tidy_genomic_data}
# To view the manhattan plot:
dup$manhattan.plot.distance
# to view the data stats
dup.data.stats <- dup$distance.stats
# to view the data
dup.data <- dup$distance
# Based on the look of the distribution using both manhattan and boxplot,
# I can filter the dataset to highlight potential duplicates.
# To run the distance (with euclidean distance instead of the default manhattan,
# and also carry the second analysis (with the genome method):
dup <- radiator::tidy_genomic_data(
data = wombat_tidy_object,
strata = "wombat.strata.tsv",
vcf.metadata = FALSE
) %>%
radiator::detect_duplicate_genomes(
data = .,
distance.method = "euclidean",
genome = TRUE
)
# to view the data of the genome data
dup.data <- dup$pairwise.genome.similarity
# Based on the look of the distribution using both manhattan and boxplot,
# I can filter the dataset based on 98% of identical genotype proportion,
# to highlight potential duplicates:
dup.filtered <- dplyr::filter(.data = dup.data, PROP_IDENTICAL > 0.98)
# Get the list of duplicates id
dup.list.names <- data.frame(INDIVIDUALS = unique(c(dup.filtered$ID1, dup.filtered$ID2)))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.