This function assesses the genetic similarity among sequences within each taxa. It takes user defined thresholds (one threshold per taxonomic level) to warn about sequences which are singularly different (based on median distance) from the others. Sequences in the reference database must be aligned.
refdb_check_seq_homogeneity(x, levels, min_n_seq = 3)
a reference database (sequences must be aligned).
a named vector of genetic similarity thresholds.
Names must correspond to taxonomic levels (taxonomic fields)
and values must be included in the interval [0, 1].
For example to assess homogeneity at 5 percents (within species) and
10 percents (within genus):
the minimum number of sequences for a taxon to be tested.
For every tested taxonomic levels, the algorithm
checks all sequences in every taxa
(for which the total number of sequence is >
In each taxon, the pairwise distance matrix among all the sequences
belonging to this taxon is computed. A sequence is tagged as suspicious
and returned by the function
if its median genetic distance from the other sequences is higher than
the threshold set by the user (
A dataframe reporting suspicious sequences whose median distance to other sequences of the same taxon is greater than the specified threshold. The first column "level_threshold_homogeneity" indicates the lowest taxonomic level for which the threshold has been exceeded and the second column "value_threshold_homogeneity" gives the computed median distance.
lib <- read.csv(system.file("extdata", "homogeneity.csv", package = "refdb")) lib <- refdb_set_fields_BOLD(lib) refdb_check_seq_homogeneity(lib, levels = c(species = 0.05, genus = 0.1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.