refdb_check_seq_homogeneity: Check for genetic homogeneity of taxa

View source: R/refdb_checks.R

refdb_check_seq_homogeneityR Documentation

Check for genetic homogeneity of taxa


This function assesses the genetic similarity among sequences within each taxa. It takes user defined thresholds (one threshold per taxonomic level) to warn about sequences which are singularly different (based on median distance) from the others. Sequences in the reference database must be aligned.


refdb_check_seq_homogeneity(x, levels, min_n_seq = 3)



a reference database (sequences must be aligned).


a named vector of genetic similarity thresholds. Names must correspond to taxonomic levels (taxonomic fields) and values must be included in the interval [0, 1]. For example to assess homogeneity at 5 percents (within species) and 10 percents (within genus): levels = c(species = 0.05, genus = 0.1)


the minimum number of sequences for a taxon to be tested.


For every tested taxonomic levels, the algorithm checks all sequences in every taxa (for which the total number of sequence is > min_n_seq) In each taxon, the pairwise distance matrix among all the sequences belonging to this taxon is computed. A sequence is tagged as suspicious and returned by the function if its median genetic distance from the other sequences is higher than the threshold set by the user (levels argument).


A dataframe reporting suspicious sequences whose median distance to other sequences of the same taxon is greater than the specified threshold. The first column "level_threshold_homogeneity" indicates the lowest taxonomic level for which the threshold has been exceeded and the second column "value_threshold_homogeneity" gives the computed median distance.


lib <- read.csv(system.file("extdata", "homogeneity.csv", package = "refdb"))
lib <- refdb_set_fields_BOLD(lib)
refdb_check_seq_homogeneity(lib, levels = c(species = 0.05, genus = 0.1))

refdb documentation built on Sept. 22, 2022, 5:07 p.m.