classify_absolute: Removes the most outlying sequences in a bin until the...

Description Usage Arguments Details

View source: R/classify_mislabelling.R

Description

Thresholds are expressed as the probability that any given letter is an error. Thus, before applying the thresholds to the distances between the sequences, they are doubled.

Usage

1
2
classify_absolute(bin, threshold = 0.01, start_threshold = 0.02,
  max_sequences = 100, max_iterations = 1000)

Arguments

bin

The input bin as a single DNAStringSet.

threshold

Outlier sequences are removed from the bin until the maximum distance between any two sequences drops below this threshold.

start_threshold

Only start the classification if the maximum distance between and two sequences in the bin is greater than this.

max_sequences

The maximum number of sequences to use for the computation of the distance matrix. If more sequences than this is present, then randomly select this many sequences and run the classification algorithm on them.

Details

The rationale for doubling the threshold is that a sequence has a read error if there is an error in 1 of the bases the sequence sequence while the distance between 2 sequences is 1 if there is a read error in any of the bases of the two sequences under consideration. Obviously its more subtle than this, see the benchmarking document for more details.

Note that as the thresholds get bigger, the behaviour will get strange. This is because it assumes that the likihood of the same mutation occuring in the two sequences being very low. However, as the error rate gets much higher, that assumption becomes invalid and strange things happen. This will not be fixed, since this library is not designed to work in an environment where the error rates are high.


philliplab/MotifBinner documentation built on Sept. 2, 2020, 11:41 a.m.