classify_absolute: Removes the most outlying sequences in a bin until the...
In philliplab/MotifBinner: Bin reads by Motif

Thresholds are expressed as the probability that any given letter is an error. Thus, before applying the thresholds to the distances between the sequences, they are doubled.

1 2	classify_absolute(bin, threshold = 0.01, start_threshold = 0.02, max_sequences = 100, max_iterations = 1000)

`bin`	The input bin as a single DNAStringSet.
`threshold`	Outlier sequences are removed from the bin until the maximum distance between any two sequences drops below this threshold.
`start_threshold`	Only start the classification if the maximum distance between and two sequences in the bin is greater than this.
`max_sequences`	The maximum number of sequences to use for the computation of the distance matrix. If more sequences than this is present, then randomly select this many sequences and run the classification algorithm on them.

The rationale for doubling the threshold is that a sequence has a read error if there is an error in 1 of the bases the sequence sequence while the distance between 2 sequences is 1 if there is a read error in any of the bases of the two sequences under consideration. Obviously its more subtle than this, see the benchmarking document for more details.

Note that as the thresholds get bigger, the behaviour will get strange. This is because it assumes that the likihood of the same mutation occuring in the two sequences being very low. However, as the error rate gets much higher, that assumption becomes invalid and strange things happen. This will not be fixed, since this library is not designed to work in an environment where the error rates are high.

philliplab/MotifBinner documentation built on Sept. 2, 2020, 11:41 a.m.