classify_bin_infovar_balance: Finds the outliers in a bin by balancing the amount of...

Description Usage Arguments Details

View source: R/classify_mislabelling.R

Description

This approach removes outliers from a bin by finding the sequence with the highest average distance to all other sequences in the bin. It is then removed. The ratio of the reduction in the average distance between the sequences and the reduction in the information available (where the number of DNA sequences in the bin is taken as the measure of the amount of information) is then computed. The process will continue until either all data has been removed or the ratio drops below some threshold. If the ratio will drop be low the threshold if the next sequence(s) is removed, then the process stops, so that the process will stop before the ratio goes under the threshold.

Usage

1
2
classify_bin_infovar_balance(bin, threshold, start_threshold = 0,
  max_sequences = 100)

Arguments

bin

The input bin as a single DNAStringSet.

threshold

Outlier sequences are removed from the bin until the distanct / information ratio drops below this threshold.

start_threshold

Only start the classification if maximum distance in the sample is greater then this number of bases normalized to the length of the sequences. A value of 0.01 means that the procedure will only start if the maximum distance between two sequences is greater then 1 if the sequences is exactly 100 bases long. TODO further normalize for number of sequences

max_sequences

The maximum number of sequences to use for the computation of the distance matrix. If more sequences than this is present, then randomly select this many sequences and run the classification algorithm on them.

Details

Both the reduction in the average distance to all other sequences and the reduction in the amount of information available is computed as a percentage relative to the original input data. The formula for the average reduction in distances is (mean(new_dmat) - mean(prev_dmat))/mean(orig_dmat) where new_dmat is the new distance matrix constructed from removing the next sequence(s), pre_dmat is the distance matrix constructed in the previous step and orig_dmat is the distance matrix constructed on the original bin passed to the function. The percentage reduction in information available is the number of sequences that will be removed in this step over the number of sequences in the input data set.


philliplab/MotifBinner documentation built on Sept. 2, 2020, 11:41 a.m.