library(MotifBinner)
Given a bin of sequences that are theoretically from only a single virus molecule, remove all of those that are actually from a different molecule.
This is accomplished by looking at the distances between each of the sequences and removing the outliers.
The difficult parts are deciding how to compute the distances efficiently and setting the thresholds that start and stop the process.
This process is further complicated by these concerns: 1) bin sizes varies from 1 to several hundred 2) We have to deal with indels 3) This process must be completely automated 4) This process must be very computationally efficient
We need to test the system using bins with known answers.
As always, the design data structure for the test data is important. Use a structure like this:
list('test1' = list('in' = DNAStringSet(...), 'out' = DNAStringSet(...)), 'test2' = ...)
The data is available in the package:
test_dat <- get_mislabel_test_data()
Using this data for a 'bin' by putting the 'src' and 'out' data together and check that only the 'out' data is removed and all the 'src' data is kept.
Basic code to run the tests is available from the package:
score_classification()
A number of metrics must be considered when looking at the accuracy of classification.
Keep it basic.
Consider these standard classification metrics:
Number of true 'in' classifications / (Total size of true 'in' population)
Number of true 'out' classifications / (Total size of the 'out' population)
Maximize both simultaneously
Now, in addition to these two, there are other metrics of interest that we can derive based on our knowledge of the system.
We know that the only source of errors should be the sequencing process. We have access to data about the error rates of the sequencing process. We can use this to make a statement like:
The sequencing process is accurate to such a degree that no two reads of the same molecule should differ by more than one base per 100 bases. Build a metric around this information.
The time it took to classify the reads in the bin.
The euclidean distance from the sensitivity and specificity from (1, 1) will be reports as 'combo' for each test. If the average of this column for all test datasets is 0, then we have a perfect classifier. This is the metric of interest to be on the lookout for.
Design and implement a who set of strategies and then benchmark them to find the best ones.
This strategy keeps all the data
Use the random strategy but set the parameter 'n' to 0. So that 0 percent of the data will be randomly removed.
kable(score_all_classifications(test_dat, 'random', params = list(n=0)), digits = 2)
This strategy removes 40% of the sample at random
kable(score_all_classifications(test_dat, 'random', params = list(n=0.4)), digits = 2)
This strategy will keep on removing the most outlying sequence as long as is leads to a percentage reduction in variance that is x times larger than the percentage of information that was discarded
kable(score_all_classifications(test_dat, 'infovar_balance', params = list(threshold = 1)), digits = 2)
kable(score_all_classifications(test_dat, 'infovar_balance', params = list(threshold = 2)), digits = 2)
kable(score_all_classifications(test_dat, 'infovar_balance', params = list(threshold = 3)), digits = 2)
kable(score_all_classifications(test_dat, 'most_frequent', params = list()), digits = 2)
To be devised
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.