Description Usage Arguments Details Value Note See Also Examples
The function demultiplex
takes a set of reads that start with a barcode
and assigns those reads to a reference barcode while possibly correcting
errors.
The correct metric should be used, with metric = "hamming"
to correct
substitution errors and metric = "seqlev"
to correct insertion,
deletion, and substitution errors.
1 | demultiplex(reads, barcodes, metric=c("hamming","seqlev","levenshtein","phaseshift"), cost_sub = 1, cost_indel = 1)
|
reads |
The reads coming from your sequencing machines that start
with a barcode. For |
barcodes |
The reference barcodes that you used during library preparation and that you want to correct in your reads. |
metric |
The distance metric to be used to assign reads to reference barcodes. |
cost_sub |
The cost weight given to a substitution. |
cost_indel |
The cost weight given to insertions and deletions. |
Reads are matched to their correct reference barcodes by calculating the distances between each read and each reference barcode. The reference barcode with the smallest distance to the read is assumed to be the correct original barcode of that read.
For metric = "hamming"
, only the first n
(with n
being the
length of the reference barcodes) bases of the read are used for these
comparisons and no bases afterwards. Reads with fewer than n
bases
cannot be matched.
For metric = "seqlev"
, the whole read is compared with the reference
barcodes. The Sequence Levenshtein distance was especially developed for
barcodes in DNA context and can cope with ambiguities that stem from changes
to the length of the barcode.
The Levenshtein distance (metric = "levenshtein"
) is largely undefined
in DNA context and should be avoided. The Levenshtein distance only works if
the length both of the reference barcode and the barcode in the read is known.
With possible insertions and deletions, this becomes an unknown. For this
reason, we always calculate the Levenshtein distance between the whole read and
the whole reference barcode without coping with potential side effects.
A vector of reference barcodes of the same length as the input reads. Each reference barcode is the corrected version of the input barcode.
Do not try to correct errors in barcodes that were not systematically constructed for such a correction. To create such a barcode set, have a look into function create.dnabarcodes
.
create.dnabarcodes
, analyse.barcodes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # Define some barcodes and inserts
barcodes <- c("AGGT", "TTCC", "CTGA", "GCAA")
insert <- 'ACGCAGGTTGCATATTTTAGGAAGTGAGGAGGAGGCACGGGCTCGAGCTGCGGCTGGGTCTGGGGCGCGG'
# Choose and mutate a couple of thousand barcodes
used_barcodes <- sample(barcodes,10000,replace=TRUE)
mutated_barcodes <- unlist(lapply(strsplit(used_barcodes,""), function(x) { pos <- sample(1:length(x),1); x[pos] <- sample(c("C","G","A","T"),1); return(paste(x,collapse='')) } ))
show(setequal(mutated_barcodes, used_barcodes)) # FALSE
# Construct reads (= barcodes + insert)
reads <- paste(mutated_barcodes, insert, sep='')
# Demultiplex
demultiplexed <- demultiplex(reads,barcodes,metric="hamming")
# Show correctness
show(setequal(demultiplexed, used_barcodes)) # TRUE
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.