bc_extract | R Documentation |
bc_extract
identifies the barcodes (and UMI) from the sequences using
regular expressions. pattern
and pattern_type
arguments are
necessary, which provides the barcode (and UMI) pattern and their location
within the sequences.
bc_extract(
x,
pattern = "",
sample_name = NULL,
metadata = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
## S4 method for signature 'data.frame'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
## S4 method for signature 'ShortReadQ'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
## S4 method for signature 'DNAStringSet'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
## S4 method for signature 'integer'
bc_extract(
x,
pattern = "",
sample_name = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
## S4 method for signature 'character'
bc_extract(
x,
pattern = "",
sample_name = NULL,
metadata = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
## S4 method for signature 'list'
bc_extract(
x,
pattern = "",
sample_name = NULL,
metadata = NULL,
maxLDist = 0,
pattern_type = c(barcode = 1),
costs = list(sub = 1, ins = 99, del = 99),
ordered = TRUE
)
x |
A single or a list of fastq files, ShortReadQ, DNAStringSet, data.frame, or named integer. |
pattern |
A string or a string vector with the same number of files, specifying the regular expression with capture. It matches the barcode (and UMI) with capture pattern. |
sample_name |
A string vector, applicable when |
metadata |
A |
maxLDist |
An integer. The minimum mismatch threshold for barcode
matching, when maxLDist is 0, the |
pattern_type |
A vector. It defines the barcode (and UMI) capture group. See Details. |
costs |
A named list, applicable when maxLDist > 0, specifying the
weight of each mismatch event while extracting the barcodes. The list
element name have to be |
ordered |
A logical value. If the value is true, the return barcodes (UMI-barcode tags) are sorted by the read counts. |
The pattern
argument is a regular expression, the capture operation
()
identifying the barcode or UMI. pattern_type
argument
annotates capture, denoting the UMI or the barcode captured pattern. In the
example:
([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC |---------| starts with 3 base pairs UMI. |----------| constant sequence in the backbone. |-------| flexible barcode sequences. |---------| 3' constant sequence.
In UMI part [ACGT]{3}
, [ACGT]
means it can be one of
the "A", "C", "G" and "T", and {3}
means it repeats 3 times.
In the barcode pattern [ACGT]+
, the +
denotes
that there is at least one of the A
or C
or G
or
T.
This function returns a BarcodeObj object if the input is a list
or a
vector
of Fastq files, otherwise it returns a data.frame.
In
the later case
the data.frame
has columns:
umi_seq
(optional): UMI sequence, applicable when there is UMI
in 'pattern' and 'pattern_type' argument.
barcode_seq
: barcode sequence.
count
: reads number.
fq_file <- system.file("extdata", "simple.fq", package="CellBarcode")
library(ShortRead)
# barcode from fastq file
bc_extract(fq_file, pattern = "AAAAA(.*)CCCCC")
# barcode from ShortReadQ object
sr <- readFastq(fq_file) # ShortReadQ
bc_extract(sr, pattern = "AAAAA(.*)CCCCC")
# barcode from DNAStringSet object
ds <- sread(sr) # DNAStringSet
bc_extract(ds, pattern = "AAAAA(.*)CCCCC")
# barcode from integer vector
iv <- tables(ds, n = Inf)$top # integer vector
bc_extract(iv, pattern = "AAAAA(.*)CCCCC")
# barcode from data.frame
df <- data.frame(seq = names(iv), freq = as.integer(iv)) # data.frame
bc_extract(df, pattern = "AAAAA(.*)CCCCC")
# barcode from list of DNAStringSet
l <- list(sample1 = ds, sample2 = ds) # list
bc_extract(l, pattern = "AAAAA(.*)CCCCC")
# Extract UMI and barcode
d1 <- data.frame(
seq = c(
"ACTTCGATCGATCGAAAAGATCGATCGATC",
"AATTCGATCGATCGAAGAGATCGATCGATC",
"CCTTCGATCGATCGAAGAAGATCGATCGATC",
"TTTTCGATCGATCGAAAAGATCGATCGATC",
"AAATCGATCGATCGAAGAGATCGATCGATC",
"CCCTCGATCGATCGAAGAAGATCGATCGATC",
"GGGTCGATCGATCGAAAAGATCGATCGATC",
"GGATCGATCGATCGAAGAGATCGATCGATC",
"ACTTCGATCGATCGAACAAGATCGATCGATC",
"GGTTCGATCGATCGACGAGATCGATCGATC",
"GCGTCCATCGATCGAAGAAGATCGATCGATC"
),
freq = c(
30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
)
)
# barcode backbone with UMI and barcode
pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_extract(
list(test = d1),
pattern,
sample_name=c("test"),
pattern_type=c(UMI=1, barcode=2))
###
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.