bc_extract: Extract barcode from sequences

bc_extractR Documentation

Extract barcode from sequences

Description

bc_extract identifies the barcodes (and UMI) from the sequences using regular expressions. pattern and pattern_type arguments are necessary, which provides the barcode (and UMI) pattern and their location within the sequences.

Usage

bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  metadata = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

## S4 method for signature 'data.frame'
bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

## S4 method for signature 'ShortReadQ'
bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

## S4 method for signature 'DNAStringSet'
bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

## S4 method for signature 'integer'
bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

## S4 method for signature 'character'
bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  metadata = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

## S4 method for signature 'list'
bc_extract(
  x,
  pattern = "",
  sample_name = NULL,
  metadata = NULL,
  maxLDist = 0,
  pattern_type = c(barcode = 1),
  costs = list(sub = 1, ins = 99, del = 99),
  ordered = TRUE
)

Arguments

x

A single or a list of fastq files, ShortReadQ, DNAStringSet, data.frame, or named integer.

pattern

A string or a string vector with the same number of files, specifying the regular expression with capture. It matches the barcode (and UMI) with capture pattern.

sample_name

A string vector, applicable when x is a list or fastq file vector. This argument specifies the sample names. If not provided, the function will look for sample names in the rownames of metadata, the fastqfile name or the list names.

metadata

A data.frame with sample names as the row names, with each metadata record by column, specifying the sample characteristics.

maxLDist

An integer. The minimum mismatch threshold for barcode matching, when maxLDist is 0, the str_match is invoked for barcode matching which is faster, otherwise aregexec is invoked and the costs parameters can be used to specify the weight of the distance calculation.

pattern_type

A vector. It defines the barcode (and UMI) capture group. See Details.

costs

A named list, applicable when maxLDist > 0, specifying the weight of each mismatch event while extracting the barcodes. The list element name have to be sub (substitution), ins (insertion) and del (deletion). The default value is list(sub = 1, ins = 99, del = 99). See aregexec for more detailed information.

ordered

A logical value. If the value is true, the return barcodes (UMI-barcode tags) are sorted by the read counts.

Details

The pattern argument is a regular expression, the capture operation () identifying the barcode or UMI. pattern_type argument annotates capture, denoting the UMI or the barcode captured pattern. In the example:

([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC
|---------| starts with 3 base pairs UMI.
           |----------| constant sequence in the backbone.
                       |-------| flexible barcode sequences.
                                |---------| 3' constant sequence.

In UMI part [ACGT]{3}, [ACGT] means it can be one of the "A", "C", "G" and "T", and {3} means it repeats 3 times. In the barcode pattern [ACGT]+, the + denotes that there is at least one of the A or C or G or T.

Value

This function returns a BarcodeObj object if the input is a list or a vector of Fastq files, otherwise it returns a data.frame. In the later case the data.frame has columns:

  1. umi_seq (optional): UMI sequence, applicable when there is UMI in 'pattern' and 'pattern_type' argument.

  2. barcode_seq: barcode sequence.

  3. count: reads number.

Examples

fq_file <- system.file("extdata", "simple.fq", package="CellBarcode")

library(ShortRead)

# barcode from fastq file
bc_extract(fq_file, pattern = "AAAAA(.*)CCCCC")

# barcode from ShortReadQ object
sr <- readFastq(fq_file)  # ShortReadQ
bc_extract(sr, pattern = "AAAAA(.*)CCCCC")

# barcode from DNAStringSet object
ds <- sread(sr)  # DNAStringSet
bc_extract(ds, pattern = "AAAAA(.*)CCCCC")

# barcode from integer vector
iv <- tables(ds, n = Inf)$top # integer vector
bc_extract(iv, pattern = "AAAAA(.*)CCCCC")

# barcode from data.frame 
df <- data.frame(seq = names(iv), freq = as.integer(iv)) # data.frame
bc_extract(df, pattern = "AAAAA(.*)CCCCC")

# barcode from list of DNAStringSet
l <- list(sample1 = ds, sample2 = ds) # list
bc_extract(l, pattern = "AAAAA(.*)CCCCC")

# Extract UMI and barcode
d1 <- data.frame(
    seq = c(
        "ACTTCGATCGATCGAAAAGATCGATCGATC",
        "AATTCGATCGATCGAAGAGATCGATCGATC",
        "CCTTCGATCGATCGAAGAAGATCGATCGATC",
        "TTTTCGATCGATCGAAAAGATCGATCGATC",
        "AAATCGATCGATCGAAGAGATCGATCGATC",
        "CCCTCGATCGATCGAAGAAGATCGATCGATC",
        "GGGTCGATCGATCGAAAAGATCGATCGATC",
        "GGATCGATCGATCGAAGAGATCGATCGATC",
        "ACTTCGATCGATCGAACAAGATCGATCGATC",
        "GGTTCGATCGATCGACGAGATCGATCGATC",
        "GCGTCCATCGATCGAAGAAGATCGATCGATC"
        ),
    freq = c(
        30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
    )
  ) 
# barcode backbone with UMI and barcode
pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_extract(
    list(test = d1), 
    pattern, 
    sample_name=c("test"), 
    pattern_type=c(UMI=1, barcode=2))

###

wenjie1991/CellBarcode documentation built on April 17, 2024, 4:40 a.m.