bc_cure_cluster: Clean barcodes by editing distance
In wenjie1991/CellBarocde: Cellular DNA Barcode Analysis toolkit

bc_cure_cluster

R Documentation

Clean barcodes by editing distance

Description

bc_cure_cluster performs clustering of barcodes by editing distance, and remove the minority barcodes with a similar sequence. This function is only applicable for the BarcodeObj object with a cleanBc slot. The barcodes with a smaller reads count will be removed.

Usage

bc_cure_cluster(
  barcodeObj,
  dist_threshold = 1,
  depth_fold_threshold = 1,
  dist_method = "hamm",
  cluster_method = "greedy",
  count_threshold = 1e+09,
  dist_costs = list(replace = 1, insert = 1, delete = 1)
)

## S4 method for signature 'BarcodeObj'
bc_cure_cluster(
  barcodeObj,
  dist_threshold = 1,
  depth_fold_threshold = 1,
  dist_method = "hamm",
  cluster_method = "greedy",
  count_threshold = 1e+07,
  dist_costs = list(replace = 1, insert = 1, delete = 1)
)

Arguments

`barcodeObj`	A BarcodeObj object.
`dist_threshold`	A single integer, or vector of integers with the length of sample number, specifying the editing distance threshold for defining two similar barcode sequences. If the input is a vector, each value in the vector relates to one sample according to its order in `BarcodeObj` object. The sequences with editing distance equal to or less than the threshold will be considered similar barcodes.
`depth_fold_threshold`	A single numeric or vector of numeric with the length of sample number, specifying the depth fold change threshold of removing the similar minority barcode. The majority of barcodes should have at least `depth_fold_threshold` times of reads of the similar minotiry one, to remove the minority similar barcode. (TODO: more precise description)
`dist_method`	A character string, specifying the editing distance used for evaluating barcode similarity. It can be "hamm" for Hamming distance or "leven" for Levenshtein distance.
`cluster_method`	A character string specifying the algorithm used to perform the clustering of barcodes. Currently only "greedy" is available, in this case, The most and the least abundant barcode will be used for comparing, the least abundant barcode is preferentially removed.
`count_threshold`	An integer, read depth threshold to consider a barcode as a true barcode. If a barcode with a count higher than this threshold it will not be removed, even if the barcode is similar to a more abundant one. Default is 1e9.
`dist_costs`	A list, the cost of the events of distance algorithm, applicable when Levenshtein distance is applied. The names of vector have to be `insert`, `delete` and `replace`, specifying the weight of insertion, deletion, and replacement events respectively. The default cost for each event is 1.

Value

A BarcodeObj object with cleanBc slot updated.

Examples

data(bc_obj)

d1 <- data.frame(
    seq = c(
        "ACTTCGATCGATCGAAAAGATCGATCGATC",
        "AATTCGATCGATCGAAGAGATCGATCGATC",
        "CCTTCGATCGATCGAAGAAGATCGATCGATC",
        "TTTTCGATCGATCGAAAAGATCGATCGATC",
        "AAATCGATCGATCGAAGAGATCGATCGATC",
        "CCCTCGATCGATCGAAGAAGATCGATCGATC",
        "GGGTCGATCGATCGAAAAGATCGATCGATC",
        "GGATCGATCGATCGAAGAGATCGATCGATC",
        "ACTTCGATCGATCGAACAAGATCGATCGATC",
        "GGTTCGATCGATCGACGAGATCGATCGATC",
        "GCGTCCATCGATCGAAGAAGATCGATCGATC"
        ),
    freq = c(
        30, 60, 9, 10, 14, 5, 10, 30, 6, 4 , 6
        )
    )

pattern <- "([ACTG]{3})TCGATCGATCGA([ACTG]+)ATCGATCGATC"
bc_obj <- bc_extract(list(test = d1), pattern, sample_name=c("test"), 
    pattern_type=c(UMI=1, barcode=2))

# Remove barcodes with depth < 5
(bc_cured <- bc_cure_depth(bc_obj, depth=5))

# Do the clustering, remove the less abundant barcodes
# one by hamming distance <= 1 
bc_cure_cluster(bc_cured, dist_threshold = 1)

# Levenshtein distance <= 1
bc_cure_cluster(bc_cured, dist_threshold = 2, dist_method = "leven",
    dist_costs = list("insert" = 2, "replace" = 1, "delete" = 2))

###

wenjie1991/CellBarocde documentation built on June 1, 2025, 11:17 p.m.