processingRawData: Data processing

View source: R/raw_data_processing.R

processingRawDataR Documentation

Data processing

Description

Reads the corresponding fast(a/q) file(s), extracts the defined barcode constructs and counts them. Optionally, a Phred-Score based quality filtering will be conducted and the results will be saved within a csv file.

Usage

processingRawData(
  file_name,
  source_dir,
  results_dir = NULL,
  mismatch = 0,
  indels = FALSE,
  label = "",
  bc_backbone,
  bc_backbone_label = NULL,
  min_score = 30,
  min_reads = 2,
  save_it = TRUE,
  seqLogo = FALSE,
  cpus = 1,
  strategy = "sequential",
  full_output = FALSE,
  wobble_extraction = TRUE,
  dist_measure = "hamming"
)

Arguments

file_name

a character string or a character vector, containing the file name(s).

source_dir

a character string which contains the path to the source files.

results_dir

a character string which contains the path to the results directory. If no value is assigned the source_dir will automatically also become the results_dir.

mismatch

an positive integer value, default is 0, if greater values are provided they indicate the number of allowed mismtaches when identifying the barcode constructes.

indels

a logical value. If TRUE the chosen number of mismatches will be interpreted as edit distance and allow for insertions and deletions as well (currently under construction).

label

a character string which serves as a label for every kind of created output file.

bc_backbone

a character string describing the barcode design, variable positions have to be marked with the letter 'N'. If only a clustering of the sequenced reads should be applied bc_backbone is expecting the string "none" and the mismatch parameter will then be interpreted as maximum dissimilarity for which two reads will be clustered together.

bc_backbone_label

a character vector, an optional list of barcode backbone names serving as additional identifier within file names and BCdat labels. If not provided ordinary numbers will serve as alternative.

min_score

a positive integer value, all fastq sequence with an average score smaller then min_score will be excluded, if min_score = 0 there will be no quality score filtering

min_reads

positive integer value, all extracted barcode sequences with a read count smaller than min_reads will be excluded from the results

save_it

a logical value. If TRUE, the raw data will be saved as a csv-file.

seqLogo

a logical value. If TRUE, the sequence logo of the entire NGS file will be generated and saved.

cpus

an integer value, indicating the number of available cpus.

strategy

since the future package is used for parallelisation a strategy has to be stated, the default is "sequential" (cpus = 1) and "multisession" (cpus > 1). For further information please read future::plan() R-Documentation.

full_output

a logical value. If TRUE, additional output files will be generated.

wobble_extraction

a logical value. If TRUE, single reads will be stripped of the backbone and only the "wobble" positions will be left.

dist_measure

a character value. If "bc_backbone = 'none'", single reads will be clustered based on a distance measure. Available distance methods are Optimal string aligment ("osa"), Levenshtein ("lv"), Damerau-Levenshtein ("dl"), Hamming ("hamming"), Longest common substring ("lcs"), q-gram ("qgram"), cosine ("cosine"), Jaccard ("jaccard"), Jaro-Winkler ("jw"), distance based on soundex encoding ("soundex"). For more detailed information see stringdist function of the stringdist-package for more information)

Value

a BCdat object which will include read counts, barcode sequences, the results directory and the search barcode backbone.

Examples

## Not run: 
bc_backbone <- "ACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANNCTTNNCGANNCTTNNGGANNCTANNACTNNCGANN"

source_dir <- system.file("extdata", package = "genBaRcode")

BC_dat <- processingRawData(file_name = "test_data.fastq.gz", source_dir,
          results_dir = "/my/test/directory/", mismatch = 2, label = "test", bc_backbone,
          min_score = 30, indels = FALSE, min_reads = 2, save_it = FALSE, seqLogo = FALSE)

## End(Not run)

genBaRcode documentation built on March 31, 2023, 11:02 p.m.