count_barcodes: Count barcodes
In andrewGhazi/malacoda: Bayesian Analysis of High-Throughput Genomic Assays

Description Usage Arguments Details Value Note Examples

Given a mapping between barcodes and alleles and a directory of MPRA sequencing results, count the abundance of each barcode and return a data frame of counts.

count_barcodes(
  barcode_allele_df,
  fastq_dir,
  temp_dir,
  keep_temp = FALSE,
  bc_start,
  bc_end,
  quality_cutoff = 30,
  n_cores = 1,
  verbose = TRUE,
  decode_barcode_set = NULL
)

`barcode_allele_df`	a data frame giving a mapping between barcodes and alleles
`fastq_dir`	a character string giving the path to a directory of MPRA FASTQ results
`temp_dir`	a character string giving the path to a directory to use for temporary intermediate files
`keep_temp`	a logical indicating whether to avoid discarding the temporary intermediate files
`bc_start`	an integer indicating the position in the read where the barcode starts
`bc_end`	an integer indicating the position in the read where the barcode ends
`quality_cutoff`	an integer indicating the quality cutoff to be used
`n_cores`	an integer giving the number of cores to use in parallel
`verbose`	logical indicating whether or not to print progress messages
`decode_barcode_set`	an optional character string giving the path to the freebarcodes barcode set .txt file to be used to decode error-correctable barcodes

The barcode_allele_df should have two columns: bc_id giving a unique identifier to each barcode and barcode which gives the barcode. The bc_id should ideally be descriptive and denote which allele of which SNP it is associated with. tidyr::unite can be convenient for preparing this input.

The quality cutoff applies to ALL bases in the barcode. That is, if every base in the barcode is not at or above the given cutoff, the read is discarded. The default cutoff, 30, is fairly aggressive, so if you're losing too many reads you can try lowering it.

The output returns a column of counts for each FASTQ in the input directory. The column names are taken from the filenames of the fastqs without the filetype extensions. The usual MPRA practice is to have one FASTQ per transfection as well as multiple replicates taken from the plasmid library – if this is not the case in your experiment you'll need to do some form of experiment-structure-appropriate aggregation before proceeding to downstream analysis.

Unmatched reads are also written out to files included in the temporary intermediates under seq_only/. These can be helpful if you want to regain reads that would otherwise be discarded by using the error-correctable barcodes from the freebarcodes package available through mpradesigntools. If you want to use these make sure to set keep_temp = TRUE.

a data frame of counts by file in the input directory

This function requires an installation of FASTX-Toolkit by the Hannon Lab (it's easy to install with apt-get) and sed (which comes as part of any standard Unix-like OS).

The temporary intermediates are comparable in size to the input FASTQ files, so make sure you have enough disk space available.

This function can also make use of the freebarcodes package through the use of the decode_barcode_set argument. This allows the user to recover some barcode reads that are otherwise lost to sequencing error. See the link above for installation/algorithmic details. The decode_barcode_set argument needs to be the full path to the actual barcode set .txt file (typically in the barcodes/ directory where freebarcodes is installed). Note that this requires the MPRA to have been designed with one of these barcode sets in the first place, which is easily possible through the mpradesigntools package.

## Not run: 
fastq_dir_example = tempdir()

# V This writes out an example directory of fastqs to a temporary directory
mapply(function(x,y){write.table(x, file = y, quote = FALSE, col.names = FALSE, row.names = FALSE)},
fastq_examples,
paste0(fastq_dir_example, '/', names(fastq_examples), '.fastq'))

tmp_dir = paste0(fastq_dir_example, '/count_temp_', format(Sys.time(), "%Y_%m_%d_%H_%M_%S"))

# The data frame of counts this returns will be mostly zeroes because
# the example fastqs included with this package are very small.

count_barcodes(barcode_allele_df = barcode_by_id,
fastq_dir = fastq_dir_example,
temp_dir = tmp_dir,
bc_start = 1, # The barcode starts at the first base read
bc_end = 10, # and ends at the 14th
quality_cutoff = 30,
n_cores = 1,
keep_temp = FALSE)
## End(Not run)