Codec: Compression/Decompression of Character Vectors
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) Examples

Compresses character vectors into raw vectors, or decompresses raw vectors into character vectors using a variety of codecs.

Codec(x,
      compression,
      compressRepeats = FALSE,
      processors = 1)

`x`	Either a character vector to be compressed, or a list of raw vectors to be decompressed.
`compression`	The type of compression algorithm to use when `x` is a character vector. This should be (an unambiguous abbreviation of) one of `"nbit"` (for nucleotides), `"qbit"` (for quality scores), `"gzip"`, `"bzip2"`, or `"xz"`. If `compression` is `"nbit"` or `"qbit"` then a second method can be provided for cases when `x` is incompressible. Decompression type is determined automatically. (See details section below.)
`compressRepeats`	Logical specifying whether to compress exact repeats and reverse complement repeats in a character vector input (`x`). Only applicable when `compression` is `"nbit"`. Repeat compression in long DNA sequences generally increases compression by about 2% while requiring three-fold more compression time.
`processors`	The number of processors to use, or `NULL` to automatically detect and use all available processors.

Codec can be used to compress/decompress character vectors using different algorithms. The "nbit" and "qbit" methods are tailored specifically to nucleotides and quality scores, respectively. These two methods will store the data as plain text ("ASCII" format) when it is incompressible. In such cases, a second compression method can be given to use in lieu of plain text. For example compression = c("nbit", "gzip") will use "gzip" compression when "nbit" compression is inappropriate.

When performing the reverse operation, decompression, the type of compression is automatically detected based on the unique signature ("magic number") added by each compression algorithm.

If x is a character vector to be compressed, the output is a list with one element containing a raw vector per character string. If x is a list of raw vectors to be decompressed, then the output is a character vector with one string per list element.

Erik Wright eswright@pitt.edu

fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")
dna <- as.character(readDNAStringSet(fas)) # aligned sequences
object.size(dna)

# compression
system.time(x <- Codec(dna, compression="nbit"))
object.size(x)/sum(nchar(dna)) # bytes per position

system.time(g <- Codec(dna, compression="gzip"))
object.size(g)/sum(nchar(dna)) # bytes per position

# decompression
system.time(y <- Codec(x))
stopifnot(dna==y)

system.time(z <- Codec(g))
stopifnot(dna==z)