Encode short nucleotide sequences into integers with a 2-bit encoding.

encodeSequences(sequences)
`sequences` |
A character vector of short nucleotide sequences, e.g., UMIs or cell barcodes. |

Each pair of bits encodes a nucleotide - 00 is A, 01 is C, 10 is G and 11 is T. The least significant byte contains the 3'-most nucleotides, and the remaining bits are set to zero. Thus, the sequence “CGGACT” is converted to the binary form:

01 10 10 00 01 11
... which corresponds to the integer 1671.

A consequence of R's use of 32-bit integers means that no element of `sequences`

can be more than 15 nt long.
Otherwise, integer overflow will occur.

An integer vector containing the encoded sequences.

Aaron Lun

10X Genomics (2017). Molecule info. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/molecule_info

encodeSequences("CGGACT")
