encodeSequences: Encode nucleotide sequences

View source: R/encodeSequences.R

encodeSequencesR Documentation

Encode nucleotide sequences

Description

Encode short nucleotide sequences into integers with a 2-bit encoding.

Usage

encodeSequences(sequences)

Arguments

sequences

A character vector of short nucleotide sequences, e.g., UMIs or cell barcodes.

Details

Each pair of bits encodes a nucleotide - 00 is A, 01 is C, 10 is G and 11 is T. The least significant byte contains the 3'-most nucleotides, and the remaining bits are set to zero. Thus, the sequence “CGGACT” is converted to the binary form:

    01 10 10 00 01 11

... which corresponds to the integer 1671.

A consequence of R's use of 32-bit integers means that no element of sequences can be more than 15 nt long. Otherwise, integer overflow will occur.

Value

An integer vector containing the encoded sequences.

Author(s)

Aaron Lun

References

10X Genomics (2017). Molecule info. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/molecule_info

Examples

encodeSequences("CGGACT")

MarioniLab/DropletUtils documentation built on Dec. 13, 2024, 6:13 a.m.