In michbur/tidysq: Tidy Processing and Analysis of Biological Sequences

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(tidysq)

Sequences in sq objects are compressed to take up less storage space. To achieve that, sq objects store an alphabet attribute that serves as a dictionary of possible symbols. This attribute can be accessed by its namesake function:

sq_dna <- sq(c("CTGAATGCAGT", "ATGCCGT", "CAGACCATT"))
alphabet(sq_dna)

It is strongly discouraged to manually assign different alphabet, as it may result in undesirable behavior.

Standard alphabets

Alphabets can be divided into standard and non-standard types. Both these groups have similar behavior, but standard alphabets have additional functionalities available due to their biological interpretation.

Standard alphabets can be subdivided into basic and extended alphabets, both groups closely linked. For every standard alphabet there exists a type such that if an sq object has this type, then its alphabet attribute has this alphabet as value.

Basic alphabets

There are three predefined basic alphabets --- for DNA, RNA and amino acid sequences. They consist of all letter codes used for bases of given type, as well as gap letter "-" and (in amino acid case) stop letter "*". Alphabets are stored as character vectors with added sq_alphabet class for additional methods. For instance, amino acid alphabet contains following letters: r get_standard_alphabet("ami_bsc").

Basic DNA/RNA alphabet is necessary for translate() operation.

Extended alphabets

For each basic alphabet there is an extended counterpart. These three extended alphabets contain all letters from the respective basic ones and, additionally, ambiguous letters (that is, letters that mean "X-or-Y-or-Z base", where X, Y and Z are chosen from corresponding base alphabet).

Both basic and extended alphabets can be acquired using get_standard_alphabet() function. It uses type interpreting not to force the user to remember exact type name (although using consistent naming is beneficial to code readability):

get_standard_alphabet("ami_ext")
get_standard_alphabet("rna_bsc")
get_standard_alphabet("DNA extended")

Removing ambiguous elements

When an sq object has an extended type, it can be converted to the basic one by utilizing remove_ambiguous() function. It works by removing either sequences where an ambiguous element is present or just this element, depending on by_letter parameter value. In the example below N is such an element:

sq_rna <- sq(c("UCGGNNCAGNN", "AUUCGGUGA", "CNCUUANNNCNU"))
sq_rna
remove_ambiguous(sq_rna)
remove_ambiguous(sq_rna, by_letter = TRUE)

Should the user wish to keep the original lengths of sequences unchanged, it's more appropriate to use substitute_letters() function instead. The most obvious replacement is "-" gap letter, present in all standard alphabets:

substitute_letters(sq_rna, c(N = "-"))

Notice, however, that returned object has atp alphabet instead. More on handling that in chapter about changing sq types.

Non-standard alphabets

Non-standard alphabet group consists of two types: untyped (unt) and atypical (atp). The former is a result of not specifying alphabet and being unable to find a standard alphabet that would contain all letters appearing in sequences. The latter, on the other hand, is used whenever the user specifies used alphabet explicitly. The difference can be best shown with calls to constructing sq() function:

sq(c("PFN&I&VO*&P", "&IO*&PVO"))
sq(c("PFN&I&VO*&P", "&IO*&PVO"),
   alphabet = c("F", "I", "N", "O", "P", "V", "&", "*"))

Obviously, as with standard alphabets, atypical ones can also contain more letters than actually appear:

sq(c("PFN&I&VO*&P", "&IO*&PVO"),
   alphabet = c("E", "F", "I", "N", "O", "P", "Q", "V", "&", "*", ":"))

Multicharacter alphabets

The main usage of atypical alphabets is to allow the user to handle data with multicharacter letters. For example sometimes amino acid sequences are described using three-character codes. These can be handled as shown below (although with specifying all, not only a handful of codes):

sq_multichar <- sq(c("TyrGlyArgArgAsp", "AspGlyArgGly", "CysGluGlyTyrProArg"),
                   alphabet = c("Arg", "Asp", "Cys", "Glu", "Gly", "Pro", "Tyr"))
sq_multichar

These letters are treated as a whole, meaning that they are indivisible. It can be observed during letter replacement operation:

substitute_letters(sq_multichar, c(Arg = "X", Glu = "His", Pro = "X"))

Type manipulation {#type_manipulation}

As shown in previous chapters, substitute_letters() return an sq object of atp type. If a type isn't satisfying, then the user can utilize typify() function that creates new sq object with desired type (backticks are necessary, when the substituted letter isn't a valid variable name):

sq_unt <- sq(c("UCGG&&CAG&&", "AUUCGGUGA", "C&CUUA&&&C&U"))
sq_sub <- substitute_letters(sq_unt, c(`&` = "-"))
sq_sub
typify(sq_sub, "rna_bsc")

However, one should note that there is a requirement for typify() to work --- typified sq object must not contain any letters not in the target alphabet. For instance, following call won't work:

typify(sq_sub, "dna_bsc")

The user isn't left alone to guess whether a sequence has invalid letters or not. In this case they can use find_invalid_letters() function that returns a list of character vectors, where each vector contains invalid letter for corresponding sequence:

find_invalid_letters(sq_sub, "dna_bsc")

However, all invalid letters within an alphabet have to be substituted before passing it to typify(). A more complicated call that replaces all ambiguous letters with "-" gap letter can be constructed as follows:

ambiguous_letters <- setdiff(
  get_standard_alphabet("rna_ext"),
  get_standard_alphabet("rna_bsc")
)
encoding <- rep("-", length(ambiguous_letters))
names(encoding) <- ambiguous_letters
encoding

sq_rna_sub <- substitute_letters(sq_rna, encoding)
typify(sq_rna_sub, "rna_bsc")