alphabet: Get alphabet of given sq object.

View source: R/alphabet.R

alphabetR Documentation

Get alphabet of given sq object.


Returns alphabet attribute of an object.





An object to extract alphabet from.


Each sq object have an alphabet associated with it. Alphabet is a set of possible letters that can appear in sequences contained in object. Alphabet is kept mostly as a character vector, where each element represents one letter.

sq objects of type ami, dna or rna have fixed, predefined alphabets. In other words, if two sq objects have exactly the same type - ami_bsc, dna_ext, rna_bsc or any other combination - they are ensured to have the same alphabet.

Below are listed alphabets for these types:



  • dna_bsc - ACGT-

  • dna_ext - ACGTWSMKRYBDHVN-

  • rna_bsc - ACGU-

  • rna_ext - ACGUWSMKRYBDHVN-

Other types of sq objects are allowed to have different alphabets. Furthermore, having an alphabet exactly identical to one of those above does not automatically indicate that the type of the sequence is one of those - e.g., there might be an atp sq that has an alphabet identical to ami_bsc alphabet. To set the type, one should use the typify or `sq_type<-` function.

The purpose of co-existence of unt and atp alphabets is the fact that although there is a standard for format of fasta files, sometimes there are other types of symbols, which do not match the standard. Thanks to these types, tidysq can import files with customized alphabets. Moreover, the user may want to group amino acids with similar properties (e.g., for machine learning) and replace the standard alphabet with symbols for whole groups. To check details, see read_fasta, sq and substitute_letters.

Important note: in atp alphabets there is a possibility of letters appearing that consist of more than one character - this functionality is provided in order to handle situations like post-translational modifications, (e.g., using "mA" to indicate methylated alanine).

Important note: alphabets of atp and unt sq objects are case sensitive. Thus, in their alphabets both lowercase and uppercase characters can appear simultaneously and they are treated as different letters. Alphabets of dna, rna and ami types are always uppercase and all functions converts other parameters to uppercase when working with dna, rna or ami - e.g. %has% operator converts lower letters to upper when searching for motifs in dna, rna or ami object.

Important note: maximum length of an alphabet is 30 letters. The user is not allowed to read fasta files or construct sq objects from character vectors that have more than 30 distinct characters in sequences (unless creating ami, dna or rna objects with ignore_case parameter set equal to TRUE).


A character vector of letters of the alphabet.

See Also

sq class

Functions from alphabet module: get_standard_alphabet()

michbur/tidysq documentation built on April 1, 2022, 5:18 p.m.