under construction

devtools::load_all(".")
library(gcatbase)

Sequence, tuple and codon representations

Note: a tuple or a codon is considered to be a special case of a nucleotide sequence.

Creation of tuples from an alphabet (all_tuples)

The function all_tuples creates all tuples. It expects the tuple size tsize and an alphabet $\Sigma$.

(rextendr does not have parameter names, does it?)

sigma = c("+", "-") # Alphabet of two letters
tuples = all_tuples(tsize = 2, alphabet = sigma)
print(tuples)

Codons can be created with the all_tuples this way. By default, the alphabet are DNA bases.

codons = all_tuples(3)
print(codons)

The alphabet of tuples or a sequence (alphabet)

alphabet determines for a sequence or a vector of sequences the underlying alphabet including its type. The type type can be "RNA" or "DNA" or "unkown".

alphabet1 = alphabet("AUGC") # one sequence
print(alphabet1$letters)
print(alphabet1$type)
print(alphabet1)
alphabet2 = alphabet(c("ATG", "CTT", "GAG", "CCG")) # vector
print(alphabet2)
sample.code <- gcatbase::code(c("ATG", "CTT", "GAG", "CCG"), id="sample.code")
alphabet2 = alphabet(sample.code) # vector
print(alphabet2)

Normalization of nucleotide sequences (normalize)

There are many ways to represent nucleotide sequences and codons or tuples in general:

gcatbase offers functions to convert the representation. The default representation is DNA in upper-case.

normalize converts a sequence or a vector of sequences (e.g. codons) to a specified representation.

rna_codons = normalize(codons, RNA = TRUE, lowercase = TRUE)
print(rna_codons)

Of course, we can only normalize a sequence as well:

seq = "ATCATGCCCACGGCGCGGAGGATAGCGAGCAGGT"
seqrna = normalize(seq, RNA = TRUE) # Uppercase is default
print(seqrna)

Codes

Create codes (code)

Codes are created with the code function. If the tuples passed to code are not unique they are made unique.

# Non unique tuple set:
X = code(tuples = c("ATG", "GCG", "ATG"), id = "A code")

This will return an object of type gcat.code. The attributes are:

Note that the code is now unique. The elements are sorted.

print(X)

In particular for codons the summary function is available.

print(summary(X))

It is also possible to create a code from a single sequence that is decomposed into tuples by means of the split function. split is explained in more detail below.

Random codes (random_code)

It is also possible to create random codes.

Xr = random_code(size = 10, tsize = 4)
print(Xr)

Persisting codes in a text file (write_codes and read_codes)

Codes can be written to and read from a file.

write_codes(filename = "codes.txt", codes = list(X, Xr), header = "# My codes file")

This will create a file codes.txt with the following content:

s = readLines("codes.txt")
cat(s[1], "\n", s[2], "\n", s[3])
codes = read_codes("codes.txt")
print(codes)

Manipulation and analysis of sequences, tuples and codons

Tuples from a sequence (split, tuples_usage)

print(seq)
t = gcatbase::split(seq, 3)
print(t)

Next we want to calculate the tuple usage. tuples_usage creates a data frame.

u = tuples_freq(t)
print(head(u, 10))

Make a code from the sequence. Note that code makes the tuples unique:

X2 = code(t, "Code from seq.")
print(X2)

split can also be used to create a code from a sequence.

X3 = code(split("AUGUGAGC", tsize = 3), id = "Code from sequence")
print(X3)

Another argument for split is sep. It can only be used when tsize is ommited. The sequence is splitted according to the value. This is useful when codons (or tuples) are copied and pasted from a text.

X5 = code(split("AUG,UUG,GCG", sep = ","), id = "Code from separator")
print(X5)

Shift tuples (shift)

X2s = code(shift(X2, 1), "shifted code")
print(X2s)

Reverse and complementary tuples (compl, rev_compl)

The complementary tuples are for instance:

print(t)
print(compl(t))

The anticodons of the tuples in t are:

print(rev_compl(t))

Classify sequence according to a code (classify, code_usage)

h = classify(seq, X)
print(h)

From this we can derive the code usage $u$ (in %)

u = length(h[h == 1]) / length(h)
print(u)

or use the convenient function code_usage:

u = code_usage(seq, X)
print(u)

Amino acids (amino_acids)

We can translate a vector of codons into amino acids:

print(amino_acids(t))

Also, we can ask what amino acids are used by a code. Here the Vertebrate Mitochondrial Code (transl_table = 2) is used:

print(amino_acids(X, numcode = 2)) # generic for amino_acids(X$tuples)

Furthermore, the data frame aa_prop contains some important chemical properties of amino acids as published in Haig, D. and Hurst, L.D. (1991).

print(aa_prop)

The columns are according to the paper:

Operations over lists (flatten)

Sometimes we need to perform some operations over entries in a Fasta file. For instance, we want to count the tuples in a Fasta record. First, the file is read and converted into GCAT's default representation:

fasta = seqinr::read.fasta("../inst/extdata/ena-sars.fasta", as.string = TRUE, forceDNAtolower = FALSE)

A typical question is the frequencies of all codons in the Fasta file. The following code calculates the frequencies:

# Generic iteration
ntuples = sapply(fasta, function(f) {
  seq = as.character(f)
  gcatbase::split(seq) # individual operation
})
tuples = unlist(ntuples)

# Now the analysis:
freq = tuples_freq(tuples)
print(head(freq))

Except of the individual operation gcatbase::split(seq) the iteration is generic and can be made available as a pattern.

This pattern is implemented in GCAT's flatten function. The first paramter is the list (here the Fasta file) and the second the function (operator) to be applied (here the split function).

codons = gcatbase::flatten(fasta, gcatbase::split)
freq = tuples_freq(codons)
print(head(freq))

Of course, this is also possible for quadrupels in frame 1.

op = function(seq) {
  seq1 = truncate(seq, 1) # Frame 1
  gcatbase::split(seq1, 4) # Make quadruples.
}

tuples4 = gcatbase::flatten(fasta, op)
freq = tuples_freq(tuples4)
print(head(freq))

Appendix

Also possible for a code and compatible to rextendr.

X = Code$new(c("AUG"), id = "Foo")
X$tuples
X$tsize
X$alphabet

References

Haig, D. and Hurst, L.D. (1991) ‘A quantitative measure of error minimization in the genetic code’, Journal of Molecular Evolution, 33(5), pp. 412–417. doi:10.1007/BF02103132.



informatik-mannheim/gcat-base documentation built on Nov. 7, 2023, 7:18 a.m.