In informatik-mannheim/gcat-base: Genetic Code Analysis Toolkit for R - GCAT Base Package (gcatbase)

under construction

devtools::load_all(".")

library(gcatbase)

Sequence, tuple and codon representations

Note: a tuple or a codon is considered to be a special case of a nucleotide sequence.

Creation of tuples from an alphabet (`all_tuples`)

The function all_tuples creates all tuples. It expects the tuple size tsize and an alphabet $\Sigma$.

(rextendr does not have parameter names, does it?)

sigma = c("+", "-") # Alphabet of two letters
tuples = all_tuples(tsize = 2, alphabet = sigma)
print(tuples)

Codons can be created with the all_tuples this way. By default, the alphabet are DNA bases.

codons = all_tuples(3)
print(codons)

The alphabet of tuples or a sequence (`alphabet`)

alphabet determines for a sequence or a vector of sequences the underlying alphabet including its type. The type type can be "RNA" or "DNA" or "unkown".

alphabet1 = alphabet("AUGC") # one sequence
print(alphabet1$letters)
print(alphabet1$type)
print(alphabet1)

alphabet2 = alphabet(c("ATG", "CTT", "GAG", "CCG")) # vector
print(alphabet2)

sample.code <- gcatbase::code(c("ATG", "CTT", "GAG", "CCG"), id="sample.code")
alphabet2 = alphabet(sample.code) # vector
print(alphabet2)

Normalization of nucleotide sequences (`normalize`)

There are many ways to represent nucleotide sequences and codons or tuples in general:

DNA or RNA?
letters in lower- or upper-case?
Sequences as string or as vector of characters?

gcatbase offers functions to convert the representation. The default representation is DNA in upper-case.

normalize converts a sequence or a vector of sequences (e.g. codons) to a specified representation.

rna_codons = normalize(codons, RNA = TRUE, lowercase = TRUE)
print(rna_codons)

Of course, we can only normalize a sequence as well:

seq = "ATCATGCCCACGGCGCGGAGGATAGCGAGCAGGT"
seqrna = normalize(seq, RNA = TRUE) # Uppercase is default
print(seqrna)

Codes

Create codes (`code`)

Codes are created with the code function. If the tuples passed to code are not unique they are made unique.

# Non unique tuple set:
X = code(tuples = c("ATG", "GCG", "ATG"), id = "A code")

This will return an object of type gcat.code. The attributes are:

id: a short identifier for this code.
tuples: unique vector of tuples.
tsize: tuple size (e.g. 3 if codons)

Note that the code is now unique. The elements are sorted.

print(X)

In particular for codons the summary function is available.

print(summary(X))

It is also possible to create a code from a single sequence that is decomposed into tuples by means of the split function. split is explained in more detail below.

Random codes (`random_code`)

It is also possible to create random codes.

Xr = random_code(size = 10, tsize = 4)
print(Xr)

Persisting codes in a text file (`write_codes` and `read_codes`)

Codes can be written to and read from a file.

write_codes(filename = "codes.txt", codes = list(X, Xr), header = "# My codes file")

This will create a file codes.txt with the following content:

s = readLines("codes.txt")
cat(s[1], "\n", s[2], "\n", s[3])

codes = read_codes("codes.txt")
print(codes)

Manipulation and analysis of sequences, tuples and codons

Tuples from a sequence (`split`, `tuples_usage`)

print(seq)
t = gcatbase::split(seq, 3)
print(t)

Next we want to calculate the tuple usage. tuples_usage creates a data frame.

u = tuples_freq(t)
print(head(u, 10))

Make a code from the sequence. Note that code makes the tuples unique:

X2 = code(t, "Code from seq.")
print(X2)

split can also be used to create a code from a sequence.

X3 = code(split("AUGUGAGC", tsize = 3), id = "Code from sequence")
print(X3)

Another argument for split is sep. It can only be used when tsize is ommited. The sequence is splitted according to the value. This is useful when codons (or tuples) are copied and pasted from a text.

X5 = code(split("AUG,UUG,GCG", sep = ","), id = "Code from separator")
print(X5)

Shift tuples (`shift`)

X2s = code(shift(X2, 1), "shifted code")
print(X2s)

Reverse and complementary tuples (`compl`, `rev_compl`)

The complementary tuples are for instance:

print(t)
print(compl(t))

The anticodons of the tuples in t are:

print(rev_compl(t))

Classify sequence according to a code (`classify`, `code_usage`)

h = classify(seq, X)
print(h)

From this we can derive the code usage $u$ (in %)

u = length(h[h == 1]) / length(h)
print(u)

or use the convenient function code_usage:

u = code_usage(seq, X)
print(u)

Amino acids (`amino_acids`)

We can translate a vector of codons into amino acids:

print(amino_acids(t))

Also, we can ask what amino acids are used by a code. Here the Vertebrate Mitochondrial Code (transl_table = 2) is used:

print(amino_acids(X, numcode = 2)) # generic for amino_acids(X$tuples)

Furthermore, the data frame aa_prop contains some important chemical properties of amino acids as published in Haig, D. and Hurst, L.D. (1991).

print(aa_prop)

The columns are according to the paper:

Polar "Woese et al.'s (1966) polar requirement is the slope of the line that results when $$\log(1 - RF)/RF$$ for free amino acids is plotted against the log mole fraction of water in pyridine solvent."
Hydropathy "Kyte and Doolittle's (1982) hydropathy is based on water-vapor transfer free energies, the interior-exterior distribution of amino acid side-chains, and the subjective judgment of the authors"
Volume "Grantham's (1974) molecular volume of side chains is the residue volume minus a constant peptide volume. "
Isoelectric "The values of isoelectric points were taken from Alff-Steinberger (1969)."

Operations over lists (`flatten`)

Sometimes we need to perform some operations over entries in a Fasta file. For instance, we want to count the tuples in a Fasta record. First, the file is read and converted into GCAT's default representation:

fasta = seqinr::read.fasta("../inst/extdata/ena-sars.fasta", as.string = TRUE, forceDNAtolower = FALSE)

A typical question is the frequencies of all codons in the Fasta file. The following code calculates the frequencies:

# Generic iteration
ntuples = sapply(fasta, function(f) {
  seq = as.character(f)
  gcatbase::split(seq) # individual operation
})
tuples = unlist(ntuples)

# Now the analysis:
freq = tuples_freq(tuples)
print(head(freq))

Except of the individual operation gcatbase::split(seq) the iteration is generic and can be made available as a pattern.

This pattern is implemented in GCAT's flatten function. The first paramter is the list (here the Fasta file) and the second the function (operator) to be applied (here the split function).

codons = gcatbase::flatten(fasta, gcatbase::split)
freq = tuples_freq(codons)
print(head(freq))

Of course, this is also possible for quadrupels in frame 1.

op = function(seq) {
  seq1 = truncate(seq, 1) # Frame 1
  gcatbase::split(seq1, 4) # Make quadruples.
}

tuples4 = gcatbase::flatten(fasta, op)
freq = tuples_freq(tuples4)
print(head(freq))

Appendix

Also possible for a code and compatible to rextendr.

X = Code$new(c("AUG"), id = "Foo")
X$tuples
X$tsize
X$alphabet

References

Haig, D. and Hurst, L.D. (1991) ‘A quantitative measure of error minimization in the genetic code’, Journal of Molecular Evolution, 33(5), pp. 412–417. doi:10.1007/BF02103132.

informatik-mannheim/gcat-base documentation built on Nov. 7, 2023, 7:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

informatik-mannheim/gcat-base
Genetic Code Analysis Toolkit for R - GCAT Base Package (gcatbase)

In informatik-mannheim/gcat-base: Genetic Code Analysis Toolkit for R - GCAT Base Package (gcatbase)

Sequence, tuple and codon representations

Creation of tuples from an alphabet (`all_tuples`)

The alphabet of tuples or a sequence (`alphabet`)

Normalization of nucleotide sequences (`normalize`)

Codes

Create codes (`code`)

Random codes (`random_code`)

Persisting codes in a text file (`write_codes` and `read_codes`)

Manipulation and analysis of sequences, tuples and codons

Tuples from a sequence (`split`, `tuples_usage`)

Shift tuples (`shift`)

Reverse and complementary tuples (`compl`, `rev_compl`)

Classify sequence according to a code (`classify`, `code_usage`)

Amino acids (`amino_acids`)

Operations over lists (`flatten`)

Appendix

References

R Package Documentation

Browse R Packages

We want your feedback!

informatik-mannheim/gcat-base Genetic Code Analysis Toolkit for R - GCAT Base Package (gcatbase)

In informatik-mannheim/gcat-base: Genetic Code Analysis Toolkit for R - GCAT Base Package (gcatbase)

Sequence, tuple and codon representations

Creation of tuples from an alphabet (all_tuples)

The alphabet of tuples or a sequence (alphabet)

Normalization of nucleotide sequences (normalize)

Codes

Create codes (code)

Random codes (random_code)

Persisting codes in a text file (write_codes and read_codes)

Manipulation and analysis of sequences, tuples and codons

Tuples from a sequence (split, tuples_usage)

Shift tuples (shift)

Reverse and complementary tuples (compl, rev_compl)

Classify sequence according to a code (classify, code_usage)

Amino acids (amino_acids)

Operations over lists (flatten)

Appendix

References

R Package Documentation

Browse R Packages

We want your feedback!

informatik-mannheim/gcat-base
Genetic Code Analysis Toolkit for R - GCAT Base Package (gcatbase)

Creation of tuples from an alphabet (`all_tuples`)

The alphabet of tuples or a sequence (`alphabet`)

Normalization of nucleotide sequences (`normalize`)

Create codes (`code`)

Random codes (`random_code`)

Persisting codes in a text file (`write_codes` and `read_codes`)

Tuples from a sequence (`split`, `tuples_usage`)

Shift tuples (`shift`)

Reverse and complementary tuples (`compl`, `rev_compl`)

Classify sequence according to a code (`classify`, `code_usage`)

Amino acids (`amino_acids`)

Operations over lists (`flatten`)