under construction
devtools::load_all(".")
library(gcatbase)
Note: a tuple or a codon is considered to be a special case of a nucleotide sequence.
all_tuples
)The function all_tuples
creates all tuples. It expects the tuple size tsize
and an alphabet $\Sigma$.
(rextendr
does not have parameter names, does it?)
sigma = c("+", "-") # Alphabet of two letters tuples = all_tuples(tsize = 2, alphabet = sigma) print(tuples)
Codons can be created with the all_tuples
this way. By default, the alphabet are DNA bases.
codons = all_tuples(3) print(codons)
alphabet
)alphabet
determines for a sequence or a vector of sequences the underlying alphabet including its type.
The type type
can be "RNA" or "DNA" or "unkown".
alphabet1 = alphabet("AUGC") # one sequence print(alphabet1$letters) print(alphabet1$type) print(alphabet1)
alphabet2 = alphabet(c("ATG", "CTT", "GAG", "CCG")) # vector print(alphabet2)
sample.code <- gcatbase::code(c("ATG", "CTT", "GAG", "CCG"), id="sample.code") alphabet2 = alphabet(sample.code) # vector print(alphabet2)
normalize
)There are many ways to represent nucleotide sequences and codons or tuples in general:
gcatbase
offers functions to convert the representation. The default representation is DNA in upper-case.
normalize
converts a sequence or a vector of sequences (e.g. codons) to a specified representation.
rna_codons = normalize(codons, RNA = TRUE, lowercase = TRUE) print(rna_codons)
Of course, we can only normalize a sequence as well:
seq = "ATCATGCCCACGGCGCGGAGGATAGCGAGCAGGT" seqrna = normalize(seq, RNA = TRUE) # Uppercase is default print(seqrna)
code
)Codes are created with the code
function. If the tuples passed to code
are not unique they are made unique.
# Non unique tuple set: X = code(tuples = c("ATG", "GCG", "ATG"), id = "A code")
This will return an object of type gcat.code
. The attributes are:
id
: a short identifier for this code.tuples
: unique vector of tuples.tsize
: tuple size (e.g. 3 if codons)Note that the code is now unique. The elements are sorted.
print(X)
In particular for codons the summary
function is available.
print(summary(X))
It is also possible to create a code from a single sequence that is decomposed into tuples by means of the split
function. split
is explained in more detail below.
random_code
)It is also possible to create random codes.
Xr = random_code(size = 10, tsize = 4) print(Xr)
write_codes
and read_codes
)Codes can be written to and read from a file.
write_codes(filename = "codes.txt", codes = list(X, Xr), header = "# My codes file")
This will create a file codes.txt
with the following content:
s = readLines("codes.txt") cat(s[1], "\n", s[2], "\n", s[3])
codes = read_codes("codes.txt") print(codes)
split
, tuples_usage
)print(seq) t = gcatbase::split(seq, 3) print(t)
Next we want to calculate the tuple usage. tuples_usage
creates a data frame.
u = tuples_freq(t) print(head(u, 10))
Make a code from the sequence. Note that code
makes the tuples unique:
X2 = code(t, "Code from seq.") print(X2)
split
can also be used to create a code from a sequence.
X3 = code(split("AUGUGAGC", tsize = 3), id = "Code from sequence") print(X3)
Another argument for split
is sep
. It can only be used when tsize
is ommited. The sequence is splitted according to the value. This is useful when codons (or tuples) are copied and pasted from a text.
X5 = code(split("AUG,UUG,GCG", sep = ","), id = "Code from separator") print(X5)
shift
)X2s = code(shift(X2, 1), "shifted code") print(X2s)
compl
, rev_compl
)The complementary tuples are for instance:
print(t) print(compl(t))
The anticodons of the tuples in t
are:
print(rev_compl(t))
classify
, code_usage
)h = classify(seq, X) print(h)
From this we can derive the code usage $u$ (in %)
u = length(h[h == 1]) / length(h) print(u)
or use the convenient function code_usage
:
u = code_usage(seq, X) print(u)
amino_acids
)We can translate a vector of codons into amino acids:
print(amino_acids(t))
Also, we can ask what amino acids are used by a code. Here the Vertebrate Mitochondrial Code (transl_table = 2) is used:
print(amino_acids(X, numcode = 2)) # generic for amino_acids(X$tuples)
Furthermore, the data frame aa_prop
contains some important chemical properties of amino acids as published in Haig, D. and Hurst, L.D. (1991).
print(aa_prop)
The columns are according to the paper:
Polar
"Woese et al.'s (1966) polar requirement is the slope of the line that results when $$\log(1 -
RF)/RF$$ for free amino acids is plotted against the log mole fraction of water in pyridine solvent."Hydropathy
"Kyte and Doolittle's (1982) hydropathy is based on water-vapor transfer free energies, the interior-exterior distribution of amino acid side-chains, and the subjective judgment of the authors"Volume
"Grantham's (1974) molecular volume of side chains is the residue volume minus a constant peptide volume. "Isoelectric
"The values of isoelectric points were taken from Alff-Steinberger (1969)."flatten
)Sometimes we need to perform some operations over entries in a Fasta file. For instance, we want to count the tuples in a Fasta record. First, the file is read and converted into GCAT's default representation:
fasta = seqinr::read.fasta("../inst/extdata/ena-sars.fasta", as.string = TRUE, forceDNAtolower = FALSE)
A typical question is the frequencies of all codons in the Fasta file. The following code calculates the frequencies:
# Generic iteration ntuples = sapply(fasta, function(f) { seq = as.character(f) gcatbase::split(seq) # individual operation }) tuples = unlist(ntuples) # Now the analysis: freq = tuples_freq(tuples) print(head(freq))
Except of the individual operation gcatbase::split(seq)
the iteration is generic and can be made available as a pattern.
This pattern is implemented in GCAT's flatten
function. The first paramter is the list (here the Fasta file) and the second the function (operator) to be applied (here the split function).
codons = gcatbase::flatten(fasta, gcatbase::split) freq = tuples_freq(codons) print(head(freq))
Of course, this is also possible for quadrupels in frame 1.
op = function(seq) { seq1 = truncate(seq, 1) # Frame 1 gcatbase::split(seq1, 4) # Make quadruples. } tuples4 = gcatbase::flatten(fasta, op) freq = tuples_freq(tuples4) print(head(freq))
Also possible for a code and compatible to rextendr
.
X = Code$new(c("AUG"), id = "Foo") X$tuples X$tsize X$alphabet
Haig, D. and Hurst, L.D. (1991) ‘A quantitative measure of error minimization in the genetic code’, Journal of Molecular Evolution, 33(5), pp. 412–417. doi:10.1007/BF02103132.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.