knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(reversetranslate)

With the increase in synthetic biology applications researchers need to be able derive DNA nucleotides from a pre-specified amino acid sequence. This package is designed to assist in such applications.

Novelty

We believe this package is novel due to the following characteristics:

  1. Provides an R package solution to solve a common bioinformatic challenge. Although current reverse translational implementations exist, they are either limited to Graphical User Interfaces (GUIs) which are often non-customizable and not conducive to batch processing, or they are implemented in non-R programming languages [@Cowley2017; @Stothard2000; @jason_stajich_2016_164997; @pepsyn]. R is one of the most common bioinformatic programming languages and a robust and familiar suite of nucleotide processing functions already exists [@R-Biostrings; @biocmanager; @r]. By providing a R reverse translation function, users can continue to apply the existing R package infrastructure and take advantage of familiar data processing procedures.

  2. Allows user to provide their own Codon Frequency Table so codon selection can accurately mimic the desired translation system. Helper functions are provided so users can easily parse one of the 855,412 Codon Usage Tables the High-performance Integrated Virtual Environment-Codon Usage Tables (HIVE-CUTs) database provides [@Athey2017].

  3. Allows the user to select from one of three biologically relevant translation models (e.g. proportional, equal, or gc_biased) depending on the users intended use case.

  4. Allows the user to filter to avoid the inclusion of low frequency codons which might assist in expression efficiency.

Load Amino Acid Sequence(s) for Translation

The user must supply the amino acid sequence you want to reverse translate. The sequence must exist as a character vector and the individual amino acids must be written in a way that match the amino acids listed in your selected Codon Frequency Table. If you wish to reverse translate multiple amino acid sequences under the same conditions store them all in a single list and pass them to reverse_translate using one of the iterative functions described in more detail below.

Download Protein Amino Acid Sequence

An excellent source of protein amino acid sequences is the UniProtKB/Swiss-Prot database [@uniprot]. To reduce redundancy UniProtKB reports a canonical amino acid sequence based on the following criteria:

  1. It is the most prevalent.
  2. It is the most similar to orthologous sequences found in other species.
  3. By virtue of its length or amino acid composition, it allows the clearest description of domains, isoforms, polymorphisms, post-translational modifications, etc.
  4. In the absence of any information, we choose the longest sequence.

Protein and nucleotide strings are often stored in a .FASTA file format. You can learn more about FASTA format here. Below we will read in the amino acid sequence for the human protein Nucleotide-binding oligomerization domain-containing protein 2 (NOD2) using both base R and the very helpful Bioconductor Biostrings package [@R-Biostrings].

download.file(url = "https://www.uniprot.org/uniprot/Q9HC29.fasta", 
              destfile = "NOD2.FASTA")

Read FASTA with Base R Functions

nod2_txt <- utils::read.delim("NOD2.FASTA", comment.char = ">",
                       stringsAsFactors = FALSE, header = FALSE)

nod2_txt <- nod2_txt$V1 %>%
  paste0(collapse = "")

nod2_txt

Read FASTA Biostrings

# Install Bioconductor Installer and Biostrings Pakcage if Needed
 if (!requireNamespace("BiocManager", quietly = TRUE)) {
       install.packages("BiocManager")
 }
 if (!requireNamespace("Biostrings", quietly = TRUE)) {
       BiocManager::install("Biostrings", version = "3.8")
 }
library(Biostrings, quietly = TRUE)

# read with Biostrings
nod2_bs <- Biostrings::readAAStringSet("NOD2.FASTA")

nod2_bs

Codon Frequency Table (CFT)

The reverse_translate function requires a properly formatted Codon Frequency Table (CFT). The user can either supply their own CFT or use one of the provided commonly desired package options. A helper function build_hive_codon_tbl is provided which allows the user to take advantage of the organism specific Codon Usage Tables provided by the High-performance Integrated Virtual Environment-Codon Usage Tables (HIVE-CUTs) database [@Athey2017].

Format

A CFT is a data.frame with 3 columns and n rows equal to the number of unique codons. In the case of the standard 3 nucleotide genetic code, 64 unique codons are available. The data.frame must contain the following three columns. Additional columns are allowed and capitalization in column names does not matter.

Supplied Codon Frequency Tables

The package supplies two CFT which are automatically accessible to the user after loading the package. The CFTs were built using the build_hive_codon_tbl function off Codon Usage Tables from the HIVE-CUTs database.

Homo Sapiens

#  Built from HIVE Homo sapiens (9606) Codon Usage Table

head(hsapien_tbl)

Escheria Coli

# Built from HIVE Escherichia coli (562) Codon Usage Table

head(ecoli_tbl)

Reverse Translation

The primary function in this package is the reverse_translation function, which assists a user in generating a nucleotide sequence from a supplied amino acid sequence. Users can select between three models of reverse translation depending on their needs described below.

Proportional Model

Reverse Translating a Single Amino Acid

For demonstration purposes, I have created a made up amino acid CFT containing three amino acids, "X", "Y" and "Z", called minimal_freq_tbl which you will have access to once the package is loaded.

minimal_freq_tbl

If we were to repeatedly reverse translate the amino acid "X" with the above CFT under the proportional model, we would expect the proportion of AAA codons to be approximately equal the proportion of GGG codons and for them be twice as large as the proportion of CCC codons. We can simulate this to confirm.

set.seed(1234)
codons <- replicate(10, {
  reverse_translate(amino_acid_seq = "X", codon_tbl = minimal_freq_tbl,
                    limit = 0, model = "proportional")
})

table(codons)

As expected the number of AAA codons selected is similar to the number of GGG codons selected. Variation exists because each replication is an independent simulation with the given probabilities of choosing a specific codon. If interested, you can increase the number of replicates, or alter the given proportions, and evaluate how that changes the model.

Reverse Translating a Peptide Sequence

In most cases you will not want to reverse translate a single amino acid, but rather a peptide sequence. Lets take the below minimal_aa_seq as an example.

minimal_aa_seq

Using the same code as before we can repeatedly reverse translate our test amino acid sequence.

set.seed(1234)
codon_string <- replicate(10, {
  reverse_translate(amino_acid_seq = minimal_aa_seq, codon_tbl = minimal_freq_tbl,
                    limit = 0 , model = "proportional")
})

We can now evaluate the count of each possible codon using the Biostrings package and see again that our AAA and GGG codons are roughly equal and are twice as present as our CCC codons.

Biostrings::trinucleotideFrequency(Biostrings::DNAStringSet(codon_string), 
                                   step = 3, simplify.as="collapsed")

Equal Model

In some instances users might want the chance a given codon is selected to be random. This is similar to the circumstance when each possible codon has an equal proportion of being selected. We can test to ensure our model is correct by again simulating reverse translating the amino acid "X", but this time under the "equal" model.

set.seed(1234)
codons <- replicate(10, {
  reverse_translate(amino_acid_seq = "X", codon_tbl = minimal_freq_tbl,
                    limit = 0, model = "equal")
})

table(codons)

As expected we see the frequency of three possible amino acids are roughly equal.

GC Biased

Depending on the intended application, it might be beneficial to select codons with low GC content. The "gc_biased" model, ranks each redundant codon by the proportion of GC content, and then selects the codon with the lowest proportion of GC. If two or more codons have the same amount of GC content, a proportional model is used to select between the remaining options.

For example, if we want to reverse translate the amino acid Z in a "gc_biased" fashion, the function will first rank the available codons by GC content.

minimal_freq_tbl

In this instance the ATG and AAC codons have the lowest proportion of GC content of the 4 possible options. Now that the two low proportion codons are selected, a proportional model is run. We will again confirm this through simulating the reverse translation of the amino acid "Z" 10 times and counting the returned codons.

set.seed(1234)
codons <- replicate(10, {
  reverse_translate(amino_acid_seq = "Z", codon_tbl = minimal_freq_tbl,
                    limit = 0, model = "gc_biased")
})

table(codons)

A more real world example would be reverse translating the 1040 amino acid long NOD2 protein using the provided Homo sapiens Codon Frequency Table.

We will compare the nucleotide GC content of the reverse translation under both the proportional and GC biased model.

set.seed(1234)
nod2_nt_prop <- reverse_translate(amino_acid_seq = nod2_txt, codon_tbl = hsapien_tbl, limit = 0,
                  model = "proportional")

# Proportional Model
alphabetFrequency(DNAString(nod2_nt_prop), baseOnly = TRUE, as.prob = TRUE)


set.seed(1234)
nod2_nt_low_gc <- reverse_translate(amino_acid_seq = nod2_txt, codon_tbl = hsapien_tbl, limit = 0,
                  model = "gc_biased")
# GC Biased Model
alphabetFrequency(DNAString(nod2_nt_low_gc), baseOnly = TRUE, as.prob = TRUE)

Running Reverse Translation on Multiple Amino Acid Sequences

Using the base::lapply functions, users can easily reverse translate a list of multiple amino acid sequences.

# create list of two NOD2 amino acid seqeunces
multiple_nod2 <- list(nod2_txt, nod2_txt)
str(multiple_nod2)

# reverse translate each sequence
nod2_nt_list <- base::lapply(multiple_nod2, reverse_translate, codon_tbl = hsapien_tbl, limit = 0,
                  model = "proportional")

str(nod2_nt_list)

# use lapply again to count the base count for each 
base::lapply(DNAStringSetList(nod2_nt_list), alphabetFrequency, baseOnly = TRUE)

References



greg-botwin/reversetranslate documentation built on Aug. 24, 2019, 4:04 a.m.