codon_usage_psite | R Documentation |
This function computes empirical codon usage indexes based on either the ribosome P-sites, A-site or E-site frequency associated with in-frame P-sites falling in the coding sequence. For each sample it computes 64 triplet-specific codon usage indexes ranging from 0 to 1 and optionally normalized for the frequency of the corresponding triplet within the CDSs. This function also allows to compare either two sets of codon usage indexes from different samples or the set of codon usage indexes computed for one sample and 64 triplet-specific values provided by the user. Multiple samples and replicates can be handled.
codon_usage_psite(
data,
annotation,
sample,
multisamples = "average",
plot_style = "split",
site = "psite",
frequency_normalization = TRUE,
contrast_sample = NULL,
codon_values = NULL,
fastapath = NULL,
fasta_genome = TRUE,
refseq_sep = NULL,
bsgenome = NULL,
gtfpath = NULL,
txdb = NULL,
dataSource = NA,
organism = NA,
transcripts = NULL,
length_range = NULL,
cl = 100,
include_stop_codons = TRUE,
label_scatter = FALSE,
label_number = 64,
label_aminoacid = FALSE
)
data |
Either list of data tables or GRangesList object from
|
annotation |
Data table as generated by |
sample |
Either character string, character string vector or named list
of character string(s)/character string vector(s) specifying the name of
the sample(s) and replicate(s) of interest. If a list is provided, each
element of the list is considered as an independent sample associated with
one ore multiple replicates. Multiple samples and replicates are handled
and organized according to |
multisamples |
Either "average" or "independent". It specifies how to
handle multiple samples and replicates stored in
|
plot_style |
Either "split" or "facet". It specifies how to organize and display multiple bar plots:
|
site |
Either "psite, "asite", "esite". It specifies if the empirical codon usage indexes should be based on ribosome P-sites ("psite"), A-sites ("asite") or E-sites ("esite"). Default is "psite". |
frequency_normalization |
Logical value whether to normalize the
64 codon usage indexes for the corresponding codon frequencies in coding
sequences. Default is TRUE.
Note: the low frequency of the three stop codons often leads to codon usage
indexes higher for these triplets than for the others. To discard the stop
codons from the plots and avoid potential biases in the interpretation of
the data, see |
contrast_sample |
Either character string or character string vector. It
specifies the sample(s) (if any) to be considered for the comparison of
codon usage indexes between:
exactly two elements from those listed either in
|
codon_values |
Data table containing 64 triplet-specific values to be
compared with the empirical codon usage indexes computed for the sample
specified in |
fastapath |
Character string specifying the FASTA file used in the
alignment step, including its path, name and extension. This file can
contain reference nucleotide sequences either of genome asseblies
(chromosome sequences) or of transcripts (see |
fasta_genome |
Logical value whether the FASTA file specified by
|
refseq_sep |
Character specifying the separator between reference
sequences' name and additional information to discard, stored in the
headers of the FASTA file specified by |
bsgenome |
Character string specifying the BSgenome data package with
the genome sequences to be loaded. If not already present in the system, it
is automatically installed through the biocLite.R script (check the list of
available BSgenome data packages by running the
|
gtfpath |
Character string specifying the location of a GTF file,
including its path, name and extension. Please make sure the GTF file and
the sequences specified by |
txdb |
Character string specifying the TxDb annotation package to be
loaded. If not already present in the system, it is automatically installed
through the biocLite.R script (check
here
the list of available TxDb annotation packages). Please make sure the TxDb
annotation package and the sequences specified by |
dataSource |
Optional character string describing the origin of the GTF
data file. This parameter is considered only if |
organism |
Optional character string reporting the genus and species of
the organism of the GTF data file. This parameter is considered only if
|
transcripts |
Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used. Please note: transcripts without annotated CDS and transcripts whose coding sequence length is not divisible by 3 are automatically discarded. |
length_range |
Integer or integer vector for restricting the plot to a
chosen range of read lengths. Default is NULL, i.e. all read lengths are
used. If specified, this parameter prevails over |
cl |
Integer value in 1,100 specifying a confidence level for restricting the plot to an automatically-defined range of read lengths. The new range is computed according to the most frequent read lengths, which accounts for the cl% of the sample and is defined by discarding the (100-cl)% of read lengths falling in the tails of the read lengths distribution. If multiple samples are analysed, a single range of read lengths is computed such that at least the cl% of all samples is represented. Default is 100. |
include_stop_codons |
Logical value whether to include the three stop codons in the plots. Default is TRUE. |
label_scatter |
Logical value whether to label the dots in the scatter
plot. Each dot is labeled using either the nucleotide sequence of the codon
or the corresponding amino acid symbol (see |
label_number |
Integer value in 1,64 specifying how many dots in the
scatter plot should be labeled. Dots farthest from the confident interval
of the regression line are automatically identified and labeled. Default is
64 i.e. all dots are labeled. This parameter is considered only if
|
label_aminoacid |
Logical value whether to use amino acid symbols to
label the dots of the scatter plot. Default is FALSE i.e. codon nucleotide
sequences are used instead. This parameter is considered only if
|
riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the main purpose of RiboSeq assays to study translational events through the isolation and sequencing of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions. BAM based on transcript coordinates can be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against sequences of genome assemblies i.e. standard chromosome sequences, thus requiring the outputs to be translated in transcript coordinates. The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA file where each sequence represents a transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The second procedure is based on reference FASTA files where each sequence represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF). The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual), is an example of tool providing such a feature.
List containing: one or more ggplot object(s) and the data table with the corresponding x- and y-axis values ("plot_dt"); an additional data table with raw, normalized and scaled number of P-sites associated with the 64 triplets for each sample contributing to the plot ("count_dt").
## data(reads_list)
## data(mm81cdna)
##
## ## Generate fake samples and replicates
## for(i in 2:6){
## samp_name <- paste0("Samp", i)
## set.seed(i)
## reads_list[[samp_name]] <- reads_list[["Samp1"]][sample(.N, 5000)]
## }
##
## ## Compute and add p-site details
## psite_offset <- psite(reads_list, flanking = 6, extremity = "auto")
## reads_psite_list <- psite_info(reads_list, psite_offset)
##
## ## Define the list of samples and replicate to use as input
## input_samples <- list("S1" = c("Samp1", "Samp2"),
## "S2" = c("Samp3", "Samp4", "Samp5"),
## "S3" = c("Samp6"))
##
## Generate bar plots, alignment on transcript sequences
## example_cu_barplot <- codon_usage_psite(reads_psite_list, mm81cdna,
## sample = input_samples,
## multisamples = "average",
## plot_style = "facet",
## fastapath = "path/to/transcriptome/FASTA/file",
## fasta_genome = FALSE)
##
## Generate bar plots, alignment on chromosome sequences
## example_cu_barplot <- codon_usage_psite(reads_psite_list, mm81cdna,
## sample = input_samples,
## multisamples = "average",
## plot_style = "facet",
## fastapath = "path/to/chromosome/FASTA/file",
## fasta_genome = TRUE)
##
## Generate scatterplot comparing two samples
## example_cu_barplot <- codon_usage_psite(reads_psite_list, mm81cdna,
## sample = input_samples,
## contrast_sample = c("S1", "S2"),
## fastapath = "path/to/transcriptome/FASTA/file",
## fasta_genome = FALSE,
## frequency_normalization = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.