codon_usage_psite: Empirical codon usage indexes.

View source: R/codon_usage.R

codon_usage_psiteR Documentation

Empirical codon usage indexes.

Description

This function computes empirical codon usage indexes based on either the ribosome P-sites, A-site or E-site frequency associated with in-frame P-sites falling in the coding sequence. For each sample it computes 64 triplet-specific codon usage indexes ranging from 0 to 1 and optionally normalized for the frequency of the corresponding triplet within the CDSs. This function also allows to compare either two sets of codon usage indexes from different samples or the set of codon usage indexes computed for one sample and 64 triplet-specific values provided by the user. Multiple samples and replicates can be handled.

Usage

codon_usage_psite(
  data,
  annotation,
  sample,
  multisamples = "average",
  plot_style = "split",
  site = "psite",
  frequency_normalization = TRUE,
  contrast_sample = NULL,
  codon_values = NULL,
  fastapath = NULL,
  fasta_genome = TRUE,
  refseq_sep = NULL,
  bsgenome = NULL,
  gtfpath = NULL,
  txdb = NULL,
  dataSource = NA,
  organism = NA,
  transcripts = NULL,
  length_range = NULL,
  cl = 100,
  include_stop_codons = TRUE,
  label_scatter = FALSE,
  label_number = 64,
  label_aminoacid = FALSE
)

Arguments

data

Either list of data tables or GRangesList object from psite_info. Each data table may or may not include one or more columns among p_site_codon, a_site_codon and e_site_codon reporting the three nucleotides covered by the P-site, A-site and E-site, respectively. These columns can be previously generated by the psite_info function. Otherwise, the column of interest must be specified by site and it is automatically generated starting from a FASTA file or a BSgenome data package.

annotation

Data table as generated by create_annotation.

sample

Either character string, character string vector or named list of character string(s)/character string vector(s) specifying the name of the sample(s) and replicate(s) of interest. If a list is provided, each element of the list is considered as an independent sample associated with one ore multiple replicates. Multiple samples and replicates are handled and organized according to multisamples and plot_style.

multisamples

Either "average" or "independent". It specifies how to handle multiple samples and replicates stored in sample:

  • if sample is a character string vector and multisamples is set to "average", the elements of the vector are considered as replicates of one sample.

  • if sample is a character string vector and multisamples is set to "independent", each element of the vector is analysed independently of the others.

  • if sample is a list, multisamples must be set to "average". Each element of the list is analysed independently of the others, its replicates averaged and its name reported in the plot. Note: when this parameter is set to "average" the bar plot associated with each sample displays the triplet-specific mean signal computed across the replicates and the corresponding standard error is also reported. Scatter plots (see contrast_sample) display the mean signal but not the standard error to streamline the plot. Default is "average".

plot_style

Either "split" or "facet". It specifies how to organize and display multiple bar plots:

  • "split": one bar plot for each sample is returned as an independent ggplot object. In each plot the bars are arranged in increasing order according to the codon usage indexes.

  • "facet": the bar plots are placed one below the other, in independent boxes. The bars of all plots are arranged following the alphabetical order of the codons. Note: this parameter is considered only if contrast_sample is not specified and thus bar plots are generated. Default is "split".

site

Either "psite, "asite", "esite". It specifies if the empirical codon usage indexes should be based on ribosome P-sites ("psite"), A-sites ("asite") or E-sites ("esite"). Default is "psite".

frequency_normalization

Logical value whether to normalize the 64 codon usage indexes for the corresponding codon frequencies in coding sequences. Default is TRUE. Note: the low frequency of the three stop codons often leads to codon usage indexes higher for these triplets than for the others. To discard the stop codons from the plots and avoid potential biases in the interpretation of the data, see include_stop_codons.

contrast_sample

Either character string or character string vector. It specifies the sample(s) (if any) to be considered for the comparison of codon usage indexes between: exactly two elements from those listed either in sample (if sample is a character string or a character string vector) or in names(sample) (if sample is a list).

  • one sample and 64 triplet-specific values: contrast_sample must be one character string from those listed either in sample (if sample is a character string or a character string vector) or in names(sample) (if sample is a list). Moreover, codon_values must be provided. Note: when this parameter is specified the function generates a scatter plot instead of one or multiple bar plots, it performs a linear regression and computes the Pearson correlation coefficient. See also label_scatter, label_number and label_aminoacid. Default is "NULL".

codon_values

Data table containing 64 triplet-specific values to be compared with the empirical codon usage indexes computed for the sample specified in contrast_sample. The data table must contain the DNA or RNA nucleotide sequence of the 64 codons and the corresponding values arranged in two columns named codon and value, respectively. Additional columns are ignored. This parameter is considered only if contrast_sample is specified. Default is NULL.

fastapath

Character string specifying the FASTA file used in the alignment step, including its path, name and extension. This file can contain reference nucleotide sequences either of genome asseblies (chromosome sequences) or of transcripts (see Details and fasta_genome). Please make sure the sequences derive from the same release of the annotation file used in the create_annotation function. Note: either fastapath or bsgenome is required to compute the frequency of each codon along sequences, used as normalization factors, even if data includes one or more columns among p_site_codon, a_site_codon and e_site_codon. Default is NULL.

fasta_genome

Logical value whether the FASTA file specified by fastapath contains nucleotide sequences of genome asseblies (chromosome sequences). If TRUE (the default), an annotation object is required (see gtfpath and txdb). FALSE implies nucleotide sequences of transcripts are provided instead.

refseq_sep

Character specifying the separator between reference sequences' name and additional information to discard, stored in the headers of the FASTA file specified by fastapath (if any). It might be required for matching the reference sequences' identifiers reported in the input list of data tables. All characters before the first occurrence of the specified separator are kept. Default is NULL i.e. no string splitting is performed.

bsgenome

Character string specifying the BSgenome data package with the genome sequences to be loaded. If not already present in the system, it is automatically installed through the biocLite.R script (check the list of available BSgenome data packages by running the available.genomes function of the BSgenome package). This parameter must be coupled with an annotation object (see gtfpath and txdb). Please make sure the sequences included in the specified BSgenome data pakage are in agreement with the sequences used in the alignment step. Note: either fastapath or bsgenome is required to compute the frequency of each codon along sequences, used as normalization factors, even if data includes one or more columns among p_site_codon, a_site_codon and e_site_codon. Default is NULL.

gtfpath

Character string specifying the location of a GTF file, including its path, name and extension. Please make sure the GTF file and the sequences specified by fastapath or bsgenome derive from the same release. Note that either gtfpath or txdb is required if and only if nucleotide sequences of genome assemblies (chromosome sequences) are provided (see fastapath or bsgenome). Default is NULL.

txdb

Character string specifying the TxDb annotation package to be loaded. If not already present in the system, it is automatically installed through the biocLite.R script (check here the list of available TxDb annotation packages). Please make sure the TxDb annotation package and the sequences specified by fastapath or bsgenome derive from the same release. Note that either gtfpath or txdb is required if and only if nucleotide sequences of genome assemblies (chromosome sequences) are provided (see fastapath or bsgenome). Default is NULL.

dataSource

Optional character string describing the origin of the GTF data file. This parameter is considered only if gtfpath is specified. For more information about this parameter please refer to the description of dataSource of the makeTxDbFromGFF function included in the GenomicFeatures package.

organism

Optional character string reporting the genus and species of the organism of the GTF data file. This parameter is considered only if gtfpath is specified. For more information about this parameter please refer to the description of organism of the makeTxDbFromGFF function included in the GenomicFeatures package.

transcripts

Character string vector listing the name of transcripts to be included in the analysis. Default is NULL i.e. all transcripts are used. Please note: transcripts without annotated CDS and transcripts whose coding sequence length is not divisible by 3 are automatically discarded.

length_range

Integer or integer vector for restricting the plot to a chosen range of read lengths. Default is NULL, i.e. all read lengths are used. If specified, this parameter prevails over cl.

cl

Integer value in 1,100 specifying a confidence level for restricting the plot to an automatically-defined range of read lengths. The new range is computed according to the most frequent read lengths, which accounts for the cl% of the sample and is defined by discarding the (100-cl)% of read lengths falling in the tails of the read lengths distribution. If multiple samples are analysed, a single range of read lengths is computed such that at least the cl% of all samples is represented. Default is 100.

include_stop_codons

Logical value whether to include the three stop codons in the plots. Default is TRUE.

label_scatter

Logical value whether to label the dots in the scatter plot. Each dot is labeled using either the nucleotide sequence of the codon or the corresponding amino acid symbol (see label_aminoacid). This parameter is considered only if contrast_sample is specified. Default is FALSE.

label_number

Integer value in 1,64 specifying how many dots in the scatter plot should be labeled. Dots farthest from the confident interval of the regression line are automatically identified and labeled. Default is 64 i.e. all dots are labeled. This parameter is considered only if label_scatter is TRUE.

label_aminoacid

Logical value whether to use amino acid symbols to label the dots of the scatter plot. Default is FALSE i.e. codon nucleotide sequences are used instead. This parameter is considered only if label_scatter is TRUE.

Details

riboWaltz only works for read alignments based on transcript coordinates. This choice is due to the main purpose of RiboSeq assays to study translational events through the isolation and sequencing of ribosome protected fragments. Most reads from RiboSeq are supposed to map on mRNAs and not on introns and intergenic regions. BAM based on transcript coordinates can be generated in two ways: i) aligning directly against transcript sequences; ii) aligning against sequences of genome assemblies i.e. standard chromosome sequences, thus requiring the outputs to be translated in transcript coordinates. The first option can be easily handled by many aligners (e.g. Bowtie), given a reference FASTA file where each sequence represents a transcript, from the beginning of the 5' UTR to the end of the 3' UTR. The second procedure is based on reference FASTA files where each sequence represents a chromosome, usually coupled with comprehensive gene annotation files (GTF or GFF). The STAR aligner, with its option –quantMode TranscriptomeSAM (see Chapter 6 of its manual), is an example of tool providing such a feature.

Value

List containing: one or more ggplot object(s) and the data table with the corresponding x- and y-axis values ("plot_dt"); an additional data table with raw, normalized and scaled number of P-sites associated with the 64 triplets for each sample contributing to the plot ("count_dt").

Examples

## data(reads_list)
## data(mm81cdna)
##
## ## Generate fake samples and replicates
## for(i in 2:6){
##   samp_name <- paste0("Samp", i)
##   set.seed(i)
##   reads_list[[samp_name]] <- reads_list[["Samp1"]][sample(.N, 5000)]
## }
##
## ## Compute and add p-site details
## psite_offset <- psite(reads_list, flanking = 6, extremity = "auto")
## reads_psite_list <- psite_info(reads_list, psite_offset)
##
## ## Define the list of samples and replicate to use as input
## input_samples <- list("S1" = c("Samp1", "Samp2"),
##                       "S2" = c("Samp3", "Samp4", "Samp5"),
##                       "S3" = c("Samp6"))
##
## Generate bar plots, alignment on transcript sequences
## example_cu_barplot <- codon_usage_psite(reads_psite_list, mm81cdna,
##                                        sample = input_samples,
##                                        multisamples = "average",
##                                        plot_style = "facet",
##                                        fastapath = "path/to/transcriptome/FASTA/file",
##                                        fasta_genome = FALSE)
##
## Generate bar plots,  alignment on chromosome sequences
## example_cu_barplot <- codon_usage_psite(reads_psite_list, mm81cdna,
##                                        sample = input_samples,
##                                        multisamples = "average",
##                                        plot_style = "facet",
##                                        fastapath = "path/to/chromosome/FASTA/file",
##                                        fasta_genome = TRUE) 
##
## Generate scatterplot comparing two samples
## example_cu_barplot <- codon_usage_psite(reads_psite_list, mm81cdna,
##                                        sample = input_samples,
##                                        contrast_sample = c("S1", "S2"),
##                                        fastapath = "path/to/transcriptome/FASTA/file",
##                                        fasta_genome = FALSE,
##                                        frequency_normalization = FALSE)

LabTranslationalArchitectomics/riboWaltz documentation built on Jan. 17, 2024, 12:18 p.m.