codon_usage: Codon usage
In JokingHero/ORFik: Open Reading Frames in Genomics

codon_usage

R Documentation

Codon usage

Description

Per AA / codon, analyse the coverage, get a multitude of features. For both A sites and P-sites (Input reads must be P-sites for now) This function takes inspiration from the codonDT paper, and among others returns the negative binomial estimates, but in addition many other features.

Usage

codon_usage(
  reads,
  cds,
  mrna,
  faFile,
  filter_table,
  filter_cds_mod3 = TRUE,
  min_counts_cds_filter = max(min(quantile(filter_table, 0.5), 1000), 1000),
  with_A_sites = TRUE,
  aligned_position = "center",
  code = GENETIC_CODE
)

Arguments

`reads`	either a single library (GRanges, GAlignment, GAlignmentPairs), or a list of libraries returned from `outputLibs(df)` with p-sites. If list, the list must have names coresponding to the library names.
`cds`	a GRangesList
`mrna`	a GRangesList
`faFile`	a FaFile from genome
`filter_table`	a matrix / vector of length equal to cds
`filter_cds_mod3`	logical, default TRUE. Remove all ORFs that are not mod3, this speeds up the computation a lot, and usually removes malformed ORFs you would not want anyway.
`min_counts_cds_filter`	numeric, default: `max(min(quantile(filter_table, 0.50), 100), 100)`. Minimum number of counts from the 'filter_table' argument.
`with_A_sites`	logical, default TRUE. Not used yet, will also return A site scores.
`aligned_position`	what positions should be taken to calculate per-codon coverage. By default: "center", meaning that positions -1,0,1 will be taken. Alternative: "left", then positions 0,1,2 are taken.
`code`	a named character vector of size 64. Default: GENETIC_CODE. Change if organism does not use the standard code.

Details

The primary column to use is "mean_txNorm", this is the fair normalized score.

Value

a data.table of rows per AA:codon. All values are given per library, per site (A or P) per codon type (start, internal, stop), sorted by the mean_txNorm_percentage column of the first library in the set, the columns are:

variable (character) : Library name
seq (character) : Amino acid:codon , for start codons: Amino acid is #, and stop codons are "*". So for human, there will be both #:ATG (the start sites), and M:ATG (internal ATGs)
sum (integer) : total counts per seq
sum_txNorm (integer) : total counts per seq normalized per tx
var (numeric) : variance of total counts per seq
N (integer) : total number of genes with this codon, per type (start, stop, internal codon)
N.total (integer) : total number of codons over all genes, per type (start, stop, internal codon)
mean_txNorm (numeric) : Default use output, the fair codon usage, normalized both for gene and genome level for codon and read counts
mean_txNorm_percentage : Percentage transform of mean_txNorm
dispersion : (mean^2) / (var - mean)
dispersion_txNorm : (mean_txNorm^2) / (var_txNorm - mean_txNorm)
alpha (numeric) : dirichlet alpha MOM estimator (imagine mean and variance of probability in 1 value, the lower the value, the higher the variance, mean is decided by the relative value between samples)
sum_txNorm (integer) : total counts per seq normalized per tx
relative_to_max_score (integer) : Max scaled percentage of mean_txNorm_percentage, so percentage on the ratio of mean_txNorm_percentage / max(mean_txNorm_percentage)
type (factor(character)) : "P" or "A"

References

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7196831/

Examples

df <- ORFik.template.experiment()[9:10,] # Subset to 2 Ribo-seq libs

## For single library
reads <- fimport(filepath(df[1,], "pshifted"))
cds <- loadRegion(df, "cds", filterTranscripts(df))
mrna <- loadRegion(df, "mrna", names(cds))
filter_table <- assay(countTable(df, type = "summarized")[names(cds)])
faFile <- findFa(df)
res <- codon_usage(reads, cds, mrna, faFile = faFile,
             filter_table = filter_table, min_counts_cds_filter = 10)

JokingHero/ORFik documentation built on June 9, 2025, 8:46 p.m.