codon_usage_exp: Codon analysis for ORFik experiment
In JokingHero/ORFik: Open Reading Frames in Genomics

codon_usage_exp

R Documentation

Codon analysis for ORFik experiment

Description

Per AA / codon, analyse the coverage, get a multitude of features. For both A sites and P-sites (Input reads must be P-sites for now) This function takes inspiration from the codonDT paper, and among others returns the negative binomial estimates, but in addition many other features.

Usage

codon_usage_exp(
  df,
  reads,
  cds = loadRegion(df, "cds", filterTranscripts(df)),
  mrna = loadRegion(df, "mrna", names(cds)),
  filter_cds_mod3 = TRUE,
  filter_table = assay(countTable(df, type = "summarized")[names(cds)]),
  faFile = df@fafile,
  min_counts_cds_filter = max(min(quantile(filter_table, 0.5), 1000), 1000),
  with_A_sites = TRUE,
  code = GENETIC_CODE,
  aligned_position = "center"
)

Arguments

`df`	an ORFik `experiment`
`reads`	either a single library (GRanges, GAlignment, GAlignmentPairs), or a list of libraries returned from `outputLibs(df)` with p-sites. If list, the list must have names coresponding to the library names.
`cds`	a GRangesList, the coding sequences, default: `loadRegion(df, "cds", filterTranscripts(df))`, longest isoform per gene.
`mrna`	a GRangesList, the full mRNA sequences (matching by names the cds sequences), default: `loadRegion(df, "mrna", names(cds))`.
`filter_cds_mod3`	logical, default TRUE. Remove all ORFs that are not mod3, this speeds up the computation a lot, and usually removes malformed ORFs you would not want anyway.
`filter_table`	an numeric(integer) matrix, where rownames are the names of the full set of mRNA transcripts. This will be subsetted to the cds subset you use. Then CDSs are filtered from this table by the 'min_counts_cds_filter' argument.
`faFile`	`FaFile`, BSgenome, fasta/index file path or an ORFik `experiment`. This file is usually used to find the transcript sequences from some GRangesList.
`min_counts_cds_filter`	numeric, default: `max(min(quantile(filter_table, 0.50), 100), 100)`. Minimum number of counts from the 'filter_table' argument.
`with_A_sites`	logical, default TRUE. Not used yet, will also return A site scores.
`code`	a named character vector of size 64. Default: GENETIC_CODE. Change if organism does not use the standard code.
`aligned_position`	what positions should be taken to calculate per-codon coverage. By default: "center", meaning that positions -1,0,1 will be taken. Alternative: "left", then positions 0,1,2 are taken.

Details

The primary column to use is "mean_txNorm", this is the fair normalized score.

Value

a data.table of rows per AA:codon. All values are given per library, per site (A or P) per codon type (start, internal, stop), sorted by the mean_txNorm_percentage column of the first library in the set, the columns are:

variable (character) : Library name
seq (character) : Amino acid:codon , for start codons: Amino acid is #, and stop codons are "*". So for human, there will be both #:ATG (the start sites), and M:ATG (internal ATGs)
sum (integer) : total counts per seq
sum_txNorm (integer) : total counts per seq normalized per tx
var (numeric) : variance of total counts per seq
N (integer) : total number of genes with this codon, per type (start, stop, internal codon)
N.total (integer) : total number of codons over all genes, per type (start, stop, internal codon)
mean_txNorm (numeric) : Default use output, the fair codon usage, normalized both for gene and genome level for codon and read counts
mean_txNorm_percentage : Percentage transform of mean_txNorm
dispersion : (mean^2) / (var - mean)
dispersion_txNorm : (mean_txNorm^2) / (var_txNorm - mean_txNorm)
alpha (numeric) : dirichlet alpha MOM estimator (imagine mean and variance of probability in 1 value, the lower the value, the higher the variance, mean is decided by the relative value between samples)
sum_txNorm (integer) : total counts per seq normalized per tx
relative_to_max_score (integer) : Max scaled percentage of mean_txNorm_percentage, so percentage on the ratio of mean_txNorm_percentage / max(mean_txNorm_percentage)
type (factor(character)) : "P" or "A"

References

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7196831/

Examples

df <- ORFik.template.experiment()[9:10,] # Subset to 2 Ribo-seq libs
## For single library
res <- codon_usage_exp(df, fimport(filepath(df[1,], "pshifted")),
                 min_counts_cds_filter = 10)
# mean_txNorm is adviced scoring column
# codon_usage_plot(res, res$mean_txNorm)
# Default for plot function is the percentage scaled version of mean_txNorm
# codon_usage_plot(res) # This gives check error
## For multiple libs
res2 <- codon_usage_exp(df, outputLibs(df, type = "pshifted", output.mode = "list"),
                 min_counts_cds_filter = 10)
# codon_usage_plot(res2)

JokingHero/ORFik documentation built on June 9, 2025, 8:46 p.m.