detect_ribo_orfs: Detect ORFs by Ribosome profiling data
In Roleren/ORFik: Open Reading Frames in Genomics

detect_ribo_orfs

R Documentation

Detect ORFs by Ribosome profiling data

Description

Finding all ORFs: 1. Find all ORFs in mRNA using ORFik findORFs, with defined parameters.
To create the candidate ORFs (all ORFs returned):
Steps (candidate set):
Define a candidate search set by these 3 rules:
1.a Allowed ORF type: uORF, NTE, etc (only keep these in candidate list)
1.b Must have at least x reads over whole orf (default 10 reads)
1.c Must have at least x reads over start site (default 3 reads)
The total list is defined by these names, and saved according to allowed ORF type/types.
To create the prediction status (TRUE/FALSE) per candidate
Steps (prediction status)
(UP_NT is a 20nt window upstream of ORF, that stops 2NT before ORF starts) :
1. ORF mean reads per NT > (UP_NT mean reads per NT * 1.3)
2. ORFScore > 2.5
3. TIS total reads + 3 > ORF median reads per NT
4. Given expression above, a TRUE prediction is defined with the AND operatior: 1. & 2. & 3.

In code that is:
predicted <- (orfs_cov_stats$mean > upstream_cov_stats$mean*1.3) & orfs_cov_stats$ORFScores > 2.5 & ((reads_start[candidates] + 3) > orfs_cov_stats$median)

Usage

detect_ribo_orfs(
  df,
  out_folder,
  ORF_categories_to_keep,
  prefix_result = paste(c(ORF_categories_to_keep, gsub(" ", "_", organism(df))), collapse
    = "_"),
  mrna = loadRegion(df, "mrna"),
  cds = loadRegion(df, "cds"),
  libraries = outputLibs(df, type = "pshifted", output.mode = "envirlist"),
  orf_candidate_ranges = findORFs(seqs = txSeqsFromFa(mrna, df, TRUE), longestORF =
    longestORF, startCodon = startCodon, stopCodon = stopCodon, minimumLength =
    minimumLength),
  orfs_gr = categorize_and_filter_ORFs(orf_candidate_ranges, ORF_categories_to_keep, cds,
    mrna),
  export_metrics_table = TRUE,
  longestORF = FALSE,
  startCodon = startDefinition(1),
  stopCodon = stopDefinition(1),
  minimumLength = 0,
  minimum_reads_ORF = 10,
  minimum_reads_start = 3
)

Arguments

`df`	an ORFik `experiment`
`out_folder`	Directory to save files
`ORF_categories_to_keep`	options, any subset of: `c("uORF", "uoORF", "annotated", "NTE", "NTT", "internal", "doORF", "dORF", "a_error")`. uORF : Upstream ORFs (Starting in 5' UTR), not overlapping CDS uoORF : Upstream ORFs (Starting in 5' UTR), overlapping CDS annotated : The defined CDS for that transcript NTE : 5' Start codon extension of annotated CDS NTT : 5' Start codon truncation of annotated CDS CTE : 3' stop codon extension of annotated CDS, i.e. readthrough CTT : 5' Start codon truncation of annotated CDS, original cds was defined with readthrough internal : Starting inside CDS, ending before CDS ends doORF : Downstream ORFs (Ending in 3' UTR), overlapping CDS dORF : Downstream ORFs (Ending in 3' UTR), not overlapping CDS a_error : Any ORF detect not in the above categories
`prefix_result`	the prefix name of output files to out_folder. Default: `paste(c(ORF_categories_to_keep, gsub(" ", "_", organism(df))), collapse = "_")`
`mrna`	= `loadRegion(df, "mrna")`
`cds`	= `loadRegion(df, "cds")`
`libraries`	the ribo-seq libraries loaded into R as list, default: `outputLibs(df, type = "pshifted", output.mode = "envirlist")`
`orf_candidate_ranges`	IRangesList, = `findORFs(seqs = txSeqsFromFa(mrna, df, TRUE), longestORF = longestORF, startCodon = startCodon, stopCodon = stopCodon, minimumLength = minimumLength)`
`orfs_gr`	= categorize_and_filter_ORFs(orf_candidate_ranges, ORF_categories_to_keep, cds, mrna). The GRangesList set of ORFs to actually search.
`export_metrics_table`	logical, default TRUE. Export table of statistics to file with suffix: "_prediction_table.rds"
`longestORF`	(logical) Default TRUE. Keep only the longest ORF per unique stopcodon: (seqname, strand, stopcodon) combination, Note: Not longest per transcript! You can also use function `longestORFs` after creation of ORFs for same result.
`startCodon`	(character vector) Possible START codons to search for. Check `startDefinition` for helper function. Note that it is case sensitive, so "atg" would give 0 hits for a sequence with only capital "ATG" ORFs.
`stopCodon`	(character vector) Possible STOP codons to search for. Check `stopDefinition` for helper function. Note that it is case sensitive, so "tga" would give 0 hits for a sequence with only capital "TGA" ORFs.
`minimumLength`	(integer) Default is 0. Which is START + STOP = 6 bp. Minimum length of ORF, without counting 3bps for START and STOP codons. For example minimumLength = 8 will result in size of ORFs to be at least START + 8*3 (bp) + STOP = 30 bases. Use this param to restrict search.
`minimum_reads_ORF`	numeric, default 10, orf removed if less reads overlap whole orf
`minimum_reads_start`	numeric, default 3, orf removed if less reads overlap start

Value

invisible(NULL), all ORF results saved to disc

Examples

# Pre requisites
# 1. Create ORFik experiment
#  ORFik::create.experiment(...)
# 2. Create ORFik optimized annotation:
# makeTxdbFromGenome(gtf = ORFik:::getGtfPathFromTxdb(df), genome = df@fafile,
#                      organism = organism(df), optimize = TRUE)
# 3. There must exist pshifted reads, either as default files, or in a relative folder called
# "./pshifted/". See ?shiftFootprintsByExperiment
# EXAMPLE:
df <- ORFik.template.experiment()
df <- df[df$libtype == "RFP",][c(1,2),]
result_folder <- riboORFsFolder(df, tempdir())
results <- detect_ribo_orfs(df, result_folder, c("uORF", "uoORF", "annotated", "NTE"))

# Load results of annotated ORFs
table <- riboORFs(df[1,], type = "table", result_folder)
table # See all statistics
sum(table$predicted) # How many were predicted as Ribo-seq ORFs
# Load 2 results
table <- riboORFs(df[1:2,], type = "table", result_folder)
table # See all statistics
sum(table$predicted) # How many were predicted as Ribo-seq ORFs

# Load GRangesList
candidates_gr <- riboORFs(df[1,], type = "ranges_candidates", result_folder)
prediction <- riboORFs(df[1,], type = "predictions", result_folder)

predicted_gr <- riboORFs(df[1:2,], type = "ranges_predictions", result_folder)
identical(predicted_gr[[1]], candidates_gr[[1]][prediction[[1]]])
## Inspect predictions in RiboCrypt
# library(RiboCrypt)
# Inspect Predicted
view <- predicted_gr[[1]][1]
#multiOmicsPlot_ORFikExp(view, df, view, leader_extension = 100, trailer_extension = 100)
# Inspect not predicted
view <- candidates_gr[[1]][!prediction[[1]]][1]
#multiOmicsPlot_ORFikExp(view, df, view, leader_extension = 100, trailer_extension = 100)

Roleren/ORFik documentation built on April 12, 2025, 5:31 a.m.