detect_ribo_orfs: Detect ORFs by Ribosome profiling data

detect_ribo_orfsR Documentation

Detect ORFs by Ribosome profiling data

Description

Finding all ORFs: 1. Find all ORFs in mRNA using ORFik findORFs, with defined parameters.
To create the candidate ORFs (all ORFs returned):
Steps (candidate set):
Define a candidate search set by these 3 rules:
1.a Allowed ORF type: uORF, NTE, etc (only keep these in candidate list)
1.b Must have at least x reads over whole orf (default 10 reads)
1.c Must have at least x reads over start site (default 3 reads)
The total list is defined by these names, and saved according to allowed ORF type/types.
To create the prediction status (TRUE/FALSE) per candidate
Steps (prediction status)
(UP_NT is a 20nt window upstream of ORF, that stops 2NT before ORF starts) :
1. ORF mean reads per NT > (UP_NT mean reads per NT * 1.3)
2. ORFScore > 2.5
3. TIS total reads + 3 > ORF median reads per NT
4. Given expression above, a TRUE prediction is defined with the AND operatior: 1. & 2. & 3.

In code that is:
predicted <- (orfs_cov_stats$mean > upstream_cov_stats$mean*1.3) & orfs_cov_stats$ORFScores > 2.5 & ((reads_start[candidates] + 3) > orfs_cov_stats$median)

Usage

detect_ribo_orfs(
  df,
  out_folder,
  ORF_categories_to_keep,
  prefix_result = paste(c(ORF_categories_to_keep, gsub(" ", "_", organism(df))), collapse
    = "_"),
  mrna = loadRegion(df, "mrna"),
  cds = loadRegion(df, "cds"),
  libraries = outputLibs(df, type = "pshifted", output = "envirlist"),
  orf_candidate_ranges = findORFs(seqs = txSeqsFromFa(mrna, df, TRUE), longestORF =
    longestORF, startCodon = startCodon, stopCodon = stopCodon, minimumLength =
    minimumLength),
  export_metrics_table = TRUE,
  longestORF = FALSE,
  startCodon = startDefinition(1),
  stopCodon = stopDefinition(1),
  minimumLength = 0,
  minimum_reads_ORF = 10,
  minimum_reads_start = 3
)

Arguments

df

an ORFik experiment

out_folder

Directory to save files

ORF_categories_to_keep

options, any subset of: c("uORF", "uoORF", "annotated", "NTE", "NTT", "internal", "doORF", "dORF", "a_error").

  • uORF: Upstream ORFs (Starting in 5' UTR), not overlapping CDS

  • uoORF: Upstream ORFs (Starting in 5' UTR), overlapping CDS

  • annotated: The defined CDS for that transcript

  • NTE: 5' Start codon extension of annotated CDS

  • NTT: 5' Start codon truncation of annotated CDS

  • internal: Starting inside CDS, ending before CDS ends

  • doORF: Downstream ORFs (Ending in 3' UTR), overlapping CDS

  • dORF: Downstream ORFs (Ending in 3' UTR), not overlapping CDS

  • a_error: Any ORF detect not in the above categories

prefix_result

the prefix name of output files to out_folder. Default: paste(c(ORF_categories_to_keep, gsub(" ", "_", organism(df))), collapse = "_")

mrna

= loadRegion(df, "mrna")

cds

= loadRegion(df, "cds")

libraries

the ribo-seq libraries loaded into R as list, default: outputLibs(df, type = "pshifted", output = "envirlist")

orf_candidate_ranges

IRangesList, = findORFs(seqs = txSeqsFromFa(mrna, df, TRUE), longestORF = longestORF, startCodon = startCodon, stopCodon = stopCodon, minimumLength = minimumLength)

export_metrics_table

logical, default TRUE. Export table of statistics to file with suffix: "_prediction_table.rds"

longestORF

(logical) Default TRUE. Keep only the longest ORF per unique stopcodon: (seqname, strand, stopcodon) combination, Note: Not longest per transcript! You can also use function longestORFs after creation of ORFs for same result.

startCodon

(character vector) Possible START codons to search for. Check startDefinition for helper function. Note that it is case sensitive, so "atg" would give 0 hits for a sequence with only capital "ATG" ORFs.

stopCodon

(character vector) Possible STOP codons to search for. Check stopDefinition for helper function. Note that it is case sensitive, so "tga" would give 0 hits for a sequence with only capital "TGA" ORFs.

minimumLength

(integer) Default is 0. Which is START + STOP = 6 bp. Minimum length of ORF, without counting 3bps for START and STOP codons. For example minimumLength = 8 will result in size of ORFs to be at least START + 8*3 (bp) + STOP = 30 bases. Use this param to restrict search.

minimum_reads_ORF

numeric, default 10, orf removed if less reads overlap whole orf

minimum_reads_start

numeric, default 3, orf removed if less reads overlap start

Value

invisible(NULL), all ORF results saved to disc

Examples

# Pre requisites
# 1. Create ORFik experiment
#  ORFik::create.experiment(...)
# 2. Create ORFik optimized annotation:
# makeTxdbFromGenome(gtf = ORFik:::getGtfPathFromTxdb(df), genome = df@fafile,
#                      organism = organism(df), optimize = TRUE)
# 3. There must exist pshifted reads, either as default files, or in a relative folder called
# "./pshifted/". See ?shiftFootprintsByExperiment
# EXAMPLE:
df <- ORFik.template.experiment()
df <- df[df$libtype == "RFP",][c(1,2),]
result_folder <- riboORFsFolder(df, tempdir())
results <- detect_ribo_orfs(df, result_folder, c("uORF", "uoORF", "annotated", "NTE"))

# Load results of annotated ORFs
table <- riboORFs(df[1,], type = "table", result_folder)
table # See all statistics
sum(table$predicted) # How many were predicted as Ribo-seq ORFs
# Load 2 results
table <- riboORFs(df[1:2,], type = "table", result_folder)
table # See all statistics
sum(table$predicted) # How many were predicted as Ribo-seq ORFs

# Load GRangesList
candidates_gr <- riboORFs(df[1,], type = "ranges_candidates", result_folder)
prediction <- riboORFs(df[1,], type = "predictions", result_folder)

predicted_gr <- riboORFs(df[1:2,], type = "ranges_predictions", result_folder)
identical(predicted_gr[[1]], candidates_gr[[1]][prediction[[1]]])
## Inspect predictions in RiboCrypt
# library(RiboCrypt)
# Inspect Predicted
view <- predicted_gr[[1]][1]
#multiOmicsPlot_ORFikExp(view, df, view, leader_extension = 100, trailer_extension = 100)
# Inspect not predicted
view <- candidates_gr[[1]][!prediction[[1]]][1]
#multiOmicsPlot_ORFikExp(view, df, view, leader_extension = 100, trailer_extension = 100)

Roleren/ORFik documentation built on April 25, 2024, 8:41 p.m.