find_uORFome: Run whole uORFomePipe prediction
In Roleren/uORFomePipe: uORF prediction in R

find_uORFome

R Documentation

Run whole uORFomePipe prediction

Description

Steps:
1: Make directory structure for orf finding, create database assign variables and validate input data.
2. Find CAGE transcripts
3. Find uORFs
4. Create database
5. Fill database with NGS and sequence features
6. Train the random forrest model
7. Predict on uORFs
8. Get analysis plots

NOTE: IF it crashes it will continue from the point you quit, so delete the mainPath folder if you want fresh rerun.
Also do not change working directory after you started running, as this might make the program crash

Usage

find_uORFome(
  mainPath,
  organism = organism(df.rfp),
  df.rfp,
  df.rna,
  df.cage,
  startCodons = "ATG|CTG|TTG|GTG|AAG|AGG|ACG|ATC|ATA|ATT",
  stopCodons = "TAA|TAG|TGA",
  mode = "uORF",
  requiredActiveCds = 30,
  max.artificial.length = 100,
  startCodons.cds.allowed = startCodons,
  stopCodons.cds.allowed = stopCodons,
  biomart = "ensembl",
  features = c("countRFP", "disengagementScores", "entropyRFP", "floss", "fpkmRFP",
    "ioScore", "ORFScores", "RRS", "RSS", "startCodonCoverage", "startRegionCoverage",
    "startRegionRelative"),
  BPPARAM = bpparam()
)

Arguments

`mainPath`	folder for uORFome to put results
`organism`	scientific name of organism, like Homo sapiens, Danio rerio, etc.
`df.rfp`	ORFik experiment of Ribo-seq
`df.rna`	ORFik experiment of RNA-seq, set to NULL if you don't have RNA-seq
`df.cage`	ORFik experiment of CAGE, set to NULL if you don't have CAGE.
`startCodons`	default "ATG\|CTG\|TTG\|GTG\|AAG\|AGG\|ACG\|ATC\|ATA\|ATT", set to "ATG\|CTG\|TTG\|GTG" for a more certain set.
`stopCodons`	default "TAA\|TAG\|TGA"
`mode`	character, default: "uORF". alternative "aCDS". Do you want to predict on uORFs or artificial CDS. if "aCDS" will run twice once for whole length CDS and one for truncated CDS to validate model works for short ORFs. "CDS" is option to predict on whole CDS.
`requiredActiveCds`	numeric, default 30. How many CDSs are required to be detected active. Size of minimum positive training set. Will abort if not bigger than this number.
`max.artificial.length`	integer, default: 100, only applies if mode = "aCDS", so ignore this for most people, when creating artificial ORFs from CDS, how large should maximum ORFs be, this number is 1/6 of maximum size of ORFs (max size 600 if artificialLength is 100) Will sample random size from 6 to that number, if max.artificial.length is 2, you can get artificial ORFs of size (6, 9 or 12) (6, + 6 + (3x1), 6 + (3x2))
`startCodons.cds.allowed`	character, default same as startCodons argument. Which start codons can the CDS you train on have ?
`stopCodons.cds.allowed`	character, default same as stopCodons argument Which stop codons can the CDS you train on have ?
`biomart`	default "ensembl", get gene symbols and GO terms for uORF genes. Will be automaticly detected by organism name in ensembl database. Set to NULL if you don't want to check Gene symbols and GO terms.
`features`	features to train model on, any of the features created during ORFik::computeFeatures, default: `c("countRFP", "disengagementScores", "entropyRFP", "floss", "fpkmRFP","ioScore", "ORFScores", "RRS", "RSS", "startCodonCoverage", "startRegionCoverage","startRegionRelative")`
`BPPARAM`	An instance of a `BiocParallelParam` class, e.g., `MulticoreParam`, `SnowParam`, `DoparParam`.

Value

the prediction as data.table with 3 columns. Prediction (0 or 1), p0 (probability of a negtive prediction), p1 (probability of positive prediction). Only one of p0 and p1 can be > 0.5, and that value will decide if prediction is 0 or 1.

Examples

mainPath <- "~/bio/results/uORFome_Zebrafish"
# df.rfp <- read.experiment("path/to/rfp.csv")
# df.rna <- read.experiment("path/to/rna.csv") # Not required
# df.cage <- read.experiment("path/to/CAGE.csv") # Not required
# find_uORFome(mainPath, df.rfp, df.rna, df.cage)

Roleren/uORFomePipe documentation built on Jan. 14, 2024, 5:11 a.m.