analyzeNovelIsoformORF: Prediction of Isoform Open Reading Frames.

View source: R/analyze_ORF.R

analyzeNovelIsoformORFR Documentation

Prediction of Isoform Open Reading Frames.

Description

For the subset of isoforms not already annotated with ORFs this function predicts the most likely Open Reading Frame (ORF) and the NMD sensitivity. This function is made to help annotate isoforms if you have performed (guided) de-novo isoform reconstruction (isoform deconvolution) and is supposed to be used after addORFfromGTF have been used to annotate the known transcript. If you did not do an isoform reconstruction there is no need to run this function as CDS will (read are supposed to) already be annotated by importRdata().

Usage

analyzeNovelIsoformORF(
    ### Core arguments
    switchAnalyzeRlist,
    analysisAllIsoformsWithoutORF, # also analyse all those annoatated as without CDS in ref annottaion
    genomeObject = NULL,

    ### Advanced argument
    minORFlength = 100,
    orfMethod = 'longest.AnnotatedWhenPossible',
    PTCDistance = 50,
    startCodons = "ATG",
    stopCodons = c("TAA", "TAG", "TGA"),
    showProgress = TRUE,
    quiet = FALSE
)

Arguments

switchAnalyzeRlist

A switchAnalyzeRlist object.

analysisAllIsoformsWithoutORF

A logic indicating whether to also analyse isoforms annotated as having no ORF by the addORFfromGTF function.

genomeObject

A BSgenome object uses as reference genome (e.g. 'Hsapiens' for Homo sapiens). Only necessary if transcript sequences were not already added (via the 'isoformNtFasta' argument in importRdata() or the extractSequence function).

minORFlength

The minimum size (in nucleotides) an ORF must be to be considered (and reported). Please note that we recommend using CPAT to predict coding potential instead of this cutoff - it is simply implemented as a pre-filter, see analyzeCPAT. Default is 100 nucleotides, which >97.5% of Gencode coding isoforms in both human and mouse have.

orfMethod

A string indicating which of the 5 available ORF identification methods should be used. The methods are:

  • longest.AnnotatedWhenPossible : A merge between "longestAnnotated" and "longest" (see below). For all isoforms where CDS start positions from known isoform overlap, only these CDS starts are considered and the longest ORF is annotated (similar to "longestAnnotated"). All isoforms without any overlapping CDS start sites they will be analysed with the "longest" approach.

  • longest : Identifies the longest ORF in the transcript (after filtering via minORFlength). This approach is similar to what the CPAT tool uses in it's analysis of coding potential.

  • mostUpstream : Identifies the most upstream ORF in the transcript (after filtering via minORFlength).

  • longestAnnotated : Identifies the longest ORF (after filtering via minORFlength) downstream of an annotated translation start site (which are supplied via the cds argument).

  • mostUpstreamAnnoated : Identifies the ORF (after filtering via minORFlength) downstream of the most upstream overlapping annotated translation start site (supplied via the cds argument).

Default is longest.AnnotatedWhenPossible.

PTCDistance

A numeric giving the maximal allowed premature termination codon-distance: The minimum distance (number of nucleotides) from the STOP codon to the final exon-exon junction. If the distance from the STOP to the final exon-exon junction is larger than this the isoform to be marked as NMD-sensitive. Default is 50.

startCodons

A vector of strings indicating the start codons identified in the DNA sequence. Default is 'ATG' (corresponding to the RNA-sequence AUG).

stopCodons

A vector of strings indicating the stop codons identified in the DNA sequence. Default is c("TAA", "TAG", "TGA").

showProgress

A logic indicating whether to make a progress bar (if TRUE) or not (if FALSE). Defaults is TRUE.

quiet

A logic indicating whether to avoid printing progress messages (incl. progress bar). Default is FALSE

Details

This is a specialized function which wraps analyzeORF(). First it extract ORF start sites already annotated ORFs in the switchAnalyzeRlist. Then it analyses all isoforms in the switchAnalyzeRlist not alreay annotated with and ORF (note the analysisAllIsoformsWithoutORF argument) using analyzeORF supplying the ORF start sites to the cds argument.

Value

For the isoforms analysed the ORF information in the switchAnalyzeRlist given as input updated and returned. See the the details section of the analyzeORF documentation for full description.

Author(s)

Kristoffer Vitting-Seerup

References

  • This function : Vitting-Seerup et al. The Landscape of Isoform Switches in Human Cancers. Mol. Cancer Res. (2017).

  • Information about NMD : Weischenfeldt J, et al: Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol. 2012, 13:R35.

See Also

addORFfromGTF

Examples

### Load data
data("exampleSwitchListIntermediary")

### Select random isoforms to remove ORF annotation for
exampleSwitchListIntermediary$orfAnalysis$orf_origin <- 'Annotation'
nToRemove <- 25
rowsToModify <- sample(which( !is.na( exampleSwitchListIntermediary$orfAnalysis$orfTransciptStart)), nToRemove)

### Remove ORF annoations
colsToModify <- which( ! colnames(exampleSwitchListIntermediary$orfAnalysis) %in% c('isoform_id','orf_origin'))
exampleSwitchListIntermediary$orfAnalysis[
    rowsToModify,
    colsToModify
] <- NA
exampleSwitchListIntermediary$orfAnalysis$orf_origin[rowsToModify] <- 'not_annotated_yet'

### Predict ORF of missing isoforms using the ORF in other isoforms
tmp <- analyzeNovelIsoformORF(
    switchAnalyzeRlist = exampleSwitchListIntermediary,
    analysisAllIsoformsWithoutORF = TRUE
)

kvittingseerup/IsoformSwitchAnalyzeR documentation built on June 28, 2024, 5:41 p.m.