findUORFs: Find upstream ORFs from transcript annotation

Description Usage Arguments Details Value See Also Examples

View source: R/find_ORFs.R

Description

Procedure: 1. Create a new search space starting with the 5' UTRs. 2. Redefine TSS with CAGE if wanted. 3. Add the whole of CDS to search space to allow uORFs going into cds. 4. find ORFs on that search space. 5. Filter out wrongly found uORFs, if CDS is included. The CDS, alternative CDS, uORFs starting within the CDS etc.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
findUORFs(
  fiveUTRs,
  fa,
  startCodon = startDefinition(1),
  stopCodon = stopDefinition(1),
  longestORF = TRUE,
  minimumLength = 0,
  cds = NULL,
  cage = NULL,
  extension = 1000,
  filterValue = 1,
  restrictUpstreamToTx = FALSE,
  removeUnused = FALSE
)

Arguments

fiveUTRs

(GRangesList) The 5' leaders or full transcript sequences

fa

a FaFile. With fasta sequences corresponding to fiveUTR annotation. Usually loaded from the genome of an organism with fa = ORFik:::findFa("path/to/fasta/genome")

startCodon

(character vector) Possible START codons to search for. Check startDefinition for helper function.

stopCodon

(character vector) Possible STOP codons to search for. Check stopDefinition for helper function.

longestORF

(logical) Default TRUE. Keep only the longest ORF per unique stopcodon: (seqname, strand, stopcodon) combination, Note: Not longest per transcript! You can also use function longestORFs after creation of ORFs for same result.

minimumLength

(integer) Default is 0. Which is START + STOP = 6 bp. Minimum length of ORF, without counting 3bps for START and STOP codons. For example minimumLength = 8 will result in size of ORFs to be at least START + 8*3 (bp) + STOP = 30 bases. Use this param to restrict search.

cds

(GRangesList) CDS of relative fiveUTRs, applicable only if you want to extend 5' leaders downstream of CDS's, to allow upstream ORFs that can overlap into CDS's.

cage

Either a filePath for the CageSeq file as .bed .bam or .wig, with possible compressions (".gzip", ".gz", ".bgz"), or already loaded CageSeq peak data as GRanges or GAlignment. NOTE: If it is a .bam file, it will add a score column by running: convertToOneBasedRanges(cage, method = "5prime", addScoreColumn = TRUE) The score column is then number of replicates of read, if score column is something else, like read length, set the score column to NULL first.

extension

The maximum number of basses upstream of the TSS to search for CageSeq peak.

filterValue

The minimum number of reads on cage position, for it to be counted as possible new tss. (represented in score column in CageSeq data) If you already filtered, set it to 0.

restrictUpstreamToTx

a logical (FALSE). If TRUE: restrict leaders to not extend closer than 5 bases from closest upstream leader, set this to TRUE.

removeUnused

logical (FALSE), if False: (standard is to set them to original annotation), If TRUE: remove leaders that did not have any cage support.

Details

From default a filtering process is done to remove "fake" uORFs, but only if cds is included, since uORFs that stop on the stop codon on the CDS is not a uORF, but an alternative cds by definition, etc.

Value

A GRangesList of uORFs, 1 granges list element per uORF.

See Also

Other findORFs: findMapORFs(), findORFsFasta(), findORFs(), startDefinition(), stopDefinition()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Load annotation
txdbFile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
                         package = "GenomicFeatures")
## Not run: 
 txdb <- loadTxdb(txdbFile)
 fiveUTRs <- loadRegion(txdb, "leaders")
 cds <- loadRegion(txdb, "cds")
 if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
   # Normally you would not use a BSgenome, but some custom fasta-
   # annotation you  have for your species
   findUORFs(fiveUTRs, BSgenome.Hsapiens.UCSC.hg19::Hsapiens, "ATG",
             cds = cds)
 }

## End(Not run)

ORFik documentation built on March 27, 2021, 6 p.m.