calcORFScore: ORFScore calculation

View source: R/orfscore.R

calcORFScoreR Documentation

ORFScore calculation

Description

ORFScore is firstly defined in Bazzini et al., 2014 (PMID: 24705786), and is used to dscover novel open reading frames (ORF) or rank ORFs showing active translation. Basically, given an ORF, read counts for the three frames are calculated. Then a Chi-squared test statistic is computed by comparing the read counts with an equal null distribution p = c(1/3, 1/3, 1/3). The log2(1 + test statistic) is called ORFScore. In addition, the sign of the ORFScore is positive if the target frame (by default is frame 1) counts are larger than the counts of the other two frames, and negative otherwise.

Usage

calcORFScore(
  bam,
  orfGRL,
  frameOrder = c(1, 2, 3),
  targetFrame = 1,
  ignoreStrand = TRUE,
  probNULL = c(1/3, 1/3, 1/3)
)

Arguments

bam

A GRanges or GAlignments object of reads. Note that the reads should be already size selected and shifted. Check function shiftReads on how to shift reads. Also, for each read, only the 5'-most position is used. (Required).

orfGRL

A GRangesList object of ORFs. We recommend assigning a unique name to each ORF using names(orfGRL). In addition, the following modifications are also applied: 1. If the names of orfGRL are NULL, rename each element as "orf_1", "orf_2", etc; 2. Strands marked as "*" are replaced with "+"; 3. Remove elements with multiple chromosomes or strands (one ORF is on multiple chromosomes or different strands); 4. Remove elements where the ORF length is not divisible by 3; and 5. MOST IMPORTANTLY, if an ORF is on positive strand, sort by coordinates (seqnames, start, end) in ascending order. Otherwise, sort by coordinates (seqnames, end, start) in descending order. The purpose is to achieve the same behavior as cdsBy function in GenomicFeatures package. (Required).

frameOrder

A numeric vector of length 3 showing the frames for each position in each ORF. By default, the first position in each ORF is frame 1, the second position is frame 2, and the third position is frame 3. Repeat this pattern afterwards (e.g. 4th position is frame 1, 5th is frame 2, and 6th is frame 3. So on and so forth). (Default: c(1, 2, 3)).

targetFrame

A numeric variable indicating which frame is expected to have higher read counts. By default, frame 1 is expected to have higher read counts than frame 2 and 3. (Default: 1).

ignoreStrand

A logical variable indicating if ignoring that reads and ORFs must be on the same strand. (Default: TRUE).

probNULL

A numeric vector of length 3 showing the null distribution of the read counts in the three frames of an ORF. Must be non-negative and sum up to 1. By default, an equal null distribution is used in the chi-squared test. (Default: c(1/3, 1/3, 1/3)).

Value

A data.frame with 9 columns, specified below: 1. Column 1 is ORF ID (orfId, either user specified in orfGRL or internally generated); 2. Columns 2 to 4 are the read counts for the three frames where the order is specified by frameOrder (e.g. frame1Count, frame2Count, and frame3Count); 3. Columns 5 to 7 are the percentages of positive counts for the three frames where the order is specified by frameOrder (e.g. frame1PosPct, frame2PosPct, and frame3PosPct). For example, if an ORF has 30 positions (10 positions for each frame), 8 positions for frame 1 are positive, 1 position for frame 2 is positive, and 0 position for frame 3 is positive, then column 5 to 7 are 0.8, 0.1, and 0. The purpose of these three columns is to help filtering ORFs with high ORFScore, but the reads only show up in very few positions in the target frame. An example would be an ORF has 300 positions. Frame 1 has 100 read counts, and frame 2 and 3 has 0 read counts. But all the 100 read counts for frame 1 are located in the same position. In this case, the ORFScore will be large (if frame 1 is the target frame), but frame1PosPct is small (only 0.01). This ORF might be more likely to be a false positive; and 4. Columns 8 and 9 are raw ORFScores (rawORFScore, test statistics with signs) and final ORFScores (ORFScore, log2(1 + rawORFScore) with signs). If the read counts for all three frames are zero, the raw and final ORFScore is set to NA. The dataframe is sorted by ORFScore in descending order.


nzhang89/RiboSeeker documentation built on April 15, 2022, 10:18 a.m.