find_novel_exons: Novel exon prediction
In khembach/DISCERNS: DISCovery of unannotated Exons from RNa-seq Splice junction reads

Description Usage Arguments Details Value Examples

Predict novel exons based on the SJ.out.tab file from STAR and/or a BAM file. The function has three different prediction modes:

Cassette exons
Novel exons from a single novel splice junction
Novel exons from reads/read pairs with two novel splice junctions.

The cassette exon prediction is always performed and the second and third mode can be turned on/off with the paramters single_sj and read_based.

find_novel_exons(
  sj_filename,
  annotation,
  min_unique = 1,
  gzipped = FALSE,
  verbose = TRUE,
  bam,
  single_sj = TRUE,
  read_based = TRUE,
  yield_size = 2e+05,
  lib_type = "PE",
  stranded = "reverse",
  read_length = 101,
  overhang_min = 12,
  min_intron_size = 21,
  cores = 1,
  tile_width = 1e+07
)

`sj_filename`	Path to SJ.out.tab file.
`annotation`	List with exon and intron annotation as GRanges. Created with `prepare_annotation()`.
`min_unique`	Integer scalar. Minimal number of reads required to map over a splice junction. Novel splice junctions with less reads are excluded from the analysis. All predicted novel exons with less than `min_unique` reads are discared.
`gzipped`	A logical scalar. Is the input file gzipped? Default FALSE.
`verbose`	Print messages about the progress.
`bam`	Character string. The path to the BAM file.
`single_sj`	A logical scalar. Should single novel splice junctions be used for prediction? Requires a BAM file.
`read_based`	A logical scalar. Should novel exons be predicted from reads or read-pairs with two novel splice junctions? Requires a BAM file.
`yield_size`	Integer scalar. Read the BAM file in chunks of this size.
`lib_type`	Character string. Type of the sequencing library: either "SE" (single-end) or "PE" (paired-end). Default "PE".
`stranded`	Character string. Strand type of the sequencing protocol: "unstranded" for unstranded protocols; "forward" or "reverse" for stranded protocols. In a "forward" protocol, the first read in a pair (or sinlge-end reads) comes from the forward (sense) strand and in a "reverse" protocol from the reverse (antisense) strand. See the Salmon documentation for an explanation of the different fragment library types. Default "reverse".
`read_length`	Integer scalar. Length of your reads in bps. Default 101.
`overhang_min`	Integer scalar. Minimum overhang length for splice junctions on both sides as defined by the `--outSJfilterOverhangMin` parameter of STAR. Use the minimum of the values for canonical splice junctions (value (2) to (4)). You do not have to set this parameter if you used the default values from STAR. Default 12.
`min_intron_size`	Integer scalar. Minimum intron size (`--alignIntronMin` parameter of STAR). You do not have to set this parameter if you used the default values from STAR. Default 21.
`cores`	Integer scalar. Number of cores to use. Default 1.
`tile_width`	Integer scalar. The genome will be partitioned into tiles of this size. The reads in the `bam` file within each tile will be consecutively imported (or in parallel if `cores` > 1). Default 1e7.

The three different prediction modes are explained in more detail below:

Novel cassette exons are predicted from pairs of novel splice junctions (SJ) that are located within an annotated intron and share the start and end coordinates of the intron:

X---------X   annotated intron
x---x         novel SJ
      x---x   novel SJ
X---NNN---X   predicted cassette exon (N)

First, the novel SJs are filtered: Only SJs that are located within an annotated intron and that share their start or end coordinates with the intron are retained. All introns that share both their start and end with a novel SJ are tested for cassette exons. If the two novel SJs within an intron do not overlap and are on the same strand, a novel cassette exon is predicted.

The second mode (parameter single_sj) is based on single novel SJs as input. The novel SJs can be divided in three different cases:

Novel SJs that touch an annotated exon with their start (5' end).
Novel SJs that touch annotated exons with their end (3' end).
Novel SJs that touch an annoated exon on both ends (5' and 3' end).

The three cases can be illustrated as follows:

Start touches annotation

AAAA            annotated exon
   J---J        novel SJ
 xxx---xx----x  read
       NN----X  novel exon and second SJ

End touches annotation

           AAA annotated exon
      J----J   novel SJ
xx---xx----xx  read
 X---NN        novel exon and second SJ

Both ends touch annotation

       AAAAA  annotated exon
AAAA          annotated exon
   J---J      novel junction
  NN---xxxxx  possible transcript with novel exon NN at the 5' of the SJ
xxxx---NN     possible transcript with novel exon NN at the 3' of the SJ

For case 1 and 2, the function reads the BAM file and takes all reads that support the novel SJ and have a second SJ. The coordinates of the second SJ, and thus the coordinates of the novel exon are determined from the reads. In case 3, the function searches for reads with two novel SJs that support a possible novel exon on either end of the SJ. If no reads are found, the function checks if the novel exon could be terminal, i.e. the first or last exon in a transcript. If yes, the novel exon is not connected to an annotated exon at its start/end and thus the start/end coordinate of the novel exon cannot be determined clearly. As an approximation, the function takes the boundaries of the read with the longest mapping to the novel exon. The third prediction mode (parameter read_based) only uses reads and the annotation as input. Novel exons are predicted from novel combinations of already annotated SJs. The SJ pairs are defined by a single read with two SJs or by read pairs where each read spans one junction. Each of the SJ pairs are compared with the annotated SJs per transcript and already annotated SJ pairs are removed. For paired-end reads, there is an additional requirement: The distance between the end of the first read in a pair and the start of the second read has to be < 2*(readlength-minOverhang) + minIntronSize. Here, readlength is the length of the reads, minOverhang is the minmal required read overhang over a SJ of the alignment tool and minIntronSize is the minimal required intron length of the alignment tool. For example, paired-end reads with a lenght of 101 nts and a minimal overhang of 6 and a minimal intron length of 21 allow a distance of at most 211 nucleotides between the two SJs: 2*(101-6) + 21 = 211. If the distance between the two SJs exceeds the limit, it cannot be guaranteed that the junctions are connected to the same exon.

Novel exons are then predicted from novel combinations of the identified SJ pairs. A novel exon is only predicted if it is contained within the boundaries of an annotated gene. This prevents the prediction of false positive exons because of wrongly mapped reads.

Data.frame with the coordinates of the identified novel exons. Each row in the data.frame is a predicted novel exon. The columns are the chromosome (seqnames), the end of the upstream exon (lend) in the transcript, the start and end of the novel exon (start and end), the start of the downstream exon (rstart) in the transcript and the strand (strand). The next three columns are the number of reads supporting each of the two splice junctions that define the novel exon: unique_left is the number of reads supporting the SJ from lend to start and unique_right is the number of supporting reads for the SJ from end to rstart. min_reads is the minimum of the two. The last column is the ID (ID) of the novel exon, i.e. a number between 1 and the total number of predictions.

sj <- system.file("extdata", "selected.SJ.out.tab",
                  package = "DISCERNS", mustWork = TRUE)
bam <- system.file("extdata", "selected.bam",
                   package = "DISCERNS", mustWork = TRUE)

## prepare annotation
gtf <- system.file("extdata", "selected.gtf", package = "DISCERNS",
                   mustWork = TRUE)
anno <- prepare_annotation(gtf)

## predict novel exons
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam)

## predict only cassette exons
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 single_sj = FALSE, read_based = FALSE)

## only predict exons from novel splice junctions
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, read_based = FALSE)

## Only consider novel splice junctions with at least ten supporting reads
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 10,
                 bam = bam)

## turn verbose off
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, verbose = FALSE)

## increase chunk size for BAM file reading if lots of memory is available
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, yield_size = 1000000)

## Stranded paired-end reads where the first read comes from the reverse
## strand, e.g., Illumina TruSeq stranded mRNA protocol. This is the default.
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, lib_type = "PE", stranded = "reverse")

## Unstranded single-end reads
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, lib_type = "SE", stranded = "unstranded")
                 
## If STAR was run with parameter --outSJfilterOverhangMin (e.g.
## --outSJfilterOverhangMin 30 6 6 6), specify the minimal junction overhang
## with parameter overhang_min: 
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, overhang_min = 6)

khembach/DISCERNS documentation built on June 23, 2020, 3:35 p.m.

khembach/DISCERNS index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

khembach/DISCERNS
DISCovery of unannotated Exons from RNa-seq Splice junction reads

find_novel_exons: Novel exon prediction
In khembach/DISCERNS: DISCovery of unannotated Exons from RNa-seq Splice junction reads

Description

Usage

Arguments

Details

Value

Examples

Related to find_novel_exons in khembach/DISCERNS...

R Package Documentation

Browse R Packages

We want your feedback!

khembach/DISCERNS DISCovery of unannotated Exons from RNa-seq Splice junction reads

find_novel_exons: Novel exon prediction In khembach/DISCERNS: DISCovery of unannotated Exons from RNa-seq Splice junction reads

Description

Usage

Arguments

Details

Value

Examples

Related to find_novel_exons in khembach/DISCERNS...

R Package Documentation

Browse R Packages

We want your feedback!

khembach/DISCERNS
DISCovery of unannotated Exons from RNa-seq Splice junction reads

find_novel_exons: Novel exon prediction
In khembach/DISCERNS: DISCovery of unannotated Exons from RNa-seq Splice junction reads