find_novel_exons: Novel exon prediction

Description Usage Arguments Details Value Examples

View source: R/find_novel_exons.R

Description

Predict novel exons based on the SJ.out.tab file from STAR and/or a BAM file. The function has three different prediction modes:

  1. Cassette exons

  2. Novel exons from a single novel splice junction

  3. Novel exons from reads/read pairs with two novel splice junctions.

The cassette exon prediction is always performed and the second and third mode can be turned on/off with the paramters single_sj and read_based.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
find_novel_exons(
  sj_filename,
  annotation,
  min_unique = 1,
  gzipped = FALSE,
  verbose = TRUE,
  bam,
  single_sj = TRUE,
  read_based = TRUE,
  yield_size = 2e+05,
  lib_type = "PE",
  stranded = "reverse",
  read_length = 101,
  overhang_min = 12,
  min_intron_size = 21,
  cores = 1,
  tile_width = 1e+07
)

Arguments

sj_filename

Path to SJ.out.tab file.

annotation

List with exon and intron annotation as GRanges. Created with prepare_annotation().

min_unique

Integer scalar. Minimal number of reads required to map over a splice junction. Novel splice junctions with less reads are excluded from the analysis. All predicted novel exons with less than min_unique reads are discared.

gzipped

A logical scalar. Is the input file gzipped? Default FALSE.

verbose

Print messages about the progress.

bam

Character string. The path to the BAM file.

single_sj

A logical scalar. Should single novel splice junctions be used for prediction? Requires a BAM file.

read_based

A logical scalar. Should novel exons be predicted from reads or read-pairs with two novel splice junctions? Requires a BAM file.

yield_size

Integer scalar. Read the BAM file in chunks of this size.

lib_type

Character string. Type of the sequencing library: either "SE" (single-end) or "PE" (paired-end). Default "PE".

stranded

Character string. Strand type of the sequencing protocol: "unstranded" for unstranded protocols; "forward" or "reverse" for stranded protocols. In a "forward" protocol, the first read in a pair (or sinlge-end reads) comes from the forward (sense) strand and in a "reverse" protocol from the reverse (antisense) strand. See the Salmon documentation for an explanation of the different fragment library types. Default "reverse".

read_length

Integer scalar. Length of your reads in bps. Default 101.

overhang_min

Integer scalar. Minimum overhang length for splice junctions on both sides as defined by the --outSJfilterOverhangMin parameter of STAR. Use the minimum of the values for canonical splice junctions (value (2) to (4)). You do not have to set this parameter if you used the default values from STAR. Default 12.

min_intron_size

Integer scalar. Minimum intron size (--alignIntronMin parameter of STAR). You do not have to set this parameter if you used the default values from STAR. Default 21.

cores

Integer scalar. Number of cores to use. Default 1.

tile_width

Integer scalar. The genome will be partitioned into tiles of this size. The reads in the bam file within each tile will be consecutively imported (or in parallel if cores > 1). Default 1e7.

Details

The three different prediction modes are explained in more detail below:

Novel cassette exons are predicted from pairs of novel splice junctions (SJ) that are located within an annotated intron and share the start and end coordinates of the intron:

1
2
3
4
X---------X   annotated intron
x---x         novel SJ
      x---x   novel SJ
X---NNN---X   predicted cassette exon (N)

First, the novel SJs are filtered: Only SJs that are located within an annotated intron and that share their start or end coordinates with the intron are retained. All introns that share both their start and end with a novel SJ are tested for cassette exons. If the two novel SJs within an intron do not overlap and are on the same strand, a novel cassette exon is predicted.

The second mode (parameter single_sj) is based on single novel SJs as input. The novel SJs can be divided in three different cases:

  1. Novel SJs that touch an annotated exon with their start (5' end).

  2. Novel SJs that touch annotated exons with their end (3' end).

  3. Novel SJs that touch an annoated exon on both ends (5' and 3' end).

The three cases can be illustrated as follows:

  1. Start touches annotation

    AAAA            annotated exon
       J---J        novel SJ
     xxx---xx----x  read
           NN----X  novel exon and second SJ
       
  2. End touches annotation

               AAA annotated exon
          J----J   novel SJ
    xx---xx----xx  read
     X---NN        novel exon and second SJ
       
  3. Both ends touch annotation

           AAAAA  annotated exon
    AAAA          annotated exon
       J---J      novel junction
      NN---xxxxx  possible transcript with novel exon NN at the 5' of the SJ
    xxxx---NN     possible transcript with novel exon NN at the 3' of the SJ
       

For case 1 and 2, the function reads the BAM file and takes all reads that support the novel SJ and have a second SJ. The coordinates of the second SJ, and thus the coordinates of the novel exon are determined from the reads. In case 3, the function searches for reads with two novel SJs that support a possible novel exon on either end of the SJ. If no reads are found, the function checks if the novel exon could be terminal, i.e. the first or last exon in a transcript. If yes, the novel exon is not connected to an annotated exon at its start/end and thus the start/end coordinate of the novel exon cannot be determined clearly. As an approximation, the function takes the boundaries of the read with the longest mapping to the novel exon. The third prediction mode (parameter read_based) only uses reads and the annotation as input. Novel exons are predicted from novel combinations of already annotated SJs. The SJ pairs are defined by a single read with two SJs or by read pairs where each read spans one junction. Each of the SJ pairs are compared with the annotated SJs per transcript and already annotated SJ pairs are removed. For paired-end reads, there is an additional requirement: The distance between the end of the first read in a pair and the start of the second read has to be < 2*(readlength-minOverhang) + minIntronSize. Here, readlength is the length of the reads, minOverhang is the minmal required read overhang over a SJ of the alignment tool and minIntronSize is the minimal required intron length of the alignment tool. For example, paired-end reads with a lenght of 101 nts and a minimal overhang of 6 and a minimal intron length of 21 allow a distance of at most 211 nucleotides between the two SJs: 2*(101-6) + 21 = 211. If the distance between the two SJs exceeds the limit, it cannot be guaranteed that the junctions are connected to the same exon.

Novel exons are then predicted from novel combinations of the identified SJ pairs. A novel exon is only predicted if it is contained within the boundaries of an annotated gene. This prevents the prediction of false positive exons because of wrongly mapped reads.

Value

Data.frame with the coordinates of the identified novel exons. Each row in the data.frame is a predicted novel exon. The columns are the chromosome (seqnames), the end of the upstream exon (lend) in the transcript, the start and end of the novel exon (start and end), the start of the downstream exon (rstart) in the transcript and the strand (strand). The next three columns are the number of reads supporting each of the two splice junctions that define the novel exon: unique_left is the number of reads supporting the SJ from lend to start and unique_right is the number of supporting reads for the SJ from end to rstart. min_reads is the minimum of the two. The last column is the ID (ID) of the novel exon, i.e. a number between 1 and the total number of predictions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
sj <- system.file("extdata", "selected.SJ.out.tab",
                  package = "DISCERNS", mustWork = TRUE)
bam <- system.file("extdata", "selected.bam",
                   package = "DISCERNS", mustWork = TRUE)

## prepare annotation
gtf <- system.file("extdata", "selected.gtf", package = "DISCERNS",
                   mustWork = TRUE)
anno <- prepare_annotation(gtf)

## predict novel exons
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam)

## predict only cassette exons
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 single_sj = FALSE, read_based = FALSE)

## only predict exons from novel splice junctions
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, read_based = FALSE)

## Only consider novel splice junctions with at least ten supporting reads
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 10,
                 bam = bam)

## turn verbose off
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, verbose = FALSE)

## increase chunk size for BAM file reading if lots of memory is available
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, yield_size = 1000000)

## Stranded paired-end reads where the first read comes from the reverse
## strand, e.g., Illumina TruSeq stranded mRNA protocol. This is the default.
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, lib_type = "PE", stranded = "reverse")

## Unstranded single-end reads
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, lib_type = "SE", stranded = "unstranded")
                 
## If STAR was run with parameter --outSJfilterOverhangMin (e.g.
## --outSJfilterOverhangMin 30 6 6 6), specify the minimal junction overhang
## with parameter overhang_min: 
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
                 bam = bam, overhang_min = 6)
                 

khembach/DISCERNS documentation built on June 23, 2020, 3:35 p.m.