Description Usage Arguments Details Value Examples
View source: R/find_novel_exons.R
Predict novel exons based on the SJ.out.tab file from STAR and/or a BAM file. The function has three different prediction modes:
Cassette exons
Novel exons from a single novel splice junction
Novel exons from reads/read pairs with two novel splice junctions.
The cassette exon prediction is always performed and the second and third
mode can be turned on/off with the paramters single_sj
and read_based
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | find_novel_exons(
sj_filename,
annotation,
min_unique = 1,
gzipped = FALSE,
verbose = TRUE,
bam,
single_sj = TRUE,
read_based = TRUE,
yield_size = 2e+05,
lib_type = "PE",
stranded = "reverse",
read_length = 101,
overhang_min = 12,
min_intron_size = 21,
cores = 1,
tile_width = 1e+07
)
|
sj_filename |
Path to SJ.out.tab file. |
annotation |
List with exon and intron annotation as GRanges. Created
with |
min_unique |
Integer scalar. Minimal number of reads required to map over
a splice junction. Novel splice junctions with less reads are excluded from
the analysis. All predicted novel exons with less than |
gzipped |
A logical scalar. Is the input file gzipped? Default FALSE. |
verbose |
Print messages about the progress. |
bam |
Character string. The path to the BAM file. |
single_sj |
A logical scalar. Should single novel splice junctions be used for prediction? Requires a BAM file. |
read_based |
A logical scalar. Should novel exons be predicted from reads or read-pairs with two novel splice junctions? Requires a BAM file. |
yield_size |
Integer scalar. Read the BAM file in chunks of this size. |
lib_type |
Character string. Type of the sequencing library: either "SE" (single-end) or "PE" (paired-end). Default "PE". |
stranded |
Character string. Strand type of the sequencing protocol: "unstranded" for unstranded protocols; "forward" or "reverse" for stranded protocols. In a "forward" protocol, the first read in a pair (or sinlge-end reads) comes from the forward (sense) strand and in a "reverse" protocol from the reverse (antisense) strand. See the Salmon documentation for an explanation of the different fragment library types. Default "reverse". |
read_length |
Integer scalar. Length of your reads in bps. Default 101. |
overhang_min |
Integer scalar. Minimum overhang length for splice
junctions on both sides as defined by the |
min_intron_size |
Integer scalar. Minimum intron size
( |
cores |
Integer scalar. Number of cores to use. Default 1. |
tile_width |
Integer scalar. The genome will be partitioned into tiles
of this size. The reads in the |
The three different prediction modes are explained in more detail below:
Novel cassette exons are predicted from pairs of novel splice junctions (SJ) that are located within an annotated intron and share the start and end coordinates of the intron:
1 2 3 4 | X---------X annotated intron
x---x novel SJ
x---x novel SJ
X---NNN---X predicted cassette exon (N)
|
First, the novel SJs are filtered: Only SJs that are located within an annotated intron and that share their start or end coordinates with the intron are retained. All introns that share both their start and end with a novel SJ are tested for cassette exons. If the two novel SJs within an intron do not overlap and are on the same strand, a novel cassette exon is predicted.
The second mode (parameter single_sj
) is based on single novel SJs as input.
The novel SJs can be divided in three different cases:
Novel SJs that touch an annotated exon with their start (5' end).
Novel SJs that touch annotated exons with their end (3' end).
Novel SJs that touch an annoated exon on both ends (5' and 3' end).
The three cases can be illustrated as follows:
Start touches annotation
AAAA annotated exon J---J novel SJ xxx---xx----x read NN----X novel exon and second SJ
End touches annotation
AAA annotated exon J----J novel SJ xx---xx----xx read X---NN novel exon and second SJ
Both ends touch annotation
AAAAA annotated exon AAAA annotated exon J---J novel junction NN---xxxxx possible transcript with novel exon NN at the 5' of the SJ xxxx---NN possible transcript with novel exon NN at the 3' of the SJ
For case 1 and 2, the function reads the BAM file and takes all reads that
support the novel SJ and have a second SJ. The coordinates of the second SJ,
and thus the coordinates of the novel exon are determined from the reads. In
case 3, the function searches for reads with two novel SJs that support a
possible novel exon on either end of the SJ. If no reads are found, the
function checks if the novel exon could be terminal, i.e. the first or last
exon in a transcript. If yes, the novel exon is not connected to an annotated
exon at its start/end and thus the start/end coordinate of the novel exon
cannot be determined clearly. As an approximation, the function takes the
boundaries of the read with the longest mapping to the novel exon.
The third prediction mode (parameter read_based
) only uses reads and the
annotation as input. Novel exons are predicted from novel combinations of
already annotated SJs. The SJ pairs are defined by a single read with two SJs
or by read pairs where each read spans one junction. Each of the SJ pairs are
compared with the annotated SJs per transcript and already annotated SJ pairs
are removed.
For paired-end reads, there is an additional requirement: The distance between
the end of the first read in a pair and the start of the second read has to be
< 2*(readlength-minOverhang) + minIntronSize
. Here, readlength
is the
length of the reads, minOverhang
is the minmal required read overhang over a
SJ of the alignment tool and minIntronSize
is the minimal required intron
length of the alignment tool. For example, paired-end reads with a lenght of
101 nts and a minimal overhang of 6 and a minimal intron length of 21 allow a
distance of at most 211 nucleotides between the two SJs: 2*(101-6) + 21 = 211.
If the distance between the two SJs exceeds the limit, it cannot be guaranteed
that the junctions are connected to the same exon.
Novel exons are then predicted from novel combinations of the identified SJ pairs. A novel exon is only predicted if it is contained within the boundaries of an annotated gene. This prevents the prediction of false positive exons because of wrongly mapped reads.
Data.frame with the coordinates of the identified novel exons. Each
row in the data.frame is a predicted novel exon. The columns are the
chromosome (seqnames
), the end of the upstream exon (lend
) in the
transcript, the start and end of the novel exon (start
and end
), the
start of the downstream exon (rstart
) in the transcript and the strand
(strand
). The next three columns are the number of reads supporting each
of the two splice junctions that define the novel exon: unique_left
is the
number of reads supporting the SJ from lend
to start
and unique_right
is the number of supporting reads for the SJ from end
to rstart
.
min_reads
is the minimum of the two. The last column is the ID (ID
) of
the novel exon, i.e. a number between 1 and the total number of predictions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | sj <- system.file("extdata", "selected.SJ.out.tab",
package = "DISCERNS", mustWork = TRUE)
bam <- system.file("extdata", "selected.bam",
package = "DISCERNS", mustWork = TRUE)
## prepare annotation
gtf <- system.file("extdata", "selected.gtf", package = "DISCERNS",
mustWork = TRUE)
anno <- prepare_annotation(gtf)
## predict novel exons
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam)
## predict only cassette exons
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
single_sj = FALSE, read_based = FALSE)
## only predict exons from novel splice junctions
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam, read_based = FALSE)
## Only consider novel splice junctions with at least ten supporting reads
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 10,
bam = bam)
## turn verbose off
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam, verbose = FALSE)
## increase chunk size for BAM file reading if lots of memory is available
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam, yield_size = 1000000)
## Stranded paired-end reads where the first read comes from the reverse
## strand, e.g., Illumina TruSeq stranded mRNA protocol. This is the default.
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam, lib_type = "PE", stranded = "reverse")
## Unstranded single-end reads
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam, lib_type = "SE", stranded = "unstranded")
## If STAR was run with parameter --outSJfilterOverhangMin (e.g.
## --outSJfilterOverhangMin 30 6 6 6), specify the minimal junction overhang
## with parameter overhang_min:
find_novel_exons(sj_filename = sj, annotation = anno, min_unique = 1,
bam = bam, overhang_min = 6)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.