collateData | R Documentation |
collateData()
creates a dataset from a collection of processBAM
output files belonging to an experiment.
collateData(
Experiment,
reference_path,
output_path,
IRMode = c("SpliceOver", "SpliceMax"),
packageCOVfiles = FALSE,
novelSplicing = FALSE,
forceStrandAgnostic = FALSE,
novelSplicing_minSamples = 3,
novelSplicing_countThreshold = 10,
novelSplicing_minSamplesAboveThreshold = 1,
novelSplicing_requireOneAnnotatedSJ = TRUE,
novelSplicing_useTJ = TRUE,
overwrite = FALSE,
n_threads = 1,
lowMemoryMode = TRUE
)
Experiment |
(Required) A 2 or 3 column data frame, ideally generated by
findSpliceWizOutput or findSamples.
The first column designate the sample names, and the 2nd column
contains the path to the processBAM output file (of type
|
reference_path |
(Required) The path to the reference generated by Build-Reference-methods |
output_path |
(Required) The path to contain the output files for the collated dataset |
IRMode |
(default |
packageCOVfiles |
(default |
novelSplicing |
(default FALSE) Whether collateData will use
novel junction reads detected in samples to infer novel splice variants.
All tandem split reads (those bridging two consecutive splice junctions)
are used, as well as novel split reads that satisfy abundance criteria
(see |
forceStrandAgnostic |
(default |
novelSplicing_minSamples |
(default 3) Novel junctions are included in building of novel reference if number samples with non-zero counts exceeds this number. |
novelSplicing_countThreshold |
(default 10) Threshold of split-reads across
novel junctions; used in conjunction with
|
novelSplicing_minSamplesAboveThreshold |
(default 1) Novel junctions are included in building of novel reference if novel junction reads are above a pre-defined threshold exceeds this number |
novelSplicing_requireOneAnnotatedSJ |
(default |
novelSplicing_useTJ |
(default |
overwrite |
(default |
n_threads |
(default |
lowMemoryMode |
(default |
In Windows, collateData runs using only 1 thread, as BiocParallel's MulticoreParam is not supported.
It is assumed that all sample processBAM outputs were generated using the same reference.
The combination of junction counts and IR quantification from processBAM is used to calculate percentage spliced in (PSI) of alternative splice events, and intron retention ratios (IR-ratio) of retained introns. Also, QC information is collated. Data is organised in a H5file and FST files for memory and processor efficient downstream access using makeSE.
The original IRFinder algorithm, see the following
wiki,
uses SpliceMax
to estimate abundance of spliced transcripts.
This calculates the number of mapped splice events
that share the boundary coordinate of either the left or right flanking
exon SpliceLeft,SpliceRight
, estimating splice abundance as the larger
of the two values.
SpliceWiz proposes a new algorithm, SpliceOver
,
to account for the possibility that the major isoform shares neither
boundary, but arises from either of the flanking exon clusters. Exon
clusters are contiguous regions covered by exons from any transcript
(except those designated as retained_intron
or
sense_intronic
), and are separated by
obligate intronic regions (genomic regions that are introns for all
transcripts). For introns that are internal to a single exon cluster
(i.e. akin to "known-exon" introns from IRFinder), SpliceOver
uses GenomicRanges::findOverlaps to sum all splice reads that overlap
the same genomic region as the intron of interest.
Detection of novel ASEs: When novelSplicing
is set to TRUE
,
novel junctions (split reads across unannotated junctions from samples
of the dataset being collated) are used in conjunction with the reference
to compile a list of novel ASEs. To avoid being overwhelmed by a large
number of false positive novel junctions (often due to mis-alignments),
a simple filtering strategy is used. This involves including novel
junctions only if it occurs in a minimum number of samples (default 3),
or if the number of split reads of a novel junction is above a pre-defined
threshold (default 10) in a certain number of samples (default 1). These
parameters can be set using novelSplicing_minSamples
,
novelSplicing_countThreshold
and novelSplicing_minSamplesAboveThreshold
respectively.
collateData()
writes to the directory given by output_path
.
This output directory is portable (i.e. it can be moved to a different
location after running collateData()
before running makeSE), but
individual files within the output folder should not be moved.
Also, the processBAM and collateData output folders should be copied to
the same destination and their relative paths preserved. Otherwise, the
locations of the "COV" files will not be recorded in the collated data and
will have to be re-assigned using covfile(se)<-
. See makeSE
processBAM, makeSE
buildRef(
reference_path = file.path(tempdir(), "Reference"),
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
bams <- SpliceWiz_example_bams()
processBAM(bams$path, bams$sample,
reference_path = file.path(tempdir(), "Reference"),
output_path = file.path(tempdir(), "SpliceWiz_Output")
)
expr <- findSpliceWizOutput(file.path(tempdir(), "SpliceWiz_Output"))
collateData(expr,
reference_path = file.path(tempdir(), "Reference"),
output_path = file.path(tempdir(), "Collated_output")
)
# Enable novel splicing:
collateData(expr,
reference_path = file.path(tempdir(), "Reference"),
output_path = file.path(tempdir(), "Collated_output"),
novelSplicing = TRUE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.