collateData: Collates a dataset from (processBAM) output files of...

View source: R/CollateData.R

collateDataR Documentation

Collates a dataset from (processBAM) output files of individual samples

Description

collateData() creates a dataset from a collection of processBAM output files belonging to an experiment.

Usage

collateData(
  Experiment,
  reference_path,
  output_path,
  IRMode = c("SpliceOver", "SpliceMax"),
  packageCOVfiles = FALSE,
  novelSplicing = FALSE,
  forceStrandAgnostic = FALSE,
  novelSplicing_minSamples = 3,
  novelSplicing_countThreshold = 10,
  novelSplicing_minSamplesAboveThreshold = 1,
  novelSplicing_requireOneAnnotatedSJ = TRUE,
  novelSplicing_useTJ = TRUE,
  overwrite = FALSE,
  n_threads = 1,
  lowMemoryMode = TRUE
)

Arguments

Experiment

(Required) A 2 or 3 column data frame, ideally generated by findSpliceWizOutput or findSamples. The first column designate the sample names, and the 2nd column contains the path to the processBAM output file (of type sample.txt.gz). (Optionally) a 3rd column contains the coverage files (of type sample.cov) of the corresponding samples. NB: all other columns are ignored.

reference_path

(Required) The path to the reference generated by Build-Reference-methods

output_path

(Required) The path to contain the output files for the collated dataset

IRMode

(default SpliceOver) The algorithm to calculate 'splice abundance' in IR quantification. Valid options are SpliceOver and SpliceMax. See details

packageCOVfiles

(default FALSE) Whether COV files should be copied over to the NxtSE object. This is useful if one wishes to transfer the NxtSE folder to a collaborator, who can then open the NxtSE object with valid COV file paths.

novelSplicing

(default FALSE) Whether collateData will use novel junction reads detected in samples to infer novel splice variants. All tandem split reads (those bridging two consecutive splice junctions) are used, as well as novel split reads that satisfy abundance criteria (see novelSplicing_minSamples, novelSplicing_minSamplesAboveThreshold, and novelSplicing_countThreshold) are used to synthesise a dataset-specific SpliceWiz reference. See details.

forceStrandAgnostic

(default FALSE) In poorly-prepared stranded libraries, it may be better to quantify in unstranded mode. Set this to TRUE if your stranded libraries may be contaminated with unstranded reads

novelSplicing_minSamples

(default 3) Novel junctions are included in building of novel reference if number samples with non-zero counts exceeds this number.

novelSplicing_countThreshold

(default 10) Threshold of split-reads across novel junctions; used in conjunction with novelSplicing_minSamplesAboveThreshold

novelSplicing_minSamplesAboveThreshold

(default 1) Novel junctions are included in building of novel reference if novel junction reads are above a pre-defined threshold exceeds this number

novelSplicing_requireOneAnnotatedSJ

(default TRUE) The default requires novel junctions to have one annotated splice site. If this is disabled, collateData will include novel junctions where neither splice site is annotated.

novelSplicing_useTJ

(default TRUE) For novel splicing, should SpliceWiz use reads with 2 or more junctions to find novel exons? Ignored if novelSplicing is set to FALSE.

overwrite

(default FALSE) If collateData() has previously been run using the same set of samples, it will not be overwritten unless this is set to TRUE.

n_threads

(default 1) The number of threads to use. If you run out of memory, try lowering the number of threads

lowMemoryMode

(default TRUE) collateData() will perform optimizations to conserve memory if this is set to TRUE. Otherwise, will prioritise performance.

Details

In Windows, collateData runs using only 1 thread, as BiocParallel's MulticoreParam is not supported.

It is assumed that all sample processBAM outputs were generated using the same reference.

The combination of junction counts and IR quantification from processBAM is used to calculate percentage spliced in (PSI) of alternative splice events, and intron retention ratios (IR-ratio) of retained introns. Also, QC information is collated. Data is organised in a H5file and FST files for memory and processor efficient downstream access using makeSE.

The original IRFinder algorithm, see the following wiki, uses SpliceMax to estimate abundance of spliced transcripts. This calculates the number of mapped splice events that share the boundary coordinate of either the left or right flanking exon ⁠SpliceLeft,SpliceRight⁠, estimating splice abundance as the larger of the two values.

SpliceWiz proposes a new algorithm, SpliceOver, to account for the possibility that the major isoform shares neither boundary, but arises from either of the flanking exon clusters. Exon clusters are contiguous regions covered by exons from any transcript (except those designated as retained_intron or sense_intronic), and are separated by obligate intronic regions (genomic regions that are introns for all transcripts). For introns that are internal to a single exon cluster (i.e. akin to "known-exon" introns from IRFinder), SpliceOver uses GenomicRanges::findOverlaps to sum all splice reads that overlap the same genomic region as the intron of interest.

Detection of novel ASEs: When novelSplicing is set to TRUE, novel junctions (split reads across unannotated junctions from samples of the dataset being collated) are used in conjunction with the reference to compile a list of novel ASEs. To avoid being overwhelmed by a large number of false positive novel junctions (often due to mis-alignments), a simple filtering strategy is used. This involves including novel junctions only if it occurs in a minimum number of samples (default 3), or if the number of split reads of a novel junction is above a pre-defined threshold (default 10) in a certain number of samples (default 1). These parameters can be set using novelSplicing_minSamples, novelSplicing_countThreshold and novelSplicing_minSamplesAboveThreshold respectively.

Value

collateData() writes to the directory given by output_path. This output directory is portable (i.e. it can be moved to a different location after running collateData() before running makeSE), but individual files within the output folder should not be moved.

Also, the processBAM and collateData output folders should be copied to the same destination and their relative paths preserved. Otherwise, the locations of the "COV" files will not be recorded in the collated data and will have to be re-assigned using ⁠covfile(se)<-⁠. See makeSE

See Also

processBAM, makeSE

Examples

buildRef(
    reference_path = file.path(tempdir(), "Reference"),
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

bams <- SpliceWiz_example_bams()
processBAM(bams$path, bams$sample,
  reference_path = file.path(tempdir(), "Reference"),
  output_path = file.path(tempdir(), "SpliceWiz_Output")
)

expr <- findSpliceWizOutput(file.path(tempdir(), "SpliceWiz_Output"))
collateData(expr,
  reference_path = file.path(tempdir(), "Reference"),
  output_path = file.path(tempdir(), "Collated_output")
)

# Enable novel splicing:

collateData(expr,
  reference_path = file.path(tempdir(), "Reference"),
  output_path = file.path(tempdir(), "Collated_output"),
  novelSplicing = TRUE
)


alexchwong/SpliceWiz documentation built on Oct. 15, 2024, 10:12 a.m.