CollateData: Processes data from IRFinder output

View source: R/CollateData.R

CollateDataR Documentation

Processes data from IRFinder output

Description

CollateData unifies a list of IRFinder output files belonging to an experiment.

Usage

CollateData(
  Experiment,
  reference_path,
  output_path,
  IRMode = c("SpliceOverMax", "SpliceMax"),
  overwrite = FALSE,
  n_threads = 1,
  samples_per_block = 16
)

Arguments

Experiment

(Required) A 2 or 3 column data frame, ideally generated by Find_IRFinder_Output or Find_Samples. The first column designate the sample names, and the 2nd column contains the path to the IRFinder output file (of type sample.txt.gz). (Optionally) a 3rd column contains the coverage files (of type sample.cov) of the corresponding samples. NB: all other columns are ignored.

reference_path

(Required) The path to the reference generated by BuildReference

output_path

(Required) The path to contain the output files for this function

IRMode

(default SpliceOverMax) The algorithm to calculate 'splice abundance' in IR quantification. Valid options are SpliceOverMax and SpliceMax. See details

overwrite

(default FALSE) If CollateData() has previously been run using the same set of samples, it will not be overwritten unless this is set to TRUE.

n_threads

(default 1) The number of threads to use. On low memory systems, reduce the number of n_threads and samples_per_block

samples_per_block

(default 16) How many samples to process per thread, maximum. Setting this to a lower value may help in memory-constrained systems.

Details

All sample IRFinder outputs must be generated using the same reference.

The combination of junction counts and IR quantification from IRFinder is used to calculate percentage spliced in (PSI) of alternative splice events, and percent intron retention (PIR) of retained introns. Also, QC information is extracted. Data is organised in a H5file and FST files for memory and processor efficient downstream access using MakeSE.

The original IRFinder algorithm, see the following wiki, uses SpliceMax to estimate abundance of spliced transcripts. This calculates the number of mapped splice events that share the boundary coordinate of either the left or right flanking exon SpliceLeft,SpliceRight, estimating splice abundance as the larger of the two values.

NxtIRF proposes a new algorithm,SpliceOverMax, to account for the possibility that the major isoform shares neither boundary, but arises from either of the flanking "exon islands". Exon islands are contiguous regions covered by exons from any transcript (except those designated as retained_intron or sense_intronic), and are separated by obligate intronic regions (genomic regions that are introns for all transcripts). For introns that are internal to a single exon island (i.e. akin to "known-exon" introns from IRFinder), SpliceOverMax uses GenomicRanges::findOverlaps to sum all splice reads that overlap the same genomic region as the intron of interest.

Value

CollateData() writes to the directory given by output_path. This output directory is portable (i.e. it can be moved to a different location after running CollateData() before running MakeSE), but individual files within the output folder should not be moved.

Also, the IRFinder and CollateData output folders should be copied to the same destination and their relative paths preserved. Otherwise, the locations of the "COV" files will not be recorded in the collated data and will have to be re-assigned using covfile(se)<-. See MakeSE

See Also

IRFinder, MakeSE

Examples

BuildReference(
    reference_path = file.path(tempdir(), "Reference"),
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

bams <- NxtIRF_example_bams()
IRFinder(bams$path, bams$sample,
  reference_path = file.path(tempdir(), "Reference"),
  output_path = file.path(tempdir(), "IRFinder_output")
)

expr <- Find_IRFinder_Output(file.path(tempdir(), "IRFinder_output"))
CollateData(expr,
  reference_path = file.path(tempdir(), "Reference"),
  output_path = file.path(tempdir(), "NxtIRF_output")
)

alexchwong/NxtIRFcore documentation built on Oct. 31, 2022, 9:14 a.m.