tallyRanges: Tallying function with a 'GRanges' interface.
In h5vc: Managing alignment tallies using a hdf5 backend

Description Usage Arguments Details Value Author(s) Examples

Functions for tallying bam files in genomic intervals provided as GRanges objects, special version of the function for direct writing or computation on a cluster exist.

1
2
3

tallyRanges(bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth = 1e+06)
tallyRangesToFile(tallyFile, study, bamfiles, ranges, reference, samples = NULL, q = 25, ncycles = 0, max.depth=1e6)
tallyRangesBatch(tallyFile, study, bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth=1e6, regID = "Tally", res = list("ncpus" = 2, "memory" = 24000, "queue"="research-rh6"), written = c(), wrfile = "written.jobs.RDa", waitTime = Inf)

`bamfiles`	Character vector giving the locations of the bam files to be tallied
`ranges`	A GRanges object describing the ranges that tallies shalle be generated in, e.g. the result of a call to `binGenome` or a set of exon or gene annotations provided by a `TxDB` object.
`reference`	`BSgenome` object describing the reference genome that the alignments were made against.
`samples`	The indices (within the HDF5 datasets) corresponding to the samples that the data represents. You can use this option to write sub-sets of samples from a cohort.
`q`	Read alignment quality cut-off.
`ncycles`	Number of cycles from the front and back of the reads that should be considered unreliable for mismatch detection
`max.depth`	Maximum depth of coverage to consider
`tallyFile`	Filename of the HDF5 tally file that the data shall be written to
`study`	The location within the HDF5 file that corresponds to the HDF5-group representing the study we are working on.
`regID`	Identifier for a `BatchJobs` registry which will be used to store and organise the cluster jobs used for parallelisation of the work.
`res`	Resource list specifying the compute resources to be requested for each of the cluster jobs.
`written`	Numerical vector indicating the Job IDs of jobs whose results have already been written to the tally file, this can be used to resume writing after a crash.
`wrfile`	Filename for a file to store the IDs of already written jobs in, can be used to resume writing after a crash.
`waitTime`	How long shall the function wait on cluster jobst to finish, before giving up. Default is wait forever.

tallyRanges returns the tallies corresponding to the specifed ranges, tallyToFile performs the same task but writes the results to the tally file directly. tallyRangesBatch uses the BatchJobs package to set up cluster jobs for tallying and collects and writes the results of those jobs to the tally file. It is important to have a properly configured cluster (inlcuding a .BatchJobs.R as well as a template file). See the documentation of BatchJobs for that information.

For tallyRanges the return value is a list of lists, where the top level corresponds to the ranges provided as an input to the function and each element is a list of the datasets in compatible format, that can directly be written to an HDF5 file using the writeToTallyFile function. The other two function perform the writing directly and return

Paul Theodor Pyl

suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
files <- list.files( system.file("extdata", package = "h5vcData"), "Pt.*bam$" )
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
suppressPackageStartupMessages(require(BSgenome.Hsapiens.NCBI.GRCh38))
suppressPackageStartupMessages(require(GenomicRanges))
dnmt3a <- read.table(system.file("extdata", "dnmt3a.txt", package = "h5vcData"), header=TRUE, stringsAsFactors = FALSE)
dnmt3a <- with( dnmt3a, GRanges(seqname, ranges = IRanges(start = start, end = end)))
dnmt3a <- reduce(dnmt3a)
require(BiocParallel)
register(MulticoreParam())
theData <- tallyRanges( bamFiles, ranges = dnmt3a[1:3], reference = Hsapiens )
str(theData)