subsampleGRanges: Randomly subsample reads from GRanges dataset
In BRGenomics: Tools for the Efficient Analysis of High-Resolution Genomics Data

Description Usage Arguments Value Use with normalized readcounts Author(s) Examples

Random subsampling is not performed on ranges, but on reads. Readcounts should be given as a metadata field (usually "score"). This function can also subsample ranges directly if field = NULL, but the sample function can be used in this scenario.

subsampleGRanges(
  dataset.gr,
  n = NULL,
  prop = NULL,
  field = "score",
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

`dataset.gr`	A GRanges object in which signal (e.g. readcounts) are contained within metadata, or a list of such GRanges objects.
`n, prop`	Either the number of reads to subsample (`n`), or the proportion of total signal to subsample (`prop`). Either `n` or `prop` can be given, but not both. If `dataset.gr` is a list, or if `length(field) > 1`, users can supply a vector or list of `n` or `prop` values to match the individual datasets, but care should be taken to ensure that a value is given for each and every dataset.
`field`	The metadata field of `dataset.gr` that contains readcounts for reach position. If each range represents a single read, set `field = NULL`. If multiple fields are given, and `dataset.gr` is not a list, then `dataset.gr` will be treated as a multiplexed GRanges, and each field will be treated as an indpendent dataset. See `mergeGRangesData`.
`expand_ranges`	Logical indicating if ranges in `dataset.gr` should be treated as descriptions of single molecules (`FALSE`), or if ranges should be treated as representing multiple adjacent positions with the same signal (`TRUE`). See `getCountsByRegions`.
`ncores`	Number of cores to use for computations. Multicore only used when `dataset.gr` is a list, or if `length(field) > 1`.

A GRanges object identical in format to dataset.gr, but containing a random subset of its data. If field != NULL, the length of the output cannot be known a priori, but the sum of its score can.

If the metadata field contains normalized readcounts, an attempt will be made to infer the normalization factor based on the lowest signal value found in the specified field.

Mike DeBerardine

data("PROseq") # load included PROseq data

#--------------------------------------------------#
# sample 10% of the reads of a GRanges with signal coverage
#--------------------------------------------------#

ps_sample <- subsampleGRanges(PROseq, prop = 0.1)

# cannot predict number of ranges (positions) that will be sampled
length(PROseq)
length(ps_sample)

# 1/10th the score is sampled
sum(score(PROseq))
sum(score(ps_sample))

#--------------------------------------------------#
# Sample 10% of ranges (e.g. if each range represents one read)
#--------------------------------------------------#

ps_sample <- subsampleGRanges(PROseq, prop = 0.1, field = NULL)

length(PROseq)
length(ps_sample)

# Alternatively
ps_sample <- sample(PROseq, 0.1 * length(PROseq))
length(ps_sample)