bootstrap-signal-by-position: Bootstrapping Mean Signal by Position for Metaplotting
In BRGenomics: Tools for the Efficient Analysis of High-Resolution Genomics Data

Description Usage Arguments Value Author(s) See Also Examples

These functions perform bootstrap subsampling of mean readcounts at different positions within regions of interest (metaSubsample), or, in the more general case of metaSubsampleMatrix, column means of a matrix are bootstrapped by sampling the rows. Mean signal counts can be calculated at base-pair resolution, or over larger bins.

metaSubsample(
  dataset.gr,
  regions.gr,
  binsize = 1,
  first.output.xval = 1,
  sample.name = deparse(substitute(dataset.gr)),
  n.iter = 1000,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  field = "score",
  NF = NULL,
  remove.empty = FALSE,
  blacklist = NULL,
  zero_blacklisted = FALSE,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

metaSubsampleMatrix(
  counts.mat,
  binsize = 1,
  first.output.xval = 1,
  sample.name = NULL,
  n.iter = 1000,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  NF = 1,
  remove.empty = FALSE,
  ncores = getOption("mc.cores", 2L)
)

`dataset.gr`	A GRanges object in which signal is contained in metadata (typically in the `"score"` field), or a list of such GRanges objects.
`regions.gr`	A GRanges object containing intervals over which to metaplot. All ranges must have the same width.
`binsize`	The size of bin (in basepairs, or number of columns for `metaSubsampleMatrix`) to use for counting signal. Especially important for counting signal over large or sparse regions.
`first.output.xval`	The relative start position of the first bin, e.g. if `regions.gr` begins at 50 bases upstream of the TSS, set `first.output.xval = -50`. This number only affects the x-values that are returned, which are provided as a convenience.
`sample.name`	Defaults to the name of the input dataset. This is included in the output as a convenience, as it allows row-binding outputs from different samples. If `length(field) > 1` and the default `sample.name` is left, the sample names will be inferred from the field names.
`n.iter`	Number of random subsampling iterations to perform. Default is `1000`.
`prop.sample`	The proportion of the ranges in `regions.gr` (e.g. the proportion of genes) or the proportion of rows in `counts.mat` to sample in each iteration. The default is `0.1` (10 percent).
`lower, upper`	The lower and upper quantiles of subsampled signal means to return. The defaults, `0.125` and `0.875` (i.e. the 12.5th and 85.5th percentiles) return a 75 percent confidence interval about the bootstrapped mean.
`field`	One or more metadata fields of `dataset.gr` to be counted.
`NF`	An optional normalization factor by which to multiply the counts. If given, `length(NF)` must be equal to `length(field)`.
`remove.empty`	A logical indicating whether regions (`metaSubsample`) or rows (`metaSubsampleMatrix`) without signal should be removed from the analysis. Not recommended if using multiple fields, as the gene lists will no longer be equivalent.
`blacklist`	An optional GRanges object containing regions that should be excluded from signal counting.
`zero_blacklisted`	When set to `FALSE` (the default), signal at blacklisted sites is ignored from computations. If set to `TRUE`, signal at blacklisted sites will be treated as equal to zero. For bootstrapping, the default behavior of ignoring signal at blacklisted sites should almost always be kept.
`expand_ranges`	Logical indicating if ranges in `dataset.gr` should be treated as descriptions of single molecules (`FALSE`), or if ranges should be treated as representing multiple adjacent positions with the same signal (`TRUE`). See `getCountsByRegions`.
`ncores`	Number of cores to use for computations.
`counts.mat`	A matrix over which to bootstrap column means by subsampling its rows. Typically, a matrix of readcounts with rows for genes and columns for positions within those genes.

A dataframe containing x-values, means, lower quantiles, upper quantiles, and the sample name (as a convenience for row-binding multiple of these dataframes). If a list of GRanges is given as input, or if multiple fields are given, a single, combined dataframe is returned containing data for all fields/datasets.

Mike DeBerardine

getCountsByPositions

data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts

# for each transcript, use promoter-proximal region from TSS to +100
pr <- promoters(txs_dm6_chr4, 0, 100)

#--------------------------------------------------#
# Bootstrap average signal in each 5 bp bin across all transcripts,
# and get confidence bands for middle 30% of bootstrapped means
#--------------------------------------------------#

set.seed(11)
df <- metaSubsample(PROseq, pr, binsize = 5,
                    lower = 0.35, upper = 0.65,
                    ncores = 1)
df[1:10, ]

#--------------------------------------------------#
# Plot bootstrapped means with confidence intervals
#--------------------------------------------------#

plot(mean ~ x, df, type = "l", main = "PROseq Signal",
     ylab = "Mean + 30% CI", xlab = "Distance from TSS")
polygon(c(df$x, rev(df$x)), c(df$lower, rev(df$upper)),
        col = adjustcolor("black", 0.1), border = FALSE)


#==================================================#
# Using a matrix as input
#==================================================#

# generate a matrix of counts in each region
countsmat <- getCountsByPositions(PROseq, pr)
dim(countsmat)

#--------------------------------------------------#
# bootstrap average signal in 10 bp bins across all transcripts
#--------------------------------------------------#

set.seed(11)
 df <- metaSubsampleMatrix(countsmat, binsize = 10,
                           sample.name = "PROseq",
                           ncores = 1)
df[1:10, ]

#--------------------------------------------------#
# the same, using a normalization factor, and changing the x-values
#--------------------------------------------------#

set.seed(11)
df <- metaSubsampleMatrix(countsmat, binsize = 10,
                          first.output.xval = 0, NF = 0.75,
                          sample.name = "PROseq", ncores = 1)
df[1:10, ]