bootstrap-signal-by-position: Bootstrapping Mean Signal by Position for Metaplotting

bootstrap-signal-by-positionR Documentation

Bootstrapping Mean Signal by Position for Metaplotting

Description

These functions perform bootstrap subsampling of mean readcounts at different positions within regions of interest (metaSubsample), or, in the more general case of metaSubsampleMatrix, column means of a matrix are bootstrapped by sampling the rows. Mean signal counts can be calculated at base-pair resolution, or over larger bins.

Usage

metaSubsample(
  dataset.gr,
  regions.gr,
  binsize = 1L,
  first.output.xval = 1L,
  sample.name = deparse(substitute(dataset.gr)),
  n.iter = 1000L,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  field = "score",
  NF = NULL,
  remove.empty = FALSE,
  blacklist = NULL,
  zero_blacklisted = FALSE,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

metaSubsampleMatrix(
  counts.mat,
  binsize = 1L,
  first.output.xval = 1L,
  sample.name = NULL,
  n.iter = 1000L,
  prop.sample = 0.1,
  lower = 0.125,
  upper = 0.875,
  NF = 1L,
  remove.empty = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr

A GRanges object in which signal is contained in metadata (typically in the "score" field), or a list of such GRanges objects.

regions.gr

A GRanges object containing intervals over which to metaplot. All ranges must have the same width.

binsize

The size of bin (in basepairs, or number of columns for metaSubsampleMatrix) to use for counting signal. Especially important for counting signal over large or sparse regions.

first.output.xval

The relative start position of the first bin, e.g. if regions.gr begins at 50 bases upstream of the TSS, set first.output.xval = -50. This number only affects the x-values that are returned, which are provided as a convenience.

sample.name

Defaults to the name of the input dataset. This is included in the output as a convenience, as it allows row-binding outputs from different samples. If length(field) > 1 and the default sample.name is left, the sample names will be inferred from the field names.

n.iter

Number of random subsampling iterations to perform. Default is 1000.

prop.sample

The proportion of the ranges in regions.gr (e.g. the proportion of genes) or the proportion of rows in counts.mat to sample in each iteration. The default is 0.1 (10 percent).

lower, upper

The lower and upper quantiles of subsampled signal means to return. The defaults, 0.125 and 0.875 (i.e. the 12.5th and 85.5th percentiles) return a 75 percent confidence interval about the bootstrapped mean.

field

One or more metadata fields of dataset.gr to be counted.

NF

An optional normalization factor by which to multiply the counts. If given, length(NF) must be equal to length(field).

remove.empty

A logical indicating whether regions (metaSubsample) or rows (metaSubsampleMatrix) without signal should be removed from the analysis. Not recommended if using multiple fields, as the gene lists will no longer be equivalent.

blacklist

An optional GRanges object containing regions that should be excluded from signal counting.

zero_blacklisted

When set to FALSE (the default), signal at blacklisted sites is ignored from computations. If set to TRUE, signal at blacklisted sites will be treated as equal to zero. For bootstrapping, the default behavior of ignoring signal at blacklisted sites should almost always be kept.

expand_ranges

Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). See getCountsByRegions.

ncores

Number of cores to use for computations.

counts.mat

A matrix over which to bootstrap column means by subsampling its rows. Typically, a matrix of readcounts with rows for genes and columns for positions within those genes.

Value

A dataframe containing x-values, means, lower quantiles, upper quantiles, and the sample name (as a convenience for row-binding multiple of these dataframes). If a list of GRanges is given as input, or if multiple fields are given, a single, combined dataframe is returned containing data for all fields/datasets.

Author(s)

Mike DeBerardine

See Also

getCountsByPositions

Examples

data("PROseq") # import included PROseq data
data("txs_dm6_chr4") # import included transcripts

# for each transcript, use promoter-proximal region from TSS to +100
pr <- promoters(txs_dm6_chr4, 0, 100)

#--------------------------------------------------#
# Bootstrap average signal in each 5 bp bin across all transcripts,
# and get confidence bands for middle 30% of bootstrapped means
#--------------------------------------------------#

set.seed(11)
df <- metaSubsample(PROseq, pr, binsize = 5,
                    lower = 0.35, upper = 0.65,
                    ncores = 1)
df[1:10, ]

#--------------------------------------------------#
# Plot bootstrapped means with confidence intervals
#--------------------------------------------------#

plot(mean ~ x, df, type = "l", main = "PROseq Signal",
     ylab = "Mean + 30% CI", xlab = "Distance from TSS")
polygon(c(df$x, rev(df$x)), c(df$lower, rev(df$upper)),
        col = adjustcolor("black", 0.1), border = FALSE)


#==================================================#
# Using a matrix as input
#==================================================#

# generate a matrix of counts in each region
countsmat <- getCountsByPositions(PROseq, pr)
dim(countsmat)

#--------------------------------------------------#
# bootstrap average signal in 10 bp bins across all transcripts
#--------------------------------------------------#

set.seed(11)
 df <- metaSubsampleMatrix(countsmat, binsize = 10,
                           sample.name = "PROseq",
                           ncores = 1)
df[1:10, ]

#--------------------------------------------------#
# the same, using a normalization factor, and changing the x-values
#--------------------------------------------------#

set.seed(11)
df <- metaSubsampleMatrix(countsmat, binsize = 10,
                          first.output.xval = 0, NF = 0.75,
                          sample.name = "PROseq", ncores = 1)
df[1:10, ]

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.