getCountsByPositions: Get signal counts at each position within regions of interest

View source: R/signal_counting.R

getCountsByPositionsR Documentation

Get signal counts at each position within regions of interest

Description

Get the sum of the signal in dataset.gr that overlaps each position within each range in regions.gr. If binning is used (i.e. positions are wider than 1 bp), any function can be used to summarize the signal overlapping each bin. For a description of the critical difference between expand_ranges = FALSE and expand_ranges = TRUE, see getCountsByRegions.

Usage

getCountsByPositions(
  dataset.gr,
  regions.gr,
  binsize = 1L,
  FUN = sum,
  simplify.multi.widths = c("error", "list", "pad 0", "pad NA"),
  field = "score",
  NF = NULL,
  blacklist = NULL,
  NA_blacklisted = FALSE,
  melt = FALSE,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr

A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects.

regions.gr

A GRanges object containing regions of interest.

binsize

Size of bins (in bp) to use for counting within each range of regions.gr. Note that counts will not be length-normalized.

FUN

If binsize > 1, the function used to aggregate the signal within each bin. By default, the signal is summed, but any function operating on a numeric vector can be used.

simplify.multi.widths

A string indicating the output format if the ranges in regions.gr have variable widths. By default, an error is returned. See details below.

field

The metadata field of dataset.gr to be counted. If length(field) > 1, the output is a list whose elements contain the output for generated each field. If field not found in names(mcols(dataset.gr)), will default to using all fields found in dataset.gr.

NF

An optional normalization factor by which to multiply the counts. If given, length(NF) must be equal to length(field).

blacklist

An optional GRanges object containing regions that should be excluded from signal counting.

NA_blacklisted

A logical indicating if NA values should be returned for blacklisted regions. By default, signal in the blacklisted sites is ignored, i.e. the reads are excluded. If NA_blacklisted = TRUE, those positions are set to NA in the final output.

melt

A logical indicating if the count matrices should be melted. If set to TRUE, a dataframe is returned in containing columns for "region", "position", and "signal". If dataset.gr is a list of multiple GRanges, or if length(field) > 1, a single dataframe is returned, which contains an additional column "sample", which contains individual sample names. If used with multi-width regions.gr, the resulting dataframe will only contain positions that are found within each respective region.

expand_ranges

Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). See getCountsByRegions.

ncores

Multiple cores will only be used if dataset.gr is a list of multiple datasets, or if length(field) > 1.

Value

If the widths of all ranges in regions.gr are equal, a matrix is returned that contains a row for each region of interest, and a column for each position (each base if binsize = 1) within each region. If dataset.gr is a list, a parallel list is returned containing a matrix for each input dataset.

Use of multi-width regions of interest

If the input regions.gr contains ranges of varying widths, setting simplify.multi.widths = "list" will output a list of variable-length vectors, with each vector corresponding to an individual input region. If simplify.multi.widths = "pad 0" or "pad NA", the output is a matrix containing a row for each range in regions.gr, but the number of columns is determined by the largest range in regions.gr. For each region of interest, columns that correspond to positions outside of the input range are set, depending on the argument, to 0 or NA.

Author(s)

Mike DeBerardine

See Also

getCountsByRegions, metaSubsample

Examples

data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts

#--------------------------------------------------#
# counts from 0 to 50 bp after the TSS
#--------------------------------------------------#

txs_pr <- promoters(txs_dm6_chr4, 0, 50) # first 50 bases
countsmat <- getCountsByPositions(PROseq, txs_pr)
countsmat[10:15, 41:50] # show only 41-50 bp after TSS

#--------------------------------------------------#
# redo with 10 bp bins from 0 to 100
#--------------------------------------------------#

# column 5 is sums of rows shown above

txs_pr <- promoters(txs_dm6_chr4, 0, 100)
countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10)
countsmat[10:15, ]

#--------------------------------------------------#
# same as the above, but with the average signal in each bin
#--------------------------------------------------#

countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10, FUN = mean)
countsmat[10:15, ]

#--------------------------------------------------#
# standard deviation of signal in each bin
#--------------------------------------------------#

countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10, FUN = sd)
round(countsmat[10:15, ], 1)

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.