getCountsByPositions: Get signal counts at each position within regions of interest
In mdeber/BRGenomics: Tools for the Efficient Analysis of High-Resolution Genomics Data

getCountsByPositions

R Documentation

Get signal counts at each position within regions of interest

Description

Get the sum of the signal in dataset.gr that overlaps each position within each range in regions.gr. If binning is used (i.e. positions are wider than 1 bp), any function can be used to summarize the signal overlapping each bin. For a description of the critical difference between expand_ranges = FALSE and expand_ranges = TRUE, see getCountsByRegions.

Usage

getCountsByPositions(
  dataset.gr,
  regions.gr,
  binsize = 1L,
  FUN = sum,
  simplify.multi.widths = c("error", "list", "pad 0", "pad NA"),
  field = "score",
  NF = NULL,
  blacklist = NULL,
  NA_blacklisted = FALSE,
  melt = FALSE,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

`dataset.gr`	A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects.
`regions.gr`	A GRanges object containing regions of interest.
`binsize`	Size of bins (in bp) to use for counting within each range of `regions.gr`. Note that counts will not be length-normalized.
`FUN`	If `binsize > 1`, the function used to aggregate the signal within each bin. By default, the signal is summed, but any function operating on a numeric vector can be used.
`simplify.multi.widths`	A string indicating the output format if the ranges in `regions.gr` have variable widths. By default, an error is returned. See details below.
`field`	The metadata field of `dataset.gr` to be counted. If `length(field) > 1`, the output is a list whose elements contain the output for generated each field. If `field` not found in `names(mcols(dataset.gr))`, will default to using all fields found in `dataset.gr`.
`NF`	An optional normalization factor by which to multiply the counts. If given, `length(NF)` must be equal to `length(field)`.
`blacklist`	An optional GRanges object containing regions that should be excluded from signal counting.
`NA_blacklisted`	A logical indicating if NA values should be returned for blacklisted regions. By default, signal in the blacklisted sites is ignored, i.e. the reads are excluded. If `NA_blacklisted = TRUE`, those positions are set to `NA` in the final output.
`melt`	A logical indicating if the count matrices should be melted. If set to `TRUE`, a dataframe is returned in containing columns for "region", "position", and "signal". If `dataset.gr` is a list of multiple GRanges, or if `length(field) > 1`, a single dataframe is returned, which contains an additional column "sample", which contains individual sample names. If used with multi-width `regions.gr`, the resulting dataframe will only contain positions that are found within each respective region.
`expand_ranges`	Logical indicating if ranges in `dataset.gr` should be treated as descriptions of single molecules (`FALSE`), or if ranges should be treated as representing multiple adjacent positions with the same signal (`TRUE`). See `getCountsByRegions`.
`ncores`	Multiple cores will only be used if `dataset.gr` is a list of multiple datasets, or if `length(field) > 1`.

Value

If the widths of all ranges in regions.gr are equal, a matrix is returned that contains a row for each region of interest, and a column for each position (each base if binsize = 1) within each region. If dataset.gr is a list, a parallel list is returned containing a matrix for each input dataset.

Use of multi-width regions of interest

If the input regions.gr contains ranges of varying widths, setting simplify.multi.widths = "list" will output a list of variable-length vectors, with each vector corresponding to an individual input region. If simplify.multi.widths = "pad 0" or "pad NA", the output is a matrix containing a row for each range in regions.gr, but the number of columns is determined by the largest range in regions.gr. For each region of interest, columns that correspond to positions outside of the input range are set, depending on the argument, to 0 or NA.

Author(s)

Mike DeBerardine

Examples

data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts

#--------------------------------------------------#
# counts from 0 to 50 bp after the TSS
#--------------------------------------------------#

txs_pr <- promoters(txs_dm6_chr4, 0, 50) # first 50 bases
countsmat <- getCountsByPositions(PROseq, txs_pr)
countsmat[10:15, 41:50] # show only 41-50 bp after TSS

#--------------------------------------------------#
# redo with 10 bp bins from 0 to 100
#--------------------------------------------------#

# column 5 is sums of rows shown above

txs_pr <- promoters(txs_dm6_chr4, 0, 100)
countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10)
countsmat[10:15, ]

#--------------------------------------------------#
# same as the above, but with the average signal in each bin
#--------------------------------------------------#

countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10, FUN = mean)
countsmat[10:15, ]

#--------------------------------------------------#
# standard deviation of signal in each bin
#--------------------------------------------------#

countsmat <- getCountsByPositions(PROseq, txs_pr, binsize = 10, FUN = sd)
round(countsmat[10:15, ], 1)

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.