getCountsByRegions: Get signal counts in regions of interest
In mdeber/BRGenomics: Tools for the Efficient Analysis of High-Resolution Genomics Data

getCountsByRegions

R Documentation

Get signal counts in regions of interest

Description

Get the sum of the signal in dataset.gr that overlaps each range in regions.gr. If expand_regions = FALSE, getCountsByRegions is written to calculate readcounts overlapping each region, while expand_regions = TRUE will calculate "coverage signal" (see details below).

Usage

getCountsByRegions(
  dataset.gr,
  regions.gr,
  field = "score",
  NF = NULL,
  blacklist = NULL,
  melt = FALSE,
  region_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

`dataset.gr`	A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects. If a list is given, a dataframe is returned containing the counts in each region for each dataset.
`regions.gr`	A GRanges object containing regions of interest.
`field`	The metadata field of `dataset.gr` to be counted. If `length(field) > 1`, a dataframe is returned containing the counts for each region in each field. If `field` not found in `names(mcols(dataset.gr))`, will default to using all fields found in `dataset.gr`.
`NF`	An optional normalization factor by which to multiply the counts. If given, `length(NF)` must be equal to `length(field)`.
`blacklist`	An optional GRanges object containing regions that should be excluded from signal counting.
`melt`	If `melt = TRUE`, a dataframe is returned containing a column for regions and another column for signal. If multiple datasets are given (if `dataset.gr` is a list or if `length(field) > 1`), the output dataframe is melted to contain a third column indicating the sample names. (See section on return values below).
`region_names`	If `melt = TRUE`, an optional vector of names for the regions in `regions.gr`. If left as `NULL`, indices of `regions.gr` are used instead.
`expand_ranges`	Logical indicating if ranges in `dataset.gr` should be treated as descriptions of single molecules (`FALSE`), or if ranges should be treated as representing multiple adjacent positions with the same signal (`TRUE`). If the ranges in `dataset.gr` do not all have a width of 1, this option has a substantial effect on the results returned. (See details).
`ncores`	Multiple cores will only be used if `dataset.gr` is a list of multiple datasets, or if `length(field) > 1`.

Value

An atomic vector the same length as regions.gr containing the sum of the signal overlapping each range of regions.gr. If dataset.gr is a list of multiple GRanges, or if length(field) > 1, a dataframe is returned. If melt = FALSE (the default), dataframes have a column for each dataset and a row for each region. If melt = TRUE, dataframes contain one column to indicate regions (either by their indices, or by region_names, if given), another column to indicate signal, and a third column containing the sample name (unless dataset.gr is a single GRanges object).

`expand_ranges = FALSE`

In this configuration, getCountsByRegions is designed to work with data in which each range represents one type of molecule, whether it's a single base (e.g. the 5' ends, 3' ends, or centers of reads) or entire reads (i.e. paired 5' and 3' ends of reads).

This is in contrast to standard run-length compressed GRanges object, as imported using rtracklayer::import.bw, in which a single range can represent multiple contiguous positions that share the same signal information.

As an example, a range of covering 10 bp with a score of 2 is treated as 2 reads (each spanning the same 10 bases), not 20 reads.

`expand_ranges = TRUE`

In this configuration, this function assumes that ranges in dataset.gr that cover multiple bases are compressed representations of multiple adjacent positions that contain the same signal. This type of representation is typical of "coverage" objects, including bedGraphs and bigWigs generated by many command line utilities, but not bigWigs as they are imported by BRGenomics::import_bigWig.

As an example, a range covering 10 bp with a score of 2 is treated as representing 20 signal counts, i.e. there are 10 adjacent positions that each contain a signal of 2.

If the data truly represents basepair-resolution coverage, the "coverage signal" is equivalent to readcounts. However, users should consider how they interpret results from whole-read coverage, as the "coverage signal" is determined by both the read counts as well as read lengths.

Author(s)

Mike DeBerardine

Examples

data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts

counts <- getCountsByRegions(PROseq, txs_dm6_chr4)

length(txs_dm6_chr4)
length(counts)
head(counts)

# Assign as metadata to the transcript GRanges
txs_dm6_chr4$PROseq <- counts

txs_dm6_chr4[1:6]

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.