getCountsByRegions: Get signal counts in regions of interest

View source: R/signal_counting.R

getCountsByRegionsR Documentation

Get signal counts in regions of interest

Description

Get the sum of the signal in dataset.gr that overlaps each range in regions.gr. If expand_regions = FALSE, getCountsByRegions is written to calculate readcounts overlapping each region, while expand_regions = TRUE will calculate "coverage signal" (see details below).

Usage

getCountsByRegions(
  dataset.gr,
  regions.gr,
  field = "score",
  NF = NULL,
  blacklist = NULL,
  melt = FALSE,
  region_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

dataset.gr

A GRanges object in which signal is contained in metadata (typically in the "score" field), or a named list of such GRanges objects. If a list is given, a dataframe is returned containing the counts in each region for each dataset.

regions.gr

A GRanges object containing regions of interest.

field

The metadata field of dataset.gr to be counted. If length(field) > 1, a dataframe is returned containing the counts for each region in each field. If field not found in names(mcols(dataset.gr)), will default to using all fields found in dataset.gr.

NF

An optional normalization factor by which to multiply the counts. If given, length(NF) must be equal to length(field).

blacklist

An optional GRanges object containing regions that should be excluded from signal counting.

melt

If melt = TRUE, a dataframe is returned containing a column for regions and another column for signal. If multiple datasets are given (if dataset.gr is a list or if length(field) > 1), the output dataframe is melted to contain a third column indicating the sample names. (See section on return values below).

region_names

If melt = TRUE, an optional vector of names for the regions in regions.gr. If left as NULL, indices of regions.gr are used instead.

expand_ranges

Logical indicating if ranges in dataset.gr should be treated as descriptions of single molecules (FALSE), or if ranges should be treated as representing multiple adjacent positions with the same signal (TRUE). If the ranges in dataset.gr do not all have a width of 1, this option has a substantial effect on the results returned. (See details).

ncores

Multiple cores will only be used if dataset.gr is a list of multiple datasets, or if length(field) > 1.

Value

An atomic vector the same length as regions.gr containing the sum of the signal overlapping each range of regions.gr. If dataset.gr is a list of multiple GRanges, or if length(field) > 1, a dataframe is returned. If melt = FALSE (the default), dataframes have a column for each dataset and a row for each region. If melt = TRUE, dataframes contain one column to indicate regions (either by their indices, or by region_names, if given), another column to indicate signal, and a third column containing the sample name (unless dataset.gr is a single GRanges object).

expand_ranges = FALSE

In this configuration, getCountsByRegions is designed to work with data in which each range represents one type of molecule, whether it's a single base (e.g. the 5' ends, 3' ends, or centers of reads) or entire reads (i.e. paired 5' and 3' ends of reads).

This is in contrast to standard run-length compressed GRanges object, as imported using rtracklayer::import.bw, in which a single range can represent multiple contiguous positions that share the same signal information.

As an example, a range of covering 10 bp with a score of 2 is treated as 2 reads (each spanning the same 10 bases), not 20 reads.

expand_ranges = TRUE

In this configuration, this function assumes that ranges in dataset.gr that cover multiple bases are compressed representations of multiple adjacent positions that contain the same signal. This type of representation is typical of "coverage" objects, including bedGraphs and bigWigs generated by many command line utilities, but not bigWigs as they are imported by BRGenomics::import_bigWig.

As an example, a range covering 10 bp with a score of 2 is treated as representing 20 signal counts, i.e. there are 10 adjacent positions that each contain a signal of 2.

If the data truly represents basepair-resolution coverage, the "coverage signal" is equivalent to readcounts. However, users should consider how they interpret results from whole-read coverage, as the "coverage signal" is determined by both the read counts as well as read lengths.

Author(s)

Mike DeBerardine

See Also

getCountsByPositions

Examples

data("PROseq") # load included PROseq data
data("txs_dm6_chr4") # load included transcripts

counts <- getCountsByRegions(PROseq, txs_dm6_chr4)

length(txs_dm6_chr4)
length(counts)
head(counts)

# Assign as metadata to the transcript GRanges
txs_dm6_chr4$PROseq <- counts

txs_dm6_chr4[1:6]

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.