getMetaRegionProfile: Create a "meta-region" profile

View source: R/COCOA.R

getMetaRegionProfileR Documentation

Create a "meta-region" profile

Description

This profile can show enrichment of genomic signals with high feature contribution scores in the region set but not in the surrounding genome, suggesting that variation is linked specifically to that region set.

Usage

getMetaRegionProfile(
  signal,
  signalCoord,
  regionSet,
  signalCol = c("PC1", "PC2"),
  signalCoordType = "default",
  binNum = 21,
  verbose = TRUE,
  aggrMethod = "default",
  absVal = TRUE
)

Arguments

signal

Matrix of feature contribution scores (the contribution of each epigenetic feature to each target variable). One named column for each target variable. One row for each original epigenetic feature (should be same order as original data/signalCoord). For (an unsupervised) example, if PCA was done on epigenetic data and the goal was to find region sets associated with the principal components, you could use the x$rotation output of prcomp(epigenetic data) as the feature contribution scores/'signal' parameter.

signalCoord

A GRanges object or data frame with coordinates for the genomic signal/original epigenetic data. Coordinates should be in the same order as the original data and the feature contribution scores (each item/row in signalCoord corresponds to a row in signal). If a data.frame, must have chr and start columns (optionally can have end column, depending on the epigenetic data type).

regionSet

A genomic ranges (GRanges) object with regions corresponding to the same biological annotation.

signalCol

A character vector with the names of the sample variables of interest/target variables (e.g. PCs or sample phenotypes).

signalCoordType

Character. Can be "default", "singleBase", or "multiBase". This describes whether the coordinates for 'signal' ('signalCoord') are each a single base (e.g. as for DNA methylation) or a region/multiple bases (e.g. as for chromatin accessibility). Different scoring options are available for each type of data. If "default" is given, the type of coordinates will be detected automatically. For "default", if each coordinate start value equals the coordinate end value (all(start(signalCoord) == end(signalCoord))), "singleBase" will be used. Otherwise, "multiBase" will be used.

binNum

Number of bins to split each region into when making the aggregate profile. More bins will give a higher resolution but perhaps more noisy profile.

verbose

A "logical" object. Whether progress of the function should be shown. One bar indicates the region set is completed.

aggrMethod

character. A character object with the aggregation method. Similar to aggregateSignalGRList 'scoringMetric' parameter. There are different methods available for signalCoordType="singleBase" vs signalCoordType="multiBase". For "singleBase", the available methods are "regionMean", "regionMedian", "simpleMean", and "simpleMedian". The default method is "regionMean". For "multiBase", the methods are "proportionWeightedMean", "simpleMean", and "simpleMedian". The default is "proportionWeightedMean". "regionMean" is a weighted average of the signal, weighted by region (absolute value of signal if absVal=TRUE). First the signal is averaged within each regionSet region, then all the regions are averaged. With "regionMean" method, be cautious in interpretation for region sets with low number of regions that overlap signalCoord. The "regionMedian" method is the same as "regionMean" but the median is taken at each step instead of the mean. The "simpleMean" method is just the unweighted average of all (absolute) signal values that overlap the given region set. For multiBase data, this includes signal regions that overlap a regionSet region at all (1 base overlap or more) and the signal for each overlapping region is given the same weight for the average regardless of how much it overlaps. The "simpleMedian" method is the same as "simpleMean" but takes the median instead of the mean. "proportionWeightedMean" is a weighted average of all signalCoord regions that overlap with regionSet regions. For each signalCoord region that overlaps with a regionSet region, we calculate what proportion of the regionSet region is covered. Then this proportion is used to weight the signal value when calculating the mean. The denominator of the mean is the sum of all the proportion overlaps.

absVal

Logical. If TRUE, take the absolute value of values in signal. Choose TRUE if you think there may be some genomic loci in a region set that will increase and others will decrease (if there may be anticorrelation between regions in a region set). Choose FALSE if you expect regions in a given region set to all change in the same direction (all be positively correlated with each other).

Details

All regions in a given region set are combined into a single aggregate profile. Regions in 'regionSet' should be expanded on each side to include a wider area of the genome around the regions of interest (see example and vignettes). To make the profile, first we optionally take the absolute value of 'signal' ('absVal' parameter). Then each expanded regionSet region is split into 'binNum' bins. The corresponding bins from each region (e.g. all bin1's, all bin2's, etc.) are grouped. All overlapping values from 'signal' are aggregated in each bin group according to the 'aggrMethod' parameter to get a meta-region profile. Since DNA strand information is not considered, the profile is averaged symmetrically around the center. A peak in the middle of this profile suggests that variability is specific to the region set of interest and is not a product of the surrounding genome. A region set can still be significant even if it does not have a peak. For example, some histone modification region sets may be in large genomic blocks and not show a peak, despite having variation across samples.

Value

A data.frame with the binned meta-region profile, one row per bin. columns: binID and one column for each target variable in signalCol. The function will return NULL if there is no overlap between signalCoord and any of the bin groups that come from regionSet (e.g. none of the bin1's overlapped signalCoord, NULL returned).

Examples

data("brcaATACCoord1")
data("brcaATACData1")
data("esr1_chr1")
featureContributionScores <- prcomp(t(brcaATACData1))$rotation
esr1_chr1_expanded <- resize(esr1_chr1, 12000, fix="center")
mrProfile <- getMetaRegionProfile(signal=featureContributionScores,
                                  signalCoord=brcaATACCoord1,
                                  regionSet=esr1_chr1_expanded,
                                  signalCol=c("PC1", "PC2"),
                                  binNum=21)

databio/COCOA documentation built on Sept. 1, 2023, 5:50 p.m.