aggregateSignalGRList: Score many region sets

View source: R/COCOA.R

aggregateSignalGRListR Documentation

Score many region sets

Description

This function will give each region set a score for each target variable given by 'signalCol' based on the 'scoringMetric' parameter. Based on these scores, you can determine which region sets out of a region set database (given by 'GRList') are most associated with the target variables. See the vignette "Introduction to Coordinate Covariation Analysis" for help interpreting your results.

Usage

aggregateSignalGRList(
  signal,
  signalCoord,
  GRList,
  signalCol = c("PC1", "PC2"),
  signalCoordType = "default",
  scoringMetric = "default",
  verbose = TRUE,
  absVal = TRUE,
  olList = NULL,
  pOlapList = NULL,
  returnCovInfo = TRUE
)

Arguments

signal

Matrix of feature contribution scores (the contribution of each epigenetic feature to each target variable). One named column for each target variable. One row for each original epigenetic feature (should be same order as original data/signalCoord). For (an unsupervised) example, if PCA was done on epigenetic data and the goal was to find region sets associated with the principal components, you could use the x$rotation output of prcomp(epigenetic data) as the feature contribution scores/'signal' parameter.

signalCoord

A GRanges object or data frame with coordinates for the genomic signal/original epigenetic data. Coordinates should be in the same order as the original data and the feature contribution scores (each item/row in signalCoord corresponds to a row in signal). If a data.frame, must have chr and start columns (optionally can have end column, depending on the epigenetic data type).

GRList

GRangesList object. Each list item is a distinct region set to test (region set: regions that correspond to the same biological annotation). The region set database must be from the same reference genome as the coordinates for the actual data/samples (signalCoord).

signalCol

A character vector with the names of the sample variables of interest/target variables (e.g. PCs or sample phenotypes).

signalCoordType

Character. Can be "default", "singleBase", or "multiBase". This describes whether the coordinates for 'signal' ('signalCoord') are each a single base (e.g. as for DNA methylation) or a region/multiple bases (e.g. as for chromatin accessibility). Different scoring options are available for each type of data. If "default" is given, the type of coordinates will be detected automatically. For "default", if each coordinate start value equals the coordinate end value (all(start(signalCoord) == end(signalCoord))), "singleBase" will be used. Otherwise, "multiBase" will be used.

scoringMetric

A character object with the scoring metric. There are different methods available for signalCoordType="singleBase" vs signalCoordType="multiBase". For "singleBase", the available methods are "regionMean", "regionMedian", "simpleMean", and "simpleMedian". The default method is "regionMean". For "multiBase", the methods are "proportionWeightedMean", "simpleMean", and "simpleMedian". The default is "proportionWeightedMean". "regionMean" is a weighted average of the signal, weighted by region (absolute value of signal if absVal=TRUE). First the signal is averaged within each regionSet region, then all the regions are averaged. With "regionMean" method, be cautious in interpretation for region sets with low number of regions that overlap signalCoord. The "regionMedian" method is the same as "regionMean" but the median is taken at each step instead of the mean. The "simpleMean" method is just the unweighted average of all (absolute) signal values that overlap the given region set. For multiBase data, this includes signal regions that overlap a regionSet region at all (1 base overlap or more) and the signal for each overlapping region is given the same weight for the average regardless of how much it overlaps. The "simpleMedian" method is the same as "simpleMean" but takes the median instead of the mean. "proportionWeightedMean" is a weighted average of all signalCoord regions that overlap with regionSet regions. For each signalCoord region that overlaps with a regionSet region, we calculate what proportion of the regionSet region is covered. Then this proportion is used to weight the signal value when calculating the mean. The denominator of the mean is the sum of all the proportion overlaps.

verbose

A "logical" object. Whether progress of the function should be shown. One bar indicates the region set is completed.

absVal

Logical. If TRUE, take the absolute value of values in signal. Choose TRUE if you think there may be some genomic loci in a region set that will increase and others will decrease (if there may be anticorrelation between regions in a region set). Choose FALSE if you expect regions in a given region set to all change in the same direction (all be positively correlated with each other).

olList

list. Each list item should be a "SortedByQueryHits" object (output of findOverlaps function). Each hits object should have the overlap information between signalCoord and one item of GRList (one unique region set). The region sets from GRList must be the "subject" in findOverlaps and signalCoord must be the "query". E.g. findOverlaps(subject=regionSet, query=signalCoord). Providing this information can greatly improve permutation speed since the overlaps will not have to be calculated for each permutation. The "runCOCOAPerm" function calculates this information only once, internally, so this does not have to be provided when using that function. When using this parameter, signalCoord, genomicSignal, and each region set must be in the same order as they were when olList was created. Otherwise, the wrong genomic loci will be referenced (e.g. if epigenetic features were filtered out of genomicSignal after olList was created.)

pOlapList

list. This parameter is only used if the scoring metric is "proportionWeightedMean" and olList is also provided as an argument. Each item of the list should be a vector that contains the proportion overlap between signalCoord and regions from one region set (one item of GRList). Specifically, each value should be the proportion of the region set region that is overlapped by a signalCoord region. The proportion overlap values should be in the same order as the overlaps given by olList for the corresponding region set.

returnCovInfo

logical. If TRUE, the following coverage and region set info will be calculated and included in function output: regionSetCoverage, signalCoverage, totalRegionNumber, and meanRegionSize. For the proportionWeightedMean scoring method, sumProportionOverlap will also be calculated.

Value

Data.frame of results, one row for each region set. It has the following columns: one column for each item of signalCol with names given by signalCol. These columns have scores for the region set for each signalCol. Other columns: signalCoverage (formerly cytosine_coverage) which has number of epigenetic features that overlapped at all with regionSet, regionSetCoverage which has number of regions from regionSet that overlapped any of the epigenetic features, totalRegionNumber that has number of regions in regionSet, meanRegionSize that has average size in base pairs of regions in regionSet, the average is based on all regions in regionSet and not just ones that overlap. For "multiBase" data, if the "proportionWeightedMean" scoring metric is used, then the output will also have a "sumProportionOverlap" column. During this scoring method, the proportion overlap between each signalCoord region and overlapping regionSet region is calculated. This column is the sum of all those proportion overlaps and is another way to quantify coverage of regionSet in addition to regionSetCoverage.

Examples

data("brcaATACCoord1")
data("brcaATACData1")
data("esr1_chr1")
data("nrf1_chr1")
featureContributionScores <- prcomp(t(brcaATACData1))$rotation
GRList <- GRangesList(esr1_chr1, nrf1_chr1)
rsScores <- aggregateSignalGRList(signal=featureContributionScores, 
                                 signalCoord=brcaATACCoord1, 
                                 GRList= GRList,
                                 signalCol=c("PC1", "PC2"), 
                                 scoringMetric="default")


databio/COCOA documentation built on Jan. 19, 2025, 8:28 a.m.