runCOCOA: Do COCOA with many region sets

Description Usage Arguments Value Examples

View source: R/COCOA.R

Description

This function will give each region set a score for each PC in 'PCsToAnnotate' based on the 'scoringMetric' parameter. Based on these scores, you can determine which region sets out of a region set database (given by GRList) are most associated with the top PCs. See the vignette "Introduction to Coordinate Covariation Analysis" for help interpreting your results.

Usage

1
2
runCOCOA(loadingMat, signalCoord, GRList, PCsToAnnotate = c("PC1",
  "PC2"), scoringMetric = "regionMean", verbose = TRUE)

Arguments

loadingMat

matrix of loadings (the coefficients of the linear combination that defines each PC). One named column for each PC. One row for each original dimension/variable (should be same order as original data/signalCoord). The x$rotation output of prcomp().

signalCoord

a GRanges object or data frame with coordinates for the genomic signal/original data (eg DNA methylation) included in the PCA. Coordinates should be in the same order as the original data and the loadings (each item/row in signalCoord corresponds to a row in loadingMat). If a data.frame, must have chr and start columns. If end is included, start and end should be the same. Start coordinate will be used for calculations.

GRList

GRangesList object. Each list item is a distinct region set to test (region set: regions that correspond to the same biological annotation). The region set database. Must be from the same reference genome as the coordinates for the actual data/samples (signalCoord).

PCsToAnnotate

A character vector with principal components to include. eg c("PC1", "PC2") These should be column names of loadingMat.

scoringMetric

A character object with the scoring metric. "regionMean" is a weighted average of the absolute value of the loadings with no normalization (recommended). First loadings are averaged within each region, then all the regions are averaged. With "regionMean" score, be cautious in interpretation for region sets with low number of regions that overlap signalCoord. The "simpleMean" method is just the unweighted average of all absolute loadings that overlap the given region set. Wilcoxon rank sum test ("rankSum") is also supported but is skewed toward ranking large region sets highly and is significantly slower than the "regionMean" method. For the ranksum method, the absolute loadings for loadings that overlap the given region set are taken as a group and all the loadings that do not overlap the region set are taken as the other group. Then p value is then given as the score. It is a one sided test, with the alternative hypothesis that the loadings in the region set will be greater than the loadings not in the region set.

verbose

A "logical" object. Whether progress of the function should be shown, one bar indicates the region set is completed.

Value

data.frame of results, one row for each region set. One column for each PC in PCsToAnnotate with score for that PC for a given region set (specific score depends on "scoringMetric" parameter). Rows will be in the same order as region sets in GRList "cytosine_coverage" column has number of cytosines that overlapped with the given region set (or in the general case, coordinates from signalCoord that overlapped regionSet). "region_coverage" column has number of regions that overlapped any coordinates from signalCoord. "total_region_number" column has total number of regions. "mean_region_size" has average region size (average of all regions, not just those that overlap a cytosine).

Examples

1
2
3
4
5
6
7
8
data("brcaMCoord1")
data("brcaLoadings1")
data("esr1_chr1")
rsScores <- runCOCOA(loadingMat=brcaLoadings1, 
                                 signalCoord=brcaMCoord1, 
                                 GRList=GRangesList(esr1_chr1), 
                                 PCsToAnnotate=c("PC1", "PC2"), 
                                 scoringMetric="regionMean")

databio/PCRSA documentation built on Dec. 7, 2018, 8:57 a.m.