decon: Identification of tissue-specific surrogate variables using...
In JasonHackney/GSDecon: Deconvolution of cellular proportions in expression data

decon

R Documentation

Identification of tissue-specific surrogate variables using pre-specified gene sets

Description

Estimate surrogate variables in a data matrix from cell type- or tissue- specific gene sets. The surrogate variables are determined by the deconComponents method.

Usage

  decon(object, model = NULL, geneSets, doPerm = TRUE, nPerm = 249, 
      pvalueCutoff = 0.01, nComp = 1, trim = FALSE, seed = NULL, ...)

Arguments

`object`	An object of class `matrix`, `ExpressionSet`, `CountDataSet`, `DESeqDataSet`, or `DGEList`, specifying the sample expression values. For `matrix` objects, each column should represent a sample, each row a feature, and it is assumed that using the `lmFit` function from limma would be appropriate for the data in the matrix. Count data are first transformed to log2-scale data using the `voom` function from the `limma` package
`model`	A formula or an n x k design matrix specifying the model of interest see `model.matrix`. Default value is NULL. For objects that are `CountDataSet` or `DESeqDataSet` object, in which case the design of the object is used to create a model matrix using the `model.matrix` function. If the object is an `ExpressionSet`, then a model matrix is created using an intercept-only model.
`geneSets`	An object of class `DeconGeneSetCollection`, a `list` of character vectors, or an incidence matrix with dimensions g x m, where the columns represent genes, and the rows represent gene sets, with a 1 where a gene is in a gene set, otherwise 0. For `list`, the values should correspond to the row names of the expression data. For `DeconGeneSetCollection`, the `geneIds` should correspond to row names of the expression data.
`doPerm`	A boolean value of length 1, specifying if permutation testing for significance of gene sets should be performed. See below for details on the permutation testing and its interpretation
`nPerm`	A numeric value of length 1, specifying how many permutations should be performed
`pvalueCutoff`	A numeric value of length 1, generally between 1/nPerm and 1 specifying at what significance level should gene sets be considered informative in the dataset.
`nComp`	A numeric value of length 1, specifying how many components to test for in each gene set. For well-formed gene sets, this should be set to 1, indicating that the gene set should have one major set of correlated genes. If more components that nComp are found to be significant, a warning is thrown, as the gene set is likely not well specified.
`trim`	Logical. Should gene sets be trimmed before summarization. If TRUE, then only genes with an average pairwise correlation coefficient > 0.1 are included in the decon algorithm. Otherwise, the whole gene set is used.
`seed`	A seed to set for random number generator used in the permutation. Setting this will allow for reproducible p-values to be generated for the gene sets.
`...`	Currently not used, but may be used in the future.

Details

decon attempts to identify gene sets that are significantly informative in the residuals matrix for a given linear model. This is done through a gene-wise permutation strategy. For each permutation, the first nComp eigenvalues are compared to the eigenvalues of the original gene set expression matrix. An empirical p-value is calculated by finding how many random eigenvalues are greater than the observed eigenvalues.

For gene sets that are considered significant (permutation p-value less than the supplied alpha), an eigengene for the gene set is calculated. For each significant eigenvalue, an eigengene is calculated by the method described for deconComponents. The first eigengene typically represents the relative amount of that cell or tissue type in the mixed sample.

Eigengenes beyond the first can be somewhat difficult to interpret, and by default are not looked for. However, looking for significance of the second (or third) eigenvalue can be informative about the relative consistency within a gene set. Ideally, the majority of the variance of the gene set would be explained by the first eigenvector. If there is a large amount of variance explained by the second eigenvector, this suggests that your gene set is identifying two separate expression patterns in the data set of interest.

Value

An object of class DeconResults with the following slots:

`pvalueCutoff`	A single numeric value giving the significance cutoff at which gene sets are considered informative
`pvalues`	A numeric vector with an entry for each gene set that has a p-value less than `pvalueCutoff`
`eigengenes`	A numeric matrix with one column for each significant gene set and one row for each sample in the expression data provided
`nComp`	A numeric vector of length 1, giving the number of significant components

Author(s)

J.A. Hackney

Examples

    ## Not run: 
        library(GEOquery)
        library(hgu133plus2.db)
        
        deconGSC <- DeconGeneSetCollection()
        GSE11058 <- getGEO("GSE11058")[[1]]
        exprs(GSE11058) <- log2(exprs(GSE11058))
        annotation(GSE11058) <- "hgu133plus2"
        
        deconU133GSC <- mapIdentifiers(deconGSC, AnnotationIdentifier(),
            revmap(hgu133plus2ENTREZID))
        deconResults <- decon(GSE11058, ~1, deconU133GSC)
    
## End(Not run)

JasonHackney/GSDecon documentation built on Aug. 6, 2022, 8:36 a.m.