R/tsea_mGSEA.R
In signatureSearch: Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

Documented in tsea_mGSEA

#' The \code{tsea_mGSEA} function performs a Modified Gene Set Enrichment 
#' Analysis (mGSEA) that supports test sets (e.g. genes or protein IDs) with 
#' duplications. The duplication support is 
#' achieved by a weighting method for duplicated items, where the weighting is 
#' proportional to the frequency of the items in the test set.
#' 
#' The original GSEA method proposed by Subramanian et at., 2005 uses 
#' predefined gene sets \eqn{S} defined by functional annotation systems 
#' such as GO and KEGG. The goal is to determine whether the genes in \eqn{S}
#' are randomly distributed throughout a ranked test gene list \eqn{L} 
#' (e.g. all genes ranked by log2 fold changes) or enriched at the top or 
#' bottom of the test list. This is expressed by an 
#' Enrichment Score (\eqn{ES}) reflecting the degree to which a set \eqn{S} 
#' is overrepresented at the extremes of \eqn{L}. 
#' 
#' For TSEA, the query is a target protein set where duplicated entries need to 
#' be maintained. To perform GSEA with duplication support, here referred to as 
#' mGSEA, the target set is transformed to a score ranked target list 
#' \eqn{L_tar} of all targets provided by the 
#' corresponding annotation system. For each target in the query target set, 
#' its frequency is divided by the number of targets in the target set, 
#' which is the weight of that target. 
#' For targets present in the annotation system but absent in the 
#' target set, their scores are set to 0. Thus, every target in the annotation 
#' system will be assigned a score and then sorted decreasingly to obtain
#' \eqn{L_tar}.
#' 
#' In case of TSEA, the original GSEA method cannot be used directly since a 
#' large portion of zeros exists in \eqn{L_tar}. If the scores of the genes in 
#' set \eqn{S} are all zeros, \eqn{N_R} (sum of scores of genes in set 
#' \eqn{S}) will be zero, which cannot be used as the denominator. 
#' In this case, \eqn{ES} is set to -1. If only some genes in set \eqn{S}
#' have scores of zeros then \eqn{N_R} is set to a larger number to decrease 
#' the weight of the genes in \eqn{S} that have non-zero scores.
#'  
#' The reason for this modification is that if only one gene in gene set 
#' \eqn{S} has a non-zero score and this gene ranks high in \eqn{L_tar}, 
#' the weight of this gene will be 1 resulting in an \eqn{ES(S)} close to 1. 
#' Thus, the original GSEA method will score the gene set \eqn{S} as 
#' significantly enriched. However, this is undesirable because in this 
#' example only one gene is shared among the target set and the gene set 
#' \eqn{S}. Therefore, giving small weights (lowest non-zero score in \eqn{L_tar}) 
#' to genes in \eqn{S} that have zero scores could decrease the weight of the
#' genes in \eqn{S} that have non-zero scores, thereby decreasing the false 
#' positive rate. To favor truly enriched functional categories (gene set \eqn{S}) 
#' at the top of \eqn{L_tar}, only gene sets with positive \eqn{ES} are selected.
#' @section Column description:
#' The TSEA results (including \code{tsea_mGSEA}) stored in the \code{feaResult}
#' object can be returned with the \code{result} method in tabular format, 
#' here \code{tibble}. The columns of this \code{tibble} are described below.
#' \itemize{
#'     \item enrichmentScore: ES from the GSEA algorithm 
#'     (Subramanian et al., 2005). The score is calculated by walking down the 
#'     gene list L, increasing a running-sum statistic when we encounter a gene 
#'     in S and decreasing when it is not. The magnitude of the increment 
#'     depends on the gene scores. The ES is the maximum deviation from zero 
#'     encountered in the random walk. It corresponds to a weighted 
#'     Kolmogorov-Smirnov-like statistic.
#'     \item NES: Normalized enrichment score. The positive and negative 
#'     enrichment scores are normalized separately by permutating the 
#'     composition of the gene list L \code{nPerm} times, and dividing the 
#'     enrichment score by the mean of the permutation ES with the same sign.
#'     \item pvalue: The nominal p-value of the ES is calculated using a 
#'     permutation test. Specifically, the composition of the gene list L is 
#'     permuted and the ES of the gene set is recomputed for the permutated 
#'     data generating a null distribution for the ES. The p-value of the 
#'     observed ES is then calculated relative to this null distribution.
#'     \item leadingEdge: Genes in the gene set S (functional category) that 
#'     appear in the ranked list L at, or before, the point where the running 
#'     sum reaches its maximum deviation from zero. It can be interpreted as 
#'     the core of a gene set that accounts for the enrichment signal.
#'     \item ledge_rank: Ranks of genes in 'leadingEdge' in gene list L.
#' }
#' Additional columns are described under the 'result' slot of the
#' \code{\link{feaResult}} object.
#' 
#' @title Target Set Enrichment Analysis (TSEA) with mGSEA Algorithm
#' @param drugs character vector containing drug identifiers used for functional
#' enrichment testing. This can be the top ranking drugs from a GESS result. 
#' Internally, drug test sets are translated to the corresponding target protein
#' test sets based on the drug-target annotations provided under the 
#' \code{dt_anno} argument.
#' @param type one of `GO`, `KEGG` or `Reactome`
#' @param ont character(1). If type is `GO`, assign \code{ont} (ontology) one of
#' `BP`,`MF`, `CC` or `ALL`. If type is `KEGG` or `Reactome`, \code{ont} is ignored.
#' @param nPerm integer defining the number of permutation iterations for 
#' calculating p-values
#' @param exponent integer value used as exponent in GSEA algorithm. It defines
#' the weight of the items in the item set \eqn{S}.
#' @param pAdjustMethod p-value adjustment method, 
#' one of 'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY', 'fdr'
#' @param pvalueCutoff double, p-value cutoff
#' @param minGSSize integer, minimum size of each gene set in annotation system
#' @param maxGSSize integer, maximum size of each gene set in annotation system
#' @param verbose TRUE or FALSE, print message or not
#' @param dt_anno drug-target annotation source. Currently, one of 'DrugBank',
#' 'CLUE', 'STITCH' or 'all'. If 'dt_anno' is 'all', the targets from the
#' DrugBank, CLUE and STITCH databases will be combined. Usually, it is
#' recommended to set the 'dt_anno' to 'all' since it provides the most
#' complete drug-target annotations. Choosing a single
#' annotation source results in sparser drug-target annotations 
#' (particularly CLUE), and thus less complete enrichment results.
#' @param readable TRUE or FALSE, it applies when type is `KEGG` or `Reactome`
#' indicating whether to convert gene Entrez ids to gene Symbols in the 'itemID' 
#' column in the result table.
#' @return \code{\link{feaResult}} object, the result table contains the
#' enriched functional categories (e.g. GO terms or KEGG pathways) ranked by 
#' the corresponding enrichment statistic.
#' @seealso \code{\link{feaResult}}, \code{\link{fea}}
#' @references 
#' GSEA algorithm: 
#' Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., 
#' Gillette, M. A., Mesirov, J. P. (2005). Gene set enrichment analysis: a 
#' knowledge-based approach for interpreting genome-wide expression profiles. 
#' Proceedings of the National Academy of Sciences of the United States of
#' America, 102(43), 15545-15550. URL: https://doi.org/10.1073/pnas.0506580102
#' @examples 
#' data(drugs10)
#' ## GO annotation system
#' #res1 <- tsea_mGSEA(drugs=drugs10, type="GO", ont="MF", exponent=1, 
#' #                   nPerm=1000, pvalueCutoff=1, minGSSize=5)
#' #result(res1)
#  ## KEGG annotation system
#' #res2 <- tsea_mGSEA(drugs=drugs10, type="KEGG", exponent=1, 
#' #                   nPerm=100, pvalueCutoff=1, minGSSize=5)
#' #result(res2)
#' ## Reactome annotation system
#' #res3 <- tsea_mGSEA(drugs=drugs10, type="Reactome", pvalueCutoff=1)
#' #result(res3)
#' @export
tsea_mGSEA <- function(drugs, type="GO", ont="MF", nPerm=1000, exponent=1, 
                       pAdjustMethod="BH", pvalueCutoff=0.05,
                       minGSSize=5, maxGSSize=500, verbose=FALSE,
                       dt_anno="all", readable=FALSE){
  if(!any(type %in% c("GO", "KEGG", "Reactome"))){
    stop('"type" argument needs to be one of "GO", "KEGG" or "Reactome"')
  }
  drugs <- unique(tolower(drugs))
  targets <- get_targets(drugs, database = dt_anno)
  gnset <- na.omit(unlist(lapply(targets$t_gn_sym, function(i) 
    unlist(strsplit(as.character(i), split = "; ")))))
  # give scores to gnset
  tar_tab <- table(gnset)
  tar_dup <- as.numeric(tar_tab); names(tar_dup) <- names(tar_tab)
  tar_weight <- sort(tar_dup/sum(tar_dup), decreasing = TRUE)
  
  if(type=="GO"){
    # Get universe genes in GO annotation system
    universe <- univ_go
    tar_diff <- setdiff(universe, gnset)
    tar_diff_weight <- rep(0, length(tar_diff))
    names(tar_diff_weight) <- tar_diff
    tar_total_weight <- c(tar_weight, tar_diff_weight)
    gsego <- gseGO2(geneList = tar_total_weight, OrgDb = "org.Hs.eg.db", 
                    ont = ont, keyType = "SYMBOL", nPerm = nPerm, 
                    minGSSize = minGSSize, maxGSSize = maxGSSize, 
                    exponent = exponent, nproc=1, verbose=verbose, 
                    pvalueCutoff = pvalueCutoff, pAdjustMethod=pAdjustMethod)
    if(is.null(gsego))
        return(NULL)
    drugs(gsego) <- drugs
    return(gsego)
  }
  
  # convert gnset symbol to entrez
  OrgDb <- load_OrgDb("org.Hs.eg.db")
  gnset_map <- suppressMessages(AnnotationDbi::select(OrgDb, keys = gnset, 
                          keytype = "SYMBOL", columns = "ENTREZID"))
  gnset_entrez <- as.character(na.omit(gnset_map$ENTREZID))
  # give scores to gnset_entrez
  tar_tab <- table(gnset_entrez)
  tar_dup <- as.numeric(tar_tab); names(tar_dup) <- names(tar_tab)
  tar_weight <- sort(tar_dup/sum(tar_dup), decreasing = TRUE)
  
  if(type=="KEGG"){
    # Get universe genes in KEGG annotation system
    KEGG_DATA <- prepare_KEGG(species="hsa", "KEGG", keyType="kegg")
    keggterms <- get("PATHID2EXTID", KEGG_DATA)
    universe <- unique(unlist(keggterms))
    
    tar_diff <- setdiff(universe, gnset_entrez)
    tar_diff_weight <- rep(0, length(tar_diff))
    names(tar_diff_weight) <- tar_diff
    tar_total_weight <- c(tar_weight, tar_diff_weight)
    
    gse_res <- gseKEGG2(geneList=tar_total_weight, organism='hsa', keyType='kegg',
                        nPerm=nPerm, exponent=exponent,
                        minGSSize = minGSSize, maxGSSize=maxGSSize, 
                        pvalueCutoff=pvalueCutoff, pAdjustMethod = pAdjustMethod,
                        verbose = verbose, readable=readable)
  }
  
  if(type=="Reactome"){
    # Get universe genes in Reactome annotation system
    Reactome_DATA <- get_Reactome_DATA(organism="human")
    raterms <- get("PATHID2EXTID", Reactome_DATA)
    universe <- unique(unlist(raterms))
    
    tar_diff <- setdiff(universe, gnset_entrez)
    tar_diff_weight <- rep(0, length(tar_diff))
    names(tar_diff_weight) <- tar_diff
    tar_total_weight <- c(tar_weight, tar_diff_weight)
    
    gse_res <- gseReactome(geneList=tar_total_weight, organism='human', 
                           exponent=exponent, nPerm=nPerm,
                           minGSSize = minGSSize, maxGSSize=maxGSSize, 
                           pvalueCutoff=pvalueCutoff, pAdjustMethod= pAdjustMethod,
                           verbose = verbose, readable=readable)
  }
  
  if(!is.null(gse_res))
    drugs(gse_res) <- drugs
  return(gse_res)
}

Any scripts or data that you put into this service are public.

signatureSearch documentation built on April 16, 2021, 6 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

signatureSearch
Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

R/tsea_mGSEA.R
In signatureSearch: Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

Defines functions tsea_mGSEA

Documented in tsea_mGSEA

Try the signatureSearch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

signatureSearch Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

R/tsea_mGSEA.R In signatureSearch: Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

Defines functions tsea_mGSEA

Documented in tsea_mGSEA

Try the signatureSearch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

signatureSearch
Environment for Gene Expression Searching Combined with Functional Enrichment Analysis

R/tsea_mGSEA.R
In signatureSearch: Environment for Gene Expression Searching Combined with Functional Enrichment Analysis