R/WebGestaltR.R

Defines functions WebGestaltR

Documented in WebGestaltR

#' Comprehensive R function for the enrichment analysis
#'
#' Main function for enrichment analysis
#'
#' WebGestaltR function can perform three enrichment analyses:
#' ORA (Over-Representation Analysis) and GSEA (Gene Set Enrichment Analysis).and
#' NTA (Network Topology Analysis). Based on the user-uploaded gene list or gene list
#' with scores, WebGestaltR function will first map the gene list to the entrez gene
#' ids and then summarize the gene list based on the GO (Gene Ontology) Slim. After
#' performing the enrichment analysis, WebGestaltR function also returns a user-friendly
#' HTML report containing GO Slim summary and the enrichment analysis result. If functional
#' categories have DAG (directed acyclic graph) structure or genes in the functional
#' categories have network structure, those relationship can also be visualized in the
#' report.
#'
#' @param enrichMethod Enrichment methods: \code{ORA}, \code{GSEA} or \code{NTA}.
#' @param organism Currently, WebGestaltR supports 12 organisms. Users can use the function
#'   \code{listOrganism} to check available organisms. Users can also input \code{others} to
#'   perform the enrichment analysis for other organisms not supported by WebGestaltR. For
#'   other organisms, users need to provide the functional categories, interesting list and
#'   reference list (for ORA method). Because WebGestaltR does not perform the ID mapping for
#'   the other organisms, the above data should have the same ID type.
#' @param enrichDatabase The functional categories for the enrichment analysis. Users can use
#'   the function \code{listGeneSet} to check the available functional databases for the
#'   selected organism. Multiple databases in a vector are supported for ORA and GSEA.
#' @param enrichDatabaseFile Users can provide one or more GMT files as the functional
#'   category for enrichment analysis. The extension of the file should be \code{gmt} and the
#'   first column of the file is the category ID, the second one is the external link for the
#'   category. Genes annotated to the category are from the third column. All columns are
#'   separated by tabs. The GMT files will be combined with \code{enrichDatabase}.
#' @param enrichDatabaseType The ID type of the genes in the \code{enrichDatabaseFile}.
#'   If users set \code{organism} as \code{others}, users do not need to set this ID type because
#'   WebGestaltR will not perform ID mapping for other organisms. The supported ID types of
#'   WebGestaltR for the selected organism can be found by the function \code{listIdType}.
#' @param enrichDatabaseDescriptionFile Users can also provide description files for the custom
#'   \code{enrichDatabaseFile}. The extension of the description file should be \code{des}. The
#'   description file contains two columns: the first column is the category ID that should be
#'   exactly the same as the category ID in the custom \code{enrichDatabaseFile} and the second
#'   column is the description of the category. All columns are separated by tabs.
#' @param interestGeneFile If \code{enrichMethod} is \code{ORA} or \code{NTA}, the extension of
#'   the \code{interestGeneFile} should be \code{txt} and the file can only contain one column:
#'   the interesting gene list. If \code{enrichMethod} is \code{GSEA}, the extension of the
#'   \code{interestGeneFile} should be \code{rnk} and the file should contain two columns
#'   separated by tab: the gene list and the corresponding scores.
#' @param interestGene Users can also use an R object as the input. If \code{enrichMethod} is
#'   \code{ORA} or \code{NTA}, \code{interestGene} should be an R \code{vector} object
#'   containing the interesting gene list. If \code{enrichMethod} is \code{GSEA},
#'   \code{interestGene} should be an R \code{data.frame} object containing two columns: the
#'   gene list and the corresponding scores.
#' @param interestGeneType The ID type of the interesting gene list. The supported ID types of
#'   WebGestaltR for the selected organism can be found by the function \code{listIdType}. If
#'   the \code{organism} is \code{others}, users do not need to set this parameter.
#' @param interestGeneNames The names of the id lists for multiomics data.
#' @param collapseMethod The method to collapse duplicate IDs with scores. \code{mean},
#'   \code{median}, \code{min} and \code{max} represent the mean, median, minimum and maximum
#'   of scores for the duplicate IDs.
#' @param referenceGeneFile For the ORA method, the users need to upload the reference gene
#'   list. The extension of the \code{referenceGeneFile} should be \code{txt} and the file can
#'   only contain one column: the reference gene list.
#' @param referenceGene For the ORA method, users can also use an R object as the reference
#'   gene list. \code{referenceGene} should be an R \code{vector} object containing the
#'   reference gene list.
#' @param referenceGeneType The ID type of the reference gene list. The supported ID types
#'   of WebGestaltR for the selected organism can be found by the function \code{listIdType}.
#'   If the \code{organism} is \code{others}, users do not need to set this parameter.
#' @param referenceSet Users can directly select the reference set from existing platforms in
#'   WebGestaltR and do not need to provide the reference set through \code{referenceGeneFile}.
#'   All existing platforms supported in WebGestaltR can be found by the function
#'   \code{listReferenceSet}. If \code{referenceGeneFile} and \code{refereneceGene} are
#'   \code{NULL}, WebGestaltR will use the \code{referenceSet} as the reference gene set.
#'   Otherwise, WebGestaltR will use the user supplied reference set for enrichment analysis.
#' @param minNum WebGestaltR will exclude the categories with the number of annotated genes
#'   less than \code{minNum} for enrichment analysis. The default is \code{10}.
#' @param maxNum WebGestaltR will exclude the categories with the number of annotated genes
#'   larger than \code{maxNum} for enrichment analysis. The default is \code{500}.
#' @param sigMethod Two methods of significance are available in WebGestaltR: \code{fdr} and
#'   \code{top}. \code{fdr} means the enriched categories are identified based on the FDR and
#'   \code{top} means all categories are ranked based on FDR and then select top categories
#'   as the enriched categories. The default is \code{fdr}.
#' @param fdrMethod For the ORA method, WebGestaltR supports five FDR methods: \code{holm},
#'   \code{hochberg}, \code{hommel}, \code{bonferroni}, \code{BH} and \code{BY}. The default
#'   is \code{BH}.
#' @param fdrThr The significant threshold for the \code{fdr} method. The default is \code{0.05}.
#' @param topThr The threshold for the \code{top} method. The default is \code{10}.
#' @param reportNum The number of enriched categories visualized in the final report. The default
#'   is \code{20}. A larger \code{reportNum} may be slow to render in the report.
#' @param perNum The number of permutations for the GSEA method. The default is \code{1000}.
#' @param gseaP The exponential scaling factor of the phenotype score. The default is \code{1}.
#'   When p=0, ES reduces to standard K-S statistics (See original paper for more details).
#' @param isOutput If \code{isOutput} is TRUE, WebGestaltR will create a folder named by
#'   the \code{projectName} and save the results in the folder. Otherwise, WebGestaltR will
#'   only return an R \code{data.frame} object containing the enrichment results. If
#'   hundreds of gene list need to be analyzed simultaneously, it is better to set
#'   \code{isOutput} to \code{FALSE}. The default is \code{TRUE}.
#' @param outputDirectory The output directory for the results.
#' @param projectName The name of the project. If \code{projectName} is \code{NULL},
#'   WebGestaltR will use time stamp as the project name.
#' @param dagColor If \code{dagColor} is \code{binary}, the significant terms in the DAG
#'   structure will be colored by steel blue for ORA method or steel blue (positive related)
#'   and dark orange (negative related) for GSEA method. If \code{dagColor} is \code{continous},
#'   the significant terms in the DAG structure will be colored by the color gradient based on
#'   corresponding FDRs.
#' @param saveRawGseaResult Whether the raw result from GSEA is saved as a RDS file, which can be
#'   used for plotting. Defaults to \code{FALSE}. The list includes
#'   \describe{
#'     \item{Enrichment_Results}{A data frame of GSEA results with statistics}
#'     \item{Running_Sums}{A matrix of running sum of scores for each gene set}
#'     \item{Items_in_Set}{A list with ranks of genes for each gene set}
#'  }
#' @param gseaPlotFormat The graphic format of GSEA enrichment plots. Either \code{svg},
#'   \code{png}, or \code{c("png", "svg")} (default).
#' @param setCoverNum The number of expected gene sets after set cover to reduce redundancy.
#'   It could get fewer sets if the coverage reaches 100\%. The default is \code{10}.
#' @param networkConstructionMethod Netowrk construction method for NTA. Either
#'   \code{Network_Retrieval_Prioritization} or \code{Network_Expansion}. Network Retrieval &
#'   Prioritization first uses random walk analysis to calculate random walk probabilities
#'   for the input seeds, then identifies the relationships among the seeds in the selected
#'   network and returns a retrieval sub-network. The seeds with the top random walk
#'   probabilities are highlighted in the sub-network. Network Expansion first uses random
#'   walk analysis to rank all genes in the selected network based on their network
#'   proximity to the input seeds and then return an expanded sub-network in which nodes
#'   are the input seeds and their top ranking neighbors and edges represent their
#'   relationships.
#' @param neighborNum The number of neighbors to include in NTA Network Expansion method.
#' @param highlightType The type of nodes to highlight in the NTA Network Expansion method,
#'   either \code{Seeds} or \code{Neighbors}.
#' @param highlightSeedNum The number of top input seeds to highlight in NTA Network Retrieval
#'   & Prioritizaiton method.
#' @param nThreads The number of cores to use for GSEA and set cover, and in batch function.
#' @param cache A directory to save data cache for reuse. Defaults to \code{NULL} and disabled.
#' @param hostName The server URL for accessing data. Mostly for development purposes.
#' @param useWeightedSetCover Use weighted set cover for ORA. Defaults to \code{TRUE}.
#' @param useAffinityPropagation Use affinity propagation for ORA. Defaults to \code{FALSE}.
#' @param usekMedoid Use k-medoid for ORA. Defaults to \code{TRUE}.
#' @param kMedoid_k The number of clusters for k-medoid. Defaults to \code{25}.
#' @param ... In batch function, passes parameters to WebGestaltR function.
#'   Also handles backward compatibility for some parameters in old versions.
#'
#' @return The WebGestaltR function returns a data frame containing the enrichment analysis
#'   result and also outputs an user-friendly HTML report if \code{isOutput} is \code{TRUE}.
#'   The columns in the data frame depend on the \code{enrichMethod} and they are the following:
#'   \describe{
#'     \item{geneSet}{ID of the gene set.}
#'     \item{description}{Description of the gene set if available.}
#'     \item{link}{Link to the data source.}
#'     \item{size}{The number of genes in the set after filtering by \code{minNum} and \code{maxNum}.}
#'     \item{overlap}{The number of mapped input genes that are annotated in the gene set.}
#'     \item{expect}{Expected number of input genes that are annotated in the gene set.}
#'     \item{enrichmentRatio}{Enrichment ratio, overlap / expect.}
#'     \item{enrichmentScore}{Enrichment score, the maximum running sum of scores for the ranked list.}
#'     \item{normalizedEnrichmentScore}{Normalized enrichment score, normalized against the average
#'       enrichment score of all permutations.}
#'     \item{leadingEdgeNum}{Number of genes/phosphosites in the leading edge.}
#'     \item{pValue}{P-value from hypergeometric test for ORA. For GSEA, please refer to its original
#'       publication or online at \url{https://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm}.}
#'     \item{FDR}{Corrected P-value for mulilple testing with \code{fdrMethod} for ORA.}
#'     \item{overlapId}{The gene/phosphosite IDs of \code{overlap} for ORA (entrez gene IDs or
#'       phosphosite sequence).}
#'     \item{leadingEdgeId}{Genes/phosphosites in the leading edge in entrez gene ID or
#'       phosphosite sequence.}
#'     \item{userId}{The gene/phosphosite IDs of \code{overlap} for ORA or \code{leadingEdgeId}
#'       for GSEA in User input IDs.}
#'     \item{plotPath}{Path of the GSEA enrichment plot.}
#'     \item{database}{Name of the source database if multiple enrichment databases are given.}
#'     \item{goId}{In NTA, like \code{geneSet}, the enriched GO terms of genes in the
#'       returned subnetwork.}
#'     \item{interestGene}{In NTA, the gene IDs in the subnetwork with 0/1 annotations indicating
#'       if it is from user input.}
#'   }
#'
#' @export
#'
#' @examples
#' \dontrun{
#' ####### ORA example #########
#' geneFile <- system.file("extdata", "interestingGenes.txt", package = "WebGestaltR")
#' refFile <- system.file("extdata", "referenceGenes.txt", package = "WebGestaltR")
#' outputDirectory <- getwd()
#' enrichResult <- WebGestaltR(
#'   enrichMethod = "ORA", organism = "hsapiens",
#'   enrichDatabase = "pathway_KEGG", interestGeneFile = geneFile,
#'   interestGeneType = "genesymbol", referenceGeneFile = refFile,
#'   referenceGeneType = "genesymbol", isOutput = TRUE,
#'   outputDirectory = outputDirectory, projectName = NULL
#' )
#'
#' ####### GSEA example #########
#' rankFile <- system.file("extdata", "GeneRankList.rnk", package = "WebGestaltR")
#' outputDirectory <- getwd()
#' enrichResult <- WebGestaltR(
#'   enrichMethod = "GSEA", organism = "hsapiens",
#'   enrichDatabase = "pathway_KEGG", interestGeneFile = rankFile,
#'   interestGeneType = "genesymbol", sigMethod = "top", topThr = 10, minNum = 5,
#'   outputDirectory = outputDirectory
#' )
#'
#' ####### NTA example #########
#' enrichResult <- WebGestaltR(
#'   enrichMethod = "NTA", organism = "hsapiens",
#'   enrichDatabase = "network_PPI_BIOGRID", interestGeneFile = geneFile,
#'   interestGeneType = "genesymbol", sigMethod = "top", topThr = 10,
#'   outputDirectory = getwd(), highlightSeedNum = 10,
#'   networkConstructionMethod = "Network_Retrieval_Prioritization"
#' )
#' }
#'
WebGestaltR <- function(enrichMethod = "ORA", organism = "hsapiens", enrichDatabase = NULL, enrichDatabaseFile = NULL, enrichDatabaseType = NULL,
                        enrichDatabaseDescriptionFile = NULL, interestGeneFile = NULL, interestGene = NULL, interestGeneType = NULL,
                        interestGeneNames = NULL, collapseMethod = "mean", referenceGeneFile = NULL, referenceGene = NULL, referenceGeneType = NULL,
                        referenceSet = NULL, minNum = 10, maxNum = 500, sigMethod = "fdr", fdrMethod = "BH", fdrThr = 0.05, topThr = 10, reportNum = 20,
                        perNum = 1000, gseaP = 1, isOutput = TRUE, outputDirectory = getwd(), projectName = NULL, dagColor = "continuous",
                        saveRawGseaResult = FALSE, gseaPlotFormat = c("png", "svg"), setCoverNum = 10, networkConstructionMethod = NULL,
                        neighborNum = 10, highlightType = "Seeds", highlightSeedNum = 10, nThreads = 1, cache = NULL,
                        hostName = "https://www.webgestalt.org/", useWeightedSetCover = FALSE, useAffinityPropagation = FALSE,
                        usekMedoid = TRUE, kMedoid_k = 25, ...) {
  extraArgs <- list(...)
  if ("keepGSEAFolder" %in% names(extraArgs) | "keepGseaFolder" %in% names(extraArgs)) {
    warning("Parameter keepGSEAFolder is obsolete.\n")
  }
  if ("is.output" %in% names(extraArgs)) {
    isOutput <- extraArgs$is.output
    warning("Parameter is.output is deprecated and changed to isOutput!\n")
    warning("Column names of the result data frame are modified.")
  }
  if ("methodType" %in% names(extraArgs)) {
    warning("Parameter methodType is obsolete.\n")
  }
  if ("lNum" %in% names(extraArgs)) {
    warning("Parameter lNum is obsolete.\n")
  }
  if ("dNum" %in% names(extraArgs)) {
    warning("Parameter dNum is deprecated and changed to reportNum.\n")
    reportNum <- extraArgs$dNum
  }

  if (!is.null(cache)) {
    cat("Use cache data if available.\n")
  }

  if (!is.null(enrichDatabase)) {
    if (length(enrichDatabase) > 1) {
      enrichDatabase <- unlist(sapply(enrichDatabase, function(x) {
        return(get_gmt_file(hostName, interestGeneType, x, organism, cache))
      }))
    } else {
      enrichDatabase <- get_gmt_file(hostName, interestGeneType, enrichDatabase, organism, cache)
    }
  }

  ## TODO: add para test for NTA
  errorTest <- parameterErrorMessage(enrichMethod = enrichMethod, organism = organism, collapseMethod = collapseMethod, minNum = minNum, maxNum = maxNum, fdrMethod = fdrMethod, sigMethod = sigMethod, fdrThr = fdrThr, topThr = topThr, reportNum = reportNum, perNum = perNum, isOutput = isOutput, outputDirectory = outputDirectory, dagColor = dagColor, hostName = hostName, cache = cache)
  if (!is.null(errorTest)) {
    return(errorTest)
  }

  if (is.null(projectName)) {
    projectName <- as.character(as.integer(Sys.time()))
  }
  projectName <- sanitizeFileName(projectName) # use for GOSlim summary file name, convert punct to _
  if (enrichMethod == "ORA") {
    enrichR <- WebGestaltROra(organism = organism, enrichDatabase = enrichDatabase, enrichDatabaseFile = enrichDatabaseFile, enrichDatabaseType = enrichDatabaseType, enrichDatabaseDescriptionFile = enrichDatabaseDescriptionFile, interestGeneFile = interestGeneFile, interestGene = interestGene, interestGeneType = interestGeneType, collapseMethod = collapseMethod, referenceGeneFile = referenceGeneFile, referenceGene = referenceGene, referenceGeneType = referenceGeneType, referenceSet = referenceSet, minNum = minNum, maxNum = maxNum, fdrMethod = fdrMethod, sigMethod = sigMethod, fdrThr = fdrThr, topThr = topThr, reportNum = reportNum, setCoverNum = setCoverNum, isOutput = isOutput, outputDirectory = outputDirectory, projectName = projectName, dagColor = dagColor, nThreads = nThreads, cache = cache, hostName = hostName, useWeightedSetCover = useWeightedSetCover, useAffinityPropagation = useAffinityPropagation, usekMedoid = usekMedoid, kMedoid_k = kMedoid_k)
  } else if (enrichMethod == "GSEA") {
    enrichR <- WebGestaltRGsea(organism = organism, enrichDatabase = enrichDatabase, enrichDatabaseFile = enrichDatabaseFile, enrichDatabaseType = enrichDatabaseType, enrichDatabaseDescriptionFile = enrichDatabaseDescriptionFile, interestGeneFile = interestGeneFile, interestGene = interestGene, interestGeneType = interestGeneType, collapseMethod = collapseMethod, minNum = minNum, maxNum = maxNum, fdrMethod = fdrMethod, sigMethod = sigMethod, fdrThr = fdrThr, topThr = topThr, reportNum = reportNum, setCoverNum = setCoverNum, perNum = perNum, p = gseaP, isOutput = isOutput, outputDirectory = outputDirectory, projectName = projectName, dagColor = dagColor, saveRawGseaResult = saveRawGseaResult, plotFormat = gseaPlotFormat, nThreads = nThreads, cache = cache, hostName = hostName, useWeightedSetCover = useWeightedSetCover, useAffinityPropagation = useAffinityPropagation, usekMedoid = usekMedoid, kMedoid_k = kMedoid_k)
  } else if (enrichMethod == "NTA") {
    enrichR <- WebGestaltRNta(organism = organism, network = enrichDatabase, method = networkConstructionMethod, neighborNum = neighborNum, highlightSeedNum = highlightSeedNum, inputSeed = interestGene, inputSeedFile = interestGeneFile, interestGeneType = interestGeneType, sigMethod = sigMethod, fdrThr = fdrThr, topThr = topThr, outputDirectory = outputDirectory, projectName = projectName, highlightType = highlightType, cache = cache, hostName = hostName)
  }
  return(enrichR)
}
bingzhang16/WebGestaltR documentation built on March 9, 2024, 4:10 p.m.