R/ActivePathways.r
In ActivePathways: Integrative Pathway Enrichment Analysis of Multivariate Omics Data

Documented in ActivePathways columnSignificance enrichmentAnalysis

#' ActivePathways
#'
#' @param scores A numerical matrix of p-values where each row is a gene and
#'   each column represents an omics dataset (evidence). Rownames correspond to the genes 
#'   and colnames to the datasets. All values must be 0<=p<=1. We recommend converting 
#'   missing values to ones. 
#' @param gmt A GMT object to be used for enrichment analysis. If a filename, a
#'   GMT object will be read from the file.
#' @param background A character vector of gene names to be used as a
#'   statistical background. By default, the background is all genes that appear
#'   in \code{gmt}.
#' @param geneset_filter A numeric vector of length two giving the lower and 
#'   upper limits for the size of the annotated geneset to pathways in gmt.
#'   Pathways with a geneset shorter than \code{geneset_filter[1]} or longer
#'   than \code{geneset_filter[2]} will be removed. Set either value to NA
#'   to not enforce a minimum or maximum value, or set \code{geneset_filter} to 
#'   \code{NULL} to skip filtering.
#' @param cutoff A maximum merged p-value for a gene to be used for analysis.
#'   Any genes with merged, unadjusted \code{p > significant} will be discarded 
#'   before testing.
#' @param significant Significance cutoff for selecting enriched pathways. Pathways with
#'   \code{adjusted_p_val <= significant} will be selected as results.
#' @param merge_method Statistical method to merge p-values. See section on Merging P-Values
#' @param correction_method Statistical method to correct p-values. See
#'   \code{\link[stats]{p.adjust}} for details.
#' @param cytoscape_file_tag The directory and/or file prefix to which the output files
#'   for generating enrichment maps should be written. If NA, files will not be written. 
#' @param color_palette Color palette from RColorBrewer::brewer.pal to color each
#'   column in the scores matrix. If NULL grDevices::rainbow is used by default.
#' @param custom_colors A character vector of custom colors for each column in the scores matrix.
#' @param color_integrated_only A character vector of length 1 specifying the color of the 
#'   "combined" pathway contribution.
#' @param scores_direction A numerical matrix of log2 transformed fold-change values where each row is a
#'   gene and each column represents a dataset (evidence). Rownames correspond to the genes
#'   and colnames to the datasets. We recommend converting missing values to zero. 
#'   Must contain the same dimensions as the scores parameter. Datasets without directional information should be set to 0.
#' @param constraints_vector A numerical vector of +1 or -1 values corresponding to the user-defined
#'   directional relationship between columns in scores_direction. Datasets without directional information should
#'   be set to 0.
#'
#' @return A data.table of terms (enriched pathways) containing the following columns:
#'   \describe{
#'     \item{term_id}{The database ID of the term}
#'     \item{term_name}{The full name of the term}
#'     \item{adjusted_p_val}{The associated p-value, adjusted for multiple testing}
#'     \item{term_size}{The number of genes annotated to the term}
#'     \item{overlap}{A character vector of the genes enriched in the term}
#'     \item{evidence}{Columns of \code{scores} (i.e., omics datasets) that contributed 
#'          individually to the enrichment of the term. Each input column is evaluated 
#'          separately for enrichments and added to the evidence if the term is found.}
#'   }
#'
#' @section Merging P-values:
#' To obtain a single p-value for each gene across the multiple omics datasets considered, 
#' the p-values in \code{scores} #' are merged row-wise using a data fusion approach of p-value merging. 
#' The eight available methods are:
#' \describe{
#'  \item{Fisher}{Fisher's method assumes p-values are uniformly
#'  distributed and performs a chi-squared test on the statistic sum(-2 log(p)).
#'  This method is most appropriate when the columns in \code{scores} are
#'  independent.}
#'  \item{Fisher_directional}{Fisher's method modification that allows for 
#'  directional information to be incorporated with the \code{scores_direction}
#'  and \code{constraints_vector} parameters.}
#'  \item{Brown}{Brown's method extends Fisher's method by accounting for the
#'  covariance in the columns of \code{scores}. It is more appropriate when the
#'  tests of significance used to create the columns in \code{scores} are not
#'  necessarily independent. The Brown's method is therefore recommended for 
#'  many omics integration approaches.}
#'  \item{DPM}{DPM extends Brown's method by incorporating directional information
#'  using the \code{scores_direction} and \code{constraints_vector} parameters.}
#'  \item{Stouffer}{Stouffer's method assumes p-values are uniformly distributed
#'  and transforms p-values into a Z-score using the cumulative distribution function of a
#'  standard normal distribution. This method is appropriate when the columns in \code{scores}
#'   are independent.}
#'  \item{Stouffer_directional}{Stouffer's method modification that allows for 
#'  directional information to be incorporated with the \code{scores_direction}
#'  and \code{constraints_vector} parameters.}
#'  \item{Strube}{Strube's method extends Stouffer's method by accounting for the 
#'  covariance in the columns of \code{scores}.}
#'  \item{Strube_directional}{Strube's method modification that allows for 
#'  directional information to be incorporated with the \code{scores_direction}
#'  and \code{constraints_vector} parameters.}
#' }
#'
#' @section Cytoscape:
#'   To visualize and interpret enriched pathways, ActivePathways provides an option
#'   to further analyse results as enrichment maps in the Cytoscape software. 
#'   If \code{!is.na(cytoscape_file_tag)}, four files will be written that can be used 
#'   to build enrichment maps. This requires the EnrichmentMap and enhancedGraphics apps.
#'
#' The four files written are:
#'   \describe{
#'     \item{pathways.txt}{A list of significant terms and the
#'     associated p-value. Only terms with \code{adjusted_p_val <= significant} are
#'     written to this file.}
#'     \item{subgroups.txt}{A matrix indicating whether the significant terms (pathways)
#'     were also found to be significant when considering only one column from
#'     \code{scores}. A one indicates that term was found to be significant 
#' 			when only p-values in that column were used to select genes.}
#'     \item{pathways.gmt}{A Shortened version of the supplied GMT
#'     file, containing only the significantly enriched terms in pathways.txt. }
#'     \item{legend.pdf}{A legend with colours matching contributions
#'     from columns in \code{scores}.}
#'   }
#'
#'   How to use: Create an enrichment map in Cytoscape with the file of terms
#'   (pathways.txt) and the shortened gmt file
#'   (pathways.gmt). Upload the subgroups file (subgroups.txt) as a table
#'   using the menu File > Import > Table from File. To paint nodes according 
#'   to the type of supporting evidence, use the 'style'
#'   panel, set image/Chart1 to use the column `instruct` and the passthrough
#'   mapping type. Make sure the app enhancedGraphics is installed. 
#'   Lastly, use the file legend.pdf as a reference for colors in the enrichment map.
#'
#' @examples
#'     fname_scores <- system.file("extdata", "Adenocarcinoma_scores_subset.tsv", 
#'          package = "ActivePathways")
#'     fname_GMT = system.file("extdata", "hsapiens_REAC_subset.gmt",
#'          package = "ActivePathways")
#'
#'     dat <- as.matrix(read.table(fname_scores, header = TRUE, row.names = 'Gene'))
#'     dat[is.na(dat)] <- 1
#'
#'     ActivePathways(dat, fname_GMT)
#'
#' @import data.table
#'
#' @export

ActivePathways <-  function(scores, gmt, background = makeBackground(gmt),
                            geneset_filter = c(5, 1000), cutoff = 0.1, significant = 0.05,
                            merge_method = c("Fisher", "Fisher_directional", "Brown", "DPM", "Stouffer",
                                            "Stouffer_directional", "Strube", "Strube_directional"),
                            correction_method = c("holm", "fdr", "hochberg", "hommel",
                                                  "bonferroni", "BH", "BY", "none"),
                            cytoscape_file_tag = NA, color_palette = NULL, custom_colors = NULL, 
                            color_integrated_only = "#FFFFF0", scores_direction = NULL, 
                            constraints_vector = NULL) {
   
   merge_method <- match.arg(merge_method)
   correction_method <- match.arg(correction_method)
   
   ##### Validation #####
   # scores
   if (!(is.matrix(scores) && is.numeric(scores))) stop("scores must be a numeric matrix")
   if (any(is.na(scores))) stop("scores cannot contain missing values, we recommend replacing NA with 1 or removing")
   if (any(scores < 0) || any(scores > 1)) stop("All values in scores must be in [0,1]")
   if (any(duplicated(rownames(scores)))) stop("scores matrix contains duplicated genes - rownames must be unique")
   
   # scores_direction and constraints_vector
   if (xor(!is.null(scores_direction),!is.null(constraints_vector))) stop("Both scores_direction and constraints_vector must be provided")
   if (!is.null(scores_direction) && !is.null(constraints_vector)){
      if (!(is.numeric(constraints_vector) && is.vector(constraints_vector))) stop("constraints_vector must be a numeric vector")
      if (any(!constraints_vector %in% c(1,-1,0))) stop("constraints_vector must contain the values: 1, -1 or 0")
      if (!(is.matrix(scores_direction) && is.numeric(scores_direction))) stop("scores_direction must be a numeric matrix")
      if (any(is.na(scores_direction))) stop("scores_direction cannot contain missing values, we recommend replacing NA with 0 or removing")
      if (nrow(scores) != nrow(scores_direction)) stop("scores and scores_direction must have the same number of rows")
      if (any(!rownames(scores_direction) %in% rownames(scores))) stop ("scores_direction gene names must match scores genes")
      if (any(rownames(scores) != rownames(scores_direction))) stop("scores genes should be in the same order as scores_direction genes")
      if (is.null(colnames(scores)) || is.null(colnames(scores_direction))) stop("column names must be provided to scores and scores_direction")
      if (any(!colnames(scores_direction) %in% colnames(scores))) stop("scores_direction column names must match scores column names")
      if (length(constraints_vector) != length(colnames(scores_direction))) stop("constraints_vector should have the same number of entries as columns in scores_direction")
      if (merge_method %in% c("Fisher","Brown","Stouffer","Strube")) stop("Only DPM, Fisher_directional, Stouffer_directional, and Strube_directional methods support directional integration")
      if (any(constraints_vector %in% 0) &&  !all(scores_direction[,constraints_vector %in% 0] == 0)) 
         stop("scores_direction entries must be set to 0's for columns that do not contain directional information")
      if (!is.null(names(constraints_vector))){
         if (!all.equal(names(constraints_vector), colnames(scores_direction), colnames(scores)) == TRUE){
            stop("the constraints_vector entries should match the order of scores and scores_direction columns")
         }}}
   
   # cutoff and significant
   stopifnot(length(cutoff) == 1)
   stopifnot(is.numeric(cutoff))
   if (cutoff < 0 || cutoff > 1) stop("cutoff must be a value in [0,1]")
   stopifnot(length(significant) == 1)
   stopifnot(is.numeric(significant))
   if (significant < 0 || significant > 1) stop("significant must be a value in [0,1]")
   
   # gmt
   if (!is.GMT(gmt)) gmt <- read.GMT(gmt)
   if (length(gmt) == 0) stop("No pathways in gmt made the geneset_filter")
   if (!(is.character(background) && is.vector(background))) {
      stop("background must be a character vector")
   } 
   
   # geneset_filter
   if (!is.null(geneset_filter)) {
      if (!(is.numeric(geneset_filter) && is.vector(geneset_filter))) {
         stop("geneset_filter must be a numeric vector")
      }
      if (length(geneset_filter) != 2) stop("geneset_filter must be length 2")
      if (!is.numeric(geneset_filter)) stop("geneset_filter must be numeric")
      if (any(geneset_filter < 0, na.rm=TRUE)) stop("geneset_filter limits must be positive")
   }
   
   # custom_colors
   if (!is.null(custom_colors)){
      if(!(is.character(custom_colors) && is.vector(custom_colors))){
         stop("colors must be provided as a character vector")   
      } 
      if(length(colnames(scores)) != length(custom_colors)) stop("incorrect number of colors is provided")
   }
   if (!is.null(custom_colors) & !is.null(color_palette)){
      stop("Both custom_colors and color_palette are provided. Specify only one of these parameters for node coloring.")
   }
   
   if (!is.null(names(custom_colors))){
      if (!all(names(custom_colors) %in% colnames(scores))){
         stop("names() of the custom colors vector should match the scores column names")
      }
   }
   
   # color_palette
   if (!is.null(color_palette)){
      if (!(color_palette %in% rownames(RColorBrewer::brewer.pal.info))) stop("palette must be from the RColorBrewer package")
   }
   
   # color_integrated_only
   if(!(is.character(color_integrated_only) && is.vector(color_integrated_only))){
      stop("color must be provided as a character vector")   
   } 
   if(1 != length(color_integrated_only)) stop("only a single color must be specified")
   
   # contribution
   contribution <- TRUE
   if (ncol(scores) == 1) {
      contribution <- FALSE
      message("scores matrix contains only one column. Column contributions will not be calculated")
   }
   
   ##### filtering and sorting ####
   
   # Remove any genes not found in the background
   orig_length <- nrow(scores)
   scores <- scores[rownames(scores) %in% background, , drop=FALSE]
   if(!is.null(scores_direction)){
      scores_direction <- scores_direction[rownames(scores_direction) %in% background, , drop=FALSE]
   }
   if (nrow(scores) == 0) {
      stop("scores does not contain any genes in the background")
   }
   if (nrow(scores) < orig_length) {
      message(paste(orig_length - nrow(scores), "rows were removed from scores",
                    "because they are not found in the background"))
   }
   
   
   # Filter the GMT
   if (!all(background %in% unique(unlist(sapply(gmt, "[", c(3)))))){
      background_genes <- lapply(sapply(gmt, "[", c(3)), intersect, background)
      background_genes <- background_genes[lapply(background_genes,length) > 0]
      gmt <- gmt[names(sapply(gmt,"[",c(3))) %in% names(background_genes)]
      for (i in 1:length(gmt)) {
         gmt[[i]]$genes <- background_genes[[i]]
      }
   }
   
   if(!is.null(geneset_filter)) {
      orig_length <- length(gmt)
      if (!is.na(geneset_filter[1])) {
         gmt <- Filter(function(x) length(x$genes) >= geneset_filter[1], gmt)
      }
      if (!is.na(geneset_filter[2])) {
         gmt <- Filter(function(x) length(x$genes) <= geneset_filter[2], gmt)
      }
      if (length(gmt) == 0) stop("No pathways in gmt made the geneset_filter")
      if (length(gmt) < orig_length) {
         message(paste(orig_length - length(gmt), "terms were removed from gmt", 
                       "because they did not make the geneset_filter"))
      }
   }
   
   # merge p-values to get a single score for each gene and remove any genes
   # that don't make the cutoff
   merged_scores <- merge_p_values(scores, merge_method, scores_direction, constraints_vector)
   merged_scores <- merged_scores[merged_scores <= cutoff]
   
   if (length(merged_scores) == 0) stop("No genes made the cutoff")
   
   # Sort genes by p-value
   ordered_scores <- names(merged_scores)[order(merged_scores)]
   
   ##### enrichmentAnalysis and column contribution #####
   
   res <- enrichmentAnalysis(ordered_scores, gmt, background)
   adjusted_p <- stats::p.adjust(res$adjusted_p_val, method = correction_method)
   res[, "adjusted_p_val" := adjusted_p]
   
   significant_indeces <- which(res$adjusted_p_val <= significant)
   if (length(significant_indeces) == 0) {
      warning("No significant terms were found")
      return()
   }
   
   if (contribution) {
      sig_cols <- columnSignificance(scores, gmt, background, cutoff,
                                     significant, correction_method, res$adjusted_p_val)
      res <- cbind(res, sig_cols[, -1])
   } else {
      sig_cols <- NULL
   }
   
   # if significant result were found and cytoscape file tag exists
   # proceed with writing files in the working directory
   if (length(significant_indeces) > 0 & !is.na(cytoscape_file_tag)) {
      prepareCytoscape(res[significant_indeces, c("term_id", "term_name", "adjusted_p_val")],
                       gmt[significant_indeces], 
                       cytoscape_file_tag,
                       sig_cols[significant_indeces,], color_palette, custom_colors, color_integrated_only)
   }
   
   res[significant_indeces]
}


#' Perform pathway enrichment analysis on an ordered list of genes
#'
#' @param genelist character vector of gene names, in decreasing order
#'   of significance
#' @param gmt GMT object
#' @param background character vector of gene names. List of all genes being used
#'   as a statistical background
#'
#' @return a data.table of terms with the following columns:
#'   \describe{
#'     \item{term_id}{The id of the term}
#'     \item{term_name}{The full name of the term}
#'     \item{adjusted_p_val}{The associated p-value adjusted for multiple testing}
#'     \item{term_size}{The number of genes annotated to the term}
#'     \item{overlap}{A character vector of the genes that overlap between the
#'        term and the query}
#'   }
#' @keywords internal
enrichmentAnalysis <- function(genelist, gmt, background) {
   dt <- data.table(term_id=names(gmt))
   
   for (i in 1:length(gmt)) {
      term <- gmt[[i]]
      tmp <- orderedHypergeometric(genelist, background, term$genes)
      overlap <- genelist[1:tmp$ind]
      overlap <- overlap[overlap %in% term$genes]
      if (length(overlap) == 0) overlap <- c()
      set(dt, i, 'term_name', term$name)
      set(dt, i, 'adjusted_p_val', tmp$p_val)
      set(dt, i, 'term_size', length(term$genes))
      set(dt, i, 'overlap', list(list(overlap)))
   }
   dt
}

#' Determine which terms are found to be significant using each column
#' individually. 
#'
#' @inheritParams ActivePathways
#' @param pvals p-value for the pathways calculated by ActivePathways
#'
#' @return a data.table with columns 'term_id' and a column for each column
#' in \code{scores}, indicating whether each term (pathway) was found to be
#' significant or not when considering only that column. For each term, 
#' either report the list of related genes if that term was significant, or NA if not. 

columnSignificance <- function(scores, gmt, background, cutoff, significant, correction_method, pvals) {
   dt <- data.table(term_id=names(gmt), evidence=NA)
   for (col in colnames(scores)) {
      col_scores <- scores[, col, drop=TRUE]
      col_scores <- col_scores[col_scores <= cutoff]
      col_scores <- names(col_scores)[order(col_scores)]
      
      res <- enrichmentAnalysis(col_scores, gmt, background)
      set(res, i = NULL, "adjusted_p_val", stats::p.adjust(res$adjusted_p_val, correction_method))
      set(res, i = which(res$adjusted_p_val > significant), "overlap", list(list(NA)))
      set(dt, i=NULL, col, res$overlap)
   }
   
   ev_names = colnames(dt[,-1:-2])
   set_evidence <- function(x) {
      ev <- ev_names[!is.na(dt[x, -1:-2])]
      if(length(ev) == 0) {
         if (pvals[x] <= significant) {
            ev <- 'combined'
         } else {
            ev <- 'none'
         }
      }
      ev
   }
   evidence <- lapply(1:nrow(dt), set_evidence)
   
   set(dt, i=NULL, "evidence", evidence)
   colnames(dt)[-1:-2] = paste0("Genes_", colnames(dt)[-1:-2])
   
   dt
}

#' Export the results from ActivePathways as a comma-separated values (CSV) file. 
#'
#' @param res the data.table object with ActivePathways results.
#' @param file_name location and name of the CSV file to write to.
#' @export
#'
#' @examples
#'     fname_scores <- system.file("extdata", "Adenocarcinoma_scores_subset.tsv", 
#'          package = "ActivePathways")
#'     fname_GMT = system.file("extdata", "hsapiens_REAC_subset.gmt",
#'          package = "ActivePathways")
#'
#'     dat <- as.matrix(read.table(fname_scores, header = TRUE, row.names = 'Gene'))
#'     dat[is.na(dat)] <- 1
#'
#'     res <- ActivePathways(dat, fname_GMT)
#'\donttest{
#'     export_as_CSV(res, "results_ActivePathways.csv")
#'}
export_as_CSV = function (res, file_name) {
   overlap_index <- which(grepl("overlap", colnames(res), fixed=TRUE))
   dataset_indices <- which(grepl("Genes_", colnames(res), fixed=TRUE))
   for (i in c(overlap_index, dataset_indices)){
      res[[i]] <- sapply(res[[i]], function(x) paste(x, collapse = "|"))
   }
   data.table::fwrite(res, file_name)	
}
Any scripts or data that you put into this service are public.
ActivePathways documentation built on Sept. 11, 2024, 8:45 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
ActivePathways
Integrative Pathway Enrichment Analysis of Multivariate Omics Data

R/ActivePathways.r
In ActivePathways: Integrative Pathway Enrichment Analysis of Multivariate Omics Data

Defines functions columnSignificance enrichmentAnalysis ActivePathways

Documented in ActivePathways columnSignificance enrichmentAnalysis

Try the ActivePathways package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

ActivePathways Integrative Pathway Enrichment Analysis of Multivariate Omics Data

R/ActivePathways.r In ActivePathways: Integrative Pathway Enrichment Analysis of Multivariate Omics Data

Defines functions columnSignificance enrichmentAnalysis ActivePathways

Documented in ActivePathways columnSignificance enrichmentAnalysis

Try the ActivePathways package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

ActivePathways
Integrative Pathway Enrichment Analysis of Multivariate Omics Data

R/ActivePathways.r
In ActivePathways: Integrative Pathway Enrichment Analysis of Multivariate Omics Data