R/gess_lincs.R

Defines functions randQuerySets rand_query_ES .lincsScores gess_lincs

Documented in gess_lincs rand_query_ES

#' @title GESS Methods
#' @rdname gess
#' @description
#' LINCS search method implements the Gene Expression Signature Search (GESS) from
#' Subramanian et al, 2017, here referred to as LINCS. The method uses as
#' query the two label sets of the most up- and down-regulated genes from a
#' genome-wide expression experiment, while the reference database is composed
#' of differential gene expression values (e.g. LFC or z-scores). Note, the
#' related CMAP method uses here ranks instead.
#' @details
#' Subramanian et al. (2017) introduced a more complex GESS algorithm,
#' here referred to as LINCS. While related to CMAP, there are several important
#' differences among the two approaches. First, LINCS weights the query genes
#' based on the corresponding differential expression scores of the GESs in the
#' reference database (e.g. LFC or z-scores). Thus, the reference database used
#' by LINCS needs to store the actual score values rather than their ranks.
#' Another relevant difference is that the LINCS algorithm uses a bi-directional
#' weighted Kolmogorov-Smirnov enrichment statistic (ES) as similarity metric.
#' @section Column description:
#' Descriptions of the columns in GESS result tables.
#' \itemize{
#'   \item pert: character, perturbagen (e.g. drugs) in the reference
#'   database. The treatment/column names of the reference database are
#'   organized as \code{pert__cell__trt_cp} format. The \code{pert} column in
#'   GESS result table contains what stored under the \code{pert} slot of the
#'   column names.
#'   \item cell: character, acronym of cell type
#'   \item type: character, perturbation type. In the CMAP/LINCS
#'   databases provided by \code{signatureSearchData}, the perturbation types
#'   are currently treatments with drug-like compounds (trt_cp). If required,
#'   users can build custom signature database with other types of
#'   perturbagens (e.g., gene knockdown or over-expression events) with the
#'   provided \code{\link{build_custom_db}} function.
#'   \item trend: character, up or down when the reference signature is
#'   positively or negatively connected with the query signature,
#'   respectively.
#'   \item N_upset: integer, number of genes in the query up set
#'   \item N_downset: integer, number of genes in the query down set
#'   \item WTCS: Weighted Connectivity Score, a bi-directional Enrichment
#'   Score for an up/down query set. If the ES values of an up set and a down
#'   set are of different signs, then WTCS is (ESup-ESdown)/2, otherwise,
#'   it is 0. WTCS values range from -1 to 1. They are positive or negative
#'   for signatures that are positively or inversely related, respectively,
#'   and close to zero for signatures that are unrelated.
#'   \item WTCS_Pval: Nominal p-value of WTCS computed by comparing WTCS
#'   against a null distribution of WTCS values obtained from a large number
#'   of random queries (e.g. 1000).
#'   \item WTCS_FDR: False discovery rate of WTCS_Pval.
#'   \item NCS: Normalized Connectivity Score. To make connectivity scores
#'   comparable across cell types and perturbation types,
#'   the scores are normalized. Given a vector of WTCS
#'   values w resulting from a query, the values are normalized within each
#'   cell line c and perturbagen type t to obtain NCS by dividing the WTCS
#'   value with the signed mean of the WTCS values within
#'   the subset of the signatures in the reference database corresponding to c
#'   and t.
#'   \item Tau: Enrichment score standardized for a given database.
#'   The Tau score compares an observed NCS to a large set of NCS values that
#'   have been pre-computed for a specific reference database. The query results
#'   are scored with Tau as a standardized measure ranging from 100 to -100.
#'   A Tau of 90 indicates that only 10% of reference perturbations exhibit
#'   stronger connectivity to the query. This way one can make more meaningful
#'   comparisons across query results.
#'
#'   Note, there are NAs in the Tau score column, the reason is that the number
#'   of signatures in \emph{Qref} that match the cell line of signature \emph{r}
#'   (the \code{TauRefSize} column in the GESS result) is less than 500,
#'   Tau will be set as NA since it is redeemed as there are not large enough
#'   samples for computing meaningful Tau scores.
#'
#'   \item TauRefSize: Size of reference perturbations for computing Tau.
#'   \item NCSct: NCS summarized across cell types. Given a vector of NCS values
#'   for perturbagen p, relative to query q, across all cell lines c in which p
#'   was profiled, a cell-summarized connectivity score is obtained using a
#'   maximum quantile statistic. It compares the 67 and 33 quantiles of
#'   NCSp,c and retains whichever is of higher absolute magnitude.
#'   \item cor_score: Correlation coefficient based on the method defined in
#'   the \code{gess_cor} function.
#'   \item raw_score: bi-directional enrichment score (Kolmogorov-Smirnov
#'   statistic) of up and down set in the query signature
#'   \item scaled_score: raw_score scaled to values from 1 to -1 by
#'   dividing the positive and negative scores with the maximum positive score
#'   and the absolute value of the minimum negative score, respectively.
#'   \item effect: Scaled bi-directional enrichment score corresponding to
#'   the scaled_score under the CMAP result.
#'   \item nSet: number of genes in the GES in the reference
#'   database (gene sets) after setting the higher and lower cutoff.
#'   \item nFound: number of genes in the GESs of the reference
#'   database (gene sets) that are also present in the query GES.
#'   \item signed: whether gene sets in the reference database have signs,
#'   representing up and down regulated genes when computing scores.
#'   \item pval: p-value of the Fisher's exact test.
#'   \item padj: p-value adjusted for multiple hypothesis testing using
#'   R's p.adjust function with the Benjamini & Hochberg (BH) method.
#'   \item effect: z-score based on the standard normal distribution.
#'   \item LOR: Log Odds Ratio.
#'   \item t_gn_sym: character, symbol of the gene encoding the
#'   corresponding drug target protein
#'   \item MOAss: character, compound MOA annotation from \code{signatureSearch}
#'   package
#'   \item PCIDss: character, compound PubChem CID annotation from
#'   \code{signatureSearch} package
#' }
#'
#' @param qSig \code{\link{qSig}} object defining the query signature including
#' the GESS method (should be 'LINCS') and the path to the reference database.
#' For details see help of \code{qSig} and \code{qSig-class}.
#' @param tau TRUE or FALSE, whether to compute the tau score. Note, TRUE is
#' only meaningful when the full LINCS database is searched, since accurate Tau
#' score calculation depends on the usage of the exact same database their
#' background values are based on.
#' @param sortby sort the GESS result table based on one of the following
#' statistics: `WTCS`, `NCS`, `Tau`, `NCSct` or `NA`
#' @param chunk_size number of database entries to process per iteration to
#' limit memory usage of search.
#' @param ref_trts character vector. If users want to search against a subset
#' of the reference database, they could set ref_trts as a character vector
#' representing column names (treatments) of the subsetted refdb.
#' @param workers integer(1) number of workers for searching the reference
#' database parallelly, default is 1.
#' @param method One of 'spearman' (default), 'kendall', or 'pearson',
#' indicating which correlation coefficient to use.
#'
#' @param higher The 'upper' threshold. If not 'NULL', genes with a score
#' larger than or equal to 'higher' will be included in the gene set with
#' sign +1. At least one of 'lower' and 'higher' must be specified.
#'
#' \code{higher} argument need to be set as \code{1} if the \code{refdb} in
#' \code{qSig} is path to the HDF5 file that were converted from the gmt file.
#'
#' @param lower The lower threshold. If not 'NULL', genes with a score smaller
#' than or equal 'lower' will be included in the gene set with sign -1.
#' At least one of 'lower' and 'higher' must be specified.
#'
#' \code{lower} argument need to be set as \code{NULL} if the \code{refdb} in
#' \code{qSig} is path to the HDF5 file that were converted from the gmt file.
#'
#' @param padj numeric(1), cutoff of adjusted p-value or false discovery rate
#' (FDR) of defining DEGs that is less than or equal to 'padj'. The 'padj'
#' argument is valid only if the reference HDF5 file contains the p-value
#' matrix stored in the dataset named as 'padj'.
#' @param addAnnotations Logical value.  If \code{TRUE} adds drug annotations to results.
#' 
#' @param GeneType A character value of either "reference" or a combination of 
#' "best inferred", "landmark" or "inferred" indicating which reference gene set 
#' query genes should be filtered against. While "reference" filters query genes 
#' against the reference database, "best inferred", "landmark" or 
#' "inferred" filter genes against LINCS gene spaces. 
#' 
#' @inheritParams addGESSannot
#'
#' 
#' @return \code{\link{gessResult}} object, the result table contains the
#' search results for each perturbagen in the reference database ranked by
#' their signature similarity to the query.
#' @import SummarizedExperiment
#' @seealso \code{\link{qSig}}, \code{\link{gessResult}},
#'          \code{\link{addGESSannot}}
#' @references For detailed description of the LINCS method and scores,
#' please refer to: Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D.,
#' Natoli, T. E., Lu, X., Golub, T. R. (2017). A Next Generation
#' Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell,
#' 171 (6), 1437-1452.e17. URL: https://doi.org/10.1016/j.cell.2017.10.049
#'
#' For detailed description of the CMap method, please refer to:
#' Lamb, J., Crawford, E. D., Peck, D., Modell, J. W., Blat, I. C.,
#' Wrobel, M. J., Golub, T. R. (2006). The Connectivity Map:
#' using gene-expression signatures to connect small molecules, genes, and
#' disease. Science, 313 (5795), 1929-1935.
#' URL: https://doi.org/10.1126/science.1132939
#'
#' Sandmann, T., Kummerfeld, S. K., Gentleman, R., & Bourgon, R.
#' (2014). gCMAP: user-friendly connectivity mapping with R. Bioinformatics ,
#' 30 (1), 127-128. URL: https://doi.org/10.1093/bioinformatics/btt592
#'
#' Graham J. G. Upton. 1992. Fisher's Exact Test. J. R. Stat. Soc. Ser. A
#' Stat. Soc. 155 (3). [Wiley, Royal Statistical Society]: 395-402.
#' URL: http://www.jstor.org/stable/2982890
#' @examples
#'
#' ############### LINCS method #############
#' # qsig_lincs <- qSig(query=list(
#' #                      upset=c("230", "5357", "2015", "2542", "1759"),
#' #                      downset=c("22864", "9338", "54793", "10384", "27000")),
#' #                    gess_method="LINCS", refdb=db_path)
#' # lincs <- gess_lincs(qsig_lincs, sortby="NCS", tau=FALSE)
#' # result(lincs)
#' @export
gess_lincs <- function(qSig, tau=FALSE, sortby="NCS",
                       chunk_size=5000, ref_trts=NULL, workers=1,
                       cmp_annot_tb=NULL, by="pert", cmp_name_col="pert",
                       GeneType = "reference",
                       addAnnotations = TRUE){
  if(!is(qSig, "qSig")) stop("The 'qSig' should be an object of 'qSig' class")
  if(gm(qSig) != "LINCS"){
    stop(paste("The 'gess_method' slot of 'qSig' should be 'LINCS'",
               "if using 'gess_lincs' function"))
  }
  upset <- qr(qSig)$upset
  downset <- qr(qSig)$downset
  db_path <- determine_refdb(refdb(qSig))
  res <- lincsEnrich(db_path, upset=upset, downset=downset,
                     tau=tau, sortby=sortby, chunk_size=chunk_size,
                     ref_trts=ref_trts, workers=workers, GeneType2=GeneType)
  # add compound annotations
  if(addAnnotations == TRUE){
  res <- addGESSannot(res, refdb(qSig), cmp_annot_tb =cmp_annot_tb[,!colnames(cmp_annot_tb) %in% "t_gn_sym"], by, cmp_name_col)
  } else {
    res <- as_tibble(res)
  }

  x <- gessResult(result = res,
                  query = qr(qSig),
                  gess_method = gm(qSig),
                  refdb = refdb(qSig))
  return(x)
}


lincsEnrich <- function (db_path, upset, downset, sortby = "NCS", type = 1, 
                         output = "all", tau = FALSE, minTauRefSize = 500, chunk_size = 5000, 
                         ref_trts = NULL, workers = 2, GeneType2 = "reference") {
  mycolnames <- c("WTCS", "NCS", "Tau", "NCSct", "N_upset", "N_downset", NA)
  if (!any(mycolnames %in% sortby)) 
    stop("Unsupported value assinged to sortby.")
  full_mat <- HDF5Array(db_path, "assay")
  rownames(full_mat) <- as.character(HDF5Array(db_path, "rownames"))
  colnames(full_mat) <- as.character(HDF5Array(db_path, "colnames"))
  if (!is.null(ref_trts)) {
    trts_valid <- trts_check(ref_trts, colnames(full_mat))
    full_mat <- full_mat[, trts_valid] }
  full_dim <- dim(full_mat)
  full_grid <- colAutoGrid(full_mat, ncol = min(chunk_size, ncol(full_mat)))
  nblock <- length(full_grid)
  ESout <- unlist(bplapply(seq_len(nblock), function(b) {
    ref_block <- read_block(full_mat, full_grid[[as.integer(b)]])
    mat <- ref_block
    if (length(upset) > 0 & length(downset) > 0) {
      ESup <- apply(mat, 2, function(x) .enrichScore(sigvec = sort(x,decreasing = TRUE), Q = upset, type = type, GeneType3 = GeneType2))
      ESdown <- apply(mat, 2, function(x) .enrichScore(sigvec = sort(x,decreasing = TRUE), Q = downset, type = type, GeneType3 = GeneType2))
      ESout1 <- ifelse(sign(ESup) != sign(ESdown), (ESup -ESdown)/2, 0)
    } else if (length(upset) > 0 & length(downset) == 0) {
      ESup <- apply(mat, 2, function(x) .enrichScore(sigvec = sort(x,decreasing = TRUE), Q = upset, type = type, GeneType3 = GeneType2))
      ESout1 <- ESup
    } else if (length(upset) == 0 & length(downset) > 0) {
      ESdown <- apply(mat, 2, function(x) .enrichScore(sigvec = sort(x,decreasing = TRUE), Q = downset, type = type, GeneType3 = GeneType2))
      ESout1 <- -ESdown
    }
  }, BPPARAM = MulticoreParam(workers = workers)))
  if (output == "esonly") {
    return(ESout)
  }
  if (output == "all") {
    resultDF <- .lincsScores(esout = ESout, upset = upset, downset = downset, minTauRefSize = minTauRefSize, tau = tau)
  }
  if (!is.na(sortby)) {
    resultDF <- resultDF[order(abs(resultDF[, sortby]), decreasing = TRUE), ]
  }
  else {
    resultDF <- resultDF
  }
  row.names(resultDF) <- NULL
  return(resultDF)
}

#' @importFrom utils read.delim
#' @importFrom stats quantile
.lincsScores <- function(esout, upset, downset, minTauRefSize, tau=FALSE) {
    ## P-value and FDR for WTCS based on ESnull from random queries where
    ## p-value = sum(ESrand > ES_obs)/Nrand

    # download ES_NULL.txt from AnnotationHub
    WTCSnull <- validLoad("EH3234")
    WTCSnull[WTCSnull[, "Freq"]==0, "Freq"] <- 1
    # Add pseudo count of 1 where Freq is zero
    myrounding <- max(nchar(as.character(WTCSnull[,"WTCS"]))) - 3
    # Three because of dot and minus sign
    es_round <- round(as.numeric(esout), myrounding)
    # Assures same rounding used for WTCSnull computation
    WTCS_pval <- vapply(es_round, function(x) {
      sum(WTCSnull[abs(WTCSnull[,"WTCS"]) > abs(x),"Freq"])/sum(WTCSnull[,"Freq"])
      }, FUN.VALUE = numeric(1))
    WTCS_fdr <- p.adjust(WTCS_pval, "fdr")
    ## Normalized connectivity score (NCS)
    grouping <- paste(gsub("^.*?__", "", names(esout)),
                      as.character(ifelse(esout > 0, "up", "down")), sep="__")
    es_na <- as.numeric(esout)
    es_na[es_na == 0] <- NA
    # eliminates zeros from mean calculation; zeros have high impact on NCS
    # values due to their high frequency
    groupmean <- tapply(es_na, grouping, mean, na.rm=TRUE)
    groupmean[is.na(groupmean)] <- 0
    # In case groups contain only zeros, NA/NaN values are introduced in mean
    # calculation which are reset to zeros here
    groupmean[groupmean==0] <- 10^-12
    # Set zeros (can be from non NAs) to small value to avoid devision by zero
    ncs <- as.numeric(esout) / abs(groupmean[grouping])
    # without abs() sign of neg values would switch to pos
    ## Tau calculation requires reference NCS lookup DB
    ## performs: sign(ncs_query) * 100/N sum(abs(ncs_ref) < abs(ncs_query))
    if(tau){
      # download taurefList.rds
      taurefList9264 <- validLoad("EH3233")

      ncs_query <- ncs; names(ncs_query) <- names(esout)
      queryDB_refDB_match <-
          unique(unlist(lapply(taurefList9264, rownames))) %in% names(ncs_query)
      if(!all(queryDB_refDB_match)) warning(
          paste0("QueryDB and tauRefDB differ by ",
          round(100 * sum(!queryDB_refDB_match)/length(queryDB_refDB_match),1),
          "% of their entries.",
          " Accurate tau computation requires close to 0% divergence. \n"))
      ncs_query_list <- split(ncs_query,
                              factor(gsub("^.*?__", "", names(ncs_query))))
      tau_score <- lapply(names(ncs_query_list), function(x) {
        tmpDF <- taurefList9264[[x]]
        ncs_query_match <- names(ncs_query_list[[x]])[names(ncs_query_list[[x]])
                                                      %in% rownames(tmpDF)]

        if(length(ncs_query_match)>0) {
          tmpDF <- tmpDF[ncs_query_match, , drop=FALSE]
          #### subset to the same length ##### rounded as in ref db
          sign(ncs_query_list[[x]][names(ncs_query_list[[x]]) %in% rownames(tmpDF)]) * 100/ncol(tmpDF) *
            rowSums(abs(tmpDF) < abs(round(ncs_query_list[[x]][names(ncs_query_list[[x]]) %in% rownames(tmpDF)], 2)))
        } else {
          NULL
        }
      })
      tau_score <- unlist(tau_score)
      tau_score <- tau_score[names(ncs_query)]
      tauRefSize <- vapply(taurefList9264, ncol,
                  FUN.VALUE = integer(1))[gsub("^.*?__", "", names(tau_score))]
      tau_score[tauRefSize < minTauRefSize] <- NA
      ## Add by YD
      rm(taurefList9264); gc()
    }
    ## Summary across cell lines (NCSct)
    ctgrouping <- gsub("__.*__", "__", names(esout))
    qmax <- tapply(ncs, ctgrouping, function(x) {
      q <- quantile(x, probs=c(0.33, 0.67))
      ifelse(abs(q[2]) >= abs(q[1]), q[2], q[1])
    })
    qmax <- qmax[ctgrouping]
    ## Organize result in data.frame
    new <- as.data.frame(t(vapply(seq_along(esout), function(i)
      unlist(strsplit(as.character(names(esout)[i]), "__")),
      FUN.VALUE = character(3))), stringsAsFactors=FALSE)
    colnames(new) <- c("pert", "cell", "type")

    if(tau){
      resultDF <- data.frame(
        new,
        trend = as.character(ifelse(esout > 0, "up", "down")),
        WTCS = as.numeric(esout),
        WTCS_Pval = WTCS_pval,
        WTCS_FDR = WTCS_fdr,
        NCS = ncs,
        Tau = tau_score,
        TauRefSize=tauRefSize,
        NCSct = qmax,
        N_upset = length(upset),
        N_downset = length(downset), stringsAsFactors = FALSE)
    } else {
      resultDF <- data.frame(
        new,
        trend = as.character(ifelse(esout > 0, "up", "down")),
        WTCS = as.numeric(esout),
        WTCS_Pval = WTCS_pval,
        WTCS_FDR = WTCS_fdr,
        NCS = ncs,
        NCSct = qmax,
        N_upset = length(upset),
        N_downset = length(downset), stringsAsFactors = FALSE)
    }
    row.names(resultDF) <- NULL
    return(resultDF)
}


## Define enrichment function according to Subramanian et al, 2005
## Note: query corresponds to gene set, here Q.
#' @importFrom readr read_tsv
.enrichScore <- function (sigvec, Q, type, GeneType3 = "reference"){
  if(GeneType3[1] == "reference"){ 
    sigvec2 <- sigvec
  } else{
    Lspace <- suppressMessages(read_tsv(system.file("extdata", "LINCSGeneSpaceSub.txt", package="signatureSearch")))
    Lt <- as.character(Lspace[Lspace$Type %in% GeneType3,]$`Entrez ID`)
    sigvec2 <- sigvec[names(sigvec) %in% Lt]
  }
  L <- names(sigvec2)
  R <- as.numeric(sigvec2)
  N <- length(L)
  NH <- length(Q)
  Ns <- N - NH
  hit_index <- as.numeric(L %in% Q)
  miss_index <- 1 - hit_index
  R <- abs(R^type)
  NR <- sum(R[hit_index == 1]) #### Adjusted here , na.rm = TRUE
  if (NR == 0)
    return(0)
  ESvec <- cumsum((hit_index * R * 1/NR) - (miss_index * 1/Ns))
  ES <- ESvec[which.max(abs(ESvec))]
  return(ES)
}

#' Function computes null distribution of Weighted Connectivity Scores (WTCS)
#' used by the LINCS GESS method for computing nominal P-values.
#'
#' @title Generate WTCS Null Distribution with Random Queries
#' @param h5file character(1), path to the HDF5 file representing the
#' reference database
#' @param N_queries number of random queries
#' @param dest path to the output file (e.g. "ES_NULL.txt")
#' @param write Logical value indicating if results should be written to dest.
#' @return File with path assigned to \code{dest}
#' @importFrom utils write.table
#' @examples
#' db_path = system.file("extdata", "sample_db.h5", package="signatureSearch")
#' rand <- rand_query_ES(h5file=db_path, N_queries=5, dest="ES_NULL.txt", write=FALSE)
#' unlink("ES_NULL.txt")
#' @seealso \code{\link{gess_lincs}}
#' @references
#' Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E.,
#' Lu, X., Golub, T. R. (2017). A Next Generation Connectivity Map: L1000
#' Platform and the First 1,000,000 Profiles. Cell, 171 (6), 1437-1452.e17.
#' URL: https://doi.org/10.1016/j.cell.2017.10.049
#' @export

rand_query_ES <- function(h5file, N_queries=1000, dest, write = TRUE) {
  ## Create list of random queries
  idnames <- drop(h5read(h5file, "rownames"))
  query_list <- randQuerySets(id_names=idnames, N_queries=N_queries,
                              set_length=150)
  ## Define vapply function
  f <- function(x, query_list, h5file) {
      esout <- lincsEnrich(h5file, upset=query_list[[x]]$up,
                           downset=query_list[[x]]$down, sortby=NA,
                           output="esonly", type=1)
      names(esout) <- drop(h5read(h5file, "colnames"))
      # message("Random query ", sprintf("%04d", x),
      #         " has been searched against reference database")
      wtcs <- esout
  }
  myMA <- vapply(seq(along=query_list), f, query_list, h5file,
                 FUN.VALUE=double(length(drop(h5read(h5file, "colnames")))))
  colnames(myMA) <- names(query_list)
  ## Collect results in frequency table with 3 diget accuracy
  esMA <- data.frame(WTCS=as.character(round(rev(seq(-1, 1, by=0.001)), 3)),
                     Freq=0, stringsAsFactors=FALSE)
  freq <- table(round(as.numeric(as.matrix(myMA),3),3))
  ## processes entire myMA data.frame
  freq <- freq[as.character(esMA[,1])]
  freq[is.na(freq)] <- 0
  esMA[,"Freq"] <- as.numeric(esMA[,"Freq"]) + as.numeric(freq)
  if(write){
  write.table(esMA, file=dest, quote=FALSE, row.names=FALSE, sep="\t")
  } 
  return(esMA)
}

randQuerySets <- function(id_names, N_queries, set_length=150) {
    randset_names <- paste0("randset_", sprintf("%09d", seq_len(N_queries)))
    rand_query_list <- lapply(randset_names, function(x) {
        id_list <- sample(id_names, 2 * set_length)
        split(id_list, rep(c("up", "down"), each=set_length))
    })
    names(rand_query_list) <- randset_names
    return(rand_query_list)
}
girke-lab/signatureSearch documentation built on Feb. 21, 2024, 8:32 a.m.