PANTS: Pathway analysis via network smoothing

Documented in pants

#' Pathway analysis via network smoothing (Pants)
#' 
#' Pants algorithm to test if scores of features (i.e. analytes such as a gene, protein, or metabolite)
#' in a pathway or those connected to the pathway in an interaction network are greater than randomized ones.
#' Allows for testing group differences (\code{\link[ezlimma]{limma_contrasts}}) with \code{contrast.v} & \code{design};
#' correlation (\code{\link[ezlimma]{limma_cor}}) with \code{design}; or 
#' mediation (\code{\link[ezlimma]{hitman}}) with \code{exposure} & \code{covariates}.
#' 
#' @param Gmat Binary feature (e.g. gene) by pathway inclusion matrix, indicating which features are in which pathways.
#' @param phenotype Vector of sample characteristics (correlation: numeric; contrasts: character). 
#' Should be same length as \code{ncol(object)}.
#' @param type Type of \pkg{ezlimma} analysis per feauture; must be one of"contrasts" 
#' (\code{\link[ezlimma]{limma_contrasts}}), "correlation" (\code{\link[ezlimma]{limma_cor}}), 
#' or "mediation" (\code{\link[ezlimma]{hitman}}). You can specify just the initial letter.
#' @param exposure Numeric vector or matrix of exposures. Ignored if \code{type!="mediation"}.
#' @param ker Laplacian kernel matrix representing the interaction network.
#' @param score_fcn Function that transforms the t-statistics from the contrasts into a non-negative value. 
#' Its input must be a vector of same length as number of elements in \code{contrast.v} (usually one). 
#' Its output must be a non-negative scalar. Ignored if \code{hitman} is \code{TRUE}.
#' @param annot.df Table of feature annotations that are appended to feature statistics.
#' @param ntop Number of top features that most impact a pathway to include.
#' @param nperm Number of sample permutations to evaluate significance of pathways.
#' @param ret.pwy.dfs Logical; return list of data frames written out to CSVs? 
#' @param ret.null.mats Logical; return matrices with null distributions for features and pathways?
#' @param ncores Integer. If > 1, number of cores to use for parallel computing. 
#' You can detect how many are available for your system using \code{\link[parallel]{detectCores}}.
#' @param seed Integer seed to set for reproducility.
#' @inheritParams ezlimma::limma_contrasts
#' @inheritParams ezlimma::limma_cor
#' @inheritParams ezlimma::roast_contrasts
#' @inheritParams ezlimma::hitman
#' @details Without \code{mediation}, \code{phenotype}'s are permuted, since this properly permutes the \code{object} to 
#' \code{phenotype} mapping. \code{object} could be equivalently permuted. 
#' With \code{mediation},  because \code{object} is tested for its association to both \code{phenotype} 
#' and \code{exposure}, \code{colnames(object)} are permuted, which offers more available permutations.
#' 
#' Scores for features in the kernel but not in the data are assigned a score of zero by default for sparsity.
#' Scores for features and pathways are compared to null scores, which are generated by permuting the columns of 
#' \code{object} and rerunning the algorithm. These are the stats returned in \code{feature.stats}.
#' 
#' For \code{\link[parallel]{makeCluster}}, the cluster \code{type} depends on the OS, which is tested in the body
#' of the function using \code{.Platform$OS.type}.
#' 
#' If \code{!is.na(name)}, an Excel file with "_pants.xlsx" appended to \code{name} gets written out with links to CSVs 
#' containing the statistics and annotation of features most affecting the pathway's score. The annotation (and possibly 
#' other statistics) are from \code{annot.df}. Additionally, the CSVs contain whether each 
#' feature is in the pathway, and an \code{impact} column describing the impact of each feature on the pathway's 
#' score. Since a pathway's score is calculated in \code{pants}, \code{impact} uses the feature statistics calculated 
#' in \code{pants} by comparing to permutation. The feature statistics from \pkg{ezlimma} and those from 
#' \code{pants} are nearly identical, though; the main difference is that \code{pants} feature significances are limited 
#' by the number of permutations, so they flatten near the extreme. The features with the largest magnitude impact score
#' are selected and can be visualized with \code{ezlimmaplot::plot_pwy}. These features may increase or decrease a 
#' pathway's score.
#' 
#' @return List of at least two data frames:
#' \describe{
#'    \item{\code{pwy.stats}}{A data frame with columns 
#'    \describe{
#'    \item{\code{nfeatures}}{number of features in the pathway.}
#'    \item{\code{score}}{only returned if \code{ret.null.mats} is \code{TRUE}; 
#'    pathway score (larger is more significant) to compare to \code{null.pwy.mat}}
#'    \item{\code{z}}{pathway permutation z-score (larger is more significant)}
#'    \item{\code{p}}{pathway permutation p-value}
#'    \item{\code{FDR}}{pathway FDR calculated from p-values with \code{p.adjust(p, method="BH")}}
#'    }}
#'    
#'    \item{\code{feature.stats}}{A data frame with columns
#'    \describe{
#'    \item{\code{score}}{without \code{mediation}, feature's score from applying \code{score_fcn} to moderated t-statistics;
#'    or with \code{mediation}, parametric z-score from \code{hitman}'s mediation p-value.}
#'    \item{\code{z}}{feature non-parametric z-score (larger is more significant) from comparing \code{score} vs. 
#'    this feature's scores in permutations (before smoothing)}
#'    \item{\code{p}}{feature's non-parametric permutation p-value}
#'    \item{\code{FDR}}{feature's non-parametric FDR from permutation \code{p}}
#'    }}
#'    
#'    if \code{ret.pwy.dfs} is \code{TRUE}:
#'    \item{\code{pwy.dfs}}{List of data frames written out to CSVs}
#'    
#'    And if \code{ret.null.mats} is \code{TRUE}:
#'    \item{\code{null.feature.mat}}{Matrix with features as rows and permutations as columns, where each element represents
#'    the score of that feature in that permutation}
#'    \item{\code{null.pwy.mat}}{Matrix with pathways as rows and permutations as columns, where each element represents
#'    the score of that pathway in that permutation}
#'    \item{\code{sample.perms}}{Matrix with samples as rows and permutations as columns, where each element represents
#'    the index of the sample simulated to represent the sample in the row in that permutation}
#'  }
#' @examples
#' # A workflow is described in the vignette; instructions to view the vignette are in the README.
#' @importFrom utils methods
#' @importFrom zeallot %<-%
#' @export

pants <- function(object, Gmat, phenotype=NULL, type=c("contrasts", "correlation", "mediation"),
                  contrast.v=NULL, design=NULL, exposure=NULL, covariates=NULL, 
                  ker=NULL, annot.df=NULL, ntop=25, score_fcn=abs, nperm=10^4-1, 
                  ret.pwy.dfs=FALSE, ret.null.mats=FALSE, min.nfeats=3, ncores=1, name=NA, seed=0){
  type <- match.arg(type)
  if (is.null(ker)){
    ker <- diag_kernel(object.rownames = rownames(object), Gmat.rownames = rownames(Gmat))
  }
  stopifnot(length(intersect(rownames(ker), rownames(object))) > 0, any(rownames(Gmat) %in% colnames(ker)), 
            colnames(object)==names(phenotype), is.null(annot.df) || any(rownames(annot.df) %in% rownames(object)),
            is.numeric(ncores), is.numeric(min.nfeats), is.numeric(ntop))
  
  c(Gmat, nfeats.per.pwy) %<-% subset_gmat(object=object, Gmat=Gmat, min.nfeats=min.nfeats)

  if (type=="mediation"){
    if (ncol(as.matrix(exposure))==1){
      stopifnot(colnames(object)==names(exposure))
    } else {
      stopifnot(colnames(object)==rownames(exposure))
    }
    hm <- ezlimma::hitman(E=exposure, M=object, Y=phenotype, covariates=covariates)
    # transform to one-sided z-score
    score.v <- stats::qnorm(p=hm[rownames(object), "EMY.p"], lower.tail = FALSE)
  } else {
    score.v <- score_features(object=object, phenotype=phenotype, type=type, contrast.v=contrast.v,
                              design=design, score_fcn=score_fcn)
  }
  
  if (ncores > 1){
    # PSOCK works without parallel::clusterExport
    cl.type <- ifelse(.Platform$OS.type=="windows", "PSOCK", "FORK")
    cl <- parallel::makeCluster(spec=ncores, type=cl.type)
  }
  
  set.seed(seed)
  try({
    if (type=="mediation"){
      sample.perms <- ezpermutations(xx=colnames(object), nperm=nperm)
      med.fun <- function(cn.perm){
        # permute object
        # must set permuted names to NULL st limma_contrasts doesn't complain that they clash with colnames(object)
        object.tmp <- object[,cn.perm]
        # to avoid names error in stopifnot
        colnames(object.tmp) <- colnames(object)
        # permuted hm result
        hm.tmp <- ezlimma::hitman(E=exposure, M=object.tmp, Y=phenotype, covariates=covariates)
        # return z-scores from permuted object
        stats::qnorm(p=hm.tmp[rownames(object), "EMY.p"], lower.tail = FALSE)
      }
      score.mat <- sapply_by_ncores(ncores=ncores, X=sample.perms, FUN=med.fun, cl=cl)
    } else {
      # perms <- lapply(seq_len(nperm), function(i) sample.int(length(phenotype)))
      sample.perms <- ezpermutations(xx=phenotype, nperm=nperm)
      contr.fun <- function(ph.perm){
        # must set permuted names to NULL st limma_contrasts doesn't complain thay they clash with colnames(object)
        # pheno.tmp <- stats::setNames(ph.perm, nm=NULL)
        score_features(object=object, phenotype=ph.perm, type=type, contrast.v=contrast.v, 
                       design=design, score_fcn=score_fcn)
      }
      score.mat <- sapply_by_ncores(ncores=ncores, X=sample.perms, FUN=contr.fun, cl=cl) 
    }
  }) # end try
  if (ncores > 1) parallel::stopCluster(cl=cl)
  dimnames(score.mat) <- list(rownames(object), paste0("perm", 1:ncol(score.mat)))
  
  # could use zeallot, but would still need temporary score.mat.comb
  # match_mats is fast
  mm <- match_mats(score.mat = cbind(v=score.v, score.mat), ker=ker, Gmat=Gmat)
  score.mat <- mm$score.mat[,-1]; score.v <- mm$score.mat[,1]; ker <- mm$ker; Gmat <- mm$Gmat
  rm(mm) #to save memory
  if (any(names(score.v) != rownames(ker) | rownames(ker) != rownames(Gmat))){
    stop("Internal function match_mats failed.", call. = FALSE)
  }

  # feature p-values (for plotting)
  # features in object & in kernel
  feature.stats <- data.frame(score = score.v, 
    matrix(NA, nrow=length(score.v), ncol=3, dimnames=list(rownames(score.mat), c("z", "p", "FDR"))))
  # need to coerce score.mat to matrix to prevent rowSums error
  feature.stats[,c("z", "p")] <- p_ecdf(eval.v=score.v, score.mat = as.matrix(score.mat), alternative = "greater")
  feature.stats[,"FDR"] <- stats::p.adjust(feature.stats[,"p"], method="BH")

  # need to compare to pwys, sometimes runs out of memory
  # use scores, not z-scores, since used to calc pwy.v/nfeats.per.pwy
  pwy.v <- (score.v %*% ker %*% Gmat)[1,]
  # null pwy scores
  pwy.mat <- as.matrix(Matrix::t(Matrix::crossprod(score.mat, ker) %*% Gmat))

  # feat.score.avg=pwy.v/nfeats.per.pwy is confusing and not helpful enough to include
  pwy.stats <- data.frame(nfeatures=nfeats.per.pwy, score=pwy.v, z=NA, p=NA)
  rownames(pwy.stats) <- colnames(Gmat)
  pwy.stats[,c("z", "p")] <- p_ecdf(eval.v=pwy.v, score.mat=pwy.mat, alternative = "greater")
  pwy.stats$FDR <- stats::p.adjust(pwy.stats$p, method="BH")
  pwy.stats <- pwy.stats[order(pwy.stats$p, -pwy.stats$nfeatures),]

  # return res
  res <- list(pwy.stats=pwy.stats, feature.stats=feature.stats)
  if (ret.null.mats){
    res$null.feature.mat <- as.matrix(score.mat)
    res$null.pwy.mat <- as.matrix(pwy.mat)
    res$sample.perms <- simplify2array(sample.perms)
    dimnames(res$sample.perms) <- list(colnames(object), dimnames(score.mat)[[2]])
  } else {
    # only include "score" if ret.null.mats
    res$pwy.stats <- res$pwy.stats[,setdiff(colnames(res$pwy.stats), "score")]
  }
  
  # compute impact & write xlsx file with links
  if (is.null(annot.df)){
    feat.df <- feature.stats
  } else {
    feat.df <- data.frame(feature.stats, annot.df[rownames(feature.stats),])
  }
  if (!is.na(name)){
    if (type=="mediation") name <- paste0(name, "_pants_hitman") else name <- paste0(name, "_pants")
  }
  wpx <- write_pants_xl(zscores=feature.stats[, "z", drop=FALSE], pwy.tab=res$pwy.stats, feat.tab=feat.df, 
                        Gmat=Gmat, ker=ker, name=name, ntop=ntop)
  if (ret.pwy.dfs) res <- c(res, pwy.dfs=list(wpx$pwy.csvs))
  return(res)
}