R/testDA_GLMM.R

Defines functions testDA_GLMM

Documented in testDA_GLMM

#' Test for differential abundance: method 'diffcyt-DA-GLMM'
#' 
#' Calculate tests for differential abundance of cell populations using method
#' 'diffcyt-DA-GLMM'
#' 
#' Calculates tests for differential abundance of clusters, using generalized linear mixed
#' models (GLMMs).
#' 
#' This methodology was originally developed and described by Nowicka et al. (2017),
#' \emph{F1000Research}, and has been modified here to make use of high-resolution
#' clustering to enable investigation of rare cell populations. Note that unlike the
#' original method by Nowicka et al., we do not attempt to manually merge clusters into
#' canonical cell populations. Instead, results are reported at the high-resolution
#' cluster level, and the interpretation of significant differential clusters is left to
#' the user via visualizations such as heatmaps (see the package vignette for an example).
#' 
#' This method fits generalized linear mixed models (GLMMs) for each cluster, and
#' calculates differential tests separately for each cluster. The response variables in
#' the models are the cluster cell counts, which are assumed to follow a binomial
#' distribution. There is one model per cluster. We also include a filtering step to
#' remove clusters with very small numbers of cells, to improve statistical power.
#' 
#' For more details on the statistical methodology, see Nowicka et al. (2017),
#' \emph{F1000Research} (section 'Differential cell population abundance'.)
#' 
#' The experimental design must be specified using a model formula, which can be created
#' with \code{\link{createFormula}}. Flexible experimental designs are possible, including
#' blocking (e.g. paired designs), batch effects, and continuous covariates. Blocking
#' variables can be included as either random intercept terms or fixed effect terms (see
#' \code{\link{createFormula}}). For paired designs, we recommend using random intercept
#' terms to improve statistical power; see Nowicka et al. (2017), \emph{F1000Research} for
#' details. Batch effects and continuous covariates should be included as fixed effects.
#' In addition, we include random intercept terms for each sample to account for
#' overdispersion typically seen in high-dimensional cytometry count data. The
#' sample-level random intercept terms are known as 'observation-level random effects'
#' (OLREs); see Nowicka et al. (2017), \emph{F1000Research} for more details.
#' 
#' The contrast matrix specifying the contrast of interest can be created with
#' \code{\link{createContrast}}. See \code{\link{createContrast}} for more details.
#' 
#' Filtering: Clusters are kept for differential testing if they have at least
#' \code{min_cells} cells in at least \code{min_samples} samples. This removes clusters
#' with very low cell counts across conditions, to improve power.
#' 
#' Normalization: Optional normalization factors can be included to adjust for composition
#' effects in the cluster cell counts per sample. For example, in an extreme case, if
#' several additional clusters are present in only one condition, while all other clusters
#' are approximately equally abundant between conditions, then simply normalizing by the
#' total number of cells per sample will create a false positive differential abundance
#' signal for the non-differential clusters. (For a detailed explanation in the context of
#' RNA sequencing gene expression, see Robinson and Oshlack, 2010.) Normalization factors
#' can be calculated automatically using the 'trimmed mean of M-values' (TMM) method
#' (Robinson and Oshlack, 2010), implemented in the \code{edgeR} package (see also the
#' \code{edgeR} User's Guide for details). Alternatively, a vector of values can be
#' provided (the values should multiply to 1).
#' 
#' 
#' @param d_counts \code{\link{SummarizedExperiment}} object containing cluster cell
#'   counts, from \code{\link{calcCounts}}.
#' 
#' @param formula Model formula object, created with \code{\link{createFormula}}. This
#'   should be a list containing three elements: \code{formula}, \code{data}, and
#'   \code{random_terms}: the model formula, data frame of corresponding variables, and
#'   variable indicating whether the model formula contains any random effect terms. See
#'   \code{\link{createFormula}} for details.
#' 
#' @param contrast Contrast matrix, created with \code{\link{createContrast}}. See
#'   \code{\link{createContrast}} for details.
#' 
#' @param min_cells Filtering parameter. Default = 3. Clusters are kept for differential
#'   testing if they have at least \code{min_cells} cells in at least \code{min_samples}
#'   samples.
#' 
#' @param min_samples Filtering parameter. Default = \code{number of samples / 2}, which
#'   is appropriate for two-group comparisons (of equal size). Clusters are kept for
#'   differential testing if they have at least \code{min_cells} cells in at least
#'   \code{min_samples} samples.
#' 
#' @param normalize Whether to include optional normalization factors to adjust for
#'   composition effects (see details). Default = FALSE.
#' 
#' @param norm_factors Normalization factors to use, if \code{normalize = TRUE}. Default =
#'   \code{"TMM"}, in which case normalization factors are calculated automatically using
#'   the 'trimmed mean of M-values' (TMM) method from the \code{edgeR} package.
#'   Alternatively, a vector of values can be provided (the values should multiply to 1).
#' 
#' 
#' @return Returns a new \code{\link{SummarizedExperiment}} object, with differential test
#'   results stored in the \code{rowData} slot. Results include raw p-values
#'   (\code{p_val}) and adjusted p-values (\code{p_adj}), which can be used to rank
#'   clusters by evidence for differential abundance. The results can be accessed with the
#'   \code{\link{rowData}} accessor function.
#' 
#' 
#' @importFrom SummarizedExperiment assays rowData 'rowData<-' colData 'colData<-'
#' @importFrom edgeR calcNormFactors
#' @importFrom lme4 glmer
#' @importFrom multcomp glht
#' @importFrom methods as is
#' @importFrom stats p.adjust
#' 
#' @export
#' 
#' @examples
#' # For a complete workflow example demonstrating each step in the 'diffcyt' pipeline, 
#' # see the package vignette.
#' 
#' # Function to create random data (one sample)
#' d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) {
#'   d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor
#'   colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
#'   d
#' }
#' 
#' # Create random data (without differential signal)
#' set.seed(123)
#' d_input <- list(
#'   sample1 = d_random(), 
#'   sample2 = d_random(), 
#'   sample3 = d_random(), 
#'   sample4 = d_random()
#' )
#' 
#' # Add differential abundance (DA) signal
#' ix_DA <- 801:900
#' ix_cols_type <- 1:10
#' d_input[[3]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10)
#' d_input[[4]][ix_DA, ix_cols_type] <- d_random(n = 1000, mean = 2, ncol = 10)
#' 
#' experiment_info <- data.frame(
#'   sample_id = factor(paste0("sample", 1:4)), 
#'   group_id = factor(c("group1", "group1", "group2", "group2")), 
#'   stringsAsFactors = FALSE
#' )
#' 
#' marker_info <- data.frame(
#'   channel_name = paste0("channel", sprintf("%03d", 1:20)), 
#'   marker_name = paste0("marker", sprintf("%02d", 1:20)), 
#'   marker_class = factor(c(rep("type", 10), rep("state", 10)), 
#'                         levels = c("type", "state", "none")), 
#'   stringsAsFactors = FALSE
#' )
#' 
#' # Prepare data
#' d_se <- prepareData(d_input, experiment_info, marker_info)
#' 
#' # Transform data
#' d_se <- transformData(d_se)
#' 
#' # Generate clusters
#' d_se <- generateClusters(d_se)
#' 
#' # Calculate counts
#' d_counts <- calcCounts(d_se)
#' 
#' # Create model formula
#' formula <- createFormula(experiment_info, cols_fixed = "group_id", cols_random = "sample_id")
#' 
#' # Create contrast matrix
#' contrast <- createContrast(c(0, 1))
#' 
#' # Test for differential abundance (DA) of clusters
#' res_DA <- testDA_GLMM(d_counts, formula, contrast)
#' 
testDA_GLMM <- function(d_counts, formula, contrast, 
                        min_cells = 3, min_samples = NULL, 
                        normalize = FALSE, norm_factors = "TMM") {
  
  if (is.null(min_samples)) {
    min_samples <- ncol(d_counts) / 2
  }
  
  counts <- assays(d_counts)[["counts"]]
  cluster_id <- rowData(d_counts)$cluster_id
  
  # filtering: keep clusters with at least 'min_cells' cells in at least 'min_samples' samples
  tf <- counts >= min_cells
  ix_keep <- apply(tf, 1, function(r) sum(r) >= min_samples)
  
  counts <- counts[ix_keep, , drop = FALSE]
  cluster_id <- cluster_id[ix_keep]
  
  # total cell counts per sample (after filtering) (for weights in model fitting)
  
  # normalization factors
  if (normalize & norm_factors == "TMM") {
    norm_factors <- calcNormFactors(counts, method = "TMM")
  }
  
  if (normalize) {
    n_cells_smp <- colSums(counts) / norm_factors
  } else {
    n_cells_smp <- colSums(counts)
  }
  
  # GLMM testing pipeline
  
  # transpose contrast matrix if created with 'createContrast' (required by 'glht')
  if (ncol(contrast) == 1 & nrow(contrast) > 1) {
    contrast <- t(contrast)
  }
  
  # fit models: separate model for each cluster
  
  p_vals <- rep(NA, length(cluster_id))
  
  for (i in seq_along(cluster_id)) {
    p_vals[i] <- tryCatch({
      # data for cluster i
      # note: divide by total number of cells per sample (after filtering) to get
      # proportions instead of counts
      y <- counts[i, ] / n_cells_smp
      data_i <- cbind(y, n_cells_smp, formula$data)
      # fit model
      # note: provide proportions (y) together with weights for total number of cells per
      # sample (n_cells_smp); this is equivalent to providing counts
      fit <- glmer(formula$formula, data = data_i, family = "binomial", weights = n_cells_smp)
      # test contrast
      test <- glht(fit, contrast)
      # return p-value
      summary(test)$test$pvalues
      # return NA as p-value if there is an error
    }, error = function(e) NA)
  }
  
  # adjusted p-values (false discovery rates)
  p_adj <- p.adjust(p_vals, method = "fdr")
  
  stopifnot(length(p_vals) == length(p_adj))
  
  # return results in 'rowData' of new 'SummarizedExperiment' object
  
  out <- data.frame(p_val = p_vals, p_adj = p_adj, stringsAsFactors = FALSE)
  
  # fill in any missing rows (filtered clusters) with NAs
  row_data <- as.data.frame(matrix(as.numeric(NA), nrow = nlevels(cluster_id), ncol = ncol(out)))
  colnames(row_data) <- colnames(out)
  cluster_id_nm <- as.numeric(cluster_id)
  row_data[cluster_id_nm, ] <- out
  
  row_data <- cbind(cluster_id = rowData(d_counts)$cluster_id, row_data)
  
  res <- d_counts
  
  rowData(res) <- row_data
  
  # return normalization factors in 'metadata'
  if (normalize) {
    metadata(res)$norm_factors <- norm_factors
  }
  
  res
}
lmweber/diffcyt documentation built on March 16, 2021, 4:43 p.m.