knitr::opts_chunk$set(
  tidy = FALSE, cache = FALSE, dev = "png", collapse = TRUE, comment = "#>",
  message = FALSE, error = FALSE, warning = TRUE
)
options(scipen = 1, digits = 2)

Default Workflow

DECAL (Differential Expression analysis of Clonal Alterations Local effects) provide you with tools to conduct differential expression analysis of single-cell perturbations experiments to potentially interacting genes. Similar to other differential expression analysis tools, it models gene expression using a Negative Binomial (or Gamma-Poisson) regression, modeling each gene UMI count by the cell alteration status and cell total count.

Features

Installation

To install decal package current version, open your R terminal and type:

## Install remotes if not available with
## install.install.packages("remotes")
remotes::install_github("mauranolab/decal")

Using decal

The package main function (decal) runs the whole statistical analysis by fitting a Negative Binomial (or Gamma-Poisson) regression to each gene and alteration pair specified and evaluate the perturbation statistical significance for a single-cell perturbation experiment. You can run decal as follow:

decal(perturbations, count, clone)

The three parameters required are the following:

decal returns a deep copy of the perturbations table with the following additional columns:

Quick Start

Below we include a example of decal analysis on a simulated dataset included with our package.

library(decal)

data("sim_decal")
perturbations <- sim_decal$perturbations
count <- sim_decal$count
clone <- sim_decal$clone

res <- decal(perturbations, count, clone)
head(res)

Exploring decal results

Now let's explore decal results. Admitting a 5% false discovery rate (FDR), we can select significantly differentiate perturbations by picking those with p_adjusted < 0.05. Below we present the top 5 interactions with highest reducing and increasing log fold-change effect (as measured by lfc).

sig <- subset(res, p_adjusted < 0.05)
sig <- sig[order(sig$lfc), ]
## select first 5 and last 5 since`sig` is ordered by lfc
pick <- c(1:5, nrow(sig) - 1:5)
sig[pick,]

To visualize the perturbation effect, let's plot the top 3 interactions increasing and reducing gene expression.

## compute a normalize size-factor
sf <- mean(colSums(count)) / colSums(count)
## add log10(x+1) transformation function
log10p <- function(x) log10(x + 1)

## pick tables first 3 and last 3
pick <- c(1:3, nrow(sig) - 1:3)

par(mfrow = c(2, 3))
for (i in pick) {
  ## Determinate perturbed and unperturbed cells
  x <- colnames(count) %in% clone[[sig$clone[i]]]
  ## Calculate normalized expression count
  y <- count[sig$gene[i],] * sf
  ## plot jitter
  title <- paste0(sig$clone[i], "+", sig$gene[i],"\nLFC: ", round(sig$lfc[i], 2))
  plot(
    x + runif(length(x), -.3, .3), log10p(y), main = title,
    axes = FALSE, xlab = "", ylab = "Normalize UMI count",
    xlim = c(-.5, 1.5), ylim = c(0.9, 1.1) * range(log10p(y))
  )
  axis(1, at = 0:1, labels = c("perturbed", "unperturbed"), las = 2)
  axis(2, at = log10p(c(0, 2, 5, 10, 20, 50, 100)), labels = c(0, 2, 5, 10, 20, 50, 100))
  ## add 95% error bar
  ymed <- c(sig$mu[i], sig$xb[i])
  ylow <- qnbinom(0.025, size = sig$theta[i], mu = ymed)
  yupr <- qnbinom(0.975, size = sig$theta[i], mu = ymed)
  points(c(0, 1), log10p(ymed), col = "firebrick", cex = 2, pch = 16)
  arrows(
    x0 = c(0, 1), y0 = log10p(ylow), x1 = c(0, 1), y1 = log10p(yupr),
    code = 3, angle = 90, col = "firebrick", length = .1, lwd = 2
  )
}

Evaluating decal results

In our sim_decal simulated dataset, we introduced some real expression perturbations as indicated by expected_lfc column in perturbations table, such as a 0 expected_lfc indicate no change was applied. We can evaluate our results using the confusion matrix below:

conf_mat <- table(real = res$expected_lfc != 0, decal = res$p_adjusted < 0.05)
round(prop.table(conf_mat), 3)

Based on this confusion matrix, our decal model presented:

These results show great metrics with recall, accuracy, precision and F1 score above 95% and a false discovery ratio within our admitted 5%.

Next, let's explore how well decal estimated the real log2 fold-change (LFC) applied. Illustrated below is the decal estimated LFC (lfc column in res) distribution for each of the real perturbations values applied (expected_lfc column).

lfc_histogram <- function(lfc, fn = function(x) { x$count / sum(x$count) }) {
  hist_breaks <- c(-Inf, seq(-4, 4, length.out = 50), Inf)
  hist_counts <- lapply(lfc, hist, breaks = hist_breaks, plot = FALSE)
  hist_ymax <- max(sapply(hist_counts, fn))
  plot(c(-4, 4), c(0, hist_ymax), type = "n",
    xlab = "Estimated LFC", ylab = "Proportion of interactions")
  for(i in seq_along(hist_counts)) {
    lines(hist_counts[[i]]$mids, fn(hist_counts[[i]]), col = i, type = "s")
  }
  legend("topleft", title = "Real LFC", legend = names(hist_counts),
    col = seq_along(hist_counts), bty = "n", lty = 1, lwd = 2, ncol = 2)
}

lfc_histogram(split(res$lfc, res$expected_lfc))

Finally, we can measure decal LFC estimative Mean Squared Error (MSE) and Rooted MSE (RMSE) using the code below. The results indicate that decal estimates presented high accuracy and precision with RMSE below 1 LFC for all perturbations effects applied.

err <- (res$lfc - res$expected_lfc)**2
err <- split(err, res$expected_lfc)
mse <- sapply(err, mean, na.rm = TRUE)

data.frame(
  N = sapply(err, length),
  MSE = mse, RMSE = sqrt(mse)
)

Exploring decal arguments

As mentioned before, decal function requires three main parameters:

  1. perturbation, a table specifying the clone population and gene to evaluate differential expression.
  2. count, a UMI count matrix where each column correspond to a cell and each row a gene or feature.
  3. clone, a list of cells specifying each clone populations composition.

In the next sections, we will further illustrate and give alternative formats for (1) and (3).

Electing potential gene perturbations

Since decal was designed to explore specific gene perturbations, it requires you to specify clone and gene (or features) pairs to evaluate expression change among the cells within the clone in regards to all other cells. For example, in our original study after we localized our alternations and evaluated their impact to all genes within 250 kbp.

In sim_decal$perturbations data.frame (see snippet below) the clone and gene pairs are specified by the clone and gene columns, which can be an id (as character vector) or index (as integer vector). In the case these elements are specified as an index, they identify the ith element of sim_decal$clone and row position in sim_decal$count, respectively.

head(sim_decal$perturbations)

By default decal uses the clone and gene columns, but you can specify a different column using gene_col and clone_col parameters like in the example below.

alt_pert <- data.frame(gene_ix = 101:110, clone_ix = 1:10)
decal(alt_pert, count, clone,
  gene_col = "gene_ix", clone_col = "clone_ix")

Defining your experiment clonal structure

After identifying your cells' clonal populations you can specify it decal (clone parameter) as a R list where each element is clone represented by a collection of ids (as character vector) or index (as integer vector) indicating the set of cells belonging to the clone. When the cells are specified by index, they correspond to the column position in your count matrix. See below a example from sim_decal$clone

sim_decal$clone[1:2]

Reducing noise and increasing power

To improve your findings, decal includes some filtering arguments to avoid testing interactions with low statistical power:

Interactions that doesn't met all of these requirements are skipped and no differential expression analysis is conducted, eventhough they are no removed from the results. By default, decal applies the following minimal filters min_x = 1; min_n = 2; and min_mu = 0.05.

Understanding decal model

Statistical model

The decal statistical model is based on the observations that UMI counts of a particular gene approximates a Poisson or Negative Binomial distribution in a single-cell RNA-seq experiment [@svensson_droplet_2020; @townes_feature_2019]. Thus, we proposed to model the observed counts as follow:

$$ Y_{gc} \sim NB(\mu_{gci}, \theta_g) (#eq:model) $$

where $Y_{cg}$ is the UMI count for gene g and cell c modeled using a negative binomial distribution with expected count $\mu_{cgi}$ and a gene-specific dispersion parameter $\theta_g$.

$$ log(\mu_{gci}) = log(D_c) + \beta_g + \beta_x X_{ci} (#eq:modelMu) $$

The $log(\mu_{cgi})$ (or log expected count) is proportional to the perturbation effect ($\beta_x$) of a specific clone i indicated by $X_{ci}$ (where $X_{ci} == 1$ if cell c belongs to the perturbed clone or $X_{ci} == 0$ otherwise) and offseted by the log cell total UMI count ($D_c = \sum_g Y_{cg}$).

The parameter $\theta_g$ defines the model dispersion where smaller values produce a wider distribution and higher values produce a tighter distribution that coincides with a Poisson distribution. Such as the model variance is defined as follow:

$$ \begin{aligned} Var(Y_{cg}) & = E[(Y_{cg} - \mu_{cgi})^2] \ & = \mu_{cgi} + \mu_{cgi}^2/\theta_g \ \lim_{\theta\to\infty} NB(\mu, \theta) & = Pois(\mu) \end{aligned} (#eq:variance) $$

This modeling approach is similar to previously published models such as edgeR [@robinson_edger_2010; @mccarthy_differential_2012], DESeq2 [@anders_differential_2010; @love_moderated_2014], and glmGamPoi [@ahlmann-eltze_glmgampoi_2021]. But it differs in two points: (i) to estimate $\theta_g$ we adopted a regularized estimation strategy to make it robust to sampling noise in low expression genes; and (ii) instead of fitting the model to all genes, it fits and measure specific perturbation to gene.

Estimating dispersion

To estimate and regularize our $\theta_g$, we adopted the strategy described by @hafemeister_normalization_2019. This strategy consists in first selecting about 2000 genes, estimating their expect count ($mu_{cg}$) using a naive Poisson regression offseted by the log cell total count, such as:

$$ \begin{aligned} Y_{cg} & \sim Pois(mu_{cg}) \ log(mu_{cg}) & = \beta_g + log(D_c) \end{aligned} (#eq:naiveTheta) $$

Next, we use $mu_{cg}$ to make a naive estimate of $\theta_g$ using a maximum likelihood estimator (MASS::theta_ml function). The resulting naive estimate was finally regularized by fitting it to a kernel smooth regression as a function of the average gene count ($\sum_g Y_{cg} / N$) to produce our final estimate of $\theta_g$. Illustrate below is a scatterplot of each naive estimate, in red our final regularized estimate and in black the real dispersion applied when simulating sim_decal dataset.

raw_theta <- attr(res, "raw_theta")
mu <- rowMeans(count)
mu_breaks <- c(0.01, 0.1, 1, 2, 5, 10, 20, 50, 100)

plot(log10(mu), raw_theta,
  xlab = "Average expression (mu)", xaxt = "n",
  ylab = expression(theta), ylim = c(0, 150))
abline(h = 100)
lines(log10(res$mu[order(res$mu)]), res$theta[order(res$mu)], col = 2, lwd = 3)
axis(1, at = log10(mu_breaks), labels = mu_breaks)
legend(
  "topleft", c("Naive Estimate", "Regularized Estimate", "Real"),
  pch = c(1, NA, NA), lty = c(NA, 1, 1), col = c(1, 2, 1)
)

As shown above, our regularized estimates approximates the real dispersion values and control some of the excessive fluctuation presented by our naive estimate.

Assumptions and Caveats

It is worth notice that decal statiscal model implies some assumptions and caveats:

  1. The UMI count distribution is independent between genes, which is usually true for single-cell experiments due to large quantity of genes and small proportion of counts each of them represent.
  2. The effect of individual clones are not enough to affect the unperturbed estimates for other clones. For such assumption to hold, the number of perturbated cells of any clone must be much smaller than the total number of cells available (n1 << N). This allow us to explore each clone independently, since it makes the effect of other clones to our unperturbed estimates negletable.
  3. The cells investigates must represent an uniform population differing only by the induced alterations. This can safely be assumed when the experiment is conducted in a uniform cell culture.

Simulating

Since many factor can affect your experiment, we included to decal a way to simulate a dataset under our model assumptions (see equation \@ref(eq:model)). These tools will allow you to estimate your experiment power and sensitivity under different conditions you may find.

decal simulation suite is compose of various functions to generate a count matrix (sim_count and sim_count_from_data), randomly assign cells to different clones (sim_clone and sim_clone_range), and generate a full experiment dataset (sim_experiment and sim_experiment_from_data). Here we will use sim_experiment and sim_experiment_from_data to generate our examples, but the other functions behave similarly.

Evaluating our model under null hypothesis

Under the null hypothesis, we expect that our perturbations produce no expression change to nearby genes. Admitting a FDR of 5% and that decal statistical model holds, we would expect that the resulting z-score present a normal distribution centered at 0, p-value have a uniform distribution and about 5% of the tests below our significance threshold.

Given an existing dataset, we can evaluate if our model holds by randomly recombining clone and gene pairs.

rnd <- unique(data.frame(
  clone = sample(perturbations$clone),
  gene = sample(perturbations$gene)
))
## remove previous clone + gene pairs
rnd <- rnd[!paste(rnd$gene, rnd$clone) %in% paste(perturbations$gene, perturbations$clone),]
rnd_res <- decal(rnd, count, clone)
rnd_res <- subset(rnd_res, !is.na(rnd_res$pvalue))

par(mfrow = c(1, 3))
hist(rnd_res$z, xlab = "Z-score", main = "")
hist(rnd_res$pvalue, xlab = "Observed Pvalue", main = "")
hist(rnd_res$p_adjusted, xlab = "Observed Qvalue", main = "")

mean(rnd_res$pvalue < 0.05)
mean(rnd_res$p_adjusted < 0.05)

As expected, this experiment z-score presented the expected z-score and p-value distribution and the ratio of tests with p-value below the significance threshold (r mean(rnd_res$pvalue < 0.05)) within the expected 5% at random.

We can conduct a similar experiment producing a random count matrix based on your experiment counts. sim_count_from_data will simulate a UMI count matrix with the same dimensions as your experiment with cells total UMI count and gene average expression similar to your reference count matrix.

rnd_count <- sim_count_from_data(count)

rnd_res <- decal(perturbations, rnd_count, clone)
rnd_res <- subset(rnd_res, !is.na(rnd_res$pvalue))

par(mfrow = c(1, 3))
hist(rnd_res$z, xlab = "Z-score", main = "")
hist(rnd_res$pvalue, xlab = "Observed Pvalue", main = "")
hist(rnd_res$p_adjusted, xlab = "Observed Qvalue", main = "")

Estimating your experiment power

Another usufull application of decal simulation suite is conducting a power analysis to decide how much should you sequence or to define your tests filters. In the example below, we generated an experiment using sim_experiment_from_data which generates a random count matrix, a clone assignment list, and a perturbation table. The generated count matrix is based on your experiment matrix with the same dimensions, similar cells total count and gene average expression. The clone list is composed of 100 clones with 5 to 20 cells each. Perturbation table indicates pairs of clone and genes and the log2 fold-change applied, where for each clone a gene was randomly to apply a effect indicated by lfc. In this case, for each clone, 180 genes were selected and listed in pwr_dat$perturbations and among them 20 were perturbed by a log2 fold-change of -2, -1, 1, and 2.

pwr_lfc <- c(rep(c(-2, -1, 1, 2), each = 20), rep(0, 100))
pwr_dat <- sim_experiment_from_data(
  count, lfc = pwr_lfc, nclones = 100, min_n = 5, max_n = 20
)
pwr_res <- decal(pwr_dat$perturbations, pwr_dat$count, pwr_dat$clone)
pwr_res <- subset(pwr_res, !is.na(pvalue))

Adminitting a 5% FDR, we can estimated the ratio of the truly perturbated gene and clone pairs were detected by decal (left plot) and the clone size frequency (right plot). For all log2 fold-change applied here, decal obtained a power close to 80% that indicate que performance of the model.

par(mfrow = c(1, 2))
barplot(
  sapply(split(
    pwr_res$p_adjusted[pwr_res$expected_lfc != 0] < 0.05,
    pwr_res$expected_lfc[pwr_res$expected_lfc != 0]
  ), mean),
  xlab = "LFC", ylab = "Power")
abline(h = 0.8, lty = 2, col = "firebrick")
hist(sapply(pwr_dat$clone, length), main = "", xlab = "Number of perturbed cells")

References

Session info

sessionInfo()

```{css, echo=FALSE} p.abstract{ font-size: 24px; font-family: inherit; text-align: center; font-weight: bold; }

div.abstract{ margin: auto; width: 90%; padding-bottom: 15px; } ```



mauranolab/decal documentation built on Dec. 21, 2021, 3:50 p.m.