knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(2)

Fit Gamma-Poisson Generalized Linear Models Reliably.

The core design aims of `gmlGamPoi`

are:

- Fit the Gamma-Poisson models on arbitrarily large or small datasets
- Be faster than alternative methods, such as
`DESeq2`

or`edgeR`

- Calculate exact or approximate results based on user preference
- Support in memory or on-disk data
- Follow established conventions around tools for RNA-seq analysis
- Present a simple user-interface
- Avoid unnecessary dependencies
- Make integration into other tools easy

You can install the release version of `r BiocStyle::Biocpkg("glmGamPoi")`

from BioConductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("glmGamPoi")

For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")`

repo.

Load the glmGamPoi package

```
library(glmGamPoi)
```

To fit a single Gamma-Poisson GLM do:

# overdispersion = 1/size counts <- rnbinom(n = 10, mu = 5, size = 1/0.7) # design = ~ 1 means that an intercept-only model is fit fit <- glm_gp(counts, design = ~ 1) fit # Internally fit is just a list: as.list(fit)[1:2]

The `glm_gp()`

function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion.

Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy:

I will first load an example dataset using the `TENxPBMCData`

package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells:

library(SummarizedExperiment) library(DelayedMatrixStats)

# The full dataset with 33,000 genes and 4340 cells # The first time this is run, it will download the data pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k") # I want genes where at least some counts are non-zero non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0) pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ] pbmcs_subset

I call `glm_gp()`

to fit one GLM model for each gene and force the calculation to happen in memory.

fit <- glm_gp(pbmcs_subset, on_disk = FALSE) summary(fit)

I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")`

and `r BiocStyle::Biocpkg("edgeR")`

. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance.

# Explicitly realize count matrix in memory so that it is a fair comparison pbmcs_subset <- as.matrix(assay(pbmcs_subset)) model_matrix <- matrix(1, nrow = ncol(pbmcs_subset)) bench::mark( glmGamPoi_in_memory = { glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) }, glmGamPoi_on_disk = { glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE) }, DESeq2 = suppressMessages({ dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, colData = data.frame(name = seq_len(4340)), design = ~ 1) dds <- DESeq2::estimateSizeFactors(dds, "poscounts") dds <- DESeq2::estimateDispersions(dds, quiet = TRUE) dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6) }), edgeR = { edgeR_data <- edgeR::DGEList(pbmcs_subset) edgeR_data <- edgeR::calcNormFactors(edgeR_data) edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix) edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix) }, check = FALSE, min_iterations = 3 )

On this dataset, `glmGamPoi`

is more than 5 times faster than `edgeR`

and more than 18 times faster than `DESeq2`

. `glmGamPoi`

does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations.

To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi`

are more reliable:

# Results with my method fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE) # DESeq2 dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, colData = data.frame(name = seq_len(4340)), design = ~ 1) sizeFactors(dds) <- fit$size_factors dds <- DESeq2::estimateDispersions(dds, quiet = TRUE) dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6) #edgeR edgeR_data <- edgeR::DGEList(pbmcs_subset, lib.size = fit$size_factors) edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix) edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)

par(mfrow = c(2, 4), cex.main = 2, cex.lab = 1.5) plot(fit$Beta[,1], coef(dds)[,1] / log2(exp(1)), pch = 16, main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "DESeq2") abline(0,1) plot(fit$Beta[,1], edgeR_fit$unshrunk.coefficients[,1], pch = 16, main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "edgeR") abline(0,1) plot(fit$Mu[,1], assay(dds, "mu")[,1], pch = 16, log="xy", main = "Gene Mean", xlab = "glmGamPoi", ylab = "DESeq2") abline(0,1) plot(fit$Mu[,1], edgeR_fit$fitted.values[,1], pch = 16, log="xy", main = "Gene Mean", xlab = "glmGamPoi", ylab = "edgeR") abline(0,1) plot(fit$overdispersions, rowData(dds)$dispGeneEst, pch = 16, log="xy", main = "Overdispersion", xlab = "glmGamPoi", ylab = "DESeq2") abline(0,1) plot(fit$overdispersions, edgeR_fit$dispersion, pch = 16, log="xy", main = "Overdispersion", xlab = "glmGamPoi", ylab = "edgeR") abline(0,1)

I am comparing the gene-wise estimates of the coefficients from all three methods. Points on the diagonal line are identical. The inferred Beta coefficients and gene means agree well between the methods, however the overdispersion differs quite a bit. `DESeq2`

has problems estimating most of the overdispersions and sets them to `1e-8`

. `edgeR`

only approximates the overdispersions which explains the variation around the overdispersions calculated with `glmGamPoi`

.

The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k`

dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k`

(17x the size) takes ~73 minutes (17x the time) on-disk.

`glmGamPoi`

provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes:

# Create random categorical assignment to demonstrate DE group <- sample(c("Group1", "Group2"), size = ncol(pbmcs_subset), replace = TRUE) # Fit model with group vector as design fit <- glm_gp(pbmcs_subset, design = group) # Compare against model without group res <- test_de(fit, reduced_design = ~ 1) # Look at first 6 genes head(res)

The p-values agree well with the ones that `edgeR`

is calculating. This is because `glmGamPoi`

uses the same framework of quasi-likelihood ratio tests that was invented by `edgeR`

and is described in Lund et al. (2012).

model_matrix <- model.matrix(~ group, data = data.frame(group = group)) edgeR_data <- edgeR::DGEList(pbmcs_subset) edgeR_data <- edgeR::calcNormFactors(edgeR_data) edgeR_data <- edgeR::estimateDisp(edgeR_data, design = model_matrix) edgeR_fit <- edgeR::glmQLFit(edgeR_data, design = model_matrix) edgeR_test <- edgeR::glmQLFTest(edgeR_fit, coef = 2) edgeR_res <- edgeR::topTags(edgeR_test, sort.by = "none", n = nrow(pbmcs_subset))

par(cex.main = 2, cex.lab = 1.5) plot(res$pval, edgeR_res$table$PValue, pch = 16, log = "xy", main = "p-values", xlab = "glmGamPoi", ylab = "edgeR") abline(0,1)

Be very careful how you interpret the p-values of a single cell experiment. Cells that come from one individual are not independent replicates. That means that you cannot turn your RNA-seq experiment with 3 treated and 3 control samples into a 3000 vs 3000 experiment by measuring 1000 cells per sample. The actual unit of replication are still the 3 samples in each condition.

Nonetheless, single cell data is valuable because it allows you to compare the effect of a treatment on specific cell types. The simplest way to do such a test is called pseudobulk. This means that the data is subset to the cells of a specific cell type. Then the counts of cells from the same sample are combined to form a "pseudobulk" sample. The `test_de()`

function of glmGamPoi supports this feature directly through the `pseudobulk_by`

and `subset_to`

parameters:

# say we have cell type labels for each cell and know from which sample they come originally sample_labels <- rep(paste0("sample_", 1:6), length = ncol(pbmcs_subset)) cell_type_labels <- sample(c("T-cells", "B-cells", "Macrophages"), ncol(pbmcs_subset), replace = TRUE) test_de(fit, contrast = Group1 - Group2, pseudobulk_by = sample_labels, subset_to = cell_type_labels == "T-cells", n_max = 4, sort_by = pval, decreasing = FALSE)

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.