test_de: Test for Differential Expression

View source: R/test_de.R

test_deR Documentation

Test for Differential Expression

Description

Conduct a quasi-likelihood ratio test for a Gamma-Poisson fit.

Usage

test_de(
  fit,
  contrast,
  reduced_design = NULL,
  full_design = fit$model_matrix,
  subset_to = NULL,
  pseudobulk_by = NULL,
  pval_adjust_method = "BH",
  sort_by = NULL,
  decreasing = FALSE,
  n_max = Inf,
  max_lfc = 10,
  verbose = FALSE
)

Arguments

fit

object of class glmGamPoi. Usually the result of calling glm_gp(data, ...)

contrast

The contrast to test. Can be a single column name (quoted or as a string) that is removed from the full model matrix of fit. Or a complex contrast comparing two or more columns: e.g. A - B, "A - 3 * B", (A + B) / 2 - C etc. For complicated experimental design that involved nested conditions, you specify the condition level to compare using the cond() helper function.
Only one of contrast or reduced_design must be specified.

reduced_design

a specification of the reduced design used as a comparison to see what how much better fit describes the data. Analogous to the design parameter in glm_gp(), it can be either a formula, a model.matrix(), or a vector.
Only one of contrast or reduced_design must be specified.

full_design

option to specify an alternative full_design that can differ from fit$model_matrix. Can be a formula or a matrix. Default: fit$model_matrix

subset_to

a vector with the same length as ncol(fit$data) or an expression that evaluates to such a vector. The expression can reference columns from colData(fit$data). A typical use case in single cell analysis would be to subset to a specific cell type (e.g. subset_to = cell_type == "T-cells"). Note that if this argument is set a new the model for the full_design is re-fit.
Default: NULL which means that the data is not subset.

pseudobulk_by

DEPRECTATED, please use the pseudobulk function instead.
A vector with the same length as ncol(fit$data) that is used to split the columns into different groups (calls split()). pseudobulk_by can also be an expression that evaluates to a vector. The expression can reference columns from colData(fit$data).
The counts are summed across the groups to create "pseudobulk" samples. This is typically used in single cell analysis if the cells come from different samples to get a proper estimate of the differences. This is particularly powerful in combination with the subset_to parameter to analyze differences between samples for subgroups of cells. Note that this does a fresh fit for both the full and the reduced design. Default: NULL which means that the data is not aggregated.

pval_adjust_method

one of the p-value adjustment method from p.adjust.methods. Default: "BH".

sort_by

the name of the column or an expression used to sort the result. If sort_by is NULL the table is not sorted. Default: NULL

decreasing

boolean to decide if the result is sorted increasing or decreasing order. Default: FALSE.

n_max

the maximum number of rows to return. Default: Inf which means that all rows are returned

max_lfc

set the maximum absolute log fold change that is returned. Large log fold changes occur for lowly expressed genes because the ratio of two small numbers can be impractically large. For example, limiting the range of log fold changes can clarify the patterns in a volcano plot. Default: 10 which corresponds to a thousand-fold (2^10) increase in expression.

verbose

a boolean that indicates if information about the individual steps are printed while fitting the GLM. Default: FALSE.

Details

The cond() helper function simplifies the specification of a contrast for complex experimental designs. Instead of working out which combination of coefficients corresponds to a research question, you can simply specify the two conditions that you want to compare.

You can only call the cond function inside the contrast argument. The arguments are the selected factor levels for each covariate. To compare two conditions, simply subtract the two cond calls. Internally, the package calls model.matrix using the provided values and the original formula from the fit to produce a vector. Subtracting two of these vectors produces a contrast vector. Missing covariates are filled with the first factor level or zero for numerical covariates.

Value

a data.frame with the following columns

name

the rownames of the input data

pval

the p-values of the quasi-likelihood ratio test

adj_pval

the adjusted p-values returned from p.adjust()

f_statistic

the F-statistic: F = (Dev_full - Dev_red) / (df_1 * disp_ql-shrunken)

df1

the degrees of freedom of the test: ncol(design) - ncol(reduced_design)

df2

the degrees of freedom of the fit: ncol(data) - ncol(design) + df_0

lfc

the log2-fold change. If the alternative model is specified by reduced_design will be NA.

References

  • Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826.

See Also

glm_gp()

Examples

 # Make Data
 Y <- matrix(rnbinom(n = 30 * 100, mu = 4, size = 0.3), nrow = 30, ncol = 100)
 annot <- data.frame(mouse = sample(LETTERS[1:6], size = 100, replace = TRUE),
        celltype = sample(c("Tcell", "Bcell", "Macrophages"), size = 100, replace = TRUE),
        cont1 = rnorm(100), cont2 = rnorm(100, mean = 30))
 annot$condition <- ifelse(annot$mouse %in% c("A", "B", "C"), "ctrl", "treated")
 head(annot)
 se <- SummarizedExperiment::SummarizedExperiment(Y, colData = annot)

 # Fit model
 fit <- glm_gp(se, design = ~ condition + celltype + cont1 + cont2)
 # Test with reduced design
 res <- test_de(fit, reduced_design = ~ celltype + cont1 + cont2)
 head(res)
 # Test with contrast argument, the results are identical
 res2 <- test_de(fit, contrast = conditiontreated)
 head(res2)
 # Test with explicit specification of the conditions
 # The results are still identical
 res3 <- test_de(fit, contrast = cond(condition = "treated", celltype = "Bcell") -
                                    cond(condition = "ctrl", celltype = "Bcell"))
 head(res3)


 # The column names of fit$Beta are valid variables in the contrast argument
 colnames(fit$Beta)
 # You can also have more complex contrasts:
 # the following compares cont1 vs cont2:
 test_de(fit, cont1 - cont2, n_max = 4)
 # You can also sort the output
 test_de(fit, cont1 - cont2, n_max = 4,
         sort_by = "pval")
 test_de(fit, cont1 - cont2, n_max = 4,
         sort_by = - abs(f_statistic))

 # If the data has multiple samples, it is a good
 # idea to aggregate the cell counts by samples.
 # This is called forming a "pseudobulk".
 se_reduced <- pseudobulk(se, group_by = vars(mouse, condition, celltype),
                          cont1 = mean(cont1), cont2 = min(cont2))
 fit_reduced <- glm_gp(se_reduced, design = ~ condition + celltype)
 test_de(fit_reduced, contrast = "conditiontreated", n_max = 4)
 test_de(fit_reduced, contrast = cond(condition = "treated", celltype = "Macrophages") -
                                      cond(condition = "ctrl", celltype = "Macrophages"),
         n_max = 4)



const-ae/glmGamPoi documentation built on Feb. 13, 2024, 1:35 a.m.