knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)

Purpose

The simulations here are meant to be portable, toy-examples of eSVD2, with the intent to demonstrate that an installation of eSVD2 was successful.

A successful installation of eSVD2 is required for this. See the last section of this README of all the system/package dependencies used when creating these simulations. Both simulations should complete in less than 2 minutes.

Overview

Broadly speaking, eSVD-DE (the name of the method, which is implemented in the eSVD2 package) is a pipeline of 6 different function calls. The procedure starts with primarily 3 inputs,

library(eSVD2)
library(Seurat)

Simulating data

We create a simple dataset that has no differentially expressed genes via the eSVD2::generate_null() function.

set.seed(10)
res <- eSVD2::generate_null()
covariates <- res$covariates
metadata_individual <- res$metadata_individual
obs_mat <- res$obs_mat

If you have Seurat also installed, you can run the following lines as well,

set.seed(10)
meta_data <- data.frame(covariates[, c("CC", "Sex", "Age")])
meta_data$Individual <- metadata_individual
seurat_obj <- Seurat::CreateSeuratObject(counts = Matrix::t(obs_mat), meta.data = meta_data)
Seurat::VariableFeatures(seurat_obj) <- rownames(seurat_obj)
seurat_obj <- Seurat::NormalizeData(seurat_obj)
seurat_obj <- Seurat::ScaleData(seurat_obj)
seurat_obj <- Seurat::RunPCA(seurat_obj, verbose = F)
seurat_obj <- Seurat::RunUMAP(seurat_obj, dims = 1:5)

Seurat::DimPlot(seurat_obj, reduction = "umap", group.by = "CC")

The separation between case and control cells is purely determined by the individual covariates of Age and Sex, not the case-control status.

Using eSVD-DE

Empirically, we have found it useful to set omitted_variables as the Log_UMI and the batch variables. (Here, there are no batch variables.) This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI. We found this beneficial since the coefficient for specifically the covariate Log_UMI bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to "mix" covariates correlated with Log_UMI too early. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)

This typically looks like the following.

set.seed(10)
eSVD_obj <- eSVD2::initialize_esvd(
  dat = obs_mat,
  covariates = covariates,
  metadata_individual = metadata_individual,
  case_control_variable = "CC",
  bool_intercept = T,
  k = 5,
  lambda = 0.1,
  verbose = 0
)
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Init",
  omitted_variables = "Log_UMI"
)

names(eSVD_obj)

Note that in the last fit of eSVD2::opt_esvd(), we do not set any omitted_variables. This is our typical recommendation -- it is good to allow all coefficients to be reparameterized right before the optimization is all done.

set.seed(10)
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = setdiff(colnames(eSVD_obj$covariates), "CC"),
  tol = 1e-6,
  verbose = 0,
  fit_name = "fit_First",
  fit_previous = "fit_Init"
)
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_First",
  omitted_variables = "Log_UMI"
)

set.seed(10)
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = NULL,
  tol = 1e-6,
  verbose = 0,
  fit_name = "fit_Second",
  fit_previous = "fit_First"
)

eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Second",
  omitted_variables = NULL
)

names(eSVD_obj)
set.seed(10)
eSVD_obj <- eSVD2::estimate_nuisance(
  input_obj = eSVD_obj,
  bool_covariates_as_library = TRUE,
  bool_library_includes_interept = TRUE,
  bool_use_log = FALSE,
  min_val = 1e-4,
  verbose = 0
)

names(eSVD_obj)
set.seed(10)
eSVD_obj <- eSVD2::compute_posterior(
  input_obj = eSVD_obj,
  bool_adjust_covariates = FALSE,
  alpha_max = 100,
  bool_covariates_as_library = TRUE,
  bool_stabilize_underdispersion = TRUE,
  library_min = 1,
  pseudocount = 0
)

names(eSVD_obj)
set.seed(10)
eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 0)

names(eSVD_obj)
set.seed(10)
eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj)

names(eSVD_obj)

To visualize the p-values, we can plot the theoretical quantiles against the observed quantiles. We see that despite the UMAP above showing strong differences between the case and control cells, the p-values are all uniformly distributed. This shows that we do not inflate the Type-1 error.

pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue)
graphics::plot(
  x = sort(pvalue_vec),
  y = seq(0, 1, length.out = length(pvalue_vec)),
  asp = TRUE,
  pch = 16,
  xlab = "Observed quantile",
  ylab = "Theoretical quantile"
)
graphics::lines(
  x = c(0, 1),
  y = c(0, 1),
  col = "red",
  lty = 2,
  lwd = 2
)

Setup

The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.

devtools::session_info()


linnykos/eSVD2 documentation built on July 17, 2024, 12:01 a.m.