knitr::opts_chunk$set(echo = TRUE)

For this vignette, we will be using a dataset that can be downloaded from https://www.dropbox.com/scl/fi/csyz62obxx8xj5gl5sgp4/asd_seurat.RData?rlkey=d7dlhqjesunal52u8z2iomjsg&dl=0. This file is called asd_seurat.RData, and contains the following variables:

Notably, this is not the version of the dataset used in the eSVD-DE paper. Instead, the intent of this vignette to show you how to perform an eSVD-DE analysis. That is, while the processed dataset in the paper is different from this vignette's processed dataset, the procedure we showcase is here is intended as the typical usage of our method.

If you wish to see how this data was preprocessed starting from publicly available data (or you cannot download the .RData file), please see https://linnykos.github.io/eSVD2/articles/asd-preprocess.html.

Relevant citation:

Velmeshev, D., Schirmer, L., Jung, D., Haeussler, M., Perez, Y., Mayer, S., ... & Kriegstein, A. R. (2019). Single-cell genomics identifies cell type–specific molecular changes in autism. Science, 364(6441), 685-689.
library(Seurat)
library(SeuratObject)
library(EnhancedVolcano)

set.seed(10)
date_of_run <- Sys.time()
session_info <- devtools::session_info()

Loading the data

We first load the data. We can use the table() function to roughly see how many cells each donor contributes and how many donors were classified to be ASD or Control.

load("asd_seurat.RData")
asd
table(asd$individual, asd$diagnosis)
An object of class Seurat 
6599 features across 13302 samples within 1 assay 
Active assay: RNA (6599 features, 6599 variable features)
 2 layers present: counts, data
            ASD Control
indiv_1823    0      72
indiv_4341    0     733
indiv_4849  428       0
indiv_4899  349       0
indiv_5144  202       0
indiv_5163    0     145
indiv_5242    0      81
indiv_5278  944       0
indiv_5294  449       0
indiv_5387    0    1129
indiv_5391    0     237
indiv_5403   69       0
indiv_5408    0     162
indiv_5419  327       0
indiv_5531  631       0
indiv_5538    0     391
indiv_5554    0     233
indiv_5565  881       0
indiv_5577    0     215
indiv_5841  451       0
indiv_5864  275       0
indiv_5879    0     278
indiv_5893    0     976
indiv_5936    0     193
indiv_5939 1016       0
indiv_5945  422       0
indiv_5958    0     826
indiv_5976    0     106
indiv_5978  264       0
indiv_6032    0     289
indiv_6033  528       0

Using eSVD-DE

We will be testing for differential expression in the mean between the 15 donors with ASD and the 16 control donors, among the 6599 genes.

gene_vec <- Seurat::VariableFeatures(asd[["RNA"]])
mat <- SeuratObject::LayerData(asd,
                               layer = "counts",
                               assay = "RNA",
                               features = gene_vec)
mat <- Matrix::t(mat)
covariate_df <- asd@meta.data[, c("individual",
                                  "region",
                                  "age",
                                  "sex",
                                  "diagnosis",
                                  "Seqbatch",
                                  "Capbatch")]

factor_variables <- c("region", "diagnosis", "sex", "Seqbatch", "Capbatch")
for (variable in factor_variables) {
  covariate_df[, variable] <- factor(covariate_df[, variable], levels = names(sort(table(covariate_df[, variable]), decreasing = T)))
}

covariate_df[, "diagnosis"] <- factor(covariate_df[, "diagnosis"], levels = c("Control", "ASD"))

covariates <- eSVD2::format_covariates(
  dat = mat,
  covariate_df = covariate_df,
  rescale_numeric_variables = "age"
)

Empirically, we have found it useful to set omitted_variables as the Log_UMI and the batch variables. This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI. We found this beneficial since the coefficient for specifically the covariate Log_UMI bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to "mix" covariates correlated with Log_UMI too early. We have also found it useful to keep the batch variables separate since some datasets we have analyzed have donor covariates that are very correlated with the batch variable, and keeping these two aspects (i.e., the biological effect of donor covariates, and the artifical effect of the sequencing batches) seperate to be beneficial. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)

On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), this took roughly 7 minutes.

time_start1 <- Sys.time()
eSVD_obj <- eSVD2::initialize_esvd(
  dat = mat,
  covariates = covariates[, -grep("individual", colnames(covariates))],
  metadata_individual = covariate_df[, "individual"],
  case_control_variable = "diagnosis_ASD",
  bool_intercept = TRUE,
  k = 30,
  lambda = 0.1,
  metadata_case_control = covariates[, "diagnosis_ASD"],
  verbose = 1
)
time_end1 <- Sys.time()

omitted_variables <- colnames(eSVD_obj$covariates)[c(grep("Seqbatch", colnames(eSVD_obj$covariates)), grep("Capbatch", colnames(eSVD_obj$covariates)))]
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Init",
  omitted_variables = c("Log_UMI", omitted_variables)
)

Note that in the last fit of eSVD2::opt_esvd(), we do not set any omitted_variables. This is our typical recommendation -- it is good to allow all coefficients to be reparameterized right before the optimization is all done.

On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), for this vignette's dataset, the first fit (i.e., call of eSVD2::opt_esvd()) took roughly 7 minutes (for 11 iterations). The second fit took roughly 4 minutes (for 15 iterations).

time_start2 <- Sys.time()
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = setdiff(colnames(eSVD_obj$covariates), "diagnosis_ASD"),
  tol = 1e-6,
  verbose = 1,
  fit_name = "fit_First",
  fit_previous = "fit_Init"
)
time_end2 <- Sys.time()
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_First",
  omitted_variables = c("Log_UMI", omitted_variables)
)

time_start3 <- Sys.time()
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = NULL,
  tol = 1e-6,
  verbose = 1,
  fit_name = "fit_Second",
  fit_previous = "fit_First"
)
time_end3 <- Sys.time()
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Second",
  omitted_variables = omitted_variables
)

On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), this took roughly 9 minutes.

time_start4 <- Sys.time()
eSVD_obj <- eSVD2::estimate_nuisance(
  input_obj = eSVD_obj,
  bool_covariates_as_library = TRUE,
  bool_library_includes_interept = TRUE,
  bool_use_log = FALSE,
  verbose = 1
)
time_end4 <- Sys.time()
eSVD_obj <- eSVD2::compute_posterior(
  input_obj = eSVD_obj,
  bool_adjust_covariates = FALSE,
  alpha_max = 2 * max(mat@x),
  bool_covariates_as_library = TRUE,
  bool_stabilize_underdispersion = TRUE,
  library_min = 0.1,
  pseudocount = 0
)
eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 1)
eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj)

save(eSVD_obj, session_info, date_of_run,
     file = "asd_esvd.RData")

Plotting the p-values

To visualize the p-values, we can plot a volcano plot where the x-axis is our estimated log fold change (based on the global case donor mean, where each donor's specific mean is first computed, minus the global control donor mean) and the y-axis the negative log 10 p-value. The dotted horizontal line denotes the FDR cutoff.

To give this volcano plot some biological interpretation, we show where the genes found in Gandal et al. (see https://www.nature.com/articles/s41586-022-05377-7) on the volcano plot as well. That study was based on the bulk sequencing of brain regions, where possibly many cell types are involved in the DEG analysis. Nonetheless, we would expect a good number of DEGs from Gandal et al. to be highly significant in our analysis, which is what our volcano plot shows.

pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue)
tmp_df <- data.frame(
  gene = names(eSVD_obj$pvalue_list$log10pvalue),
  lfc = log2(eSVD_obj$case_mean) - log2(eSVD_obj$control_mean),
  pvalue = pvalue_vec
)
pCutoff <- max(pvalue_vec[which(eSVD_obj$pvalue_list$fdr_vec <= 0.05)])

data("gandal_df", package = "eSVD2")
gandal_genes <- gandal_df[which(gandal_df[, "WholeCortex_ASD_FDR"] <= 0.05), "external_gene_name"]
gandal_genes <- intersect(gandal_genes, tmp_df$gene)

tmp_df2 <- tmp_df[c(which(!tmp_df$gene %in% gandal_genes),
                    which(tmp_df$gene %in% gandal_genes)), ]

colCustom <- sapply(tmp_df2$gene, function(gene) {
  ifelse(gene %in% gandal_genes, "coral", "navy")
})
names(colCustom)[colCustom == "coral"] <- "Gandal et al."
names(colCustom)[colCustom == "navy"] <- "others"

# png(
#   "volcano.png",
#   width = 1500,
#   height = 2000,
#   res = 300,
#   units = "px"
# )
EnhancedVolcano::EnhancedVolcano(
  tmp_df2,
  lab = tmp_df2$gene,
  selectLab = gandal_genes,
  colCustom = colCustom,
  x = "lfc",
  y = "pvalue",
  ylim = c(0, max(eSVD_obj$pvalue_list$log10pvalue)),
  labSize = 3,
  pCutoff = pCutoff,
  FCcutoff = 0,
  drawConnectors = TRUE
)
# graphics.off()
knitr::include_graphics("https://github.com/linnykos/eSVD2/blob/master/vignettes/volcano.png?raw=true")

Alternatively, you can make the volcano plot that specifically highlights the housekeeping genes, which can act as a rough "negative control". That is, we would not expect many of the housekeeping genes to be significantly different between case and control donors. This is what we see in our volcano plot.

data("housekeeping_df", package = "eSVD2")
hk_genes <- housekeeping_df[, "Gene"]
hk_genes <- intersect(hk_genes, tmp_df$gene)

tmp_df2 <- tmp_df[c(which(!tmp_df$gene %in% hk_genes),
                    which(tmp_df$gene %in% hk_genes)), ]

colCustom <- sapply(tmp_df2$gene, function(gene) {
  ifelse(gene %in% hk_genes, "darkolivegreen1", "navy")
})
names(colCustom)[colCustom == "darkolivegreen1"] <- "Housekeeping genes"
names(colCustom)[colCustom == "navy"] <- "others"

# png(
#   "volcano_hk.png",
#   width = 1500,
#   height = 2000,
#   res = 300,
#   units = "px"
# )
EnhancedVolcano::EnhancedVolcano(
  tmp_df2,
  lab = tmp_df2$gene,
  selectLab = hk_genes,
  colCustom = colCustom,
  x = "lfc",
  y = "pvalue",
  ylim = c(0, max(eSVD_obj$pvalue_list$log10pvalue)),
  labSize = 3,
  pCutoff = pCutoff,
  FCcutoff = 0,
  drawConnectors = TRUE
)
# graphics.off()
knitr::include_graphics("https://github.com/linnykos/eSVD2/blob/master/vignettes/volcano_hk.png?raw=true")

Setup

The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.

devtools::session_info()


linnykos/eSVD2 documentation built on July 17, 2024, 12:01 a.m.