knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 150, out.width="100%", crop = NULL )
In this vignette, we will demonstrate how to introduce co-expression between features in an already-simulated scRNA-seq dataset. For this, we will first explain how we originally simulated single-cell data in the acorde manuscript using the SymSim R package, and then describe how to introduce and visualize co-expression patterns in the previously simulated scRNA-seq expression matrix.
If required, the SymSim R package can be installed
from GitHub as follows (note usage of build_vignettes = TRUE to be able to
access package documentation):
devtools::install_github("YosefLab/SymSim", build_vignettes = TRUE)
Load the SymSim package, acorde and other required dependencies:
suppressPackageStartupMessages({ library(SymSim) library(tidyverse) library(acorde) library(scater) })
First, we simulated true counts for 8 cell types, which requires the generation of a tree object detailing the relationships between the cell types (in this case, discrete cell types):
tree <- pbtree(n = 7, type = "discrete") plotTree(tree)
SymSim implements a two-step simulation, in which true counts are first simulated to then add technology-specific noise and user-defined biasess that are normally found on scRNA-seq data. For the first step, we run SymSim with the following parameters to generate a dataset containing 1000 total cells and 8000 genes:
# true counts true_counts <- SimulateTrueCounts(ncells_total = 1000, ngenes = 8000, min_popsize = 100, i_minpop = 1, nevf = 10, n_de_evf = 9, evf_type = "discrete", phyla = tree, vary = "s", Sigma = 0.25, gene_effect_prob = 0.5, bimod = 0.4, prop_hge = 0.03, mean_hge = 5, randseed = 123)
Next, we run the simulation of observed counts as follows:
observed_counts <- True2ObservedCounts(true_counts = true_counts$counts, meta_cell = true_counts$cell_meta, protocol = "nonUMI", alpha_mean = 0.1, alpha_sd = 0.005, lenslope = 0, gene_len = rep(1000, nrow(true_counts$counts)), depth_mean = 4e6, depth_sd = 1e4)
Here, we selected "nonUMI" as the simulated protocol to mimic the properties
of the Smart-seq2 dataset used in our study. For a detailed description of the
rest of the parameters, please see the SymSim documentation.
Finally, we created a SingleCellExperiment object for further processing, including
a Principal Component Analysis (PCA) to better characterize the data structure and the
variability between cells and cell types:
# feature and cell IDs as metadata rownames(observed_counts$counts) <- paste0("Feature", seq(1, 8000)) colnames(observed_counts$counts) <- paste0("Cell", seq(1, 1000)) colData <- tibble(Cell = colnames(observed_counts$counts), Group = paste0("Group", observed_counts$cell_meta$pop)) # create SCE symsim_sce <- SingleCellExperiment( assays = list(counts = observed_counts$counts, logcounts = log2(observed_counts$counts+1)), colData = colData)
Our co-expression simulation method relies on breaking the cell-type connectivity between features to rearrange them. To perform this rearrangement and build new, synthetic features, we require the user to provide cross-cell type expression patterns. These patterns are qualitative, i.e. just indicate low or high expression in a given cell type.
If you started at this point, or you want to reproduce the results in our manuscript, load the already-simulated SymSim data object stored within this package:
# load data data("symsim_sce") # view object symsim_sce # cell type IDs and composition symsim_sce$Group %>% table() # plot PCA using scater function plotPCA(symsim_sce, colour_by = "Group")
Next, it is required that users build an expression pattern table. This must have a structure where feature-level expression patterns are defined row-wise, meaning that cell-types are situated in the columns. Note that the number and order of cell types must be the same as in the simulated dataset. Each row in the \code{dataframe} should then include a \code{TRUE} value whenever high expression in a given cell type is desired, and \code{FALSE} to select low or no expression in that cell type. Here is an example of how to generate this structure:
# create cluster patterns patterns <- tibble(one.a = c(TRUE, rep(FALSE, 7)), one.b = one.a[sample(seq_along(one.a))], one.c = one.b[sample(seq_along(one.b))], one.d = one.c[sample(seq_along(one.c))], one.e = one.d[sample(seq_along(one.d))], two.a = c(rep(TRUE, 2), rep(FALSE, 6)), two.b = two.a[sample(seq_along(two.a))], two.c = two.b[sample(seq_along(two.b))], two.d = two.c[sample(seq_along(two.c))], two.e = two.d[sample(seq_along(two.d))], three.a = c(rep(TRUE, 3), rep(FALSE, 5)), three.b = three.a[sample(seq_along(three.a))], three.c = three.b[sample(seq_along(three.b))], three.d = three.c[sample(seq_along(three.c))], three.e = three.d[sample(seq_along(three.d))]) %>% t %>% as_tibble() # show the results patterns
In this example, we have 15 expression patterns, each corresponding to one gene cluster. The first pattern, for instance, where only the \code{Group1} cell type has a \code{TRUE} value, corresponds to genes where there is Group1-specific expression. Conversely, the last pattern will contain genes that are highly expressed in cell types \code{Group3}, \code{Group6} and \code{Group8}.
Now, we are ready to run the \code{simulate_coexpression} function in acorde to generate the selected co-expression patterns in the already-simulated dataset:
coexpr_results <- simulate_coexpression(symsim_sce, feature_no = 1400, patterns, cluster_size = 200) names(coexpr_results)
As a result, we obtain a named list including the expression matrix and the feature IDs that go into each of the 15 clusters.
In this function, \code{feature_no} corresponds to the number of highly and lowly expressed features that are to be selected to produce clusters, i.e. a total of \code{feature_no * 2} features. In this example, we selected to generate 200-feature clusters. Given that our pattern matrix has 15 patterns, we expect to generate a new expression matrix with 3000 synthetically co-expressed features:
# show expression matrix coexpr_results$sim_matrix # show summary of clusters map_int(coexpr_results$sim_clusters, length)
Now, we can use some of the \code{acorde} visualization functions to make sure that our data contains the specified patterns:
# scale the matrix to enhance visualization coexpr.scaled <- scale_isoforms(coexpr_results$sim_matrix, isoform_col = "feature") # create cell-to-cell-type ID table ct <- colData(symsim_sce) %>% as_tibble %>% rename(cell = "Cell", cell_type = "Group") # compute average-by-cell type cluster patterns cluster_patterns <- map(coexpr_results$sim_clusters, ~calculate_cluster_profile(coexpr.scaled, isoform_ids = ., id_table = ct, isoform_col = "feature")) # plot patterns library(cowplot) theme_set(theme_cowplot()) pattern_plots <- map(cluster_patterns, plot_cluster_profile, ct_labels = seq(1, 8)) plot_grid(plotlist = pattern_plots, labels = seq(1, 15), ncol = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.