knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, dev = "png", message = FALSE, error = FALSE, warning = TRUE)
suppressPackageStartupMessages(library(scPNMF)) suppressPackageStartupMessages(library(SingleCellExperiment)) suppressPackageStartupMessages(library(umap)) suppressPackageStartupMessages(library(ggplot2))
scPNMF is a method to facilitate gene selection for targeted gene profiling by learning a sparse gene encoding of single cells. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types, with a small number, e.g., < 200 gene. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to help the prediction of cell types in the new data.
In this manual, we will demonstrate how to use r Githubpkg("JSB-UCLA/scPNMF")
package to select informative genes, and project datasets with limited gene numbers (e.g., targeted gene profiling) into a low-dimensional space.
Here we use zheng4
dataset as an example. The zheng4
dataset is stored as an SingleCellExperiment
object. Unlike other single-cell methods, scPNMF DO NOT require normalization (e.g., normalized by cell library size). The input data is simply the raw log-count matrix, where rows are genes and columns are cells.
data(zheng4, package = "scPNMF") Input_zheng4 <- logcounts(zheng4)
The input data contains r dim(Input_zheng4)[1]
genes and r dim(Input_zheng4)[2]
cells.
The first step is to run the Projective Nonnegative Matrix Factorization (PNMF). Briefly, PNMF takes advantages from both PCA and NMF (as its name!). Here is a table comparing PNMF with PCA and NMF.
Method|Optimization Problem|Non-negativity|Sparsity|Mutually Exclusiveness|New Data Projection ------|--------------------|--------------|--------|----------------------|------------------- PNMF | $\min\limits_{\mathbf{W}} \|\mathbf{X} - \mathbf{W}\mathbf{W}^T\mathbf{X}\|\ s.t.\ \mathbf{W}\geq 0$ | Yes | Very high | Very high | Yes PCA | $\min\limits_{\mathbf{W}} \|\mathbf{X} - \mathbf{W}\mathbf{W}^T\mathbf{X}\|\ s.t.\ \mathbf{W}^T\mathbf{W} = \mathbf{I}$ | No | Low | Low | Yes NMF | $\min\limits_{\mathbf{W}, \mathbf{H}} \|\mathbf{X} - \mathbf{W}\mathbf{H}\|\ s.t.\ \mathbf{W},\mathbf{H}\geq 0$ | Yes | High | High | No
We run the PNMF algorithm. $K$, the number of low dimension, is a key parameter that needs to be pre-specified by users. We proposed a metric $dev.ortho$ (normalized difference between $W^TW$ and $I$) in Section S1.1 of our scPNMF paper to select $K$, which is implemented in K_selection()
function. However, for the purpose of saving time as well as getting a stable result, we recommend the users to directly set $K=20$ on a typical scRNA-seq data. Here we use $K=15$ for demonstration purpose.
res_pnmf <- scPNMF::PNMFfun(X = Input_zheng4, K = 15, method="EucDist", tol=1e-4, maxIter=1000, verboseN = TRUE) W <- res_pnmf$Weight S <- res_pnmf$Score
The output is a list containing two matrices: the weight matrix (projection matrix) $\mathbf{W}$ and the score matrix (low-dimensional space) $\mathbf{S} = \mathbf{W}^T\mathbf{X}$. The two output matrices are very similar as those from PCA.
The second main step of scPNMF is to select informative bases among the $K$ bases found by PNMF in the last step(i.e., columns of $\mathbf{W}$ and rows of $\mathbf{S}$). The main goal is to remove unwanted variations of cells (e.g., variations irrelevant to cell types). There are three main strategies: functional annotations (optional); data-driven basis selection: correlations with cell library sizes and multimodality tests. We have designed functions for each strategy, and finally we show how to perform basis selection with regard to all the strategies by a combined function.
Due to the nice interpretability of PNMF, each basis briefly represents a funtional gene cluster. Users can perform Gene Ontology (GO) analysis to annoate each basis, and decide which bases they believe are useless and remove irrelevant bases. This is an OPTIONAL step since it requires prior knowledge from users. Here we demonstrate functional annotations on the first five bases by setting dim_use = 1:5
.
res_annotation <- scPNMF::basisAnnotate(W = W, dim_use = 1:5, id_type = "ENSEMBL", return_fig = TRUE) plot(res_annotation$p_comp)
The default basis selection procedure in scPNMF is purely data-driven. We have two basic assumptions: 1. A basis which is highly correlated with cell library size (here, total log-counts) should be removed since researchers usually treat cell library size as an unwanted variation; 2. A basis which shows strong multi-modal pattern should be kept since it indicates cell sub-population (e.g., cell type) structure.
The following function will perform the basis check and users can inspect the checking result by setting return_fig = TRUE
.
res_test <- scPNMF::basisTest(S = S, X = Input_zheng4, return_fig = TRUE, ncol = 5, mc.cores = 1)
The following wrapper function will select bases automatically.
W_select <- scPNMF::basisSelect(W = W, S = S, X=Input_zheng4, toTest = TRUE, toAnnotate = FALSE, mc.cores = 1) colnames(W_select)
scPNMF can work as a dimensionality reduction method. Empirically, the result is similar to PCA. Here we use UMAP plot to visualize the reduced space. The cells (dots) are colored by the originally given cell type labels.
set.seed(123) S_select <- t(Input_zheng4) %*% W_select cell_type <- colData(zheng4)$phenoid umap_select <- data.frame(umap(S_select)$layout, cell_type) colnames(umap_select) <- c("UMAP1", "UMAP2", "cell_type") ggplot(umap_select, aes(x=UMAP1, y=UMAP2)) + geom_point(size=0.1, alpha=1, aes(color=cell_type)) + ggtitle("UMAP Plot") + theme_bw() + theme(aspect.ratio=1, plot.title = element_text(face = "bold", size=12, hjust = 0.5), legend.position = "bottom") + guides(color = guide_legend(override.aes = list(size=5)))
The most important application of scPNMF is to select informative genes (often called highly-variable genes) in an unsupervised way. Compared to other gene selection methods, scPNMF works well with a small informative gene number budget. Here we set the pre-specified gene number as $M = 100$.
ig <- getInfoGene(W_select, M = 100, by_basis = FALSE, return_trunW = TRUE, dim_use = NULL) print(ig$InfoGene)
The returned object ig
contains a informative gene vector and a new projection matrix $\mathbf{W}_{S,(M)}$ which only relies on $M$ informative genes.
$\mathbf{W}_{S,(M)}$ can be used to project new data. Since it only uses $M$ genes, this projection works for new datasets with only $M$ genes (e.g., single-cell targeted gene profiling). Here we illustrate how the projection works with a subset of the original zheng4
data that only keeps $M$ genes to mimic the structure of new targeted gene profiling data.
S_new <- getProjection(Input_zheng4[ig$InfoGene, ], ig$trunW)
We still use a UMAP plot to visualize the low-dimensional space with $M=100$ genes.
set.seed(456) cell_type <- colData(zheng4)$phenoid umap_new <- data.frame(umap(S_new)$layout, cell_type) colnames(umap_new) <- c("UMAP1", "UMAP2", "cell_type") ggplot(umap_new, aes(x=UMAP1, y=UMAP2)) + geom_point(size=0.1, alpha=1, aes(color=cell_type)) + ggtitle("UMAP Plot on Newly Projected Data") + theme_bw() + theme(aspect.ratio=1, plot.title = element_text(face = "bold", size=12, hjust = 0.5), legend.position = "bottom") + guides(color = guide_legend(override.aes = list(size=5)))
If you meet problems when using scPNMF, please report it on issues. For more questions, email us!
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.