In JSB-UCLA/scPNMF: scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, dev = "png",
                      message = FALSE, error = FALSE, warning = TRUE)

Introduction

suppressPackageStartupMessages(library(scPNMF))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(umap))
suppressPackageStartupMessages(library(ggplot2))

scPNMF is a method to facilitate gene selection for targeted gene profiling by learning a sparse gene encoding of single cells. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types, with a small number, e.g., < 200 gene. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to help the prediction of cell types in the new data.

In this manual, we will demonstrate how to use r Githubpkg("JSB-UCLA/scPNMF") package to select informative genes, and project datasets with limited gene numbers (e.g., targeted gene profiling) into a low-dimensional space.

Setting Up the Data

Here we use zheng4 dataset as an example. The zheng4 dataset is stored as an SingleCellExperiment object. Unlike other single-cell methods, scPNMF DO NOT require normalization (e.g., normalized by cell library size). The input data is simply the raw log-count matrix, where rows are genes and columns are cells.

data(zheng4, package = "scPNMF")
Input_zheng4 <- logcounts(zheng4)

The input data contains r dim(Input_zheng4)[1] genes and r dim(Input_zheng4)[2] cells.

Step I: PNMF

The first step is to run the Projective Nonnegative Matrix Factorization (PNMF). Briefly, PNMF takes advantages from both PCA and NMF (as its name!). Here is a table comparing PNMF with PCA and NMF.

Method|Optimization Problem|Non-negativity|Sparsity|Mutually Exclusiveness|New Data Projection ------|--------------------|--------------|--------|----------------------|------------------- PNMF | $\min\limits_{\mathbf{W}} \|\mathbf{X} - \mathbf{W}\mathbf{W}^T\mathbf{X}\|\ s.t.\ \mathbf{W}\geq 0$ | Yes | Very high | Very high | Yes PCA | $\min\limits_{\mathbf{W}} \|\mathbf{X} - \mathbf{W}\mathbf{W}^T\mathbf{X}\|\ s.t.\ \mathbf{W}^T\mathbf{W} = \mathbf{I}$ | No | Low | Low | Yes NMF | $\min\limits_{\mathbf{W}, \mathbf{H}} \|\mathbf{X} - \mathbf{W}\mathbf{H}\|\ s.t.\ \mathbf{W},\mathbf{H}\geq 0$ | Yes | High | High | No

We run the PNMF algorithm. $K$, the number of low dimension, is a key parameter that needs to be pre-specified by users. We proposed a metric $dev.ortho$ (normalized difference between $W^TW$ and $I$) in Section S1.1 of our scPNMF paper to select $K$, which is implemented in K_selection() function. However, for the purpose of saving time as well as getting a stable result, we recommend the users to directly set $K=20$ on a typical scRNA-seq data. Here we use $K=15$ for demonstration purpose.

res_pnmf <- scPNMF::PNMFfun(X = Input_zheng4,
                            K = 15, method="EucDist", tol=1e-4, maxIter=1000, verboseN = TRUE)
W <- res_pnmf$Weight
S <- res_pnmf$Score

The output is a list containing two matrices: the weight matrix (projection matrix) $\mathbf{W}$ and the score matrix (low-dimensional space) $\mathbf{S} = \mathbf{W}^T\mathbf{X}$. The two output matrices are very similar as those from PCA.

Step II: Basis Selection

The second main step of scPNMF is to select informative bases among the $K$ bases found by PNMF in the last step(i.e., columns of $\mathbf{W}$ and rows of $\mathbf{S}$). The main goal is to remove unwanted variations of cells (e.g., variations irrelevant to cell types). There are three main strategies: functional annotations (optional); data-driven basis selection: correlations with cell library sizes and multimodality tests. We have designed functions for each strategy, and finally we show how to perform basis selection with regard to all the strategies by a combined function.

Functional Annotations

Due to the nice interpretability of PNMF, each basis briefly represents a funtional gene cluster. Users can perform Gene Ontology (GO) analysis to annoate each basis, and decide which bases they believe are useless and remove irrelevant bases. This is an OPTIONAL step since it requires prior knowledge from users. Here we demonstrate functional annotations on the first five bases by setting dim_use = 1:5.

res_annotation <- scPNMF::basisAnnotate(W = W, 
                                        dim_use = 1:5,
                                        id_type = "ENSEMBL",
                                        return_fig = TRUE)
plot(res_annotation$p_comp)

Data-driven Selection (Correlations with Cell Library Sizes and Multimodality Tests)

The default basis selection procedure in scPNMF is purely data-driven. We have two basic assumptions: 1. A basis which is highly correlated with cell library size (here, total log-counts) should be removed since researchers usually treat cell library size as an unwanted variation; 2. A basis which shows strong multi-modal pattern should be kept since it indicates cell sub-population (e.g., cell type) structure.

The following function will perform the basis check and users can inspect the checking result by setting return_fig = TRUE.

res_test <- scPNMF::basisTest(S = S, X = Input_zheng4,
                              return_fig = TRUE, ncol = 5,
                              mc.cores = 1)

Combined Basis Selection Function

The following wrapper function will select bases automatically.

W_select <- scPNMF::basisSelect(W = W, S = S,
                                X=Input_zheng4, toTest = TRUE, toAnnotate = FALSE, mc.cores = 1)
colnames(W_select)

Application I: Dimensionality Reduction

scPNMF can work as a dimensionality reduction method. Empirically, the result is similar to PCA. Here we use UMAP plot to visualize the reduced space. The cells (dots) are colored by the originally given cell type labels.

set.seed(123)
S_select <- t(Input_zheng4) %*% W_select
cell_type <- colData(zheng4)$phenoid
umap_select <- data.frame(umap(S_select)$layout, cell_type)
colnames(umap_select) <- c("UMAP1", "UMAP2", "cell_type")

ggplot(umap_select, aes(x=UMAP1, y=UMAP2)) + geom_point(size=0.1, alpha=1, aes(color=cell_type)) +
  ggtitle("UMAP Plot") + theme_bw() +
  theme(aspect.ratio=1, plot.title = element_text(face = "bold", size=12, hjust = 0.5), legend.position = "bottom") +
  guides(color = guide_legend(override.aes = list(size=5)))

Application II: Informative Gene Selection

The most important application of scPNMF is to select informative genes (often called highly-variable genes) in an unsupervised way. Compared to other gene selection methods, scPNMF works well with a small informative gene number budget. Here we set the pre-specified gene number as $M = 100$.

ig <- getInfoGene(W_select, M = 100, by_basis = FALSE, return_trunW = TRUE, dim_use = NULL)
print(ig$InfoGene)

The returned object ig contains a informative gene vector and a new projection matrix $\mathbf{W}_{S,(M)}$ which only relies on $M$ informative genes.

Application III: Data Projection

$\mathbf{W}_{S,(M)}$ can be used to project new data. Since it only uses $M$ genes, this projection works for new datasets with only $M$ genes (e.g., single-cell targeted gene profiling). Here we illustrate how the projection works with a subset of the original zheng4 data that only keeps $M$ genes to mimic the structure of new targeted gene profiling data.

S_new <- getProjection(Input_zheng4[ig$InfoGene, ], ig$trunW)

We still use a UMAP plot to visualize the low-dimensional space with $M=100$ genes.

set.seed(456)
cell_type <- colData(zheng4)$phenoid
umap_new <- data.frame(umap(S_new)$layout, cell_type)
colnames(umap_new) <- c("UMAP1", "UMAP2", "cell_type")

ggplot(umap_new, aes(x=UMAP1, y=UMAP2)) + geom_point(size=0.1, alpha=1, aes(color=cell_type)) +
  ggtitle("UMAP Plot on Newly Projected Data") + theme_bw() +
  theme(aspect.ratio=1, plot.title = element_text(face = "bold", size=12, hjust = 0.5), legend.position = "bottom") +
  guides(color = guide_legend(override.aes = list(size=5)))

Getting help

If you meet problems when using scPNMF, please report it on issues. For more questions, email us!

Session information

sessionInfo()

References

JSB-UCLA/scPNMF documentation built on July 19, 2023, 10:28 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

JSB-UCLA/scPNMF
scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

In JSB-UCLA/scPNMF: scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

Introduction

Setting Up the Data

Step I: PNMF

Step II: Basis Selection

Functional Annotations

Data-driven Selection (Correlations with Cell Library Sizes and Multimodality Tests)

Combined Basis Selection Function

Application I: Dimensionality Reduction

Application II: Informative Gene Selection

Application III: Data Projection

Getting help

Session information

References

R Package Documentation

Browse R Packages

We want your feedback!

JSB-UCLA/scPNMF scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

In JSB-UCLA/scPNMF: scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

Introduction

Setting Up the Data

Step I: PNMF

Step II: Basis Selection

Functional Annotations

Data-driven Selection (Correlations with Cell Library Sizes and Multimodality Tests)

Combined Basis Selection Function

Application I: Dimensionality Reduction

Application II: Informative Gene Selection

Application III: Data Projection

Getting help

Session information

References

R Package Documentation

Browse R Packages

We want your feedback!

JSB-UCLA/scPNMF
scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling