The ccfindR
(Cancer Clone findeR) package [@woo_etal] contains
implementations and utilities for analyzing single-cell
RNA-sequencing data, including quality control,
unsupervised clustering for discovery of cell types,
and visualization of the outcomes. It is especially
suitable for analysis of transcript-count data utilizing
unique molecular identifiers (UMIs), e.g., data derived
from 10x Genomics platform. In these data sets, RNA counts
are non-negative integers, enabling clustering using non-negative
matrix factorization (NMF) [@lee_seung].
Input data are UMI counts in the form of a matrix with each genetic
feature ("genes") in rows and cells (tagged by barcodes) in columns,
produced by read alignment and counting pipelines. The count matrix and
associated gene and cell annotation files are bundled into a main
object of class scNMFSet
, which extends the
SingleCellExperiment
class [http://dx.doi.org/10.18129/B9.bioc.SingleCellExperiment)].
Quality control for both cells and genes can be performed via filtering
steps based on UMI counts and variance of expressions, respectively.
The NMF factorization is first performed for multiple values of
ranks (the reduced dimension of factorization) to find the most
likely value. A production run for the chosen rank then leads to
factor matrices, allowing the user to identify and visualize genes
representative of clusters and assign cells into clusters.
The NMF approach offers a means to identify cell subtypes and classify individual cells into these clusters based on clustering using expression counts. In contrast to alternatives such as principal component analyses [@hastie_etal], NMF leverages the non-negative nature of count data and factorizes the data matrix $\sf X$ into two factor matrices $\sf W$ and $\sf H$ [@lee_seung]:
\begin{equation} \sf{X} \sim {\sf W}{\sf H}. \end{equation}
If $\sf X$ is a $p\times n$ matrix ($p$ genes and $n$ cells), the basis matrix $\sf W$ is $p \times r$ and coefficient matrix $\sf H$ is $r \times n$ in dimension, respectively, where the rank $r$ is a relatively small integer. A statistical inference-based interpretation of NMF is to view $X_{ij}$ as a realization of a Poisson distribution with the mean for each matrix element given by $({\sf WH}){ij}\equiv \Lambda{ij}$, or
\begin{equation} \Pr(x_{ij})=\frac{e^{-\Lambda_{ij}}{\Lambda_{ij}}^{x_{ij}}} {\Gamma(1+x_{ij})}. \end{equation}
The maximum likelihood inference of the latter is then achieved by maximizing
\begin{equation} L = \sum_{ij} \left(X_{ij} \ln \frac{\Lambda_{ij}}{X_{ij}}- \Lambda_{ij}+X_{ij}\right). \end{equation}
The Kullback-Leibler measure of the distance between $\sf X$ and $\sf \Lambda$, which is minimized, is equal to $-L$. Lee and Seung's update rule [@lee_seung] solves this optimization task iteratively.
While also including this classical iterative update
algorithm to find basis and coefficient factors of the count matrix, the
main workhorse in ccfindR
is the variational Bayesian inference
algorithm proposed by Cemgil [@cemgil]. Thus the key distinguishing
features of ccfindR
[@woo_etal] compared to other existing implementations -- NMF
for generic data
[@gaujoux_seoighe] and NMFEM
for single-cell analysis [@zhu_etal] --
are
In particular, a traditional way (in maximum likelihood inference) to determine the rank is to evaluate the factorization quality measures (and optionally compare with those from randomized data). The Bayesian formulation of NMF algorithm instead incorporates priors for factored matrix elements $\sf W$ and $\sf H$ modeled by gamma distributions. Inference can be combined with hyperparameter update to optimize the marginal likelihood (ML; conditional probability of data under hyperparameters and rank), which provides a statistically well-controlled means to determine the optimal rank describing data.
For large rank values, it can be challenging to interpret clusters identified. To facilitate biological interpretation, we provide a procedure where cluster assignment of cells is repeated for multiple rank values, typically ranging from 2 to the optimal rank, and a phylogenetic tree connecting different clusters at neighboring rank values are constructed. This tree gives an overview of different types of cells present in the system viewed at varying resolution.
We illustrate a typical workflow with a single-cell count data set generated from peripheral blood mononuclear cell (PBMC) data [https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/]. The particular data set used below was created by sampling from five purified immune cell subsets.
To install the package, do the following:
BiocManager::install('ccfindR')
After installation, load the package by
library(ccfindR)
The input data can be a simple matrix:
# A toy matrix for count data set.seed(1) mat <- matrix(rpois(n = 80, lambda = 2), nrow = 4, ncol = 20) ABC <- LETTERS[1:4] abc <- letters[1:20] rownames(mat) <- ABC colnames(mat) <- abc
The main S4
object containing data and subsequent analysis outcomes is of
class scNMFSet
, created by
# create scNMFSet object sc <- scNMFSet(count = mat)
This class extends SingleCellExperiment
class,
adding extra slots for storing factorization outcomes.
In particular, assays
, rowData
, and colData
slots of
SingleCellExperiment
class are used to store RNA count matrix,
gene, and cell annotation data frames, respectively.
In the simplest initialization above, the named argument count
is
used as the count matrix and is equivalent to
# create scNMFSet object sc <- scNMFSet(assays = list(counts = mat))
See SingleCellExperiment
documentations for more details of these
main slots. For instance, row and column names can be stored by
# set row and column names suppressMessages(library(S4Vectors)) genes <- DataFrame(ABC) rownames(genes) <- ABC cells <- DataFrame(abc) rownames(cells) <- abc sc <- scNMFSet(count = mat, rowData = genes, colData = cells) sc
Alternatively, sparse matrix format (of class dgCMatrix
) can be used.
A MatrixMarket
format file can be read directly:
# read sparse matrix dir <- system.file('extdata', package = 'ccfindR') mat <- Matrix::readMM(paste0(dir,'/matrix.mtx')) rownames(mat) <- 1:nrow(mat) colnames(mat) <- 1:ncol(mat) sc <- scNMFSet(count = mat, rowData = DataFrame(1:nrow(mat)), colData = DataFrame(1:ncol(mat))) sc
The number of rows in assays$counts
and rowData
, the number of columns
in assays$counts
and rows in colData
must match.
The gene and barcode meta-data and count files resulting from 10x Genomics' Cell Ranger pipeline (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) can also be read:
# read 10x files sc <- read_10x(dir = dir, count = 'matrix.mtx', genes = 'genes.tsv', barcodes = 'barcodes.tsv') sc
The parameter dir
is the directory containing the files.
Filenames above are the defaults and can be omitted.
The function returns an scNMFSet
object.
By default, any row or column entirely consisting of zeros in counts
and the
corresponding elements in rowData
and colData
slots will be removed.
This feature can be turned off by remove.zeros = FALSE
.
For quality control, cells and genes can be filtered manually using normal
subsetting syntax of R
: the slots in the object sc
are accessed and
edited using accessors and sub-setting rules;
see
SingleCellExperiment
:
# slots and subsetting counts(sc)[1:7,1:3] head(rowData(sc)) head(colData(sc)) sc2 <- sc[1:20,1:70] # subsetting of object sc2 <- remove_zeros(sc2) # remove empty rows/columns sc2
We provide two streamlined functions each for cell and gene filtering, which are illustrated below:
sc <- filter_cells(sc, umi.min = 10^2.6, umi.max = 10^3.4)
markers <- c('CD4','CD8A','CD8B','CD19','CD3G','CD3D', 'CD3Z','CD14') sc0 <- filter_genes(sc, markers = markers, vmr.min = 1.5, min.cells.expressed = 50, rescue.genes = FALSE)
The function filter_cells()
plots histogram of UMI counts for all cells
when called without threshold parameters (Fig. \@ref(fig:cells)).
This plot can be used
to set desirable thresholds, umi.min
and umi.max
. Cells with UMI counts
outside will be filtered out. The function filter_genes()
displays
scatter plot of the total number of cells with nonzero count and VMR
(variance-to-mean ratio) for each gene (Fig. \@ref(fig:genes)).
In both plots, selected
cells and genes are shown in red. Note that the above example has thresholds
that are too stringent, which is intended to speed up the subsequent
illustrative runs.
A list of pre-selected marker genes can be provided to help identify
clusters via the markers
parameter in filter_genes()
. Here, we use a
set of classical PBMC marker genes (shown in orange).
Gene-filtering can also be augmented by scanning for those genes whose count
distributions among cells are non-trivial: most have zero count as its
maximum; some have one or more distinct peaks at nonzero count values.
These may signify the existence of groups of cells in which the genes
are expressed in distinguishable fashion. The selection of genes by filter_genes()
will be set as the union of threshold-based group and those with such
nonzero-count modes by setting rescue.genes = TRUE
:
sc_rescue <- filter_genes(sc, markers = markers, vmr.min = 1.5, min.cells.expressed = 50, rescue.genes = TRUE, progress.bar = FALSE)
This "gene rescue" scan will take some time and a progress bar is
displayed if progress.bar = TRUE
.
For subsequent analysis, we will use the latter selection and also name rows with gene symbols:
rownames(sc_rescue) <- rowData(sc_rescue)[,2] sc <- sc_rescue
The main function for maximum likelihood NMF on a count matrix is
factorize()
. It performs a series of iterative updates to matrices
$\sf W$ and $\sf H$.
Since the global optimum of likelihood function is not directly accessible,
computational inference relies on local maxima, which depends on initializations.
We adopt the randomized initialization scheme, where the factor matrix
elements are drawn from uniform distributions.
To make the inference reproducible, one can set the random number seed
by set.seed(seed)
,
where seed
is a positive integer, prior to calling factorize()
. Updates
continue until convergence is reached, defined by either the fractional
change in likelihood being smaller than Tol
(criterion = likelihood
)
or a set number (ncnn.step
) of steps observed during which the
connectivity matrix remains unchanged (criterion = connectivity
).
The connectivity matrix $\sf C$ is a symmetric $n\times n$ matrix with
elements $C_{jl}=1$ if $j$ and $l$ cells belong to the same cluster
and $0$ otherwise. The cluster membership is dynamically checked by
finding the row
index $k$ for which the coefficient matrix element $H_{kj}$ is maximum
for each cell indexed by $j$.
During iteration, with verbose = 3
, step number, log likelihood per
elements, and the number of terms in the upper-diagonal part of $\sf C$
that changed from the previous step are printed:
set.seed(1) sc <- factorize(sc, ranks = 3, nrun = 1, ncnn.step = 1, criterion='connectivity', verbose = 3)
The function factorize()
returns the same object sc
with extra slots
ranks
(the rank value for which factorization was performed), basis
(a list containing the basis matrix $\sf W$), coeff
(a list containing
the coefficient matrix $\sf H$), and measure
(a data frame containing
the factorization quality measure; see below). The criterion
used to stop iteration is either connectivity
(no changes to
connectivity matrix for ncnn.steps
) or likelihood
(changes
to likelihood smaller than Tol
).
To reduce the dependence of final estimates for $\sf W$ and $\sf H$ on initial guess, inferences need to be repeated for many different initializations:
sc <- factorize(sc, ranks = 3, nrun = 5, verbose = 2)
After each run, the likelihood and dispersion are printed, and the global
maximum of likelihood as well as the corresponding matrices $\sf W$ and
$\sf H$ are stored. The dispersion $\rho$ is a scalar measure of how
close the consistency matrix $\sf{\bar C}\equiv {\rm Mean}({\sf C})$
elements, where ${\sf C}$ is the
connectivity matrix, are to binary values $0,1$. The mean is
over multiple runs:
$$ \rho=\frac{4}{n^2}\sum_{jl}\left({\bar C}_{jl}-1/2\right)^2.$$
Note in the output above that $\rho$ decays from 1 as the number of
runs increases and then stabilizes. This degree of convergence of
$\rho$ is a good indication for the adequacy of nrun
. The cophenetic
is the correlation between the distance $1-{\sf{\bar C}}$ and the
height matrix of hierarchical clustering
[@brunet_etal].
To discover clusters of cells, the reduced dimensionality of factorization,
or the rank $r$, must be estimated. The examples above used a single
rank value. If the parameter ranks
is a vector, the set of inferences will
be repeated for each rank value.
sc <- factorize(sc, ranks = seq(3,7), nrun = 5, verbose = 1, progress.bar = FALSE)
Note that nrun
parameter above is set to a small value for illustration.
In a real application, typical values of nrun
would be larger. The progress
bar shown by default under verbose = 1
for overall nrun
runs is turned off
above. It can be set to TRUE
here (and below) to monitor the progress.
After factorization, the measure
slot has been filled:
measure(sc)
These measures can be plotted (Fig. \@ref(fig:measure)): ```{R measure, fig.large=TRUE, fig.cap='Factorization quality measures as functions of the rank. Dispersion measures the degree of bimodality in consistency matrix. Cophenetic correlation measures the degree of agreement between consistency matrix and hierarchical clustering.'} plot(sc)
## Bayesian NMF The maximum likelihood-based inference must rely on quality measures to choose optimal rank. Bayesian NMF allows for the statistical comparison of different models, namely those with different ranks. The quantity compared is the log probability (ML or "evidence") of data conditional to models (defined by rank and hyperparameters). The main function for Bayesian factorization is `vb_factorize()`: ```r sb <- sc_rescue set.seed(2) sb <- vb_factorize(sb, ranks =3, verbose = 3, Tol = 2e-4, hyper.update.n0 = 5)
The iteration maximizes log ML (per matrix elements)
and terminates when its fractional change becomes smaller than Tol
.
The option criterion = connectivity
can also be used.
By default, hyperparameters of priors are also updated after
hyper.update.n0
steps.
As in maximum likelihood, multiple ranks can be specified:
sb <- vb_factorize(sb, ranks = seq(2,7), nrun = 5, verbose = 1, Tol = 1e-4, progress.bar = FALSE)
With nrun
larger than 1, multiple inferences will be performed for
each rank with different initial conditions and the solution with the
highest ML will be chosen. The object after a vb_factorize
run
will have its measure
slot filled:
measure(sb)
For smaller sample sizes under larger rank values, columns of basis
matrices may turn out to be uniform, which signifies that the
corresponding cluster is redundant. By default (unif.stop=TRUE
),
if such a uniform
column is found in the basis matrix, the rank scan will terminate with
a warning. The last column of measure
named as nunif
counts the
number of such uniform columns found if run under unif.stop=FALSE
.
Plotting the object displays the log ML as a function of rank (Fig. \@ref(fig:lml)):
plot(sb)
The optimal rank is estimated from the rank-evidence profile by:
optimal_rank(sb)
The heterogeneity class (type I or II) distinguishes cases where there is a clear and finite optimal rank (type I) from those where the evidence asymptotically reaches a maximal level (type II) [@woo_etal].
The rank scan above using Bayesian inference correctly identifies
$r=5$ as the optimal rank. The fit results for each rank--from
either maximum likelihood or Bayesian inference--are stored
in sb@basis
and sb@coeff
. Both are lists of matrices of length
equal to the number of rank values scanned. One can access them by, e.g.,
ranks(sb) head(basis(sb)[ranks(sb)==5][[1]]) # basis matrix W for rank 5
Heatmaps of $\sf W$ and $\sf H$ matrices are displayed by feature_map()
and
cell_map()
, respectively (Figs. \@ref(fig:sb)-\@ref(fig:sc)):
feature_map(sb, markers = markers, rank = 5, max.per.cluster = 4, gene.name = rowData(sb)[,2], cexRow = 0.7)
In addition to the marker gene list provided as a parameter, the representative groups of genes for clusters are selected by the "max" scheme [@carmona-saez_etal]: genes are sorted for each cluster with decreasing magnitudes of coefficient matrix elements, and among the top members of the list, those for which the magnitude is the actual maximum over all clusters are chosen. Based on the marker-metagene map in Fig. 6, we rename the clusters 1-5 as follows:
cell_type <- c('B_cell','CD8+_T','CD4+_T','Monocytes','NK') colnames(basis(sb)[ranks(sb) == 5][[1]]) <- cell_type rownames(coeff(sb)[ranks(sb) == 5][[1]]) <- cell_type
cell_map(sb, rank = 5)
In visualize_clusters()
, each column of $\sf H$ matrix is used to assign
cells into clusters, and inter/intra-cluster separations are visualized
using tSNE algorithm [@van_der_maaten_hinton]. It uses the Rtsne()
function of the Rtsne
package. A barplot of cluster cell counts are
also displayed (Fig: \@ref(fig:tsne)):
visualize_clusters(sb, rank = 5, cex = 0.7)
It is useful to extract hierarchical relationships among the clusters identified. This feature requires a series of inference outcomes for an uninterrupted range of rank values, e.g., from 2 to 7:
tree <- build_tree(sb, rmax = 5) tree <- rename_tips(tree, rank = 5, tip.labels = cell_type) plot_tree(tree, cex = 0.8, show.node.label = TRUE)
The build_tree
function returns a list containing the tree. The second
command above renames the label of terminal nodes by our cell type label.
In Fig. \@ref(fig:tree), the relative distance between clusters can be seen to be consistent with the tSNE plot in Fig. \@ref(fig:tsne).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.