README.md

scNMF: single cell non-negative matrix factorization toolkit

scNMF is a toolkit for:

See the vignettes folder for a fast and gentle introduction to scNMF and a vignette to reproduce figures in the scNMF manuscript.

NMF cross-validation for optimal rank determination

scNMF introduces a new method for cross-validation based on the robustness of NMF models on a bipartition of the input. Specifically,

  1. the input matrix is split into halves by either rows or columns,
  2. NMF is run on both halves,
  3. factors in both models are paired based on bipartite matching on a cosine similarity graph,
  4. the mean angular distance of both models is the mean cosine distance of all matched factors.

Here scNMF::nnmf.cv is run on several simulated datasets, and then scNMF::canyon.plot is used to visualize the results. The left dataset

R code to reproduce this figure in 1 minute

library(scNMF)
library(NMF)
syn <- syntheticNMF(5000, 10, 500, seed = 123, noise = TRUE, ribbon.confidence = 0.99)
cv <- nnmf.cv(syn, byrow = FALSE, k = seq(3,25,1), n.starts = 5, ribbon.confidence = 0.99)
p1 <- canyon.plot(cv)
p2 <- canyon.plot(cv, collapse = TRUE)

wrapper <- function(x, ...) {paste(strwrap(x, ...), collapse = "\n")}

canyonplot <- plot_grid(
    ggdraw() + draw_label("NMF cross-validation on a simulated dataset of rank 10\nusing a measure of angular similarity between models", size = 14),
    plot_grid(
        p1 + NoLegend() + ggtitle("all random starts"),
        get_legend(p1),
        p2 + ggtitle("average of all random starts"),
        ncol = 3,
        rel_widths = c(1,0.1,1),
        labels = c("A","","B")
    ) + labs(caption  = wrapper("Result of `scNMF::nnmc.cv` plotted by `scNMF::canyon.plot` with (A) `line.collapse = FALSE` or (B) `line.collapse = TRUE`. Ribbon represents a 99% confidence ribbon about the five random starts based on a linear loes fit. The mean model angle is the sum of the angles of the factors, where the factors are matched to achieve the minimum possible overall model angle", width = 160)) +
        theme(plot.caption = element_text(hjust = 0, size = 10)), 
    nrow = 2,
    rel_heights = c(0.2,1)
)

ggsave("canyonplot.png", plot = canyonplot, width = 10, height = 5, units = "in", dpi = "retina")

NMF cross-validation on a synthetic dataset

NMF on compressed transcriptional spaces is not robust

NMF cross-validation on the PBMC3k dataset (stopping criteria: min.dist = 0.01, min.cells = 5)

cross-validation

NMF cross-validation on the PBMC3k dataset (stopping criteria: min.cells = 5)

cross-validation

NMF cross-validation on the bmcite dataset (stopping criteria: min.cells = 10)

cross-validation

Similar results were obtained for the moca7k dataset and the entire MOCAE13.5 dataset.

The "smart split" maximizes signal redundancy between halves of the input, thus theoretically giving the objective function the most statistical power possible for measuring robustness. Note how smart split captured all of the information the other five runs captured collectively, and with far less volatility in the signal. This means "k-fold" cross-validation only needs to be run once with smart split.



zdebruine/scNMF documentation built on Jan. 1, 2021, 1:50 p.m.