nnlm.cv: Cross-validation for NMF

Description Usage Arguments Details Value

Description

NMF cross-validation for rank determination against the angle between bipartite factorizations

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
nnlm.cv(
  A,
  byrow = TRUE,
  k = seq(from = 5, to = 20, by = 2),
  max.iter = 1000,
  rel.tol = 0.001,
  n.threads = 0,
  verbose = 1,
  trace = 5,
  seed = 123,
  n.starts = 1,
  alpha = c(0, 0, 0.5),
  beta = c(0, 0, 0.5),
  return.models = FALSE,
  smart.split = TRUE,
  smart.split.block.size = 200,
  reduction = "dclus",
  dist.method = "cosine"
)

Arguments

A

A matrix to be factorized (i.e. result from average.expression) or a Seurat object with cluster centers in a dimensional reduction slot. If sparse, will be coerced to dense format. If the entire data should be used from the Seurat object, specify reduction = NULL.

byrow

Bipartition by rows rather than columns (default TRUE)

k

Array of integer ranks (default seq(from = 5, to = 20, by = 2))

max.iter

Maximum number of alternating NNLS solves (default 1000)

rel.tol

Stop criterion for each NNMF run, defined as the relative tolerance between two successive iterations: |e2-e1|/avg(e1,e2). (default 1e-3, although 1e-2 may be useful for faster course-grained preliminary analysis of large datasets, small datasets may benefit from a higher tolerance such as 1e-4)

n.threads

Number of threads/CPUs to use (default is 0, for all cores)

verbose

0 = no tracking, 1 = progress bars for each n.starts, 2 = message for each factorization, 3 = all the details for each factorization

trace

An integer specifying a multiple of ANLS NNMF iterations at which MSE error should be calculated and checked for convergence against rel.tol. To check error every iteration, specify 1. To avoid checking error entirely, specify trace > max.iter (default is 5, and is generally an efficient and effective value). For particularly sparse or heterogenous datasets which require hundreds of ANNLS iterations, setting a trace of 10 or 20 may speed up the calculation slightly.

seed

Random seed for reproducibility.

n.starts

Number of random starts, each run at all given values of k for a unique set of indices (default 1)

return.models

Boolean, should W and H matrices be returned for each run (default FALSE). W and H matrices can take up significant memory in large cross-validation experiments.

smart.split

Boolean, whether to use smart.split to determine indices if n.starts = 1. Smart split maximizes the signal redundancy between the bipartition of the dataset to achieve optimal cross-validation results. Generally, a single run of smart.split is as informative as multiple runs on random subsets. TRUE by default.

smart.split.block.size

Integer, default 200. Smaller is faster, larger achieves better separation of redundant features. Block size gives how many features to run bipartite matching on at a time, the rate limiting component is the bipartite graph solver. When block size is small, the similarity of matched features will be lower. When block size is large, similarity of matched features will be higher and cross-validation result may be better.

reduction

If Seurat object is provided, specify a reduction to use feature loadings (i.e. cluster centers), otherwise specify NULL to use the entire counts matrix from the default assay ("dclus" by default).

dist.method

"cosine" (default) or "bhjattacharyya" (alternative) for computing distances between clusters and a similarity graph. In exceptionally sparse datasets, bhjattacharyya distance can outperform cosine distance.

Details

nnmf.cv splits the dataset into non-overlapping halves by either row or column and runs NMF on both of these halves at a number of ranks of k. Factors in the NMF model are matched one-to-one by cosine similarity, and the mean angle between both models is calculated as the mean of the angles between matched factors. The rank of k with the minimum angle is the rank at which latent space is most robust.

This cross-validation procedure can be run multiple times on permutations of the dataset, but if only a single run is requested (n.start = 1), a "smart split" is applied (semi non-random) which maximizes signal redundancy between bipartite partitions of the dataset. Generally, a single run with smart.split is sufficient for determination of optimal rank k and captures most of the information that would be learned from multiple starts on entirely random partitions. The scNMF::canyon.plot function is useful for visualizing the results of nnmf.cv to determine optimal rank k or for optimizing the cross-validation procedure. After determining the optimal rank, scNMF::nnmf may be run at the optimal rank.

Subsetting: For large datasets, nnmf.cv may often be run on a subset of the data if signal redundancy is sufficient. However, if there is insufficient signal redundancy, nnmf.cv may not reveal any "canyon" or local minima.

Value

A list with cross-validation info, most easily visualized by running scNMF::canyon.plot on the result. List includes a tall format dataframe of factor angles (factor.angle with columns "k", "factor.angle", "seed"), a tall format dataframe of model angles (model.angle with columns "k", "model.angle", "seed"), if models were requested a list of models and matched factors within a list of starts


zdebruine/scNMF documentation built on Jan. 1, 2021, 1:50 p.m.