nmf: Running NMF algorithms

Description Usage Arguments Details Value Optimized C++ vs. plain R Algorithms Seeding methods Methods (by generic) References See Also Examples

Description

The function nmf is a S4 generic defines the main interface to run NMF algorithms within the framework defined in package NMF. It has many methods that facilitates applying, developing and testing NMF algorithms.

The package vignette vignette('NMF') contains an introduction to the interface, through a sample data analysis.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
nmf(x, rank, method, ...)

## S4 method for signature 'data.frame,ANY,ANY'
nmf(x, rank, method, ...)

## S4 method for signature 'mMatrix,numeric,'NULL''
nmf(x, rank, method, seed = NULL, model = NULL, ...)

## S4 method for signature 'mMatrix,numeric,list'
nmf(x, rank, method, ..., .parameters = list())

## S4 method for signature 'mMatrix,numeric,character'
nmf(x, rank, method, ...)

## S4 method for signature 'mMatrix,numeric,'function''
nmf(
  x,
  rank,
  method,
  seed,
  model = "NMFstd",
  ...,
  name,
  objective = "euclidean",
  mixed = FALSE
)

## S4 method for signature 'mMatrix,NMF,ANY'
nmf(x, rank, method, seed, ...)

## S4 method for signature 'mMatrix,'NULL',ANY'
nmf(x, rank, method, seed, ...)

## S4 method for signature 'mMatrix,missing,ANY'
nmf(x, rank, method, ...)

## S4 method for signature 'mMatrix,numeric,missing'
nmf(x, rank, method, ...)

## S4 method for signature 'mMatrix,matrix,ANY'
nmf(x, rank, method, seed, model = list(), ...)

## S4 method for signature 'mMatrix,data.frame,ANY'
nmf(x, rank, method, ...)

## S4 method for signature 'formula,ANY,ANY'
nmf(x, rank, method, ..., model = NULL)

## S4 method for signature 'mMatrix,numeric,NMFStrategy'
nmf(
  x,
  rank,
  method,
  seed = nmf.getOption("default.seed"),
  rng = NULL,
  nrun = if (length(rank) > 1L) 30 else 1,
  model = NULL,
  .options = list(),
  .pbackend = nmf.getOption("pbackend"),
  .callback = NULL,
  .tmpdir = getwd(),
  ...
)

Arguments

x

target data to fit, i.e. a matrix-like object

rank

specification of the factorization rank. It is usually a single numeric value, but other type of values are possible (e.g. matrix), for which specific methods are implemented. See for example methods nmf,matrix,matrix,ANY.

If rank is a numeric vector with more than one element, e.g. a range of ranks, then nmf performs the estimation procedure described in nmfEstimateRank.

method

specification of the NMF algorithm. The most common way of specifying the algorithm is to pass the access key (i.e. a character string) of an algorithm stored in the package's dedicated registry, but methods exists that handle other types of values, such as function or list object. See their descriptions in section Methods.

If method is missing the algorithm to use is obtained from the option nmf.getOption('default.algorithm'), unless it can be infer from the type of NMF model to fit, if this later is available from other arguments. Factory fresh default value is ‘brunet’, which corresponds to the standard NMF algorithm from Brunet et al. (2004) (see section Algorithms).

Cases where the algorithm is inferred from the call are when an NMF model is passed in arguments rank or seed (see description for nmf,matrix,numeric,NULL in section Methods).

...

extra arguments to allow extension of the generic. Arguments that are not used in the chain of internal calls to nmf methods are passed to the function that effectively implements the algorithm that fits an NMF model on x.

seed

specification of the starting point or seeding method, which will compute a starting point, usually using data from the target matrix in order to provide a good guess.

The seeding method may be specified in the following way:

a character string:

giving the name of a registered seeding method. The corresponding method will be called to compute the starting point.

Available methods can be listed via nmfSeed(). See its dedicated documentation for details on each available registered methods (nmfSeed).

a list:

giving the name of a registered seeding method and, optionally, extra parameters to pass to it.

a single numeric:

that is used to seed the random number generator, before generating a random starting point.

Note that when performing multiple runs, the L'Ecuyer's RNG is used in order to produce a sequence of random streams, that is used in way that ensures that parallel computation are fully reproducible.

an object that inherits from NMF:

it should contain the data of an initialised NMF model, i.e. it must contain valid basis and mixture coefficient matrices, directly usable by the algorithm's workhorse function.

a function:

that computes the starting point. It must have signature (object="NMF", target="matrix", ...) and return an object that inherits from class NMF. It is recommended to use argument object as a template for the returned object, by only updating the basis and coefficient matrices, using basis<- and coef<- respectively.

model

specification of the type of NMF model to use.

It is used to instantiate the object that inherits from class NMF, that will be passed to the seeding method. The following values are supported:

  • NULL, the default model associated to the NMF algorithm is instantiated and ... is looked-up for arguments with names that correspond to slots in the model class, which are passed to the function nmfModel to instantiate the model. Arguments in ... that do not correspond to slots are passed to the algorithm.

  • a single character string, that is the name of the NMF model class to be instantiate. In this case, arguments in ... are handled in the same way as when model is NULL.

  • a list that contains named values that are passed to the function nmfModel to instantiate the model. In this case, ... is not looked-up at all, and passed entirely to the algorithm. This means that all necessary model parameters must be specified in model.

Argument/slot conflicts: In the case a parameter of the algorithm has the same name as a model slot, then model MUST be a list – possibly empty –, if one wants this parameter to be effectively passed to the algorithm.

If a variable appears in both arguments model and ..., the former will be used to initialise the NMF model, the latter will be passed to the NMF algorithm. See code examples for an illustration of this situation.

.parameters

list of method-specific parameters. Its elements must have names matching a single method listed in method, and be lists of named values that are passed to the corresponding method.

name

name associated with the NMF algorithm implemented by the function method (only used when method is a function).

objective

specification of the objective function associated with the algorithm implemented by the function method (only used when method is a function).

It may be either 'euclidean' or 'KL' for specifying the euclidean distance (Frobenius norm) or the Kullback-Leibler divergence respectively, or a function with signature (x="NMF", y="matrix", ...) that computes the objective value for an NMF model x on a target matrix y, i.e. the residuals between the target matrix and its NMF estimate. Any extra argument may be specified, e.g. function(x, y, alpha, beta=2, ...).

mixed

a logical that indicates if the algorithm implemented by the function method support mixed-sign target matrices, i.e. that may contain negative values (only used when method is a function).

rng

rng specification for the run(s). This argument should be used to set the the RNG seed, while still specifying the seeding method argument seed.

nrun

number of runs to perform. It specifies the number of runs to perform. By default only one run is performed, except if rank is a numeric vector with more than one element, in which case a default of 30 runs per value of the rank are performed, allowing the computation of a consensus matrix that is used in selecting the appropriate rank (see consensus).

When using a random seeding method, multiple runs are generally required to achieve stability and avoid bad local minima.

.options

this argument is used to set runtime options.

It can be a list containing named options with their values, or, in the case only boolean/integer options need to be set, a character string that specifies which options are turned on/off or their value, in a unix-like command line argument way.

The string must be composed of characters that correspond to a given option (see mapping below), and modifiers '+' and '-' that toggle options on and off respectively. E.g. .options='tv' will toggle on options track and verbose, while .options='t-v' will toggle on option track and toggle off option verbose.

Modifiers '+' and '-' apply to all option character found after them: t-vp+k means track=TRUE, verbose=parallel=FALSE, and keep.all=TRUE. The default behaviour is to assume that .options starts with a '+'.

for options that accept integer values, the value may be appended to the option's character e.g. 'p4' for asking for 4 processors or 'v3' for showing verbosity message up to level 3.

The following options are available (the characters after “-” are those to use to encode .options as a string):

debug - d

Toggle debug mode (default: FALSE). Like option verbose but with more information displayed.

keep.all - k

used when performing multiple runs (nrun>1): if TRUE, all factorizations are saved and returned (default: FALSE). Otherwise only the factorization achieving the minimum residuals is returned.

parallel - p

this option is useful on multicore *nix or Mac machine only, when performing multiple runs (nrun > 1) (default: TRUE). If TRUE, the runs are performed using the parallel foreach backend defined in argument .pbackend. If this is set to 'mc' or 'par' then nmf tries to perform the runs using multiple cores with package link[doParallel]{doParallel} – which therefore needs to be installed.

If equal to an integer, then nmf tries to perform the computation on the specified number of processors. When passing options as a string the number is appended to the option's character e.g. 'p4' for asking for 4 processors.

If FALSE, then the computation is performed sequentially using the base function sapply.

Unlike option 'P' (capital 'P'), if the computation cannot be performed in parallel, then it will still be carried on sequentially.

IMPORTANT NOTE FOR MAC OS X USERS: The parallel computation is based on the doMC and multicore packages, so the same care should be taken as stated in the vignette of doMC: “it is not safe to use doMC from R.app on Mac OS X. Instead, you should use doMC from a terminal session, starting R from the command line.”

parallel.required - P

Same as p, but an error is thrown if the computation cannot be performed in parallel or with the specified number of processors.

shared.memory - m

toggle usage of shared memory (requires the synchronicity package). Default is as defined by nmf.getOption('shared.memory').

restore.seed - r

deprecated option since version 0.5.99. Will throw a warning if used.

simplifyCB - S

toggle simplification of the callback results. Default is TRUE

track - t

enables error tracking (default: FALSE). If TRUE, the returned object's slot residuals contains the trajectory of the objective values, which can be retrieved via residuals(res, track=TRUE) This tracking functionality is available for all built-in algorithms.

verbose - v

Toggle verbosity (default: FALSE). If TRUE, messages about the configuration and the state of the current run(s) are displayed. The level of verbosity may be specified with an integer value, the greater the level the more messages are displayed. Value FALSE means no messages are displayed, while value TRUE is equivalent to verbosity level 1.

.pbackend

specification of the foreach parallel backend to register and/or use when running in parallel mode. See options p and P in argument .options for how to enable this mode. Note that any backend that is internally registered is cleaned-up on exit, so that the calling foreach environment should not be affected by a call to nmf – except when .pbackend=NULL.

Currently it accepts the following values:

‘par’

use the backend(s) defined by the package doParallel;

a numeric value

use the specified number of cores with doParallel backend;

‘seq’

use the foreach sequential backend doSEQ;

NULL

use currently registered backend;

NA

do not compute using a foreach loop – and therefore not in parallel – but rather use a call to standard sapply. This is useful for when developing/debugging NMF algorithms, as foreach loop handling may sometime get in the way.

Note that this is equivalent to using .options='-p' or .options='p0', but takes precedence over any option specified in .options: e.g. nmf(..., .options='P10', .pbackend=NA) performs all runs sequentially using sapply. Use nmf.options(pbackend=NA) to completely disable foreach/parallel computations for all subsequent nmf calls.

‘mc’

identical to ‘par’ and defined to ensure backward compatibility.

.callback

Used when option keep.all=FALSE (default). It allows to pass a callback function that is called after each run when performing multiple runs (i.e. with nrun>1). This is useful for example if one is also interested in saving summary measures or process the result of each NMF fit before it gets discarded. After each run, the callback function is called with two arguments, the NMFfit object that as just been fitted and the run number: .callback(res, i). For convenience, a function that takes only one argument or has signature (x, ...) can still be passed in .callback. It is wrapped internally into a dummy function with two arguments, only the first of which is passed to the actual callback function (see example with summary).

The call is wrapped into a tryCatch so that callback errors do not stop the whole computation (see below).

The results of the different calls to the callback function are stored in a miscellaneous slot accessible using the method $ for NMFfit objects: res$.callback. By default nmf tries to simplify the list of callback result using sapply, unless option 'simplifyCB' is FASE.

If no error occurs res$.callback contains the list of values that resulted from the calling the callback function –, ordered as the fits. If any error occurs in one of the callback calls, then the whole computation is not stopped, but the error message is stored in res$.callback, in place of the result.

See the examples for sample code.

.tmpdir

path to the directory where a temporary directory is created to store intermediate results. This is only relevant for multi-runs performed using a foreach backend (including the sequential backend 'doSEQ').

Details

The nmf function has multiple methods that compose a very flexible interface allowing to:

The workhorse method is nmf,matrix,numeric,NMFStrategy, which is eventually called by all other methods. The other methods provides convenient ways of specifying the NMF algorithm(s), the factorization rank, or the seed to be used. Some allow to directly run NMF algorithms on different types of objects, such as data.frame or ExpressionSet objects.

Value

The returned value depends on the run mode:

Single run:

An object of class NMFfit.

Multiple runs, single method:

When nrun > 1 and method is not list, this method returns an object of class NMFfitX.

Multiple runs, multiple methods:

When nrun > 1 and method is a list, this method returns an object of class NMFList.

Optimized C++ vs. plain R

Lee and Seung's multiplicative updates are used by several NMF algorithms. To improve speed and memory usage, a C++ implementation of the specific matrix products is used whenever possible. It directly computes the updates for each entry in the updated matrix, instead of using multiple standard matrix multiplication.

The algorithms that benefit from this optimization are: 'brunet', 'lee', 'nsNMF' and 'offset'. % and 'lnmf' However there still exists plain R versions for these methods, which implement the updates as standard matrix products. These are accessible by adding the prefix '.R#' to their name: '.R#brunet', '.R#lee', '.R#nsNMF' and '.R#offset'.

Algorithms

All algorithms are accessible by their respective access key as listed below. The following algorithms are available:

‘brunet’

Standard NMF, based on the Kullback-Leibler divergence, from Brunet et al. (2004). It uses simple multiplicative updates from Lee and Seung (2001), enhanced to avoid numerical underflow.

Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).

‘lee’

Standard NMF based on the Euclidean distance from Lee and Seung (2001). It uses simple multiplicative updates.

Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).

ls-nmf

Least-Square NMF from Wang et al. (2006). It uses modified versions of Lee and Seung's multiplicative updates for the Euclidean distance, which incorporates weights on each entry of the target matrix, e.g. to reflect measurement uncertainty.

Default stopping criterion: stationarity of the objective function (see nmf.stop.stationary).

‘nsNMF’

Nonsmooth NMF from Pascual-Montano et al. (2006). It uses a modified version of Lee and Seung's multiplicative updates for the Kullback-Leibler divergence Lee and Seung (2001), to fit a extension of the standard NMF model, that includes an intermediate smoothing matrix, meant meant to produce sparser factors.

Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).

‘offset’

NMF with offset from Badea (2008). It uses a modified version of Lee and Seung's multiplicative updates for Euclidean distance Lee and Seung (2001), to fit an NMF model that includes an intercept, meant to capture a common baseline and shared patterns, in order to produce cleaner basis components.

Default stopping criterion: invariance of the connectivity matrix (see nmf.stop.connectivity).

‘pe-nmf’

Pattern-Expression NMF from Zhang2008. It uses multiplicative updates to minimize an objective function based on the Euclidean distance, that is regularized for effective expression of patterns with basis vectors.

Default stopping criterion: stationarity of the objective function (see nmf.stop.stationary).

‘snmf/r’, ‘snmf/l’

Alternating Least Square (ALS) approach from Kim and Park (2007). It applies the nonnegative least-squares algorithm from Van Benthem and Keenan (2004) (i.e. fast combinatorial nonnegative least-squares for multiple right-hand), to estimate the basis and coefficient matrices alternatively (see fcnnls). It minimises an Euclidean-based objective function, that is regularized to favour sparse basis matrices (for ‘snmf/l’) or sparse coefficient matrices (for ‘snmf/r’).

Stopping criterion: built-in within the internal workhorse function nmf_snmf, based on the KKT optimality conditions.

Seeding methods

The purpose of seeding methods is to compute initial values for the factor matrices in a given NMF model. This initial guess will be used as a starting point by the chosen NMF algorithm.

The seeding method to use in combination with the algorithm can be passed to interface nmf through argument seed. The seeding seeding methods available in registry are listed by the function nmfSeed (see list therein).

Detailed examples of how to specify the seeding method and its parameters can be found in the Examples section of this man page and in the package's vignette.

Methods (by generic)

References

Brunet J, Tamayo P, Golub TR, Mesirov JP (2004). “Metagenes and molecular pattern discovery using matrix factorization.” _Proceedings of the National Academy of Sciences of the United States of America_, *101*(12), 4164-9. ISSN 0027-8424, doi: 10.1073/pnas.0308531101 (URL: https://doi.org/10.1073/pnas.0308531101).

Brunet J, Tamayo P, Golub TR, Mesirov JP (2004). “Metagenes and molecular pattern discovery using matrix factorization.” _Proceedings of the National Academy of Sciences of the United States of America_, *101*(12), 4164-9. ISSN 0027-8424, doi: 10.1073/pnas.0308531101 (URL: https://doi.org/10.1073/pnas.0308531101).

Lee DD, Seung H (2001). “Algorithms for non-negative matrix factorization.” _Advances in neural information processing systems_. <URL: http://scholar.google.com/scholar?q=intitle:Algorithms+for+non-negative+matrix+factorization\#0>.

Wang G, Kossenkov AV, Ochs MF (2006). “LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates.” _BMC bioinformatics_, *7*, 175. ISSN 1471-2105, doi: 10.1186/1471-2105-7-175 (URL: https://doi.org/10.1186/1471-2105-7-175).

Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-marqui RD (2006). “Nonsmooth nonnegative matrix factorization (nsNMF).” _IEEE Trans. Pattern Anal. Mach. Intell_, *28*, 403-415.

Badea L (2008). “Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization.” _Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing_, *290*, 267-78. ISSN 1793-5091, <URL: http://www.ncbi.nlm.nih.gov/pubmed/18229692>.

Kim H, Park H (2007). “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis.” _Bioinformatics (Oxford, England)_, *23*(12), 1495-502. ISSN 1460-2059, doi: 10.1093/bioinformatics/btm134 (URL: https://doi.org/10.1093/bioinformatics/btm134).

Van Benthem MH, Keenan MR (2004). “Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems.” _Journal of Chemometrics_, *18*(10), 441-450. ISSN 0886-9383, doi: 10.1002/cem.889 (URL: https://doi.org/10.1002/cem.889).

See Also

nmfAlgorithm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Only basic calls are presented in this manpage.
# Many more examples are provided in the demo file nmf.R
## Not run: 
demo('nmf')

## End(Not run)

# random data
x <- rmatrix(20,10)

# run default algorithm with rank 2
res <- nmf(x, 2)

# specify the algorithm
res <- nmf(x, 2, 'lee')

# get verbose message on what is going on
res <- nmf(x, 2, .options='v') 
## Not run:  
# more messages
res <- nmf(x, 2, .options='v2')
# even more
res <- nmf(x, 2, .options='v3')
# and so on ... 

## End(Not run)
 

renozao/NMF documentation built on June 14, 2020, 9:35 p.m.