knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette demonstrates basic usage of the RcppML::nmf
function and visualization of the results.
Install the RcppML R package from CRAN or the development version from GitHub.
```{R, eval = FALSE} install.packages('RcppML') # install CRAN version
## What is NMF? Non-negative Matrix Factorization (NMF) finds additive signals in non-negative data in terms of the features and samples associated with those signals. NMF gives an approximation of an input matrix as the cross-product of two low-rank submatrices: $$A = wdh$$ Here, $A$ is the input matrix, $w$ is a tall matrix of features in rows and factors in columns, and $h$ is a wide matrix of factors in rows and samples in columns. `RcppML::nmf` introduces one more important component into this system, a scaling diagonal, $d$. This scaling diagonal provides: * consistent factor scalings throughout model fitting * robustness across random restarts * symmetry in factorization of symmetric matrices * a means for convex L1 regularization ## Running NMF Run NMF on the `iris` dataset. We need to specify a rank (`k`) for the factorization, and will also specify the `seed` for random initialization for reproducibility: ```{R, message = FALSE, warning = FALSE, results = "hide"} library(RcppML) library(Matrix) library(ggplot2) library(cowplot) data(iris) model <- nmf(iris[,1:4], k = 3, seed = 1)
model
The result of RcppML::nmf
is an S34object of class nmf
. The nmf
class has many useful methods:
```{R, warning = FALSE} methods(class = "nmf")
One of these useful methods is `summary` (which in turn has a `plot` method): ```{R} species_stats <- summary(model, group_by = iris$Species) species_stats
```{R, fig.height = 2.5, fig.width = 3} plot(species_stats, stat = "sum")
Notice how NMF factors capture variable information among iris species. The `biplot` method for NMF (see `?biplot.nmf` for details) can compare the weights of different features or samples in two factors: ```{R, fig.height = 3, fig.width = 4} biplot(model, factors = c(1, 2), group_by = iris$Species)
NMF is randomly initialized, thus results may be slightly different every time. To run NMF many times, set multiple seeds, and the best model will be returned.
Here we run 10 factorizations at a higher tolerance, and the best model is returned:
```{R, results = "hide"} model2 <- nmf(iris[,1:4], k = 3, seed = 1:10, tol = 1e-5)
```{R} # MSE of model from single random initialization evaluate(model, iris[,1:4]) # MSE of best model among 10 random restarts evaluate(model2, iris[,1:4])
The second model is better.
Sparse factors contain only a few non-zero values and make it easy to identify features or samples that are important.
L1/LASSO regularization is the best method for introducing sparsity into a linear model.
```{R, results = "hide"} data(movielens) ratings <- movielens$ratings model_L1 <- nmf(ratings, k = 7, L1 = 0.1, seed = 123, mask_zeros = TRUE)
```{R} sparsity(model_L1)
The sparsity
S3 method for class nmf
makes it easy to compute the sparsity of factors, as done above.
Note that mask_zeros = TRUE
in the example above. This is because zero-valued ratings are missing, and thus should not be considered during factorization.
In the above example, we regularized both $w$ and $h$, however each model can also be regularized separately:
```{R, results = "hide"} model_no_L1 <- nmf(ratings, k = 7, L1 = 0, seed = 123, mask_zeros = TRUE) model_L1_h <- nmf(ratings, k = 7, L1 = c(0, 0.1), seed = 123, mask_zeros = TRUE) model_L1_w <- nmf(ratings, k = 7, L1 = c(0.1, 0), seed = 123, mask_zeros = TRUE)
df <- rbind(sparsity(model_no_L1), sparsity(model_L1_h), sparsity(model_L1_w), sparsity(model_L1)) df$side <- c(rep("none", 14), rep("h only", 14), rep("w only", 14), rep("both", 14)) df$side <- factor(df$side, levels = unique(df$side))
```{R, fig.height = 3, fig.width = 4} ggplot(df, aes(x = side, y = sparsity, color = model)) + geom_boxplot(outlier.shape = NA, width = 0.6) + geom_point(position = position_jitterdodge()) + theme_classic() + labs(x = "Regularized side of model", y = "sparsity of model factors")
Note how each side of the model is regularized independently.
L1 regularization does not significantly affect model loss:
# L1 = 0 evaluate(model_no_L1, movielens$ratings, mask = "zeros") # L1 = 0.1 evaluate(model_L1, movielens$ratings, mask = "zeros")
L1 regularization also does not significantly affect model information at low penalties. Here we measure the cost of bipartite matching between two models on a cosine distance matrix for L1 = 0
, L1 = 0.01
, and L1 = 0.1
:
```{R, results = "hide"} model_low_L1 <- nmf(movielens$ratings, k = 5, L1 = 0.01, seed = 123)
```{R} # cost of bipartite matching: L1 = 0 vs. L1 = 0.01 bipartiteMatch(1 - cosine(model_no_L1$w, model_low_L1$w))$cost / 10 # cost of bipartite matching: L1 = 0 vs. L1 = 0.1 bipartiteMatch(1 - cosine(model_no_L1$w, model_L1$w))$cost / 10
These cosine angles (range 0 to 1) are very small -- in other words, the models are very similar.
See ?RcppML::cosine
for details on very fast computation of cosine similarity.
In the above code, we computed cosine distance by subtracting cosine similarity from 1, matched on this cost matrix, and divided by 10 to find the mean cosine distance between matched factors. In both cases, factors correspond well.
Thus, regularized RcppML::nmf
increases factor sparsity without significantly affecting the loss or information content of the model.
NMF models learned on some samples can be projected to other samples, a common routine in recommendation systems or transfer learning.
For instance, we may train a model on some samples and then use it to predict the values in other samples. For instance, in this dataset we predict what bird species are likely to be encountered in a grid of land given information about just a fraction of the species.
```{R, results = "hide", fig.width = 6, fig.height = 3} data(hawaiibirds) A <- hawaiibirds$counts
test_grids <- sample(1:ncol(A), ncol(A) / 5) test_species <- sample(1:nrow(A), nrow(A) * 0.5)
mask <- matrix(0, nrow(A), ncol(A)) mask[test_species, test_grids] <- 1 mask <- as(mask, "dgCMatrix")
model <- nmf(A, k = 15, mask = mask, tol = 1e-6, seed = 123)
df <- rbind( data.frame( "observed" = as.vector(A[test_species, test_grids]), "predicted" = as.vector(prod(model)[test_species, test_grids]), "set" = "test" ), data.frame( "observed" = as.vector(A[-test_species, -test_grids]), "predicted" = as.vector(prod(model)[-test_species, -test_grids]), "set" = "train" ) )
ggplot(df, aes(observed, predicted, color = set)) + theme_classic() + theme(aspect.ratio = 1) + scale_y_continuous(expand = c(0, 0), limits = c(0, 1), trans = "sqrt") + scale_x_continuous(expand = c(0, 0), limits = c(0, 1), trans = "sqrt") + geom_point(size = 0.5) + facet_grid(cols = vars(set)) + theme(legend.position = "none")
## Cross-validation for rank determination Cross-validation can assist in finding a reasonable factorization rank. Here we determine the optimal rank for the `aml` dataset using cross-validation across three random replicates: ```{R} data(aml) cv_data <- crossValidate(aml$data, k = 1:10, reps = 3) head(cv_data)
Use the S4 plot
method for the nmfCrossValidation
class to visualize:
plot(cv_data)
The optimal rank is around k = 8
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.