In hdsu-bioquant/ButchR: Non-Negative Matrix Factorization Suite

library(BiocStyle)
library(ButchR)
library(knitr)
library(ComplexHeatmap)
library(viridis)
library(tidyverse)

Introduction {#introduction}

NMF (nonnegative matrix factorization) is a matrix decomposition method. A description of the algorithm and it's implementation can be found e.g. in [@Lee_article1999]. In 2003, Brunet et al. applied NMF to gene expression data [@Brunet_article2003]. In 2010, r CRANpkg("NMF"), an R package implementing several NMF solvers was published [@Gaujoux_article2010]. NMF basically solves the problem as illustrated in the following figure (Image taken from https://en.wikipedia.org/wiki/Non-negative_matrix_factorization):

NMF

Here, $V$ is an input matrix with dimensions $n \times m$. It is decomposed into two matrices $W$ of dimension $n \times l$ and $H$ of dimension $l \times m$, which when multiplied approximate the original matrix $V$. $l$ is a free parameter in NMF, it is called the factorization rank. If we call the columns of $W$ \emph{signatures}, then $l$ corresponds to the number of signatures. The decomposition thus leads to a reduction in complexity if $l < n$, i.e. if the number of signatures is smaller than the number of features, as indicated in the above figure.

In 2015, Mejia-Roa et al. introduced an implementation of an NMF-solver in CUDA, which lead to significant reduction of computation times by making use of massive parallelisation on GPUs [@Mejia_article2015]. Other implementations of NMF-solvers on GPUs exist.

It is the pupose of the package ButchR described here to provide wrapper functions in R to these NMF-solvers in TensorFlow. Massive parallelisation not only leads to faster algorithms, but also makes the benefits of NMF accessible to much bigger matrices. Furthermore, functions for estimation of the optimal factorization rank and post-hoc feature selection are provided.

The ButchR package {#ButchR_package}

The matrix decomposition results are stored in an S4 object called ButchR_NMF. ButchR provides functions to access the best factorzation after $n$ initailization $W$ and $H$ matrices for a given factorzation rank.

A crucial step in data analysis with NMF is the determination of the optimal factorization rank, i.e. the number of columns of the matrix $W$ or equivalently the number of rows of the matrix $H$. No consensus method for an automatic evaluation of the optimal factorization rank has been found to date. Instead, the decomposition is usually performed iteratively over a range of possible factorization ranks and different quality measures are computed for every tested factorization ranks. Many quality measures have been proposed:

The Frobenius reconstruction error, i.e. the Frobenius norm of the residuals of the decomposition: $||W \cdot H - V||_{F}$
Criteria to assess the stability of the decomposition:
The cophenetic correlation coefficient
An Amari type distance
Silhouette values over clusters of patterns extracted iteratively at the same factorization rank

The package ButchR provides a function to visualize all factorization metrics.

Example: leukemia data

Preparations

Load the example data

data(leukemia)

Now we are ready to start an NMF analysis.

NMF analysis

Call wrapper function

The wrapper function for the NMF solvers in the ButchR package is run_NMF_tensor. It is called as follows:

k_min <- 2
k_max <- 4

leukemia_nmf_exp <- run_NMF_tensor(X = leukemia$matrix,
                                   ranks = k_min:k_max,
                                   method = "NMF",
                                   n_initializations = 10, 
                                   extract_features = TRUE)

Depending on the choice of parameters (dimensions of the input matrix, number of iterations), this step may take some time. Note that the algorithm updates the user about the progress in the iterations.

normalize W matrix

To make the features in the $W$ matrix comparable, the factorization is normalized to make all columns of $W$ sum 1.

leukemia_nmf_exp <- normalizeW(leukemia_nmf_exp)

Several functions to access the results are available:

`HMatrix`

Returns the matrix H for the optimal decomposition (i.e. the one with the minimal residual) for a specific factorization rank k. The number of rows of the matrix H corresponds to the chosen factorization rank.

leukemia_Hk2 <- HMatrix(leukemia_nmf_exp, k = 2)
class(leukemia_Hk2)
dim(leukemia_Hk2)
kable(leukemia_Hk2[, 1:5])

If no value for k is supplied, the function returns a list of matrices, one for every factorization rank.

leukemia_Hlist <- HMatrix(leukemia_nmf_exp)
class(leukemia_Hlist)
length(leukemia_Hlist)
kable(leukemia_Hlist$k2[, 1:5])

`WMatrix`

Returns the matrix W for the optimal decomposition (i.e. the one with the minimal residual) for a specific factorization rank k. The number of columns of the matrix W corresponds to the chosen factorization rank.

leukemia_Wk2 <- WMatrix(leukemia_nmf_exp, k = 2)
class(leukemia_Wk2)
dim(leukemia_Wk2)
kable(as.data.frame(leukemia_Wk2[1:5, ]))

If no value for k is supplied, the function returns a list of matrices, one for every factorization rank.

leukemia_Wlist <- WMatrix(leukemia_nmf_exp)
class(leukemia_Wlist)
length(leukemia_Wlist)
kable(as.data.frame(leukemia_Wlist$k2[1:5, ]))

`FrobError`

Returns a data frame with as many columns as there are iterated factorization ranks and as many rows as there are iterations per factorization rank.

kable(FrobError(leukemia_nmf_exp))

Determine the optimal factorization rank

In NMF, Several methods have been described to assess the optimal factorization rank. The ButchR package implements some of them:

Frobenius error: The most important information about the many iterated d ecompositions is the norm of the residual. In NMF this is often called the Frobenius error, as the Frobenius norm may be used.
Alexandrov Criterion: In [@Alex2013] an approach is described in which a modified silhouette criterion is used to estimate the stability across iteration steps for one fixed factorization rank k.
Cophenetic correlation coefficient
Amari distance

The values of the computed factorization metrics can be accessed with OptKStats:

kable(OptKStats(leukemia_nmf_exp))

These quality measures can be displayed together:

Generate plots to estimate optimal k

gg_plotKStats(leukemia_nmf_exp)

Visualize the matrix H (exposures)

The matrices H may be visualized as heatmaps. We can define a meta information object and annotate meta data:

heat_anno <- HeatmapAnnotation(df = leukemia$annotation[, c("ALL_AML", "Type")],
                               col = list(ALL_AML = c("ALL" = "grey80", 
                                                      "AML" = "grey20"),
                                          Type = c("-" = "white",
                                                   "B-cell" = "grey80",
                                                   "T-cell" = "grey20")))

And now display the matrices H with meta data annotation:

for(ki in k_min:k_max) {
  cat("\n")
  cat("  \n#### H matrix for k=",  ki, "   \n  ")
  #plot H matrix
  tmp_hmatrix <- HMatrix(leukemia_nmf_exp, k = ki)
  h_heatmap <- Heatmap(tmp_hmatrix,
                       col = viridis(100),
                       name = "Exposure",
                       clustering_distance_columns = 'pearson',
                       show_column_dend = TRUE,
                       top_annotation = heat_anno,
                       show_column_names = FALSE,
                       show_row_names = FALSE,
                       cluster_rows = FALSE)
  print(h_heatmap)
}

Feature selection

Row K-means to determine signature specific features

### Find representative regions.
# Get W for best K
leukemia_features <- SignatureSpecificFeatures(leukemia_nmf_exp,
                                               k = 4, 
                                               return_all_features = TRUE)
colnames(leukemia_features) <- paste0("Sign.", 1:4)
kable(head(leukemia_features))

Feature visualization

# List of signature specific features
# leukemia_specific <- SignatureSpecificFeatures(leukemia_nmf_exp,
#                                                k = 4, 
#                                                return_all_features = FALSE)


leukemia_specific <- rownames(leukemia_features)[rowSums(leukemia_features) == 1]
leukemia_Wspecific <- WMatrix(leukemia_nmf_exp, k = 4)[leukemia_specific, ]
colnames(leukemia_Wspecific) <- paste0("Sign.", 1:4)

# normalize exposure score in W matrix across rows
leukemia_Wspecific <- leukemia_Wspecific/matrixStats::rowMaxs(leukemia_Wspecific)

# Display selected features on W matrix
w_heatmap <- Heatmap(leukemia_Wspecific,
                     col = inferno(100),
                     name = "W matrix",
                     clustering_distance_columns = 'pearson',
                     show_column_dend = TRUE,
                     show_column_names = TRUE,
                     show_row_names = FALSE,
                     cluster_rows = TRUE,
                     cluster_columns = FALSE)
w_heatmap

Sys.sleep(60)

References

hdsu-bioquant/ButchR documentation built on Jan. 28, 2023, 6:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hdsu-bioquant/ButchR
Non-Negative Matrix Factorization Suite

In hdsu-bioquant/ButchR: Non-Negative Matrix Factorization Suite

Introduction {#introduction}

The ButchR package {#ButchR_package}

Example: leukemia data

NMF analysis

Call wrapper function

normalize W matrix

`HMatrix`

`WMatrix`

`FrobError`

Determine the optimal factorization rank

Generate plots to estimate optimal k

Visualize the matrix H (exposures)

Feature selection

Row K-means to determine signature specific features

Feature visualization

References

R Package Documentation

Browse R Packages

We want your feedback!

hdsu-bioquant/ButchR Non-Negative Matrix Factorization Suite

In hdsu-bioquant/ButchR: Non-Negative Matrix Factorization Suite

Introduction {#introduction}

The ButchR package {#ButchR_package}

Example: leukemia data

NMF analysis

Call wrapper function

normalize W matrix

HMatrix

WMatrix

FrobError

Determine the optimal factorization rank

Generate plots to estimate optimal k

Visualize the matrix H (exposures)

Feature selection

Row K-means to determine signature specific features

Feature visualization

References

R Package Documentation

Browse R Packages

We want your feedback!

hdsu-bioquant/ButchR
Non-Negative Matrix Factorization Suite

`HMatrix`

`WMatrix`

`FrobError`