LSMF: Least squares matrix factorization

Documented in lsmf

#' Least squares matrix factorization
#'
#' @description Fast and versatile sparse matrix factorization with one- or two-sided least squares constraints, L1/L2/angular regularization, distributional weighting, binary zero-masking, parallelization, and multiple initialization options
#'
#' @details
#' LSMF runs matrix factorization using alternating least squares with an efficient least squares coordinate descent solver.
#' Matrix operations make use of the Armadillo linear algebra library in a parallelized C++ backend.
#' Sparse matrix support for non-zero-masked factorizations means the entire matrix does not need to be transferred into memory.
#'
#' Basic usage is documented below. For details see the vignette and the references section. To report issues or seek case-by-case optimization, open a GitHub issue.
#'
#' ---------------------------------------
#'
#' NON-NEGATIVE MATRIX FACTORIZATION
#' \itemize{
#'    \item Set \code{lower.limits = 0} and specify at minimum \code{A} and \code{k}
#' }
#' When to use: data with meaningfully non-negative counts
#'
#' ---------------------------------------
#'
#' ONE-SIDED NMF (OSNMF-R or OSNMF-L)
#' \itemize{
#'    \item Set \code{lower.limits = c(1, 0)} for OSNMF-left or \code{lower.limits = c(0, 1)} for OSNMF-right
#' }
#' When to use: clustering, graphing
#'
#' ---------------------------------------
#'
#' SYMMETRIC NMF
#' \itemize{
#'    \item Set \code{alpha.L2 = c(1, 0)} and any other parameters for W and H equal to the same value for W and H. Do not specify a value for alpha.elasticnet
#'    \item An L2 penalty of 1 causes W to converge towards H such that the slope between factors in W and H is 1. As W converges towards H, H counter-converges towards the values which give the lowest MSE
#'    \item The lower \code{tol} is set (i.e. 1e-5), the better the linearity between W and H
#'    \item If signal is random or very fuzzy, the linear fit may never converge
#'    \item If needed, calculate mean(W, t(H)) as an approximation of the "perfect" factorization
#' }
#' When to use: Whenever factorizing a symmetric dataset, always set alpha.L2 = c(1, 0). The only exception that occurs to mind is OSNMF/R for clustering on H in a rank-2 factorization.
#'
#' ---------------------------------------
#'
#' ZERO-MASKED NMF
#' \itemize{
#'     \item Set \code{mask.zeros = TRUE}
#'     \item If any of \code{weights != NULL, sample.weights != NULL, feature.weights != NULL}, the sparse matrix computation advantage is lost even though zeros will be assigned a weight of 0 (completely masked). 
#'     \item While ZMNMF is significantly faster, distribution-based weighting should be prefered in most use cases.
#' }
#' When to use: Imputation on zero-inflated data that is naturally dense, extremely sparse sparse and robust signals
#'
#' ---------------------------------------
#'
#' LASSO REGULARIZATION
#' \itemize{
#'    \item Set \code{alpha.L1 = c([0,n], [0,n])} for W, H. Typically a value > 0 and < 1
#'    \item Sparsity is enforced on the model as a whole (W, H, or both), not on individual factors
#'    \item Over-penalization readily encourages a few dense factors and many extremely sparse factors
#'    \item Cross-validation against an experimental objective is important.
#'    \item Some samples or features may receive entirely all zero coefficients.
#' }
#' When to use:
#' \itemize{
#'    \item Extract robust factors from data with high dropout and noisy signal, where robustness is not achieved with default MF
#'    \item Introduce sparsity in rank-1 or rank-2 factorization for soft clustering
#' }
#' 
#' ---------------------------------------
#'
#' RIDGE REGULARIZATION
#' \itemize{
#'    \item Set \code{alpha.L2 = c([0,n],[0,n])} for W, H. Typically a value > 0 and < 1.
#'    \item In matrix factorization, ridge regularization encourages the distribution of values in W and H to overlap.
#'    \item Optimal results are achieved when one of W or H, generally the first to be initialized (W by default) is set to 1 and the other is set to 0 (i.e. \code{alpha.L2 = c(1, 0)})
#' }
#' When to use (given \code{alpha.L2 = c(1, 0)} when \code{start.W = TRUE}):
#' \itemize{
#'    \item Symmetric data should almost always be factorized with \code{alpha.L2 = c(1, 0)}
#'    \item Learning factor models across independent experiments which can be compared without the need for normalization or transformation (i.e. the contributions of sample weights and feature weights are known to be 1:1)
#' }
#' 
#' ---------------------------------------
#'
#' RANK-1 MATRIX FACTORIZATION
#' \itemize{
#'    \item Set \code{k = 1}
#'    \item Note that non-negative count data will never give negative values in "W" or "H" even if \code{lower.limits = c(-Inf, -Inf)}.
#'    \item Over- or under-determined rank-k on W and rank-1 on H factorizations may be generated by first solving a rank-k factorization, and then a projection for H with k = 1 using \code{LSMF::ls.project}.
#' }
#' When to use: Divisive clustering, diffusion coordinate assignment, trajectory mapping
#'
#'
#' ---------------------------------------
#'
#' Methods in LSMF are built on top of a computational framework for ANNLS NMF refined and published by Xihui "Eric" Lin in the NNLM R package.
#' LSMF specifically uses alternating non-negative least squares with coordinate descent least squares (SCD method) against a mean squared error loss (MSE loss).
#'
#' @param A matrix of "features x samples" as "rows x columns" of or coercible to sparse dgCMatrix
#' @param k rank
#' @param lower.limits Lower limits for coefficients. Specify NULL, a single numeric, or an array of two for c(W, H), default of c(0, 0) is equivalent to NMF
#' @param upper.limits Upper limits for coefficients. Specify NULL, a single numeric or an array of two for c(W, H)
#' @param constraint.type List of two for list(W,H) with list components of values c("continuous", "bernoulli", "bernoulli.signed", or a multinomial array of values to constrain to). Default is list("continuous", "continuous"). If not "continuous", "lower.limits" and "upper.limits" are disregarded.
#' @param alpha.L1 lasso regularization on W and H, a single numeric or an array of two for c(W, H) typically in the range [0,1]
#' @param alpha.L2 ridge regularization on W and H, a single numeric or an array of two for c(W, H) typically in the range [0,1]
#' @param alpha.angular angular regularization on W and H, a single numeric or an array of two for c(W, H) in the range [0,1]
#' @param alpha.elasticnet elastic net mixing parameter for ((1-alpha) * L2 + alpha * L1) regularization on W and H, array of two for c(W, H). Default is c(NULL, NULL). Supercedes alpha.L1, alpha.L2
#' @param alpha.L0 cardinality of non-zero coefficients in each row of W or column of H, specify a single nummeric or an array of two for c(W, H), default c(k, k) results in no penalization
#' @param mask.zeros regard zeros as missing/masked values
#' @param weights matrix of same dimensions as "A" of or coercible to dgCMatrix with weights for model learning in the range [0,1] where 0 = completely masked and 1 = completely weighted
#' @param sample.weights vector of length ncol(A) with values in the range [0,1]
#' @param feature.weights vector of length ncrow(A) with values in the range [0,1]
#' @param n.threads number of threads/CPUs to use, or leave 0 or NULL to let OpenMP decide
#' @param init initialization for W and H, one of c("random", "nndsvd", "nndsvda", "nndsvdar") or a list of two matrices of class "matrix" as initializations for list(W, H). See \code{LSMF::nndsvd} and \code{LSMF::rand.init} for methodological details. Random initialization handles proper initialization for \code{constraint.type} other than "continuous" while nndsvd methods cannot
#' @param seed seed for init = "random" initialization (default NULL)
#' @param start.W should alternating least squares start with updating W, as is default, or with H
#' @param tol stopping criterion based on relative error between two successive iterations: |e2-e1|/mean(e1,e2)
#' @param scd.tol stopping criterion for sequential coordinate descent least squares solver between two successive iterations
#' @param maxit permitted number of alternating least squares iterations if "tol" criteria remains unsatisfied
#' @param scd.maxit permitted number of sequential coordinate descent iterations if "scd.tol" criteria remains unsatisfied
#' @param trace check error for convergence every trace iterations (checking error is costly). To not check error at all and simply execute \code{maxit} iterations, set trace to NULL or 0
#' @param adaptive.constraint gradually impose constraints? applies if \code{constraint.type != "continuous"}
#' @param adaptive.bernoulli.steps applies only if \code{constraint.type = list("continuous", "bernoulli")}, how many factorizations to perform on multinomial distributions between c(0.5, 0.5) and c(0, 1). For example, 1 step corresponds to factorizing with a multinomial distribution on H of c(0.25,0.75), 4 steps corresponds to factorizing with a multinomial distribution on H of c(0.4, 0.6) ... to (0.1, 0.9)
#' @param adaptive.alpha gradually introduce alpha penalties? (does not apply to \code{alpha.L2})
#' @param adaptive.tol \code{tol} for convergence in each adaptive update step. Also applies to \code{adaptive.constraint}
#' @param adaptive.maxit \code{maxit} for each adaptive update step
#' @param adaptive.trace \code{trace} for each adaptive update step. To not check error in adaptive updating steps and simply execute \code{maxit} iterations, set \code{adaptive.trace = adaptive.maxit}
#' @param adaptive.L1.step double, for stepwise introduction of L1 penalty, how much penalty to increment with each update
#' @param adaptive.angular.step double, for stepwise introduction of angular penalty, how much penalty to increment with each update
#' @param adaptive.L0.step integer, for stepwise introduction of L0 penalty, how many more values to set to 0 in each update
#' @param return.sparse convert W and/or H to dgCMatrix. Boolean or array of boolean for c(W, H)
#' @param verbose boolean
#' @param show.warnings Display warnings from checks for improper or questionable usage of input parameters.
#' @return
#' A list of:
#' 	\itemize{
#'    \item W               : feature coefficients, class "matrix"
#'    \item H               : sample coefficients, class "matrix"
#'    \item W.init          : initialization used for W
#'    \item H.init          : initialization used for H
#'    \item weights         : weights used (if applicable)
#'    \item mse             : mean squared error with applicable penalties and weights
#'    \item rel.err         : error of last iteration relative to previous iteration
#'    \item call            : list of interpreted parameter values
#'  }
#' @references
#' 
#' Lin, X., and Boutros, P.C. (2020). "Optimization and expansion of non-negative matrix factorization." BMC Bioinformatics.
#' 
#' Kuang, D., Yun, S., and Park, H. (2015). "SymNMF: nonnegative low-rank approximation of a similarity matrix for graph clustering." Journal of Global Optimization.
#'
#' Lee and Seung
#'
#' Alternating least squares factorization
#'
#' @author Zach DeBruine, \email{zach.debruine@@vai.org}
#' @seealso \code{\link{ls.project}}, \code{\link{mse}}
#' 
#' @examples 
#' \dontrun{
#' data(moca7k)
#' # run lsmf with default parameters
#' model <- lsmf(moca7k, k = 10, trace = 5)
#' 
#' # with L1 regularization on W and a larger tolerance
#' model <- lsmf(moca7k, k = 10, alpha = c(0,0,0.5), trace = 5, tol = 1e-2)
#'
#' # prepare a similarity of first 1000 samples
#' #   in moca7k dataset, and induce 75% dropout
#' library(Matrix)
#' data(moca7k)
#' m.init <- as(sparse.cos(moca7k[,1:1000]), "dgCMatrix")
#' m <- m.init * rsparsematrix(1000, 1000, 0.25, symmetric = TRUE, rand.x = NULL)
#' diag(m) <- 1
#' 
#' # run lsmf.sym with default parameters
#' model <- lsmf.sym(m, k = 10)
#' 
#' # compare similarity between W.mean and W
#' mean(diag(cor(model$W, model$W.mean)))
#' # [1] 0.994232
#' 
#' # plot W against t(H)
#' plot(model$W, t(model$H))
#' # plot looks pretty good, but notice shift of low weighted values towards W, 
#' #  this is because W is initialized first.
#' #  lowering the tolerance removes that shift, for instance,
#' model2 <- lsmf.sym(m, k = 10, tol = 1e-6)
#' plot(model2$W, t(model2$H))
#' # basically a perfect fit!
#' 
#' # let's factorize it again and see if we get the same result, 
#' #   note that factor orderings may be scrambled
#' model3 <- lsmf.sym(m, k = 10, tol = 1e-6)
#' model3.reordered <- match.factors(model2$W.mean, model3$W.mean)$m.aligned
#' # plot first matched pair
#' plot(model3.reordered, model2$W.mean)
#' # in general there is decent agreement, but multiple factorizations 
#' #   would be useful in finding a robust minima
#' # note that some factors are highly robust while others are not very robust
#' plot(diag(cor(model3.reordered, model2$W.mean)))
#' 
#' # take for instance the best fit vector or the worst fit vector and compare them
#' best.pair <- which.max(diag(cor(model3.reordered, model2$W.mean)))
#' worst.pair <- which.min(diag(cor(model3.reordered, model2$W.mean)))
#' plot(model3.reordered[, best.pair], model2$W.mean[, best.pair])
#' plot(model3.reordered[, worst.pair], model2$W.mean[, worst.pair])
#' 
#' # ZERO-MASKING
#' # Examine zero-masking for imputing signal by comparing "m" to "m.init",
#' #  since signal dropout in "m" is 75%
#' model.orig <- lsmf.sym(m.init, k = 10, tol = 1e-6)
#' model.zm <- lsmf.sym(m, k = 10, tol = 1e-6, mask.zeros = TRUE)
#' model.nzm <- lsmf.sym(m, k = 10, tol = 1e-6, mask.zeros = FALSE)
#' 
#' # compare model.zm and model.nzm to model.orig, after matching
#' W.orig <- model.orig$W
#' W.zm <- match.factors(W.orig, model.zm$W)$m.aligned
#' W.nzm <- match.factors(W.orig, model.nzm$W)$m.aligned
#' plot(W.orig, W.zm)
#' plot(W.orig, W.nzm)
#' # zero-masking imputes signal dropout excellently!
#' }
#' @export
lsmf <- function(A,
  k,
  lower.limits = c(0, 0),
  upper.limits = c(NULL, NULL),
  constraint.type = list("continuous", "continuous"),
  alpha.L1 = c(0, 0),
  alpha.L2 = c(0, 0),
  alpha.angular = c(0, 0),
  alpha.elasticnet = c(NULL, NULL),
  alpha.L0 = c(k, k),
  mask.zeros = FALSE,
  weights = NULL,
  sample.weights = NULL,
  feature.weights = NULL,
  n.threads = 0,
  init = "random",
  seed = 123,
  start.W = TRUE,
  tol = 1e-4,
  scd.tol = 1e-8,
  maxit = 1e4,
  scd.maxit = 100,
  trace = 1,
  adaptive.constraint = TRUE,
  adaptive.bernoulli.steps = 1,
  adaptive.alpha = TRUE,
  adaptive.tol = 1e-2,
  adaptive.maxit = 10,
  adaptive.trace = 3,
  adaptive.L1.step = 0.1,
  adaptive.angular.step = 0.1,
  adaptive.L0.step = 1,
  return.sparse = c(FALSE, FALSE),
  verbose = TRUE,
  show.warnings = TRUE) {

  # ***************************************************************************** CHECKS, VARIABLE INITIALIZATIONS, AND DATATYPE CONVERSIONS
  require(Matrix)

  # check input matrix
  if (class(A)[1] != "dgCMatrix") {
    A <- as(A, "dgCMatrix")
    stopifnot("input matrix A is not a dgCMatrix or coercible to a dgCMatrix" = class(A)[1] == "dgCMatrix")
  }

  # check weights
  if (is.null(weights)) {
    use.weights <- FALSE
  } else {
    if (class(weights)[1] != "dgCMatrix") {
      weights <- as(weights, "dgCMatrix")
      stopifnot("weights matrix is not a dgCMatrix or coercible to a dgCMatrix" = class(weights[1] == "dgCMatrix"))
    }
    use.weights <- TRUE
    if (max(weights) > 1 || min(weights < 0)) {
      if (show.warnings) warning("All values of weights were not in the range of 0 to 1. Setting weights to NULL and proceeding with unweighted factorization")
      use.weights <- FALSE
    }
    if (nrow(A) != nrow(weights) || ncol(A) != ncol(weights)) {
      if (show.warnings) warning("Dimensions of weights is not equal to dimensions of A. Setting weights to NULL and proceeding with unweighted factorization")
      use.weights <- FALSE
    }
  }

  # apply sample.weights
  if (!is.null(sample.weights)) {
    if (sample.weights != ncol(A)) {
      if (show.warnings) warning("sample.weights != ncol(A), proceeding without weighting by samples")
    } else {
      if (use.weights == FALSE) {
        weights <- matrix(1, nrow = nrow(A), ncol = ncol(A))
      }
      for (i in 1:ncol(weights))
        weights[, i] <- weights[, i] * sample.weights[i]
      use.weights <- TRUE
    }
  }

  # apply feature.weights
  if (!is.null(feature.weights)) {
    if (feature.weights != nrow(A)) {
      if (show.warnings) warning("feature.weights != nrow(A), proceeding without weighting by features")
    } else {
      if (use.weights == FALSE) {
        weights <- matrix(1, nrow = nrow(A), ncol = ncol(A))
      }
      for (i in 1:nrow(weights))
        weights[i,] <- weights[i,] * feature.weights[i]
      use.weights <- TRUE
    }
  }

  if (!use.weights) {
    # simulate a small sparse matrix as a space filler to pass along in the Rcpp functions.
    # these weights will never be used
    weights <- rsparsematrix(10, 10, 0.1)
  } else if (mask.zeros == TRUE) {
    # Set weights for all zero values in A to 0 in the weights matrix and reindex weights
    weights[A == 0] <- 0
    weights <- as(weights, "dgCMatrix")
  }
  use.weights <- as.integer(use.weights)

  # check k
  k <- as.integer(k)
  if (k >= min(dim(A))) {
    if (show.warnings) warning("k is >= the minimum dimensions of the input. Do not use without proper regularization.")
  }

  # check n.threads
  if (is.null(n.threads)) n.threads <- 0
  n.threads <- n.threads[1]
  if (n.threads < 0) n.threads <- 0
  n.threads <- as.integer(n.threads[1])

  # check lower.limits
  if (is.null(lower.limits))
    lower.limits <- rep(-1e10, 2)
  if (length(lower.limits) == 1) {
    lower.limits <- rep(lower.limits, 2)
  } else if (length(lower.limits) > 2) {
    if (show.warnings) warning("length of lower.limits is greater than length two. Proceeding with lower.limits = c(0,0)")
    lower.limits <- c(0, 0)
  }
  lower.limits <- sapply(lower.limits, function(x) ifelse(is.null(x), -1e10, as.double(x)))

  # check upper.limits
  if (is.null(upper.limits))
    upper.limits <- rep(1e10, 2)
  if (length(upper.limits) == 1) {
    upper.limits <- rep(upper.limits, 2)
  } else if (length(upper.limits) > 2) {
    if (show.warnings) warning("length of upper.limits is greater than length two. Proceeding with upper.limits = c(1e6,1e6)")
    upper.limits <- c(1e6, 1e6)
  }
  upper.limits <- sapply(upper.limits, function(x) ifelse(is.null(x), -1e10, as.double(x)))

  if (upper.limits[1] < lower.limits[1]) {
    if (show.warnings) warning("upper.limits[1] < lower.limits[1], swapping upper.limits[1] and lower.limits[1]")
    lower.lim <- lower.limits[1]
    upper.limits[1] <- lower.limits[1]
    lower.limits[1] <- lower.lim
  }
  if (upper.limits[2] < lower.limits[2]) {
    if (show.warnings) warning("upper.limits[2] < lower.limits[2], swapping upper.limits[2] and lower.limits[2]")
    lower.lim <- lower.limits[2]
    upper.limits[2] <- lower.limits[2]
    lower.limits[2] <- lower.lim
  }

  # check constraint type
  multinomial.values <- list(c(0, 1), c(0, 1))
  if (length(constraint.type) == 1) {
    constraint.type = list(constraint.type, constraint.type)
  }
  for (i in 1:2) {
    if (constraint.type[[i]][1] == "continuous") {
      constraint.type[[i]] <- as.integer(1)
    } else if (constraint.type[[i]][1] == "bernoulli") {
      constraint.type[[i]] <- as.integer(2)
    } else if (constraint.type[[i]][1] == "bernoulli.signed") {
      constraint.type[[i]] <- as.integer(3)
    } else {
      # multinomial constraint distribution?
      if (!is.vector(constraint.type[[i]])) {
        if (show.warnings) warning("constraint.type was not a list containing an array or one of valid values. Setting invalid constraint.type to 'continuous'.")
        constraint.type[[i]] <- as.integer(1)
      } else {
        multinomial.values[[i]] <- sapply(constraint.type[[i]], as.double)
        constraint.type[[i]] <- as.integer(4)
      }
    }
  }
  constraint.type <- unlist(constraint.type)
  multinomial.values1 <- sapply(as.vector(multinomial.values)[[1]], as.double)
  multinomial.values2 <- sapply(as.vector(multinomial.values)[[2]], as.double)

  # check alpha.L1
  if (length(alpha.L1) == 1) {
    alpha.L1 <- rep(alpha.L1, 2)
  }
  stopifnot("length of alpha.L1 is not equal to two" = length(alpha.L1) == 2)
  if (min(alpha.L1) < 0) {
    if (show.warnings) warning("min(alpha.L1) < 0. Proceeding with alpha.L1 <- c(0, 0)")
    alpha.L1 <- c(0, 0)
  }
  alpha.L1 <- sapply(alpha.L1, as.double)

  # check alpha.L2
  if (length(alpha.L2) == 1) {
    alpha.L2 <- rep(alpha.L2, 2)
  }
  stopifnot("length of alpha.L2 is not equal to two" = length(alpha.L2) == 2)
  if (min(alpha.L2) < 0) {
    if (show.warnings) warning("min(alpha.L2) < 0. Proceeding with alpha.L2 <- c(0, 0)")
    alpha.L2 <- c(0, 0)
  }
  alpha.L2 <- sapply(alpha.L2, as.double)

  # check alpha.angular
  if (length(alpha.angular) == 1) {
    alpha.angular <- rep(alpha.angular, 2)
  }
  stopifnot("length of alpha.angular is not equal to two" = length(alpha.angular) == 2)
  if (min(alpha.angular) < 0) {
    if (show.warnings) warning("min(alpha.angular) < 0. Proceeding with alpha.angular = c(0, 0)")
    alpha.angular <- c(0, 0)
  }
  if (max(alpha.angular) > 1) {
    if (show.warnings) warning("max(alpha.angular) > 1. Proceeding with alpha.angular = c(0, 0)")
    alpha.angular <- c(0, 0)
  }
  alpha.angular <- sapply(alpha.angular, as.double)

  # checking alpha.elasticnet
  if (length(alpha.elasticnet) == 1) {
    alpha.elasticnet <- rep(alpha.elasticnet, 2)
  }
  if (!is.null(alpha.elasticnet)) {
    stopifnot("length of alpha.elasticnet is not equal to two" = length(alpha.elasticnet) == 2)
    if (min(alpha.elasticnet) < 0 || max(alpha.elasticnet) > 1) {
      if (show.warnings) warning("alpha.elasticnet must either be NULL or a value in the range from 0 to 1. Proceeding with alpha.elasticnet = c(NULL, NULL)")
    }
    if (!is.null(alpha.elasticnet[1])) {
      if (alpha.L2[1] > 0 || alpha.L1[1] > 0)
        if (show.warnings) warning("alpha.elasticnet[1] was not NULL and alpha.L2[1] or alpha.L1[1] was > 0. Specified value for alpha.elasticnet is used. Remember that elasticnet = (1-alpha)/2 * L2 + alpha * L1.")
      alpha.L2[1] <- as.double(1 - alpha.elasticnet[1]) # note that division by 2 is already implemented in the C code
      alpha.L1[1] <- as.double(alpha.elasticnet[1])
    }
    if (!is.null(alpha.elasticnet[2])) {
      if (alpha.L2[2] > 0 || alpha.L1[2] > 0)
        if (show.warnings) warning("alpha.elasticnet[2] was not NULL and alpha.L2[2] or alpha.L1[2] was > 0. Specified value for alpha.elasticnet is used. Remember that elasticnet = (1-alpha)/2 * L2 + alpha * L1.")
      alpha.L2[2] <- as.double(1 - alpha.elasticnet[2]) # note that division by 2 is already implemented in the C code
      alpha.L1[2] <- as.double(alpha.elasticnet[2])
    }
  }

  # check alpha.L0
  if (length(alpha.L0) == 1) {
    alpha.L0 <- rep(alpha.L0, 2)
  }
  alpha.L0 <- sapply(alpha.L0, function(x) {
    if (x > k) {
      if (show.warnings) warning("alpha.L0 was greater than k. alpha.L0 is set to k")
      return(as.integer(k))
    } else if (x < 1) {
      if (show.warnings) warning("alpha.L0 was set to a value less than 1. alpha.L0 is set to 1")
      return(as.integer(1))
    } else {
      return(as.integer(x))
    }
  })

  # check initializations and generate initial models
  if (is.list(init) && length(init) == 2) {
    W.init <- as.matrix(init[[1]])
    H.init <- as.matrix(init[[1]])
    if (nrow(W.init) != nrow(A) || ncol(W.init) != k || ncol(H.init) != ncol(A) || nrow(H.init) != k) {
      if (show.warnings) warning("Provided matrices for W.init and H.init were not of correct dimensions. Proceeding with random initialization.")
      init <- "random"
    }
    if (!is.numeric(W.init) || !is.numeric(H.init)) {
      W.init <- as.numeric(W.init)
      H.init <- as.numeric(H.init)
      if (show.warnings) warning("Provided matrices for W.init and H.init were not numeric. They have been coerced by as.numeric.")
    }
    if (sum(W.init[W.init == 0]) > 0 || sum(H.init[H.init == 0]) > 0) {
      if (show.warnings) warning("Zeros were detected in provided W.init and/or H.init matrices. Initializing with zeros causes dead links in the factorization. Understand the theoretical implications before using. It is recommended to use zero-weighting for features or samples instead.")
    }
    if (abs(mean(W.init) * mean(H.init) - mean(A@x)) > 2 * sd(A@x)) {
      if (show.warnings) warning("Provided W.init and H.init matrices were used, but the mean(W.init) * mean(H.init) was more than two standard deviations away from the mean(non-zero values in A). W.init and H.init should be scaled so that W * H yields a distribution similar to non-zero values in A.")
    }
  } else if (is.list(init)) {
    if (show.warnings) warning("A list was provided for init but was not of length two. Proceeding with random initialization.")
    init <- "random"
  }

  if (!is.list(init)) {
    if (init %in% c("nndsvd", "nndsvda", "nndsvdar")) {
      mod <- nndsvd(A, k, init)
      W.init <- mod$W
      H.init <- mod$H
    }
    if (init %in% c("random") == FALSE) {
      if (show.warnings) warning("init was not one of valid valids. Proceeding with random initialization.")
      init <- "random"
    }
    if (init == "random") {
      bounds <- c(0, 1000)
      if (constraint.type[[1]] == 1) {
        W.method = "random"
      } else if (constraint.type[[1]] == 2 || constraint.type[[1]] == 3) {
        W.method = "bounded"
        bounds = c(0.4, 0.6)
      } else if (constraint.type[[1]] == 4) {
        W.method = "bounded"
        vals <- multinomial.values1[multinomial.values1 > 0]
        if (length(vals > 2)) {
          bounds = c(min(vals), max(vals))
        } else {
          W.method = "random"
        }
      }
      if (constraint.type[[2]] == 1) {
        H.method = "random"
      } else if (constraint.type[[2]] == 2 || constraint.type[[2]] == 3) {
        H.method = "bounded"
        bounds = c(0.4, 0.6)
      } else if (constraint.type[[2]] == 4) {
        H.method = "bounded"
        vals <- multinomial.values2[multinomial.values2 > 0]
        if (length(vals > 2)) {
          bounds = c(min(vals), max(vals))
        } else {
          H.method = "random"
        }
      }
      mod <- rand.init(A, k, W.method = W.method, H.method = H.method, bounds = bounds, n.sd = 1, seed = seed)
      W.init <- mod$W
      H.init <- mod$H
    }
  }
  W.init <- t(W.init)
  W.initial <- W.init
  H.initial <- H.init

  # check maxit
  maxit <- maxit[1]
  if (maxit < 1) {
    maxit <- 10
    if (show.warnings) warning("maxit was less than 1. Reset to 10 and proceeded with factorization.")
  }
  if (maxit < 5 && show.warnings)
    warning("maxit is less than 5. This is a very low value and satisfactory convergence will likely not be achieved within this number of iterations. Consider increasing to at least 10")
  maxit <- as.integer(maxit)

  # check adaptive.maxit
  adaptive.maxit <- adaptive.maxit[1]
  if (adaptive.maxit < 1) {
    adaptive.maxit <- 1
    if (show.warnings) warning("adaptive.maxit was less than one. Reset to 1 and proceeded with factorization.")
  }
  if (adaptive.maxit > maxit && show.warnings)
    warning("adaptive.maxit is greater than maxit. Generally maxit should be greater than adaptive.maxit. Consider switching around.")
  adaptive.maxit <- as.integer(adaptive.maxit)

  # check trace
  trace <- trace[1]
  if (trace == 0 || is.null(trace))
    trace <- maxit + 1
  if (trace < 1) trace <- 1
  trace <- as.integer(trace)

  # check adaptive.trace
  adaptive.trace <- adaptive.trace[1]
  if (adaptive.trace == 0 || is.null(adaptive.trace))
    adaptive.trace <- adaptive.maxit + 1
  if (adaptive.trace < 1) adaptive.trace <- 1
  adaptive.trace <- as.integer(trace)

  # check mask.zeros
  mask.zeros <- as.integer(mask.zeros[1])
  if (mask.zeros != 1) {
    if (mask.zeros != 0) {
      if (show.warnings) warning("mask.zeros was not set to TRUE/FALSE, 0/1. Proceeding with mask.zeros = FALSE/0")
      mask.zeros <- 0
    }
  }

  # check start.W
  if (!is.logical(start.W)) {
    if (show.warnings) warning("start.W was not a TRUE/FALSE value. Setting start.W = TRUE")
    start.W <- TRUE
  }
  start.W <- as.integer(start.W)

  # check scd.tol
  if (scd.tol < 1e-9) {
    if (show.warnings) warning("scd.tol is a very small number. Generally this should be set between 1e-6 and 1e-9. A small value may result in unnecessarily precise refinement of individual alternating NNLS iterations and cause al onger runtime.")
  } else if (scd.tol > 1e-6) {
    if (show.warnings) warning("scd.tol is a very large number. Generally this should be set between 1e-6 and 1e-9. A large value may result in poor solutions, requiring more alternating NNLS iterations and longer runtime.")
  }
  scd.tol <- as.double(scd.tol)

  # check scd.maxit
  scd.maxit <- as.double(scd.maxit)

  # check tol
  tol <- as.double(tol)

  # check adaptive.tol
  adaptive.tol <- as.double(adaptive.tol)
  if (adaptive.tol < tol && show.warnings)
    warning("adaptive.tol is less than tol. Generally tol should be less than adaptive.tol. Consider switching around.")

  # check adaptive.constraint
  if (adaptive.constraint) {
    if (sum(constraint.type) == 2)
      adaptive.constraint <- FALSE
  }

  # check adaptive.alpha
  if (adaptive.alpha) {
    if (sum(c(alpha.L1, alpha.angular)) == 0 && alpha.L0 == c(k, k)) {
      adaptive.alpha <- FALSE
    }
  }

  # set adaptive.verbose
  ifelse(adaptive.trace >= adaptive.maxit, adaptive.verbose <- 0, adaptive.verbose <- 1)
  if (verbose == 0) adaptive.verbose <- 0

  # ***************************************************************************** ADAPTIVE MATRIX FACTORIZATION

  # if adaptive.constraint = TRUE, factorize without the constraint first and use the resulting models as initialization for the final model
  if (adaptive.constraint) {
    it.lower.limits <- lower.limits
    it.upper.limits <- upper.limits
    for (i in 1:2) {
      if (constraint.type[[i]] == 2) {
        # if bernoulli constraint, and no steps are requested, factorize wih upper.limits = 1, lower.limits = 0
        it.lower.limits[i] <- 0
        it.upper.limits[i] <- 1
      } else if (constraint.type[[i]] == 3) {
        # if bernoulli signed constraint, factorize with upper.limits = 1, lower.limits = -1
        it.lower.limits[i] <- 0
        it.upper.limits[i] <- -1
      } else if (constraint.type[[i]] == 4) {
        # if multinomial constraint, factorize with upper.limits = max(multinomial.values[[1]]), min = min(multinomial.values[[1]])
        it.lower.limits[i] <- min(multinomial.values[[i]])
        it.upper.limits[i] <- max(multinomial.values[[i]])
      }
    }

    if (verbose > 0)
      message(paste0("Factorizing initial model without constraints or L1/L0/angular regularization, and with upper.limits = c(", upper.limits[1], ", ", upper.limits[2], ") and lower.limits = c(", lower.limits[1], ", ", lower.limits[2], ")"))

    mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, it.lower.limits, it.upper.limits, c(0L, 0L), alpha.L2, c(0L, 0L), c(k, k), c(1L, 1L), multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
    W.init <- mod$W
    H.init <- mod$H

    # if multiple steps are requested for bernoulli transformation on H, start with c(0.4, 0.6) and traverse a path to c(0, 1)
    if (constraint.type[[1]] == 1 && constraint.type[[2]] == 2 && adaptive.bernoulli.steps > 0) {
      step.size <- 0.5 / adaptive.bernoulli.steps
      for (j in 1:adaptive.bernoulli.steps) {
        step.lower <- 0.5 - step.size * j
        step.upper <- 0.5 + step.size * j
        it.multinomial.values <- c(step.lower, step.upper)
        if (verbose > 0)
          message(paste0("Factorizing model with adaptive bernoulli constraints using a multinomial distribution constraint on H of c(", step.lower, ", ", step.upper, ")"))
        mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, c(0L, 0L), alpha.L2, c(0L, 0L), c(k, k), c(1L, 4L), multinomial.values1, it.multinomial.values, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
        W.init <- mod$W
        H.init <- mod$H
      }
    }
  }

  # if adaptive.alpha = TRUE, factorize without L0/angular/L1 first then add in regularizations incrementally
  if (adaptive.alpha) {

    # first apply L1 regularization
    if (sum(alpha.L1) > 0) {
      # L1 regularization is added incrementally adaptive.L1.step at a time, where adaptive.L1.step is a double, starting without any penalization
      # start by adding in any applicable L1 regularization on W
      if (alpha.L1[1] > 0) {
        it.alpha.L1.W <- 0
        while (it.alpha.L1.W < alpha.L1[1]) {
          if (verbose > 0)
            message(paste0("Factoring model with alpha.L1 = c(", it.alpha.L1.W, ", 0) and any applicable constraints"))
          mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, c(as.double(it.alpha.L1.W), 0L), alpha.L2, c(0L, 0L), c(k, k), constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
          W.init <- mod$W
          H.init <- mod$H
          it.alpha.L1.W = it.alpha.L1.W + adaptive.L1.step
        }
      }
      # now add in any applicable L1 regularization for H
      if (alpha.L1[2] > 0) {
        it.alpha.L1.H <- 0
        while (it.alpha.L1.H < alpha.L1[2]) {
          if (verbose > 0)
            message(paste0("Factoring model with alpha.L1 = c(", alpha.L1[1], ", ", it.alpha.L1.H, ") and any applicable constraints"))
          mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, c(alpha.L1[1], as.double(it.alpha.L1.H)), alpha.L2, c(0L, 0L), c(k, k), constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
          W.init <- mod$W
          H.init <- mod$H
          it.alpha.L1.H = it.alpha.L1.H + adaptive.L1.step
        }
      }
    }

    # next apply angular regularization
    if (sum(alpha.angular) > 0) {
      # alpha.angular regularization is added incrementally adaptive.angular.step at a time, where adaptive.angular.step is a double, starting without any penalization
      # start by adding in any applicable angular regularization on W
      if (alpha.angular[1] > 0) {
        it.alpha.angular.W <- 0
        while (it.alpha.angular.W < alpha.angular[1]) {
          if (verbose > 0)
            message(paste0("Factoring model with alpha.angular = c(", it.alpha.angular.W, ", 0) and any applicable alpha.L1 or alpha.L2 regularization and other constraints"))
          mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, alpha.L1, alpha.L2, c(as.double(it.alpha.angular.W), 0L), c(k, k), constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
          W.init <- mod$W
          H.init <- mod$H
          it.alpha.angular.W = it.alpha.angular.W + adaptive.angular.step
        }
      }
      # now add in any applicable angular regularization for H
      if (alpha.angular[2] > 0) {
        it.alpha.angular.H <- 0
        while (it.alpha.angular.H < alpha.angular[2]) {
          if (verbose > 0)
            message(paste0("Factoring model with alpha.angular = c(", alpha.angular[1], ", ", it.alpha.angular.H, ") and any applicable alpha.L1 or alpha.L2 regularization and other constraints"))
          mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, alpha.L1, alpha.L2, c(alpha.angular[1], as.double(it.alpha.angular.H)), c(k, k), constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
          W.init <- mod$W
          H.init <- mod$H
          it.alpha.angular.H = it.alpha.angular.H + adaptive.angular.step
        }
      }
    }

    # lastly, apply L0 regularization
    if (alpha.L0[1] < k || alpha.L0[2] < k) {
      # alpha.L0 regularization is enforced incrementally adaptive.L0.step at a time, where adaptive.L0.step is an integer, starting without any penalization (alpha.L0 = k)
      # start by adding in any applicable regularization on W
      if (alpha.L0[1] < k) {
        it.alpha.L0.W <- k
        while (it.alpha.L0.W > alpha.L0[1]) {
          if (verbose > 0)
            message(paste0("Factoring model with alpha.L0 = c(", it.alpha.L0.W, ", ", k, ") and any applicable alpha.L1, alpha.angular, or alpha.L2 regularization and other constraints"))
          mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, alpha.L1, alpha.L2, alpha.angular, c(as.double(it.alpha.L0.W), k), constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
          W.init <- mod$W
          H.init <- mod$H
          it.alpha.L0.W = it.alpha.L0.W - adaptive.L0.step
        }
      }
      # now add in any applicable L0 regularization for H
      if (alpha.L0[2] < k) {
        it.alpha.L0.H <- k
        while (it.alpha.L0.H > alpha.L0[2]) {
          if (verbose > 0)
            message(paste0("Factoring model with alpha.L0 = c(", alpha.L0[1], ", ", it.alpha.L0.H, ") and any applicable alpha.L1, alpha.angular, or alpha.L2 regularization and other constraints"))
          mod <- cls_mf(A, k, W.init, H.init, n.threads, adaptive.verbose, lower.limits, upper.limits, alpha.L1, alpha.L2, alpha.angular, c(alpha.L0[1], as.double(it.alpha.L0.H)), constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, adaptive.trace, mask.zeros, adaptive.tol, scd.tol, adaptive.maxit, scd.maxit, start.W)
          W.init <- mod$W
          H.init <- mod$H
          it.alpha.L0.H = it.alpha.L0.H - adaptive.L0.step
        }
      }
    }
  }

  # ***************************************************************************** FINAL MODEL MATRIX FACTORIZATION
  # If adaptive.constraint and adaptive.alpha are FALSE, then this is the only model that will be run

  if (adaptive.constraint == TRUE || adaptive.alpha == TRUE) {
    if (verbose > 0)
      message("Factorizing final model")
  }

  res <- cls_mf(A, k, W.init, H.init, n.threads, verbose, lower.limits, upper.limits, alpha.L1, alpha.L2, alpha.angular, alpha.L0, constraint.type, multinomial.values1, multinomial.values2, weights, use.weights, trace, mask.zeros, tol, scd.tol, maxit, scd.maxit, start.W)

  # ***************************************************************************** POST-PROCESSING

  # if the result didn't converge within the permitted number of iterations, throw a warning
  if (res$rel.err > tol)
    if (show.warnings) warning("target tol was not reached within maxit iterations")

  # add back names
  rownames(res$W) <- rownames(A)
  colnames(res$H) <- colnames(A)
  colnames(res$W) <- rownames(res$H) <- paste0("nmf_", 1:k)

  # log call of interpreted parameters (non-matrix items only)
  res$call <- list(
    "k" = k,
    "maxit" = maxit,
    "tol" = tol,
    "n.threads" = n.threads,
    "verbose" = verbose,
    "lower.limits" = lower.limits,
    "upper.limits" = upper.limits,
    "alpha.L1" = alpha.L1,
    "alpha.L2" = alpha.L2,
    "alpha.angular" = alpha.angular,
    "alpha.elasticnet" = alpha.elasticnet,
    "alpha.L0" = alpha.L0,
    "trace" = trace,
    "mask.zeros" = mask.zeros,
    "start.W" = start.W,
    "scd.tol" = scd.tol,
    "scd.maxit" = scd.maxit,
    "multinomial.values" = multinomial.values,
    "constraint.type" = constraint.type,
    "use.weights" = use.weights,
    "seed" = seed,
    "return.sparse" = return.sparse,
    "show.warnings" = show.warnings,
    "adaptive.constraint" = adaptive.constraint,
    "adaptive.bernoulli.steps" = adaptive.bernoulli.steps,
    "adaptive.alpha" = adaptive.alpha,
    "adaptive.tol" = adaptive.tol,
    "adaptive.maxit" = adaptive.maxit,
    "adaptive.trace" = adaptive.trace,
    "adaptive.L1.step" = adaptive.L1.step,
    "adaptive.angular.step" = adaptive.angular.step,
    "adaptive.L0.step" = adaptive.L0.step
  )

  if (length(return.sparse) == 1 && return.sparse == TRUE) {
    res$W <- as(res$W, "dgCMatrix")
    res$H <- as(res$H, "dgCMatrix")
  } else if (length(return.sparse) == 2) {
    if (return.sparse[[1]])
      res$W <- as(res$W, "dgCMatrix")
    if (return.sparse[[2]])
      res$H <- as(res$H, "dgCMatrix")
  }

  res$W.init = W.initial
  res$H.init = H.initial

  if (use.weights) {
    res$weights <- weights
  } else {
    res$weights <- as(matrix(), "dgCMatrix")
  }

  return(res)
}