imputeX: Impute missing entries in 'X' data

View source: R/impute.R

imputeXR Documentation

Impute missing entries in 'X' data

Description

Replace 'NA'/'NaN' values in new 'X' data according to the model predictions, given that same 'X' data and optionally 'U' data.

Note: this function will not perform any internal re-indexing for the data. If the 'X' to which the data was fit was a 'data.frame', the numeration of the items will be under 'model$info$item_mapping'. There is also a function predict_new which will let the model do the appropriate reindexing.

Usage

imputeX(
  model,
  X,
  weight = NULL,
  U = NULL,
  U_bin = NULL,
  nthreads = model$info$nthreads
)

Arguments

model

A collective matrix factorization model as output by function CMF. This functionality is not available for the other model classes.

X

New 'X' data with missing values which will be imputed. Must be passed as a dense matrix from base R (class 'matrix').

weight

Associated observation weights for entries in 'X'. If passed, must have the same shape as 'X'.

U

New 'U' data, with rows matching to those of 'X'. Can be passed in the following formats:

  • A sparse COO/triplets matrix, either from package 'Matrix' (class 'dgTMatrix'), or from package 'SparseM' (class 'matrix.coo').

  • A sparse matrix in CSR format, either from package 'Matrix' (class 'dgRMatrix'), or from package 'SparseM' (class 'matrix.csr'). Passing the input as CSR is faster than COO as it will be converted internally.

  • A sparse row vector from package 'Matrix' (class 'dsparseVector').

  • A dense matrix from base R (class 'matrix'), with missing entries set as 'NA'/'NaN'.

  • A dense row vector from base R (class 'numeric').

  • A 'data.frame'.

U_bin

New binary columns of 'U' (rows matching to those of 'X'). Must be passed as a dense matrix from base R or as a 'data.frame'.

nthreads

Number of parallel threads to use.

Details

If using the matrix factorization model as a general missing-value imputer, it's recommended to:

  • Fit a model without user biases.

  • Set a lower regularization for the item biases than for the matrices.

  • Tune the regularization parameter(s) very well.

In general, matrix factorization works better for imputation of selected entries of sparse-and-wide matrices, whereas for dense matrices, the method is unlikely to provide better results than mean/median imputation, but it is nevertheless provided for experimentation purposes.

Value

The 'X' matrix with its missing values imputed according to the model predictions.

Examples

library(cmfrec)

### Simplest example
SeqMat <- matrix(1:50, nrow=10)
SeqMat[2,1] <- NaN
SeqMat[8,3] <- NaN
m <- CMF(SeqMat, k=1, lambda=1e-10, nthreads=1L, verbose=FALSE)
imputeX(m, SeqMat)


### Better example with multivariate normal data
if (require("MASS")) {
    ### Generate random data, set some values as NA
    set.seed(1)
    n_rows <- 1000
    n_cols <- 5
    mu <- rnorm(n_cols)
    S <- matrix(rnorm(n_cols^2), nrow = n_cols)
    S <- t(S) %*% S
    X <- MASS::mvrnorm(n_rows, mu, S)
    X_na <- X
    values_NA <- matrix(runif(n_rows*n_cols) < .15, nrow=n_rows)
    X_na[values_NA] <- NaN
    
    ### In the event that any column is fully missing
    if (any(colSums(is.na(X_na)) == n_rows)) {
        cols_remove <- colSums(is.na(X_na)) == n_rows
        X_na <- X_na[, !cols_remove, drop=FALSE]
        values_NA <- values_NA[, !cols_remove, drop=FALSE]
    }
    
    ### Impute missing values with model
    model <- CMF(X_na, k=3, lambda=c(0,0,1,1,1,1),
                 user_bias=FALSE,
                 verbose=FALSE, nthreads=1L)
    X_imputed <- imputeX(model, X_na)
    cat(sprintf("RMSE for imputed values w/model: %f\n",
                sqrt(mean((X[values_NA] - X_imputed[values_NA])^2))))
    
    ### Compare against simple mean imputation
    X_means <- apply(X_na, 2, mean, na.rm=TRUE)
    X_imp_mean <- X_na
    for (cl in 1:n_cols)
        X_imp_mean[values_NA[,cl], cl] <- X_means[cl]
    cat(sprintf("RMSE for imputed values w/means: %f\n",
                sqrt(mean((X[values_NA] - X_imp_mean[values_NA])^2))))
}

cmfrec documentation built on April 11, 2023, 6 p.m.