imputeX | R Documentation |
Replace 'NA'/'NaN' values in new 'X' data according to the model predictions, given that same 'X' data and optionally 'U' data.
Note: this function will not perform any internal re-indexing for the data. If the 'X' to which the data was fit was a 'data.frame', the numeration of the items will be under 'model$info$item_mapping'. There is also a function predict_new which will let the model do the appropriate reindexing.
imputeX( model, X, weight = NULL, U = NULL, U_bin = NULL, nthreads = model$info$nthreads )
model |
A collective matrix factorization model as output by function CMF. This functionality is not available for the other model classes. |
X |
New 'X' data with missing values which will be imputed. Must be passed as a dense matrix from base R (class 'matrix'). |
weight |
Associated observation weights for entries in 'X'. If passed, must have the same shape as 'X'. |
U |
New 'U' data, with rows matching to those of 'X'. Can be passed in the following formats:
|
U_bin |
New binary columns of 'U' (rows matching to those of 'X'). Must be passed as a dense matrix from base R or as a 'data.frame'. |
nthreads |
Number of parallel threads to use. |
If using the matrix factorization model as a general missing-value imputer, it's recommended to:
Fit a model without user biases.
Set a lower regularization for the item biases than for the matrices.
Tune the regularization parameter(s) very well.
In general, matrix factorization works better for imputation of selected entries of sparse-and-wide matrices, whereas for dense matrices, the method is unlikely to provide better results than mean/median imputation, but it is nevertheless provided for experimentation purposes.
The 'X' matrix with its missing values imputed according to the model predictions.
library(cmfrec) ### Simplest example SeqMat <- matrix(1:50, nrow=10) SeqMat[2,1] <- NaN SeqMat[8,3] <- NaN m <- CMF(SeqMat, k=1, lambda=1e-10, nthreads=1L, verbose=FALSE) imputeX(m, SeqMat) ### Better example with multivariate normal data if (require("MASS")) { ### Generate random data, set some values as NA set.seed(1) n_rows <- 1000 n_cols <- 5 mu <- rnorm(n_cols) S <- matrix(rnorm(n_cols^2), nrow = n_cols) S <- t(S) %*% S X <- MASS::mvrnorm(n_rows, mu, S) X_na <- X values_NA <- matrix(runif(n_rows*n_cols) < .15, nrow=n_rows) X_na[values_NA] <- NaN ### In the event that any column is fully missing if (any(colSums(is.na(X_na)) == n_rows)) { cols_remove <- colSums(is.na(X_na)) == n_rows X_na <- X_na[, !cols_remove, drop=FALSE] values_NA <- values_NA[, !cols_remove, drop=FALSE] } ### Impute missing values with model model <- CMF(X_na, k=3, lambda=c(0,0,1,1,1,1), user_bias=FALSE, verbose=FALSE, nthreads=1L) X_imputed <- imputeX(model, X_na) cat(sprintf("RMSE for imputed values w/model: %f\n", sqrt(mean((X[values_NA] - X_imputed[values_NA])^2)))) ### Compare against simple mean imputation X_means <- apply(X_na, 2, mean, na.rm=TRUE) X_imp_mean <- X_na for (cl in 1:n_cols) X_imp_mean[values_NA[,cl], cl] <- X_means[cl] cat(sprintf("RMSE for imputed values w/means: %f\n", sqrt(mean((X[values_NA] - X_imp_mean[values_NA])^2)))) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.