fit: Matrix Factorization Models

fit_modelsR Documentation

Matrix Factorization Models

Description

Models for collective matrix factorization (also known as multi-view or multi-way). These models try to approximate a matrix 'X' as the product of two lower-rank matrices 'A' and 'B' (that is: X ~ A*t(B)) by finding the values of 'A' and 'B' that minimize the squared error w.r.t. 'X', optionally aided with side information matrices 'U' and 'I' about rows and columns of 'X'.

The package documentation is built with recommendation systems in mind, for which it assumes that 'X' is a sparse matrix in which users represent rows, items represent columns, and the non-missing values denote interactions such as movie ratings from users to items. The idea behind it is to recommend the missing entries in 'X' that have the highest predicted value according to the approximation. For other domains, take any mention of users as rows and any mention of items as columns (e.g. when used for topic modeling, the "users" are documents and the "items" are word occurrences).

In the 'CMF' model (main functionality of the package and most flexible model type), the 'A' and 'B' matrices are also used to jointly factorize the side information matrices - that is: U ~ A*t(C), I ~ B*t(D), sharing the same components or latent factors for two factorizations. Informally, this means that the obtained factors now need to explain both the interactions data and the attributes data, making them generalize better to the non-present entries of 'X' and to new data.

In 'CMF' and the other non-implicit models, the 'X' data is always centered beforehand by subtracting its mean, and might optionally add user and item biases (which are model parameters, not pre-estimated).

The model might optionally generate so-called implicit features from the same 'X' data, by factorizing binary matrices which tell which entries in 'X' are present, i.e.: Ix ~ A*t(Bi), t(Ix) ~ B*Ai, where Ix is an indicator matrix which is treated as full (no unknown values).

The 'CMF_implicit' model extends the collective factorization idea to the implicit-feedback case, based on reference [3]. While in 'CMF' the values of 'X' are taken at face value and the objective is to minimize squared error over the non-missing entries, in the implicit-feedback variants the matrix 'X' is assumed to be binary (all entries are zero or one, with no unknown values), with the positive entries (those which are not missing in the data) having a weight determined by 'X'.

'CMF' is intended for explicit feedback data (e.g. movie ratings, which contain both likes and dislikes), whereas 'CMF_implicit' is intended for implicit feedback data (e.g. number of times each user watched each movie/series, which do not contain dislikes and the values are treated as confidence scores).

The 'MostPopular' model is a simpler heuristic implemented for comparison purposes which is equivalent to either 'CMF' or 'CMF_implicit' with 'k=0' plus user/item biases. If a personalized model is not able to beat this heuristic under the evaluation metrics of interest, chances are that such personalized model needs better tuning.

The 'ContentBased' model offers a different alternative in which the latent factors are determined directly from the user/item attributes (which are no longer optional) - that is: A = U*C, B = I*D, optionally adding per-column intercepts, and is aimed at cold-start predictions (such a model is extremely unlikely to perform better for new users in the presence of interactions data). For this model, the package provides functionality for making predictions about potential new entries in 'X' which involve both new rows and new columns at the same time. Unlike the others, it does not offer an implicit-feedback variant.

The 'OMF_explicit' model extends the 'ContentBased' by adding a free offset determined for each user and item according to 'X' data alone - that is: Am = A + U*C, Bm = B + I*D, X ~ Am*t(Bm), and 'OMF_implicit' extends the idea to the implicit-feedback case.

Note that 'ContentBased' is equivalent to 'OMF_explicit' with 'k=0', 'k_main=0' and 'k_sec>0' (see documentation for details about these parameters). For a different formulation in which user factors are determined directly for item attributes (and same for items with user attributes), it's also possible to use 'OMF_explicit' with 'k=0' while passing 'k_sec' and 'k_main'.

('OMF_explicit' and 'OMF_implicit' were only implemented for research purposes for cold-start recommendations in cases in which there is side info about users but not about items or vice-versa - it is not recommended to rely on them.)

Some extra considerations about the parameters here:

  • By default, the terms in the optimization objective are not scaled by the number of entries (see parameter 'scale_lam'), thus hyperparameters such as 'lambda' will require more tuning than in other software and will require trying a wider range of values.

  • The regularization applied to the matrices is the same for all users and for all items.

  • The default hyperparameters are not geared towards speed - for faster fitting times, use ‘method=’als'', 'use_cg=TRUE', 'finalize_chol=FALSE', 'precompute_for_predictions=FALSE', 'verbose=FALSE', and pass 'X' as a matrix (either sparse or dense).

  • The default hyperparameters are also very different than in other software - for example, for ‘CMF_implicit', in order to match the Python package’s 'implicit' hyperparameters, one would have to use 'k=100', 'lambda=0.01', 'niter=15', 'use_cg=TRUE', 'finalize_chol=FALSE', and use single-precision floating point numbers (not supported in the R version of this package).

Usage

CMF(
  X,
  U = NULL,
  I = NULL,
  U_bin = NULL,
  I_bin = NULL,
  weight = NULL,
  k = 40L,
  lambda = 10,
  method = "als",
  use_cg = TRUE,
  user_bias = TRUE,
  item_bias = TRUE,
  center = TRUE,
  add_implicit_features = FALSE,
  scale_lam = FALSE,
  scale_lam_sideinfo = FALSE,
  scale_bias_const = FALSE,
  k_user = 0L,
  k_item = 0L,
  k_main = 0L,
  w_main = 1,
  w_user = 1,
  w_item = 1,
  w_implicit = 0.5,
  l1_lambda = 0,
  center_U = TRUE,
  center_I = TRUE,
  maxiter = 800L,
  niter = 10L,
  parallelize = "separate",
  corr_pairs = 4L,
  max_cg_steps = 3L,
  precondition_cg = FALSE,
  finalize_chol = TRUE,
  NA_as_zero = FALSE,
  NA_as_zero_user = FALSE,
  NA_as_zero_item = FALSE,
  nonneg = FALSE,
  nonneg_C = FALSE,
  nonneg_D = FALSE,
  max_cd_steps = 100L,
  precompute_for_predictions = TRUE,
  include_all_X = TRUE,
  verbose = TRUE,
  print_every = 10L,
  handle_interrupt = TRUE,
  seed = 1L,
  nthreads = parallel::detectCores()
)

CMF_implicit(
  X,
  U = NULL,
  I = NULL,
  k = 40L,
  lambda = 1,
  alpha = 1,
  use_cg = TRUE,
  k_user = 0L,
  k_item = 0L,
  k_main = 0L,
  w_main = 1,
  w_user = 1,
  w_item = 1,
  l1_lambda = 0,
  center_U = TRUE,
  center_I = TRUE,
  niter = 10L,
  max_cg_steps = 3L,
  precondition_cg = FALSE,
  finalize_chol = FALSE,
  NA_as_zero_user = FALSE,
  NA_as_zero_item = FALSE,
  nonneg = FALSE,
  nonneg_C = FALSE,
  nonneg_D = FALSE,
  max_cd_steps = 100L,
  apply_log_transf = FALSE,
  precompute_for_predictions = TRUE,
  verbose = TRUE,
  handle_interrupt = TRUE,
  seed = 1L,
  nthreads = parallel::detectCores()
)

MostPopular(
  X,
  weight = NULL,
  implicit = FALSE,
  center = TRUE,
  user_bias = ifelse(implicit, FALSE, TRUE),
  lambda = 10,
  alpha = 1,
  NA_as_zero = FALSE,
  apply_log_transf = FALSE,
  nonneg = FALSE,
  scale_lam = FALSE,
  scale_bias_const = FALSE
)

ContentBased(
  X,
  U,
  I,
  weight = NULL,
  k = 20L,
  lambda = 100,
  user_bias = FALSE,
  item_bias = FALSE,
  add_intercepts = TRUE,
  maxiter = 3000L,
  corr_pairs = 3L,
  parallelize = "separate",
  verbose = TRUE,
  print_every = 100L,
  handle_interrupt = TRUE,
  start_with_ALS = TRUE,
  seed = 1L,
  nthreads = parallel::detectCores()
)

OMF_explicit(
  X,
  U = NULL,
  I = NULL,
  weight = NULL,
  k = 50L,
  lambda = 10,
  method = "lbfgs",
  use_cg = TRUE,
  precondition_cg = FALSE,
  user_bias = TRUE,
  item_bias = TRUE,
  center = TRUE,
  k_sec = 0L,
  k_main = 0L,
  add_intercepts = TRUE,
  w_user = 1,
  w_item = 1,
  maxiter = 10000L,
  niter = 10L,
  parallelize = "separate",
  corr_pairs = 7L,
  max_cg_steps = 3L,
  finalize_chol = TRUE,
  NA_as_zero = FALSE,
  verbose = TRUE,
  print_every = 100L,
  handle_interrupt = TRUE,
  seed = 1L,
  nthreads = parallel::detectCores()
)

OMF_implicit(
  X,
  U = NULL,
  I = NULL,
  k = 50L,
  lambda = 1,
  alpha = 1,
  use_cg = TRUE,
  precondition_cg = FALSE,
  add_intercepts = TRUE,
  niter = 10L,
  apply_log_transf = FALSE,
  max_cg_steps = 3L,
  finalize_chol = FALSE,
  verbose = FALSE,
  handle_interrupt = TRUE,
  seed = 1L,
  nthreads = parallel::detectCores()
)

Arguments

X

The main matrix with interactions data to factorize (e.g. movie ratings by users, bag-of-words representations of texts, etc.). The package is built with recommender systems in mind, and will assume that 'X' is a matrix in which users are rows, items are columns, and values denote interactions between a given user and item. Can be passed in the following formats:

  • A 'data.frame' representing triplets, in which there should be one row for each present or non-missing interaction, with the first column denoting the user/row ID, the second column the item/column ID, and the third column the value (e.g. movie rating). If passed in this format, the user and item IDs will be reindexed internally, and the side information matrices should have row names matching to those IDs. If there are observation weights, these should be the fourth column.

  • A sparse matrix in COO/triplets format, either from package 'Matrix' (class 'dgTMatrix') or from package 'SparseM' (class 'matrix.coo').

  • A dense matrix from base R (class 'matrix'), with missing values set as 'NA'/'NaN'.

If using the package 'softImpute', objects of type 'incomplete' from that package can be converted to 'Matrix' objects through e.g. 'as(X, "TsparseMatrix")'. Sparse matrices can be created through e.g. 'Matrix::sparseMatrix(..., repr="T")'.

It is recommended for faster fitting times to pass the 'X' data as a matrix (either sparse or dense) as then it will avoid internal reindexes.

Note that, generally, it's possible to pass partially disjoints sets of users/items between the different matrices (e.g. it's possible for both the 'X' and 'U' matrices to have rows that the other doesn't have). If any of the inputs has less rows/columns than the other(s) (e.g. 'U' has more rows than 'X', or 'I' has more rows than there are columns in 'X'), will assume that the rest of the rows/columns have only missing values. However, when having partially disjoint inputs, the order of the rows/columns matters for speed for the 'CMF' and 'CMF_implicit' models under the ALS method, as it might run faster when the 'U'/'I' inputs that do not have matching rows/columns in 'X' have those unmatched rows/columns at the end (last rows/columns) and the 'X' input is shorter. See also the parameter 'include_all_X' for info about predicting with mismatched 'X'.

If passed as sparse/triplets, the non-missing values should not contain any 'NA'/'NaN's.

U

User attributes information. Can be passed in the following formats:

  • A 'matrix', with rows corresponding to rows of 'X' and columns to user attributes. For the 'CMF' and 'CMF_implicit' models, missing values are supported and should be set to 'NA'/'NaN'.

  • A 'data.frame' with the same format as above.

  • A sparse matrix in COO/triplets format, either from package 'Matrix' (class 'dgTMatrix') or from package 'SparseM' (class 'matrix.coo'). Same as above, rows correspond to rows of 'X' and columns to user attributes. If passed as sparse, the non-missing values cannot contain 'NA'/'NaN' - see parameter 'NA_as_zero_user' for how to interpret non-missing values. Sparse side info is not supported for 'OMF_implicit', nor for 'OMF_explicit' with 'method=als'.

If 'X' is a 'data.frame', should be either a 'data.frame' or 'matrix', containing row names matching to the first column of 'X' (which denotes the user/row IDs of the non-zero entries). If 'U' is sparse, 'X' should be passed as sparse or dense matrix (not a 'data.frame').

Note that, if 'U' is a 'matrix' or 'data.frame', it should have the same number of rows as 'X' in the 'ContentBased', 'OMF_explicit', and 'OMF_implicit' models.

Be aware that 'CMF' and 'CMF_implicit' tend to perform better with dense and not-too-wide user/item attributes.

I

Item attributes information. Can be passed in the following formats:

  • A 'matrix', with rows corresponding to columns of 'X' and columns to item attributes. For the 'CMF' and 'CMF_implicit' models, missing values are supported and should be set to 'NA'/'NaN'.

  • A 'data.frame' with the same format as above.

  • A sparse matrix in COO/triplets format, either from package 'Matrix' (class 'dgTMatrix') or from package 'SparseM' (class 'matrix.coo'). Same as above, rows correspond to columns of 'X' and columns to item attributes. If passed as sparse, the non-missing values cannot contain 'NA'/'NaN' - see parameter 'NA_as_zero_item' for how to interpret non-missing values. Sparse side info is not supported for 'OMF_implicit', nor for 'OMF_explicit' with 'method=als'.

If 'X' is a 'data.frame', should be either a 'data.frame' or 'matrix', containing row names matching to the second column of 'X' (which denotes the item/column IDs of the non-zero entries). If 'I' is sparse, 'X' should be passed as sparse or dense matrix (not a 'data.frame').

Note that, if 'I' is a 'matrix' or 'data.frame', it should have the same number of rows as there are columns in 'X' in the 'ContentBased', 'OMF_explicit', and 'OMF_implicit' models.

Be aware that 'CMF' and 'CMF_implicit' tend to perform better with dense and not-too-wide user/item attributes.

U_bin

User binary columns/attributes (all values should be zero, one, or missing), for which a sigmoid transformation will be applied on the predicted values. If 'X' is a 'data.frame', should also be a 'data.frame', with row names matching to the first column of 'X' (which denotes the user/row IDs of the non-zero entries). Cannot be passed as a sparse matrix. Note that 'U' and 'U_bin' are not mutually exclusive.

Only supported with “method='lbfgs'“.

I_bin

Item binary columns/attributes (all values should be zero, one, or missing), for which a sigmoid transformation will be applied on the predicted values. If 'X' is a 'data.frame', should also be a 'data.frame', with row names matching to the second column of 'X' (which denotes the item/column IDs of the non-zero entries). Cannot be passed as a sparse matrix. Note that 'I' and 'I_bin' are not mutually exclusive.

Only supported with “method='lbfgs'“.

weight

(Optional and not recommended) Observation weights for entries in 'X'. Must have the same shape as 'X' - that is, if 'X' is a sparse matrix, must be a vector with the same number of non-zero entries as 'X', if 'X' is a dense matrix, 'weight' must also be a dense matrix. Alternatively, if 'X' is a sparse COO matrix, 'weight' may also be passed as a sparse COO matrix in the same format, but it will not check whether the indices match between the two. If 'X' is a 'data.frame', should be passed instead as its fourth column.

Cannot have missing values.

This is only supported for the explicit-feedback models, as the implicit-feedback ones determine the weights through 'X'.

k

Number of latent factors to use (dimensionality of the low-rank factorization) - these will be shared between the factorization of the 'X' matrix and the side info matrices in the 'CMF' and 'CMF_implicit' models, and will be determined jointly by interactions and side info in the 'OMF_explicit' and 'OMF_implicit' models. Additional non-shared components can also be specified through 'k_user', 'k_item', and 'k_main' (also 'k_sec' for 'OMF_explicit').

Typical values are 30 to 100.

lambda

Regularization parameter to apply on the squared L2 norms of the matrices. Some models ('CMF', 'CMF_implicit', 'ContentBased', and 'OMF_explicit' with the L-BFGS method) can use different regularization for each matrix, in which case it should be an array with 6 entries (regardless of the model), corresponding, in this order, to: 'user_bias', 'item_bias', 'A', 'B', 'C', 'D'. Note that the default value for 'lambda' here is much higher than in other software, and that the loss/objective function is not divided by the number of entries anywhere, so this parameter needs good tuning. For example, a good value for the MovieLens10M would be 'lambda=35' (or 'lambda=0.05' with 'scale_lam=TRUE'), whereas for the LastFM-360K, a good value would be 'lambda=5'.

Typical values are 0.01 to 100, with the implicit-feedback models requiring less regularization.

method

Optimization method used to fit the model. If passing 'lbfgs', will fit it through a gradient-based approach using an L-BFGS optimizer, and if passing 'als', will fit it through the ALS (alternating least-squares) method. L-BFGS is typically a much slower and a much less memory efficient method compared to 'als', but tends to reach better local optima and allows some variations of the problem which ALS doesn't, such as applying sigmoid transformations for binary side information.

Note that not all models allow choosing the optimizer:

  • 'CMF_implicit' and 'OMF_implicit' can only be fitted through the ALS method.

  • 'ContentBased' can only be fitted through the L-BFGS method.

  • 'MostPopular' can only use an ALS-like procedure, but which will ignore parameters such as 'niter'.

  • Models with non-negativity constraints can only be fitted through the ALS method, and the matrices to which the constraints apply can only be determined through a coordinate descent procedure (which will ignore what is passed to 'use_cg' and 'finalize_chol').

  • Models with L1 regularization can only be fitted through the ALS method, and the sub-problems are solved through a coordinate-descent procedure.

use_cg

In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations ('niter') to reach a good optimum. In general, better results are achieved with 'use_cg=FALSE' for the explicit-feedback models. Note that, if using this method, calculations after fitting which involve new data such as factors, might produce slightly different results from the factors obtained inside the fitted model with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to use 'finalize_chol=TRUE'. Even if passing 'TRUE' here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. This option is not available when using L1 regularization and/or non-negativity constraints. Ignored when using the L-BFGS method.

user_bias

Whether to add user/row biases (intercepts) to the model. If using it for purposes other than recommender systems, this is is usually not suggested to include.

item_bias

Whether to add item/column biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.

center

Whether to center the "X" data by subtracting the mean value. For recommender systems, it's highly recommended to pass 'TRUE' here, the more so if the model has user and/or item biases.

For 'MostPopular', if passing 'implicit=TRUE', this option will be ignored (assumed 'FALSE').

add_implicit_features

Whether to automatically add so-called implicit features from the data, as in reference [5] and similar. If using this for recommender systems with small amounts of data, it's recommended to pass 'TRUE' here.

scale_lam

Whether to scale (increase) the regularization parameter for each row of the model matrices (A, B, C, D) according to the number of non-missing entries in the data for that particular row, as proposed in reference [7]. For the A and B matrices, the regularization will only be scaled according to the number of non-missing entries in 'X' (see also the 'scale_lam_sideinfo' parameter). Note that, when using the options 'NA_as_zero_*', all entries are considered to be non-missing. If passing 'TRUE' here, the optimal value for 'lambda' will be much smaller (and likely below 0.1). This option tends to give better results, but requires more hyperparameter tuning. Only supported for the ALS method.

For the 'MostPopular' model, this is not supported when passing 'implicit=TRUE', and it is not recommended to use for it, as it will tend to recommend items which have a single user interaction with the maximum possible value (e.g. 5-star movies from only 1 user).

When generating factors based on side information alone, if passing 'scale_lam_sideinfo', will regularize assuming there was one observation present. Be aware that using this option without 'scale_lam_sideinfo=TRUE' can lead to bad cold-start recommendations as it will set a very small regularization for users who have no 'X' data.

Warning: in smaller datasets, using this option can result in top-N recommendations having mostly items with very few interactions (see parameter 'scale_bias_const').

scale_lam_sideinfo

Whether to scale (increase) the regularization parameter for each row of the "A" and "B" matrices according to the number of non-missing entries in both 'X' and the side info matrices 'U' and 'I'. If passing 'TRUE' here, 'scale_lam' will also be assumed to be 'TRUE'.

scale_bias_const

When passing 'scale_lam=TRUE' and 'user_bias=TRUE' or 'item_bias=TRUE', whether to apply the same scaling to the regularization of the biases to all users and items, according to the average number of non-missing entries rather than to the number of entries for each specific user/item.

While this tends to result in worse RMSE, it tends to make the top-N recommendations less likely to select items with only a few interactions from only a few users.

Ignored when passing 'scale_lam=FALSE' or not using user/item biases.

k_user

Number of factors in the factorizing 'A' and 'C' matrices which will be used only for the 'U' and 'U_bin' matrices, while being ignored for the 'X' matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by 'k'.

k_item

Number of factors in the factorizing 'B' and 'D' matrices which will be used only for the 'I' and 'I_bin' matrices, while being ignored for the 'X' matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by 'k'.

k_main

For the 'CMF' and 'CMF_implicit' models, this denotes the number of factors in the factorizing 'A' and 'B' matrices which will be used only for the 'X' matrix, while being ignored for the 'U', 'U_bin', 'I', and 'I_bin' matrices. For the 'OMF_explicit' model, this denotes the number of factors which are determined without the user/item side information. These will be the last factors of the matrices once the model is fit. Will be counted in addition to those already set by 'k'.

w_main

Weight in the optimization objective for the errors in the factorization of the 'X' matrix.

w_user

For the 'CMF' and 'CMF_implicit' models, this denotes the weight in the optimization objective for the errors in the factorization of the 'U' and 'U_bin' matrices. For the 'OMF_explicit' model, this denotes the multiplier for the effect of the user attributes in the final factor matrices.

Ignored when passing neither 'U' nor 'U_bin'.

w_item

For the 'CMF' and 'CMF_implicit' models, this denotes the weight in the optimization objective for the errors in the factorization of the 'I' and 'I_bin' matrices. For the 'OMF_explicit' model, this denotes the multiplier for the effect of the item attributes in the final factor matrices.

Ignored when passing neither 'I' nor 'I_bin'.

w_implicit

Weight in the optimization objective for the errors in the factorizations of the implicit 'X' matrices. Note that, depending on the sparsity of the data, the sum of errors from these factorizations might be much larger than for the original 'X' and a smaller value will perform better. It is recommended to tune this parameter carefully. Ignored when passing 'add_implicit_features=FALSE'.

l1_lambda

Regularization parameter to apply to the L1 norm of the model matrices. Can also pass different values for each matrix (see 'lambda' for details). Note that, when adding L1 regularization, the model will be fit through a coordinate descent procedure, which is significantly slower than the Cholesky method with L2 regularization. Only supported with the ALS method.

Not recommended.

center_U

Whether to center the 'U' matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using 'NA_as_zero_user=TRUE'.

center_I

Whether to center the 'I' matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using 'NA_as_zero_item=TRUE'.

maxiter

Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that the 'CMF' model is likely to require fewer iterations to converge compared to other models, whereas the 'ContentBased' model, which optimizes a highly non-linear function, will require more iterations and benefits from using more correction pairs. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending hundreds of iterations without any significant decrease in the loss function or gradient norm, it's highly likely that the regularization is too low.

Ignored when using the ALS method.

niter

Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Ignored when using the L-BFGS method.

Typical values are 6 to 30.

parallelize

How to parallelize gradient calculations when using more than one thread with ‘method=’lbfgs''. Passing 'separate' will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing 'single' will iterate over the data only once, and then sum the obtained results from each thread. Passing 'separate' is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passing 'nthreads=1', or when using the ALS method, or when compiling without OpenMP support.

corr_pairs

Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements. Ignored when using the ALS method.

max_cg_steps

Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing 'use_cg=FALSE' or using the L-BFGS method.

precondition_cg

Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors.

Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by 'max_cg_steps') at each iteration regardless of whether it has reached the optimum already.

Ignored when passing 'use_cg=FALSE' or 'method="als"'.

finalize_chol

When passing 'use_cg=TRUE' and using the ALS method, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the resulting factors inside the model object and calls to factors or similar with the same data.

NA_as_zero

Whether to take missing entries in the 'X' matrix as zeros (only when the 'X' matrix is passed as a sparse matrix or as a 'data.frame') instead of ignoring them. This is a different model from the implicit-feedback version with weighted entries, and it's a much faster model to fit. Note that passing 'TRUE' will affect the results of the functions factors and factors_single (as it will assume zeros instead of missing). It is possible to obtain equivalent results to the implicit-feedback model if passing 'TRUE' here, and then passing an 'X' with all values set to one and weights corresponding to the actual values of 'X' multiplied by 'alpha', plus 1 ('W := 1 + alpha*X' to imitate the implicit-feedback model). If passing this option, be aware that the defaults are also to perform mean centering and add user/item biases, which might be undesirable to have together with this option. For the OMF_explicit model, this option will only affect the data to which the model is fit, while being always assumed 'FALSE' for new data (e.g. when calling 'factors').

NA_as_zero_user

Whether to take missing entries in the 'U' matrix as zeros (only when the 'U' matrix is passed as a sparse matrix) instead of ignoring them. Note that passing 'TRUE' will affect the results of the functions factors and factors_single if no data is passed there (as it will assume zeros instead of missing). This option is always assumed 'TRUE' for the 'ContentBased', 'OMF_explicit', and 'OMF_implicit' models.

NA_as_zero_item

Whether to take missing entries in the 'I' matrix as zeros (only when the 'I' matrix is passed as a sparse matrix) instead of ignoring them. This option is always assumed 'TRUE' for the 'ContentBased', 'OMF_explicit', and 'OMF_implicit' models.

nonneg

Whether to constrain the 'A' and 'B' matrices to be non-negative. In order for this to work correctly, the 'X' input data must also be non-negative. This constraint will also be applied to the 'Ai' and 'Bi' matrices if passing 'add_implicit_features=TRUE'.

Important: be aware that the default options are to perform mean centering and to add user and item biases, which might be undesirable and hinder performance when having non-negativity constraints (especially mean centering).

This option is not available when using the L-BFGS method. Note that, when determining non-negative factors, it will always use a coordinate descent method, regardless of the value passed for 'use_cg' and 'finalize_chol'.

When used for recommender systems, one usually wants to pass 'FALSE' here.

For better results, do not use centering alongside this option, and use a higher regularization coupled with more iterations..

nonneg_C

Whether to constrain the 'C' matrix to be non-negative. In order for this to work correctly, the 'U' input data must also be non-negative.

Note: by default, the 'U' data will be centered by columns, which doesn't play well with non-negativity constraints. One will likely want to pass 'center_U=FALSE' along with this.

nonneg_D

Whether to constrain the 'D' matrix to be non-negative. In order for this to work correctly, the 'I' input data must also be non-negative.

Note: by default, the 'I' data will be centered by columns, which doesn't play well with non-negativity constraints. One will likely want to pass 'center_I=FALSE' along with this.

max_cd_steps

Maximum number of coordinate descent updates to perform per iteration. Pass zero for no limit. The procedure will only use coordinate descent updates when having L1 regularization and/or non-negativity constraints. This number should usually be larger than 'k'.

precompute_for_predictions

Whether to precompute some of the matrices that are used when making predictions from the model. If 'FALSE', it will take longer to generate predictions or top-N lists, but will use less memory and will be faster to fit the model. If passing 'FALSE', can be recomputed later on-demand through function precompute.for.predictions.

Note that for 'ContentBased', 'OMF_explicit', and 'OMF_implicit', this parameter will always be assumed to be 'TRUE', due to requiring the original matrices for the pre-computations.

include_all_X

When passing an input 'X' which has less columns than rows in 'I', whether to still make calculations about the items which are in 'I' but not in 'X'. This has three effects: (a) the topN functionality may recommend such items, (b) the precomptued matrices will be less usable as they will include all such items, (c) it will be possible to pass 'X' data to the new factors or topN functions that include such columns (rows of 'I'). This option is ignored when using 'NA_as_zero', and is only relevant for the 'CMF' model as all the other models will have the equivalent of 'TRUE' here.

verbose

Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing 'FALSE' and using the L-BFGS method, the optimization routine will not respond to interrupt signals.

print_every

Print L-BFGS convergence messages every n-iterations. Ignored when not using the L-BFGS method.

handle_interrupt

When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing 'TRUE'), or raise an interrupt exception without producing a fitted model object (when passing 'FALSE').

seed

Seed to use for random number generation. If passing 'NULL', will draw a non-reproducible random integer to use as seed.

nthreads

Number of parallel threads to use. Note that, the more threads that are used, the higher the memory consumption.

alpha

Weighting parameter for the non-zero entries in the implicit-feedback model. See [3] for details. Note that, while the author's suggestion for this value is 40, other software such as the Python package 'implicit' use a value of 1, whereas Spark uses a value of 0.01 by default, and values higher than 10 are unlikely to improve results. If the data has very high values, might even be beneficial to put a very low value here - for example, for the LastFM-360K, values below 1 might give better results.

apply_log_transf

Whether to apply a logarithm transformation on the values of 'X' (i.e. 'X := log(X)')

implicit

(Only selectable for the 'MostPopular' model) Whether to use the implicit-feedback model, in which the 'X' matrix is assumed to have only binary entries and each of them having a weight in the loss function given by the observer user-item interactions and other parameters.

add_intercepts

(Only for 'ContentBased', 'OMF_explicit', 'OMF_implicit') Whether to add intercepts/biases to the user/item attribute matrices.

start_with_ALS

(Only for 'ContentBased') Whether to determine the initial coefficients through an ALS procedure. This might help to speed up the procedure by starting closer to an optimum. This option is not available when the side information is passed as sparse matrices.

k_sec

(Only for 'OMF_explicit') Number of factors in the factorizing matrices which are determined exclusively from user/item attributes. These will be at the beginning of the 'C' and 'D' matrices once the model is fit. If there are no attributes for a given matrix (user/item), then that matrix will have an extra 'k_sec' factors (e.g. if passing user side info but not item side info, then the 'B' matrix will have an extra 'k_sec' factors). Will be counted in addition to those already set by 'k'. Not supported when using ‘method=’als''.

For a different model having only 'k_sec' with 'k=0' and 'k_main=0', see the 'ContentBased' model

Details

In more details, the models predict the values of 'X' as follows:

  • 'CMF': X ~ A * t(B) + μ + bias_u + bias_i, where μ is the global mean for the non-missing entries in 'X', and bias_u, bias_i are the user and item biases (column and row vector, respectively). In addition, the other matrices are predicted as U ~ A*t(C) + μ_U and I ~ B*t(D) + μ_I, where μ_U, μ_I are the column means from the side info matrices, which are determined as a simple average with no regularization (these are row vectors), and if having binary variables, also U_bin ~ sigm(A*t(C_bin)) and I_bin ~ sigm(B*t(D_bin)), where 'sigm' is a sigmoid function ( sigm(x) = 1/(1+exp(-x))). Under the options 'NA_as_zero_*', the mean(s) for that matrix are not added into the model for simplicity. For the implicit features option, the other matrices are predicted simply as Ix ~ A*t(Bi), t(Ix) ~ B*t(Ai).

    If using 'k_user', 'k_item', 'k_main', then for 'X', only columns '1' through 'k+k_user' are used in the approximation of 'U', and only columns 'k_user+1' through 'k_user+k+k_main' are used for the approximation of 'X' (similar thing for 'B' with 'k_item'). The implicit factors matrices (Ai, Bi) always use the same components/factors as 'X'.

    Be aware that the functions for determining new factors will by default omit the bias term in the output.

  • 'CMF_implicit': X ~ A * t(B), while 'U' and 'I' remain the same as for 'CMF', and the ordering of the non-shared factors is the same. Note that there is no mean centering or user/item biases in the implicit-feedback model, but if desired, the 'CMF' model can be made to mimic 'CMF_implicit' while still accommodating for mean centering and biases.

  • 'MostPopular': X ~ μ + bias_u + bias_i (when using 'implicit=FALSE') or X ~ bias_i (when using 'implicit=TRUE').

  • 'ContentBased': X ~ Am * t(Bm), where Am = U*C + C_bias and Bm = I * D + D_bias - the C_bias, D_bias are per-column/factor intercepts (these are row vectors).

  • 'OMF_explicit': X ~ Am * t(Bm) + μ + bias_u + bias_i, where Am = w_user * U * C + C_bias + A and Bm = w_item * (I * D + D_bias) + B. If passing 'k_sec' and/or 'k_main', then columns '1' through 'k_sec' of Am, Bm are determined as those same columns from A, B, while U*C + C_bias, I*D + D_bias will be shorter by 'k_sec' columns (alternatively, can be though of as having those columns artificially set to zeros), and columns 'k_sec+k+1' through 'k_sec+k+k_main' of Am, Bm are determined as those last 'k_main' columns of U*C + C_bias, I*D + D_bias, while A, B will be shorter by 'k_main' columns (alternatively, can be though of as having those columns artificially set to zeros). If one of U or I is missing, then the corresponding A or B matrix will be extended by 'k_sec' columns (which will not be zeros) and the corresponding prediction matrix (Am, Bm) will be equivalent to that matrix (which was the free offset in the presence of side information).

  • 'OMF_implicit': X ~ Am * t(Bm), with Am, Bm remaining the same as for 'OMF_explicit'.

When calling the prediction functions, new data is always transposed or deep copied before passing them to the underlying C functions - as such, for the 'ContentBased' model, it might be faster to use the matrices directly instead (all these matrices will be under 'model$matrices', but will be transposed).

The precomputed matrices, when they are square, will only contain the lower triangle only, as they are symmetric. For 'CMF' and 'CMF_implicit', one might also see variations of a new matrix called 'Be' (extended 'B' matrix), which is from reference [1] and defined as Be = [[0, Bs, Bm], [Ca, Cs, 0]] , where Bs are columns 'k_item+1' through 'k_item+k' from 'B', Bm are columns 'k_item+k+1' through 'k_item+k+k_main' from 'B', Ca are columns '1' through 'k_user' from 'C', and Cs are columns 'k_user+1' through 'k_user+k' from 'C'. This matrix is used for the closed-form solution of a given vector of 'A' in the functions for predicting on new data (see reference [1] for details or if you would like to use your own solver with the fitted matrices from this package), as long as there are no binary columns to which to apply a transformation, in which case it will always solve them with the L-BFGS method.

When using user biases, the precomputed matrices will have an extra column, which is derived by adding an extra column to 'B' (at the end) consisting of all ones (this is how the user biases are calculated).

For the implicit-feedback models, the weights of the positive entries (defined as the non-missing entries in 'X') will be given by W = 1 + α * X.

For the 'OMF' models, the 'ALS' method will first find a solution for the equivalent 'CMF' problem with no side information, and will then try to predict the resulting matrices given the user/item attributes, assigning the residuals as the free offsets. While this might sound reasonable, in practice it tends to give rather different results than when fit through the L-BFGS method. Strictly speaking, the regularization parameter in this case is applied to the Am, Bm matrices, and the prediction functions for new data will offer an option 'exact' for determining whether to apply the regularization to the A, B matrices instead.

For reproducibility, the initializations of the model matrices (always initialized as '~ Normal(0, 1)') can be controlled through 'set.seed', but if using parallelizations, there are potential sources of irreproducibility of random seeds due to parallelized aggregations and/or BLAS function calls, which is especially problematic for the L-BFGS method with ‘parallelize=’single''.

In order to further avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it's necessary to sort it beforehand by columns/items and also pass the data data with item indices sorted beforehand to the prediction functions. The package does not perform any indices sorting or de-duplication of entries of sparse matrices.

Value

Returns a model object (class named just like the function that produced it, plus general class 'cmfrec') on which methods such as topN and factors can be called. The returned object will have the following fields:

  • 'info': will contain the hyperparameters, problem dimensions, and other information such as the number of threads, as passed to the function that produced the model. The number of threads ('nthreads') might be modified after-the-fact. If 'X' is a 'data.frame', will also contain the re-indexing of users and items under 'user_mapping' and 'item_mapping', respectively. For the L-BFGS method, will also contain the number of function evaluations ('nfev') and number of updates ('nupd') that were performed.

  • 'matrices': will contain the fitted model matrices (see section 'Description' for the naming and for details on what they represent), but note that they will be transposed (due to R's column-major representation of matrices) and it is recommended to use the package's prediction functionality instead of taking the matrices directly.

  • 'precomputed': will contain some pre-computed calculations based on the model matrices which might help speed up predictions on new data.

Evaluating models

Metrics for implicit-feedback recommendations or model quality can be calculated using the recometrics package.

Performance tips

It is recommended to have the RhpcBLASctl package installed for better performance - if available, will be used to control the number of internal BLAS threads before entering a multi-threaded region, in order to avoid oversubscription of threads. This can become an issue when using OpenBLAS if it is the 'pthreads' variant.

This package relies heavily on BLAS and LAPACK functions. For better performance, it is recommended to use an optimized backed for them, such as MKL or OpenBLAS.

In Windows, the easiest way of getting MKL is to use Microsoft's MRAN distribution of R, while OpenBLAS can be obtained by following this tutorial (no new R installation required).

In Linux, these can be installed through the system's package manager. In Debian and Debian-based distributions such as Ubuntu, the default BLAS and LAPACK can be configured through the alternatives system (see the Debian docs or this post for MKL).

By default, in a regular x86-64 CPU, R will compile all packages with generic options '-msse2' and '-O2', which misses lots of performance optimizations, and in particular, 'cmfrec' will not be able to achieve its maximum performance with them.

It is recommended to use compilation options '-O3', '-march=native', '-fno-math-errno', '-fno-trapping-math', and '-std=c99' or '-std=gnu99'. These can be activated in multiple ways:

  • (On Linux) Creating an empty text file '~/.R/Makevars' and adding this line there: 'CFLAGS += -O3 -march=native -fno-math-errno -fno-trapping-math' (plus an empty line at the end), then installing the usual way with 'install.packages("cmfrec")'.

  • Installing 'cmfrec' from source, but modifying the 'Makevars' file (it has lines that can be uncommented in order to enable these optimizations).

  • Modifying the global 'Makeconf' variable. This is a file which defines the default compilation options for all R packages, so be careful about it. In Debian, this file will typically be under '/etc/R/', but this can vary in other operating systems. In this file, replace all occurences of '-O2' with '-O3', and all occurrences of '-msse2' with '-march=native -fno-math-errno -fno-trapping-math' (e.g. open them in some text editor or in RStudio and use the 'Replace All' functionality) (not recommended to edit this global file, it should be preferred to edit the local user Makevars instead).

References

  1. Cortes, David. "Cold-start recommendations in Collective Matrix Factorization." arXiv preprint arXiv:1809.00366 (2018).

  2. Singh, Ajit P., and Geoffrey J. Gordon. "Relational learning via collective matrix factorization." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.

  3. Hu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.

  4. Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. "Applications of the conjugate gradient method for implicit feedback collaborative filtering." Proceedings of the fifth ACM conference on Recommender systems. 2011.

  5. Rendle, Steffen, Li Zhang, and Yehuda Koren. "On the difficulty of evaluating baselines: A study on recommender systems." arXiv preprint arXiv:1905.01395 (2019).

  6. Franc, Vojtech, Vaclav Hlavac, and Mirko Navara. "Sequential coordinate-wise algorithm for the non-negative least squares problem." International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg, 2005.

  7. Zhou, Yunhong, et al. "Large-scale parallel collaborative filtering for the netflix prize." International conference on algorithmic applications in management. Springer, Berlin, Heidelberg, 2008.

Examples

### See the package vignette for an extended version of this example

library(cmfrec)
if (require("recommenderlab") && require("MatrixExtra")) {
    ### Load the ML100K dataset (movie ratings)
    ### (users are rows, items are columns)
    data("MovieLense")
    X <- as.coo.matrix(MovieLense@data)

    ### Will add basic side information about the users
    U <- MovieLenseUser
    U$id      <- NULL
    U$zipcode <- NULL
    U <- model.matrix(~.-1, data=U)

    ### Will additionally use the item genres as side info
    I <- MovieLenseMeta
    I$title <- NULL
    I$year  <- NULL
    I$url   <- NULL
    I <- as.coo.matrix(I)

    ### Fit a factorization model
    ### (it's recommended to change the hyperparameters
    ###  and use multiple threads)
    model <- CMF(X=X, U=U, I=I, k=10L, niter=5L,
                 NA_as_zero_item=TRUE,
                 verbose=FALSE, nthreads=1L)

    ### Predict rating for entries X[1,3], X[2,5], X[10,9]
    ### (first ID is the user, second is the movie)
    predict(model, user=c(1,2,10), item=c(3,5,9))

    ### Recommend top-5 for user ID = 10
    ### (Note that 'MatrixExtra' makes this return a 'sparseVector')
    seen_by_user <- MovieLense@data[10, , drop=TRUE]@i
    rec <- topN(model, user=10, n=5, exclude=seen_by_user)
    rec

    ### Print them in a more understandable format
    movie_names <- colnames(X)
    n_ratings <- colSums(as.csc.matrix(X, binary=TRUE))
    avg_ratings <- colSums(as.csc.matrix(X)) / n_ratings
    print_recommended <- function(rec, txt) {
        cat(txt, ":\n",
            paste(paste(1:length(rec), ". ", sep=""),
                  movie_names[rec],
                  " - Avg rating:", round(avg_ratings[rec], 2),
                  ", #ratings: ", n_ratings[rec],
                  collapse="\n", sep=""),
            "\n", sep="")
    }
    print_recommended(rec, "Recommended for user_id=10")


    ### Recommend assuming it is a new user,
    ### based on its data (ratings + side info)
    x_user <- X[10, , drop=TRUE] ## <- this is a 'sparseVector'
    u_user <- U[10, ]
    rec_new <- topN_new(model, n=5, X=x_user, U=u_user, exclude=seen_by_user)
    cat("lists are identical: ", identical(rec_new, rec), "\n")

    ### Recommend based on side information alone
    ### (a.k.a. cold-start recommendation)
    rec_cold <- topN_new(model, n=5, U=u_user)
    print_recommended(rec_cold, "Recommended based on side info")

    ### Obtain factors for the user
    factors_user <- model$matrices$A[, 10, drop=TRUE]

    ### Re-calculate them based on the data
    factors_new <- factors_single(model, X=x_user, U=u_user)

    ### Should be very close, but due to numerical precision,
    ### might not be exactly equal (see section 'Details')
    cat("diff: ", factors_user - factors_new, "\n")

    ### Can also calculate them in batch
    ### (slicing is provided by package "MatrixExtra")
    Xslice <- as.csr.matrix(X)[1:10, , drop=FALSE]
    Uslice <- U[1:10, , drop=FALSE]
    factors_multiple <- factors(model, X=Xslice, U=Uslice)
    cat("diff: ", factors_multiple[10, , drop=TRUE] - factors_new, "\n")

    ### Can make cold-start predictions, e.g.
    ### predict how would users [1,2,3] rate a new item,
    ### given it's side information (here it's item ID = 5)
    predict_new_items(model, user=c(1,2,3), item=c(1,1,1), I=I[5, ])
}

cmfrec documentation built on Nov. 26, 2022, 5:05 p.m.