recosystem
is an R wrapper of the LIBMF
library developed by
Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang, Bo-Wen Yuan, Meng-Yuan Yang,
and Chih-Jen Lin (https://www.csie.ntu.edu.tw/~cjlin/libmf/),
an open source library for recommender system using parallel marix
factorization. [@LIBMF]
LIBMF
is a high-performance C++ library for large scale matrix factorization.
LIBMF
itself is a parallelized library, meaning that
users can take advantage of multicore CPUs to speed up the computation.
It also utilizes some advanced CPU features to further improve the performance.
[@LIBMF]
recosystem
is a wrapper of LIBMF
, hence it inherits most of the features
of LIBMF
, and additionally provides a number of user-friendly R functions to
simplify data processing and model building. Also, unlike most other R packages
for statistical modeling that store the whole dataset and model object in
memory, LIBMF
(and hence recosystem
) can significantly reduce memory use,
for instance the constructed model that contains information for prediction
can be stored in the hard disk, and output result can also be directly
written into a file rather than be kept in memory.
The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below:
| | item_1 | item_2 | item_3 | ... | item_n | |--------|--------|--------|--------|-----|--------| | user_1 | 2 | 3 | ?? | ... | 5 | | user_2 | ?? | 4 | 3 | ... | ?? | | user_3 | 3 | 2 | ?? | ... | 3 | | ... | ... | ... | ... | ... | | | user_m | 1 | ?? | 5 | ... | 4 |
Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be named collaborative filtering, matrix completion, matrix recovery, etc.
A popular technique to solve the recommender system problem is the matrix factorization method. The idea is to approximate the whole rating matrix $R_{m\times n}$ by the product of two matrices of lower dimensions, $P_{n\times k}$ and $Q_{n\times k}$, such that
$$R\approx PQ'$$
Let $p_u$ be the $u$-th row of $P$, and $q_v$ be the $v$-th row of $Q$, then the rating given by user $u$ on item $v$ would be predicted as $p_u q'_v$.
A typical solution for $P$ and $Q$ is given by the following optimization problem [@FPSG2015; @LRSG]:
$$\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]$$
where $(u,v)$ are locations of observed entries in $R$, $r_{u,v}$ is the observed rating, $f$ is the loss function, and $\mu_P,\mu_Q,\lambda_P,\lambda_Q$ are penalty parameters to avoid overfitting.
The process of solving the matrices $P$ and $Q$ is referred to as
model training, and the selection of penalty parameters is called
parameter tuning. In recosystem
, we provide convenient functions for
these two tasks, and additionally have functions for model exporting
(outputing $P$ and $Q$ matrices) and prediction.
Each step in the recommender system involves data input and output, as the table below shows:
| Step | Input | Output | |------------------|-------------------|----------------------------------| | Model training | Training data set | -- | | Parameter tuning | Training data set | -- | | Exporting model | -- | User matrix $P$, item matrix $Q$ | | Prediction | Testing data set | Predicted values |
Data may have different formats and types of storage, for example the input
data set may be saved in a file or stored as R objects, and users may want
the output results to be directly written into file or to be returned as R
objects for further processing. In recosystem
, we use two classes,
DataSource
and Output
, to handle data input and output in a unified way.
An object of class DataSource
specifies the source of a data set (either
training or testing), which can be created by the following two functions:
data_file()
: Specifies a data set from a file in the hard diskdata_memory()
: Specifies a data set from R objectsdata_matrix()
: Specifies a data set from a sparse matrixAnd an object of class Output
describes how the result should be output,
typically returned by the functions below:
out_file()
: Result should be saved to a fileout_memory()
: Result should be returned as R objectsout_nothing()
: Nothing should be outputMore data source formats and output options may be supported in the future along with the development of this package.
The data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers
user_index item_index rating
User index and item index may start with either 0 or 1, and this can be
specified by the index1
parameter in data_file()
and data_memory()
.
For example, with index1 = FALSE
, the training data file for the rating matrix
in the beginning of this article may look like
0 0 2 0 1 3 1 1 4 1 2 3 2 0 3 2 1 2 ...
From version 0.4 recosystem
supports two special types of matrix factorization:
the binary matrix factorization (BMF), and the one-class matrix factorization (OCMF).
BMF requires ratings to take value from {-1, 1}
, and OCMF requires all the ratings to be positive.
Testing data file is similar to training data, but since the ratings in
testing data are usually unknown, the rating
entry in testing data file
can be omitted, or can be replaced by any placeholder such as 0
or ?
.
The testing data file for the same rating matrix would be
0 2 1 0 2 2 ...
Example data files are contained in the <recosystem>/dat
(or <recosystem>/inst/dat
, for source package) directory.
The usage of recosystem
is quite simple, mainly consisting of the following steps:
Reco()
.$tune()
method to select best tuning parameters
along a set of candidate values.$train()
method. A number of parameters
can be set inside the function, possibly coming from the result of $tune()
.$output()
, i.e. write the factorization matrices
$P$ and $Q$ into files or return them as R objects.$predict()
method to compute predicted values.Below is an example on some simulated data:
library(recosystem) set.seed(123) # This is a randomized algorithm train_set = data_file(system.file("dat", "smalltrain.txt", package = "recosystem")) test_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem")) r = Reco() opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2), costp_l1 = 0, costq_l1 = 0, nthread = 1, niter = 10)) opts r$train(train_set, opts = c(opts$min, nthread = 1, niter = 20)) ## Write predictions to file pred_file = tempfile() r$predict(test_set, out_file(pred_file)) print(scan(pred_file, n = 10)) ## Or, directly return an R vector pred_rvec = r$predict(test_set, out_memory()) head(pred_rvec, 10)
Detailed help document for each function is available in topics
?recosystem::Reco
, ?recosystem::tune
, ?recosystem::train
,
?recosystem::output
and ?recosystem::predict
.
To build recosystem
from source, one needs a C++ compiler that supports
the C++11 standard.
Also, there are some flags in file src/Makevars
(src/Makevars.win
for Windows system) that may have influential
effect on performance. It is strongly suggested to set proper flags
according to your type of CPU before compiling the package, in order to
achieve the best performance:
Makevars
provides generic options that should apply to most
CPUs.PKG_CPPFLAGS += -DUSESSE PKG_CXXFLAGS += -msse3
PKG_CPPFLAGS += -DUSEAVX PKG_CXXFLAGS += -mavx
After editing the Makevars
file, run R CMD INSTALL recosystem
on
the package source directory to install recosystem
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.