regtools: Regression and Classification Tools

Linear Models in Recommender Systems

N. Matloff, UC Davis

In the collaborative filtering approach to recommender systems modeling, a very simple but common model for the rating user i gives to item j is

Yij = μ + ui + vj + εij

where

μ is the overall mean rating over all users and items
ui is the propensity of user i to rate items liberally or harshly
vj is the propensity of item j to be rated liberally or harshly
εij is an error term, incorporating all other factors
taken as random variables as i and j vary through all users and items, ui, vj, and εij are independent with mean 0

The form of the above model suggests using linear model software, e.g.

library(dslabs)         
data(movielens)
ml <- movielens
ml <- ml[,c(5,1,6)]
ml$userId <- as.factor(ml$userId)
ml$movieId <- as.factor(ml$movieId)
lm(rating ~ .,data=ml)

At first glance, this seems like a questionable idea. In this version of the MovieLens data, there are 671 users and 9066 movies, thus nearly 10,000 dummy variables generated by lm(). With only 100,000 data points (and which are not independent), we run a real risk of overfitting. Worse, the code is quite long-running (over 2 hours in the run I tried on an ordinary PC).

But it turns out there is a simple, fast, closed-form solution, both for this model and for some more advanced versions featuring interaction terms.

Estimating μ is easy. From its definition, we take our estimate to be

Y.. = Σi Σj Yij / n

where is the total number of data points.

Write the above model in population form.

Y = μ + U + I + e

Now consider user i, taking expectation conditioned on U = i:

E(Y | U = i) = μ + ui

The natural estimate of the LHS is

Y.. = Σi Ni

where Ni is the number of items rated by user i.

Our estimate for ui is then