inst/RecSysLinModels.md

Linear Models in Recommender Systems

N. Matloff, UC Davis

Overview

In the collaborative filtering approach to recommender systems modeling, a very simple but common model for the rating user i gives to item j is

Yij = μ + ui + vj + εij

where

The form of the above model suggests using linear model software, e.g.

library(dslabs)         
data(movielens)
ml <- movielens
ml <- ml[,c(5,1,6)]
ml$userId <- as.factor(ml$userId)
ml$movieId <- as.factor(ml$movieId)
lm(rating ~ .,data=ml)

At first glance, this seems like a questionable idea. In this version of the MovieLens data, there are 671 users and 9066 movies, thus nearly 10,000 dummy variables generated by lm(). With only 100,000 data points (and which are not independent), we run a real risk of overfitting. Worse, the code is quite long-running (over 2 hours in the run I tried on an ordinary PC).

But it turns out there is a simple, fast, closed-form solution, both for this model and for some more advanced versions featuring interaction terms.

Analysis: Noniteractive model

Estimating μ is easy. From its definition, we take our estimate to be

Y.. = Σi Σj Yij / n

where is the total number of data points.

Write the above model in population form.

Y = μ + U + I + e

Now consider user i, taking expectation conditioned on U = i:

E(Y | U = i) = μ + ui

The natural estimate of the LHS is

Y.. = Σi Ni

where Ni is the number of items rated by user i.

Our estimate for ui is then

Yi. - Y..

A similar derivation yields our estimate for vj,

Y.j - Y..

(under construction)



matloff/regtools documentation built on Oct. 23, 2024, 2:58 a.m.