A Matrix-Factorization-Based Recommendation System


Description

Personalized recommendation systems are popular business applications of machine learning. Imagine given a set of ratings from Netflix users on many movies they have seen $(\text{ratings are from one star to five starts})$, how can we get a personalized rating predictor for a given user on unseen movies?

Let’s say $U$ as a set of users and $I$ as a set of items, then the rating predictor can be expressed as $f: U \times I \rightarrow \mathcal{R}$. We can then use a matrix $M$ with size of $m \times n$ where $m$ rows represent users and columns are for $n$ movies. For example, if $M_{12,34} = 5$, that is, a 5 star is given by user 12 on movie 34. However, this rating matrix should be very sparse as it is impossible for everyone to watch large portion of movies in the market. More practically, only 1% of the cells in the rating matrix are observed in average while all other 99% are missing values. The objective of our predictor is to estimate those missing values, reflecting the user’s preference learned from available ratings.

Matrix Factorization

Intuitively we can assume the matrix is a low-rank matrix as there are only a few factors $(\text{e.g, genre, director, main actor/actress, etc.})$ that determine the preferences. Let’s say we use $r$ as the number of factors. We can define the user profile $U \in \mathcal{R}^{m \times r}$, and the movie profile $V \in \mathcal{R}^{n \times r}$. Then we want to use the inner product of two length $r$ vectors to estimate the rating given by a user $u$ on movie $v$:
$$M_{u,v} \approx \sum_{k=1}^{r} U_{u,k} V_{v,k}$$
We use the sum of squares error (SSE) as our metrics and we can have the objective function we need to minimize:

$$ E(U,V) = \sum_{(u,v) \in M} (M_{u,v} - U_{u}^T V_v)^2 = \sum_{(u,v) \in M} (M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k})^2 $$

This looks similar to the loss function of the linear regression:

$$E(\theta) = \sum_{i=1}^m (Y^i - \theta^T x^i)^2 = \sum_{i=1}^m (Y^i - \sum_{k=1}^r \theta_k x_k^i)^2$$

where $m$ is the size of training data. We can compare above two functions and see that for linear regression, we use the same $\theta$ on any input $x^i$, but for matrix factorization we are proposing, different rows of $U$ are applied based on the user $u$.

Gradient Descent

Since there is no closed form solution to our objective function, we use the gradient descent:

$$U_{u,k} \leftarrow U_{u,k} - \mu \frac{\partial E(U,V)}{\partial U_{u,k}}$$

where $\mu$ can be used as the update rate. And we have

$$ \frac{\partial E(U,V)}{\partial U_{u,k}} = -2 \sum_{v|(u,v)\in M}(M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k}) * V_{v,k} $$

similarly for $V$, so we can do gradient descent:

$$ U_{u,k} \leftarrow U_{u,k} + 2\mu \sum_{v|(u,v)\in M} (M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k}) * V_{v,k} $$
$$ V_{v,k} \leftarrow V_{v,k} + 2\mu \sum_{v|(u,v)\in M} (M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k}) * U_{u,k} $$

Regularization

We can penalize for large values in $U$ and $V$ to avoid overfitting by including regularization terms. We can rewrite the regularized objective function:

$$E(U,V) = \sum_{(u,v) \in M} (M_{u,v} - U_{u}^T V_v)^2 = \sum_{(u,v) \in M} (M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k})^2 + \lambda \sum_{u,k} U_{u,k}^2 + \lambda \sum_{v,k} V_{v,k}^2$$

where $lambda$ is a parameter which controls the importance of regularization terms. We can rewrite the gradient descent considering $(\text{take user profile as an instance})$:

$$\frac{\partial E(U,V)}{\partial U_{u,k}} = -2 \sum_{v|(u,v)\in M}(M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k}) * V_{v,k} + 2\lambda U_{u,k}$$

so we have:
$$ U_{u,k} \leftarrow U_{u,k} + 2\mu \sum_{v|(u,v)\in M} (M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k}) * V_{v,k} - 2\lambda U_{u,k}$$
and
$$ V_{v,k} \leftarrow V_{v,k} + 2\mu \sum_{v|(u,v)\in M} (M_{u,v} - \sum_{k=1}^r U_{u,k} V_{v,k}) * U_{u,k} - 2\lambda V_{v,k}$$

Code in Matlab

Let’s use the data set movie_data.mat as an example. The file contains the test set and a train set. The size of each set is $943 \times 1682$, i.e. 943 users and 1682 movies. Each value in the matrix can be 0: missing, or 1 - 5: ratings. The implementation of myRecommender.m would be quite straightforward:

myRecommender.m
function [U, V] = myRecommender(rateMatrix, lowRank) % Parameters maxIter = 5e3; % Choose your own. learningRate = 1e - 4; % Choose your own. regularizer = 1e - 2; % Choose your own. % Random initialization: [n1, n2] = size(rateMatrix); U = rand(n1, lowRank) / lowRank; V = rand(n2, lowRank) / lowRank; % Gradient Descent: %% change of reconstruction error (see if it is converged) and iteration used to stop loop deltaRecoError = 1e - 4 * nnz(rateMatrix > 0); recoError = Inf; newRecoError = 5 * 5 * numel(rateMatrix); it = 0; %% update U and V while abs(newRecoError - recoError) > deltaRecoError && it < maxIter recoError = newRecoError; % note that if sum(Uuk*Vik) holds non-zero value for % corresponding zero value Mui, we should not take it into account % since we want UV to be similar to M as much as possible U = U + 2 * learningRate * ((rateMatrix - U * V') .* (rateMatrix > 0)) * V - 2 * learningRate * regularizer * U; V = V + 2 * learningRate * ((rateMatrix - U * V') .* (rateMatrix > 0))' * U - 2 * learningRate * regularizer * V; % update the objective funtion newRecoError = sumsqr((rateMatrix - U * V') .* (rateMatrix > 0)) + regularizer * sumsqr(U) + regularizer * sumsqr(V); it = it + 1; end end

And we can calculate root-mean-square errors $(\text{RMSEs})$ as the performance on the training set and the test set, respectively.

main.m
clear; % Load data: load ('movie_data'); rateMatrix = train; testMatrix = test; % Global SVD Test: lowRank = [1, 3, 5]; for l=1:size(lowRank, 2) tic; [U, V] = myRecommender(rateMatrix, lowRank(l)); logTime = toc; trainRMSE = norm((U*V' - rateMatrix) .* (rateMatrix > 0), 'fro') / sqrt(nnz(rateMatrix > 0)); testRMSE = norm((U*V' - testMatrix) .* (testMatrix > 0), 'fro') / sqrt(nnz(testMatrix > 0)); fprintf('SVD-%d\t%.4f\t%.4f\t%.2f\n', lowRank(l), trainRMSE, testRMSE, logTime); end

We take 1, 3, and 5 as example low ranks. The output based on chosen hyper-parameters $(\mu, \lambda)$ would be

SVD-1	0.9253	0.9537	3.12
SVD-3	0.8678	0.9348	8.77
SVD-5	0.8209	0.9388	10.37

All RMSEs are no greater than 1 and the value 1 is considerably small compared to the rating range 5. The results are in fact pretty good considering we are predicting 99% missing values using 1% known data points.


Author: Chujie Chen
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Chujie Chen !
评论
  TOC