Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Co-occurrence

Data & Analytics

The present document can't read!
Please download to view
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  • Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Co-occurrence Dawen Liang Columbia University/Netflix Jaan Altosaar Laurent Charlin David Blei
  • A simple trick to boost the performance of your recommender system without using any additional data Dawen Liang Columbia University/Netflix Jaan Altosaar Laurent Charlin David Blei
  • • User-item interaction is commonly encoded in a user-by-item matrix • In the form of (user, item, preference) triplets • Matrix factorization is the standard method to infer latent user preferences Motivation Items Us er s ? ?
  • • Alternatively we can model item co- occurrence across users • Analogy: modeling a set of documents (users) as a bag of co-occurring words (items): e.g., “Pluto” and “planet” Motivation : … ,{ } ,{ } ,{ } , , ,{ }
  • Can we combine these two views in a single model? YES
  • Items Us er s ? ? ≈ User latent factors θ Item latent factors β K # ItemsK # us er s * Click matrix Y “Collaborative filtering for implicit feedback datasets”, Y. Hu, Y. Koren, C. Volinsky, ICDM 08. Lmf = X u,i cui(yui � ✓>u �i)2
  • • Skip-gram word2vec • Learn a low-dimensional word embedding in a continuous space • Predict context words given the current word Word embedding
  • Item embedding • Skip-gram word2vec • Learn a low-dimensional word embedding in a continuous space • Predict context words given the current word We can embed item se quence s in the same fa shion
  • Levy & Goldberg show that skip-gram word2vec is implicitly factorizing (some variation of) the pointwise mutual information (PMI) matrix “Neural Word Embedding as Implicit Matrix Factorization”, Levy & Goldberg, NIPS 14. co-occurrence patterns for rare items (items not consumed by many users). Finally, we discuss future extensions of this model such as taking advantage of user co-occurrence information in addition to item co-occurrence. 2. RELATED WORK The user and item factors learned through matrix factor- ization are usually regularized with their ` 2 -norm. It is also standard to further regularize the factors with side infor- mation [21]. Side information is information, in addition to the user-item preference data, about users, items, or both users and items (e.g., item metadata or user demographics) that may be indicative of preferences. It is customary to either inform the factors by conditioning on the side infor- mation or to constrain the factors by having them generate the side information (in which case the side information is modeled as a random variable). Representative examples of each case are given in [1] and [23] respectively. Recent work also proposes an item regularizer that can generate the words of the reviews associated with an item [2]. The generative model uses a neural-net based language model similar to word embeddings. The distinguishing feature of our work is that the regularization comes from a (determinis- tic) transformation of the original user-item preference data instead of additional data. Similar idea has also been applied to music and product recommendations [11, 7] by explicitly conditioning on the item contents. This kind of regularization through incorporation of side information can alternatively be viewed as enforcing more complex structure in the prior of the latent factors. For example, in Ranganath et al. [16], they find that imposing a deep exponential family prior on the matrix factorization model, which implicitly conditions on consumption counts in a non-linear fashion, can be helpful in reducing the e↵ect of extremely popular (or extremely unpopular) items on held- out recommendation tasks. This is analogous to our findings with the CoFactor model. Guàrdia-Sebaoun et al. [5] observe that user trajectories over items can be modeled much like sequences of words. In detail, they use a word embedding model [13] to learn item representations. User embeddings are then inferred as to predict the next item in the trajectory (given the previous item). User and item embeddings are finally used as covariates in a subsequent regression model of user-item preferences. Learning item embeddings for recommendations shares commonalities with our work. The di↵erence is that we treat the items consumed by each user as exchangeable (i.e., we do not assume that the trajectories are given, rather we assume that we are given an unordered set of items for each user). Additionally we show that in our model jointly learning all parameters yields higher performance. 3. THE COFACTOR MODEL In this section, we first introduce the two building blocks of the CoFactor model: matrix factorization (MF) and item embedding. Then we describe the CoFactor model and how to perform model inference. Matrix factorization. MF is standard in collaborative filtering [9]. Given the sparse user-item interaction matrix Y 2 RU⇥I from U users and I items, MF decomposes it into the product of user and item latent factors denoted ✓u 2 RK (u = 1, . . . , U) and �i 2 RK (i = 1, . . . , I), respectively. In this paper, our focus is on implicit feedback data [8, 15]. However, we would like to point out that this is not a limiting aspect of our CoFactor model – it can be readily extended to the explicit feedback setting. In the matrix factorization for implicit feedback model [8], the following objective is optimized: L mf = X u,i cui(yui � ✓>u �i)2 + �✓ X u k✓uk2 2 + �� X i k�ik2 2 . In this MF objective, the scaling parameter cui is a hyper- parameter that is normally set to be cy=1 > cy=0. It can be tuned to balance the unobserved ratings (y = 0) from the observed ratings (y = 1) in the click matrix. �✓ and �� are regularization parameters. The solution to this objective L mf can be interpreted as the maximum a posteriori estimate for the following probabilistic model: ✓u ⇠ N (0,��1✓ IK) �i ⇠ N (0,��1� IK) yui ⇠ N (✓>u �i, c�1ui ). We choose to introduce MF from an optimization perspective since our contribution results in a model which does not have a clear generative interpretation. Word embedding. Word embedding models (e.g., word2vec [13]) have gained success in many natural lan- guage processing tasks. Given a sequence of words, these models attempt to embed each word into a low-dimensional (relative to the vocabulary size) continuous space. In the skip-gram model, the objective is to predict context words— surrounding words within a fixed window—given the current word. Stochastic gradient descent (SGD) with negative sam- pling is normally used to train a word embedding model (see Mikolov et al. [13] for more details). Levy and Goldberg [10] show that word2vec with a neg- ative sampling value of k can be interpreted as implicitly factorizing the pointwise mutual information (PMI) matrix shifted by log k. PMI between a word i and its context word j is defined as: PMI(i, j) = log P (i, j) P (i)P (j) Empirically, it is estimated as: PMI(i, j) = log #(i, j) ·D #(i) ·#(j) . Here #(i, j) is the number of times word j appears in the context of word i. D is the total number of word-context pairs. #(i) = P j #(i, j) and #(j) = P i #(i, j). After making the connection between word2vec and matrix factorization, Levy and Goldberg [10] further proposed to perform word embedding by spectral dimensionality reduc- tion (e.g., singular value decomposition) on shifted positive PMI (SPPMI) matrix: SPPMI(i, j) = max � max{PMI(i, j), 0}� log k, 0 This is attractive since it does not require learning rate and hyperparamter tuning as in SGD. In our CoFactor model, we will follow the same approach to decompose the SPPMI matrix. Item embedding. Users consume items sequentially. Sequences of items are analogous to sequences of words, so current word/item context word/item Co-occurrence matrix • PMI(“Pluto”, “planet”) > PMI(“Pluto”, “RecSys”)
  • Jointly factorize both the click matrix and co-occurrence PMI matrix with a shared item representation/embedding CoFactor
  • • Item representation must account for both user- item interactions and item-item co-occurrence • Alternative interpretation: regularizing the traditional MF objective with item embeddings learned by factorizing the item co-occurrence matrix L co = X u,i cui(yui � ✓>u �i)2 + X mij 6=0 (mij � �>i �j � wi � cj)2 Matrix factorization Item embedding Shared item representation/embedding
  • Problem/application-specific • Define context as the entire user click history • #(i, j) is the number of users who clicked on both item i and item j • Do not require any additional information beyond standard MF model How to define “co-occur”
  • • Data preparation: 70/20/10 train/test/validation • Make sure train/validation do not overlap in time with test • Metrics: [email protected], 50, [email protected], [email protected] Empirical study ArXiv ML-20M TasteProfile # of users 25,057 111,148 221,830 # of items 63,003 11,711 22,781 # interactions 1.8M 8.2M 14.0M % interactions 0.12% 0.63% 0.29% with timestamps yes yes no Table 1: Attributes of datasets after preprocessing. Inter- actions are non-zero entries (listening counts, watches, and clicks). % interactions refers to the density of the user-item interaction matrix (Y ). For datasets with timestamps, we ensure there is no overlap in time between the training and test sets. why jointly factoring both the user click matrix and item co-occurrence matrix boosts the performance by exploring the model fits. • We also demonstrate the importance of joint learning by comparing with a model that does word2vec for item embedding followed by MF in a two-step process. 4.1 Datasets Throughout this study we use three medium- to large- scale user-item consumption datasets from various domains: 1) scientific articles data from the arXiv2; 2) users’ movie viewing data from MovieLens [6]; 3) the taste profile subset of the million song dataset [3]. In more details: ArXiv: user-paper clicks derived from log data collected in the first half of 2012 from the arXiv pre-print server. The data are binarized (multiple clicks by the same user on a single paper are considered to be a single click). We preprocess the data to ensure that all users and items have a minimum of ten clicks. MovieLens-20M (ML-20M): user-movie ratings col- lected from a movie recommendation service. We follow the standard procedure of binarizing explicit data by only keeping the ratings of four or higher and interpret them as implicit feedback. We only keep users who have watched at least five movies. TasteProfile: user-song play counts collected by the mu- sic intelligence company Echo Nest (now owned by Spotify).3 We binarize the play counts and interpret them as implicit preference data. We further preprocess the dataset by only keeping users with at least 20 songs in their listening history and songs that are listened to by at least 50 users. To create the training/validation/test splits for the datasets with timestamps, we order all the user-item interactions by time and take the first 80% as training data, from which we randomly selected 10% as the validation set. For the remain- ing 20% of the data, we only keep the users and items that appear in the training and validation sets to obtain the test set. For TasteProfile, for which we do not have timestamps, we randomly split the observed user-item interactions into training/validation/test sets with 70/10/20 proportions. The final dimensions of all the datasets after preprocessing are summarized in Table 1. 4.2 Metrics As is typical in evaluating implicit feedback recommen- 2http://arxiv.org 3http://the.echonest.com dation challenges, we turn to the ranking-based metrics: [email protected] , truncated normalized discounted cumulative gain ([email protected]), and mean average precision ([email protected]). For each user, all the metrics compare the predicted rank of (unobserved) items with their true rank. While [email protected] considers all items ranked within the first M to be equivalent, [email protected] and [email protected] use a monotonically increasing discount to emphasize the importance of higher ranks versus lower ones. Formally, define ⇡ as a permutation over all the items, 1{·} is the indicator function, and u(⇡(i)) returns 1 if user u has consumed item ⇡(i). In our case the model learns to predict ranking ⇡ for each user by sorting the predicted preference ✓>u �i for i = 1, . . . , I. [email protected] for user u is defined as [email protected](u,⇡) := MX i=1 1{u(⇡(i)) = 1} min(M, PI i0 1{u(⇡(i0)) = 1}) . The expression in the denominator evaluates to the minimum between M and the number of items consumed by user u. In this way, [email protected] is normalized to have a maximum of 1. This corresponds to successfully retrieving all the relevant items in top M of the list. [email protected] for user u is defined as [email protected](u,⇡) := MX i=1 21{u(⇡(i))=1} � 1 log(i+ 1) Normalized DCG (NDCG) is the DCG normalized to the 0–1 range where a one signifies a perfect ranking. Mean average precision ([email protected]) calculates the mean of users’ average precision (AP). The average precision [email protected] for a user u is: [email protected](u,⇡) := MX i=1 [email protected](u,⇡) min(i, PI i0 1{u(⇡(i0)) = 1}) 4.3 Experiment protocols We compare the CoFactor model with the weighted matrix factorization (WMF) model [8]. Given the similarity between the objective of CoFactor L co and that of WMF L mf , we can attribute the improvement to the regularization imposed by the co-occurrence SPPMI matrix. In all our experiments, the dimension of the latent space K is set to 100. The CoFactor model is trained following the inference algorithm described in Section 3.1. We monitor the convergence of the algorithm using [email protected] on the validation set for both CoFactor and WMF. Both CoFactor and WMF require hyperparameter tun- ing. We first select the best hyperparameters for WMF (the weight cy=0 and cy=1, the regularization parameters � = �✓ = ��) based on their performance on the valida- tion set. For CoFactor, we then keep the same cy=1/cy=0 ratio and regularization parameters, and grid search for the best relative scale ` 2 {0.01, 0.05, 0.1, . . . , 1, 5, 10}. Note that when scaling cui, we also scale �✓ and �� . The selected relative scale indicates the importance of regularization with co-occurrence information. If a large scaling parameter cui is selected, it means the MF part of the model can be ef- fective at recommendation on its own without significant help from the information provided by the co-occurrence matrix (as the scaling parameter goes to infinity, CoFactor is reduced to WMF). On the other hand, a smaller scaling
  • Quantitative results ArXiv ML-20M TasteProfile WMF CoFactor WMF CoFactor WMF CoFactor [email protected] 0.063 0.067 0.133 0.145 0.198 0.208 [email protected] 0.108 0.110 0.165 0.177 0.286 0.300 [email protected] 0.076 0.079 0.160 0.172 0.257 0.268 [email protected] 0.019 0.021 0.047 0.055 0.103 0.111 Table 2: Comparison between the widely-used weighted matrix factorization (WMF) model [8] and our CoFactor model. CoFactor significantly outperforms WMF on all the datasets across all metrics. The improvement is most pronounced on the movie watching (ML-20M) and music listening (TasteProfile) datasets. parameter indicates that the model benefits from account- ing for co-occurrence patterns in the observed user behavior data. We also grid search for the negative sampling values k 2 {1, 2, 5, 10, 50} which e↵ectively modulate how much to shift the empirically estimated PMI matrix. 4.4 Analyzing the CoFactor model fits Table 2 summarizes the quantitative results. Each metric is averaged across all users in the test set. As we can see, CoFactor outperforms WMF [8] on all datasets and across all metrics. The improvement is very clear for the MovieLens (ML-20M) and music (TasteProfile) data. Note that both models make use of the same data and optimize similar objec- tives, with CoFactor benefiting from an extra co-occurrence regularization term. Exploratory analysis When does CoFactor do better/worse? Figure 1 shows the breakdown of [email protected] by user activity in the train- ing set for all three datasets. Although details vary across datasets, the CoFactor model is able to consistently improve recommendation performance for users who have only con- sumed a small amount of items. This is understandable since the standard MF model is known to have the limitation of not being able to accurately infer inactive users’ preferences, while CoFactor explicitly makes use of the additional sig- nal from item co-occurrence patterns to help learn better latent representations, even when user-item interaction data is scarce. For active users (the rightmost panel of each plot), WMF does worse than CoFactor on ArXiv, better on ML- 20M, and roughly the same on TasteProfile. However, since active users are the minority (most recommendation datasets have a long tail), the standard error is also bigger. To understand when CoFactor makes better recommen- dation, we take a look at how it ranks relatively rare items di↵erently from WMF. Figure 2 shows the histogram of ranks for movies with less than 10 viewings in the train- ing set (which consists of 3,239 movies out of the 11,711 total movies). We show four randomly selected users on the MovieLens ML-20M data (the general trend is consistent across the entire dataset, and we observe a similar patten for TasteProfile as well). We can see that WMF mostly rank rare movies in the middle of its recommendations, which is expected because the collaborative filtering model is driven by item popularity. On the other hand, CoFactor can push these rare movies both to the top and bottom of its recom- mendations. We did not observe as clear of a pattern on the ArXiv data. We conjecture that this is due to the fact that ArXiv dataset is less popularity-biased than ML-20M and TasteProfile: the mean (median) of users who consumed an (a) ArXiv (b) ML-20M (c) TasteProfile Figure 1: Average normalized discounted cumulative gain ([email protected]) breakdown for both CoFactor and weighted matrix factorization (WMF) [8] based on user activity on di↵erent datasets. Error bars correspond to one standard error. There is some variation across datasets, but CoFactor is able to consistently improve recommendation performance for users who have only consumed a small amount of items. • We get better results by simply re-using the data • Item co-occurrence is in principle available to MF model, but MF model (bi-linear) has limited modeling capacity to make use of it
  • < 50 ≥ 50, < 100 ≥ 100, < 150 ≥ 150, < 500 ≥ 500 1umber of songs Whe user hDs lisWeneG Wo 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 A ve rD ge 1 D C G @ 10 0 CoFDFWor W0F User activity: Low High We observe similar trend for other datasets as well
  • Toy Story (24659) Fight Club (18728) Kill Bill: Vol. 1 (8728) Mouchette (32) Army of Shadows (L'armée des ombres) (96) User’s watch history The Silence of the Lambs (37217) Pulp Fiction (37445) Finding Nemo (9290) Atalante L’ (90) Diary of a Country Priest (Journal d'un curé de campagne) (68) Top recommendation by CoFactor Rain Man (11862) Pulp Fiction (37445) Finding Nemo (9290) The Godfather: Part II (15325) That Obscure Object of Desire (Cet obscur objet du désir) (300) Top recommendation by WMF number of users who watched the movie in the training set
  • How important is joint learning? item is 597 (48) for ML-20M and 441 (211) for TasteProfile, while only 21 (12) for ArXiv. Figure 2: The histogram of ranks from four randomly se- lected users for movies with less than 10 watches in the Movie- Lens ML-20M training set for both CoFactor and weighted matrix factorization (WMF) [9]. The extra regularization in CoFactor allows it to push these rare items further up and down its recommendation list for each user. In Figure 3, we compare CoFactor and WMF for a par- ticular user who has watched many popular movies (e.g., “Toy Story” and “Fight Club”), as well as some rare French movies (e.g. “Mouchette” and “Army of Shadows”). We find that WMF highly ranks popular movies and places them on top of its recommendations. Even when it recommends a French movie (“That Obscure Object of Desire”) to this user, it is a relatively popular one (as indicated by the num- ber of users who have watched it). In contrast, CoFactor will better balance between the popular movies and relevant (but relatively rare) French movies in its recommendation. This makes sense because rare French movies co-occur in the training set, and this is captured by the SPPMI matrix. We find similar exploratory examples in TasteProfile. This explains the performance gap between CoFactor and WMF among active users on ML-20M, as CoFactor could potentially rank popular movies lower than WMF. For active users, they are more likely to have watched popular movies in both the training and test sets. When we look at the top 100 users where WMF does better than CoFactor in terms of [email protected], we find that the performance di↵erence is never caused by WMF recommending a rare movie in the test set that CoFactor fails to recommend. How important is joint learning? One of the main contributions of the proposed CoFactor model is incorpo- rating both MF and item embedding in a joint learning objective. This “end-to-end” approach has proven e↵ective in many machine learning models. To demonstrate the advan- tage of joint learning, we conduct an alternative experiment on ArXiv where we perform the learning in a two-stage pro- cess. First, we train a skip-gram word2vec model on the click data, treating users’ entire click histories as the context—this is equivalent to factorizing the SPPMI matrix in CoFactor model. We then fix these learned item embeddings from WMF CoFactor word2vec + reg [email protected] 0.063 0.067 0.052 [email protected] 0.108 0.110 0.095 [email protected] 0.076 0.079 0.065 [email protected] 0.019 0.021 0.016 Table 3: Comparison between joint learning (CoFactor) and learning from a separate two-stage (word2vec + reg) process on ArXiv. Even though they make similar modeling assumptions, CoFactor provides superior performance. word2vec as the latent factors ˇi in the MF model, and learn user latent factors ✓u. Learning ✓u in this way is the same as one iteration of CoFactor, which is e↵ectively doing a weighted ridge regression. We evaluate the model learned from this two-stage process (word2vec + reg) and report the quantitative results in Ta- ble 3. (The results for WMF and CoFactor are copied from Table 2 for comparison.) The performance of the two-stage model is much worse. This is understandable because the item embeddings learned from word2vec are certainly not as well-suited for recommendation as the item latent factors learned from MF. Using this item embedding in the second stage to infer user preferences will likely lead to inferior per- formance. On the other hand, CoFactor is capable of finding the proper balance between MF and item embeddings to get the best of both worlds. 5. DISCUSSION It is interesting that CoFactor improves over WMF by re-using the preference data in a di↵erent encoding. The item co-occurrence information is, in principle, available from the user-item interaction data to MF. However, as a bi- linear model with limited capacity, MF cannot uncover such information. On the other hand, highly non-linear models, such as deep neural networks, have shown limited success at the task of preference prediction [21, 17]. Our approach is somewhat in between, where we have used a deterministic non-linear transformation to re-encode the input. A natural question is whether it is always better to in- corporate item co-occurrence information. This depends on the problem and specific data. However, as mentioned in Section 4.3, as the relative scale on the weight cui goes to infinity, CoFactor is e↵ectively reduced to WMF. Therefore, as long as there is additional information that is useful in the co-occurrence matrix, in principle, the performance of CoFactor is lower bounded by that of WMF. 5.1 Possible extensions to the model It is straightforward to extend our CoFactor model to explicitly handle sessions of user-item consumption data by constructing a session-based co-occurrence matrix, as discussed in Section 2. This is likely to yield improved performance in some settings, such as in purchase behavior or music listening data, where the notion of a ‘shopping cart’ or ‘playlist’ induce session-based patterns. Adding user-user co-occurrence to our model to regularize user latent factors is another natural extension, which we leave to future work. We anticipate this extension to yield further improvements, especially for users in the long tail. The type of regularization we have added to MF models can
  • Extension • User-user co-occurrence • Higher-order co-occurrence patterns • Add the same type of item-item co- occurrence regularization in other collaborative filtering methods, e.g., BPR, factorization machine, or SLIM
  • Conclusion • We present CoFactor model: • Jointly factorize both user-item click matrix and item-item co-occurrence matrix • Motivated by the recent success of word embedding models (e.g., word2vec) • Explore the results both quantitatively and qualitatively to investigate the pros/cons Source code available: https://github.com/dawenl/cofactor https://github.com/dawenl/cofactor
  • Thank you • We present CoFactor model: • Jointly factorize both user-item click matrix and item-item co-occurrence matrix • Motivated by the recent success of word embedding models (e.g., word2vec) • Explore the results both quantitatively and qualitatively to investigate the pros/cons Source code available: https://github.com/dawenl/cofactor https://github.com/dawenl/cofactor