Title: Learning User Preferences
1Learning User Preferences
Jason Rennie MIT CSAIL jrennie_at_gmail.com
Advisor Tommi Jaakkola
2Information Extraction
- Informal Communication e-mail, mailing lists,
bulletin boards - Issues
- Context switching
- Abbreviations shortened forms
- Variable punctuation, formatting, grammar
3Thesis Advertisement Outline
- Thesis is not end-to-end IE system
- We address some IE problems
- Identifying Resolving Named Entites
- Tracking Context
- Learning User Preferences
4Identifying Named Entities
- Rialto is now open until 11pm
- Facts/Opinions usually about a named entity
- Tools typically rely on punctuation,
capitalization, formatting, grammar - We developed criterion to identify topic-oriented
words using occurrence stats
Rennie Jaakkola, SIGIR 2005
5Resolving Named Entites
- Theyre now open until 11pm
- What does they refer to?
- Clustering
- Group noun phrases that co-refer
- McCallum Wellner (2005)
- Excellent for proper nouns
- Our contribution better modeling of non-proper
nouns (incl. pronouns)
6Tracking Context
- The Swordfish was fabulous
- Indirect comment on restaurant.
- Restaurant identifed by context.
- Use word statistics to find topic switches
- Contribution new sentence clustering algorithm
7Learning User Preferences
- Examples
- I loved Rialto last night.
- Overall, Oleana was worth the money
- Radius wasnt bad, but wasnt great
- Om was purely pretentious
- Issues
- Translate text to partial ordering or rating
- Predict unobserved ratings
8Preference Problems
- Single User w/ Item Features
- Multi-user, no features
- Aka Collaborative Filtering
9Single User, Item Features
Ratings
10Single User, Item Features
?
Preference Scores
11Many Users, No Features
Features
Weights
Ratings
Preference Scores
12Collaborative Filtering
- Possible goals
- Predict missing entries
- Cluster users or items
- Applications
- Movies, Books
- Genetic Interaction
- Network routing
- Sports performance
items
users
13Outline
- Single User, Features
- Loss functions, Convexity, Large Margin
- Loss function for Ratings
- Many Users, No Features
- Feature Selection, Rank, SVD
- Regularization tie together multiple tasks
- Optimization scale to large problems
- Extensions
14This Talk Contributions
- Implementation and systematic evaluation of loss
functions for Single User prediction. - Scaling Multi-user regularization to large
(thousands of users/items) problems - Analysis of optimization
- Extensions
- Hybrid features multiple users
- Observation model multiple ratings
15Rating Classification
- n ordered classes
- Learn weight vector, thresholds
1
3
2
3
1
2
1
2
1
1
2
3
2
2
1
3
3
3
w
16Loss Functions
0-1
Hinge
Logistic
Smooth Hinge
Mod. Least Squares
Margin Agreement
17Convexity
- Convex function gt no local minima
- Set convex if all line segments within set
18Convexity of Loss Functions
- 0-1 loss is not convex
- Local minima, sensitive to small changes
- Convex Bound
- Large margin solution with regularization
- Stronger guarantees
19Proportional Odds
- McCullagh introduced original rating model
- Linear interaction weights features
- Thresholds
- Maximum likelihood
McCullagh, 1980
20Immediate-Thresholds
Shashua Levin, 2003
21Some Errors are Better than Others
22Not a Bound on Absolute Diff.
4
3
2
1
5
23All-Thresholds Loss
Srebro, Rennie Jaakkola, NIPS 2004
24Experiments
Multi-Class Imm-Thresh All-Thresh p-value
MLS .7486 .7491 .6700 1.7e-18
Hinge .7433 .7628 .6702 6.6e-17
Logistic .7490 .7248 .6623 7.3e-22
Least Squares 1.3368
Rennie Srebro, IJCAI 2005
25Many Users, No Features
Features
Weights
Ratings
Preference Scores
26Background Lp-norms
- L0 non-zero entries lt0,2,0,3,4gt0 3
- L1 absolute value sum lt2,-2,1gt1 5
- L2 Euclidean length lt1,-1gt2 ?2
- General vp (?i vip)1/p
27Background Feature Selection
- Objective Loss Regularization
L1
L2 Squared
28Singular Value Decomposition
- XUSV
- U,V orthogonal (rotation)
- S diagonal, non-negative
- Eigenvalues of XXUSVVSUUSSU are squared
singular values of X - Rank s0
- SVD used to obtain least-squares low-rank
approximation
29Low Rank Matrix Factorization
V
U
X rank k
¼
- Sum-Squared Loss
- Fully Observed Y
- Classification Error Loss
- Partially Observed Y
Use SVD to find Global Optimum
Non-convex No explicit soln.
30Low-Rank Non-Convex Set
31Trace Norm Regularization
Fazel et al., 2001
32Many Users, No Features
Features
V
X
U
Y
Weights
Ratings
Preference Scores
33Max Margin Matrix Factorization
Trace Norm
All-Thresholds Loss
- Convex function of X and ?
- Low rank in X
Srebro, Rennie Jaakkola, NIPS 2004
34Properties of the Trace Norm
The factorization U?S, V?S minimizes both
quantities
35Factorized Optimization
- Factorized Objective (tight bound)
- Gradient descent O(n3) per round
- Stationary points, but no local minima
Rennie Srebro, ICML 2005
36Collaborative Prediction Results
size, sparsity EachMovie 36656x1648, 96 EachMovie 36656x1648, 96 MovieLens 6040x3952, 96 MovieLens 6040x3952, 96
Algorithm Weak Error Strong Error Weak Error Strong Error
URP .8596 .8859 .6946 .7104
Attitude .8787 .8845 .6912 .7000
MMMF .8548 .8439 .6650 .6725
URP Attitude Marlin, 2004
MMMF Rennie Srebro, 2005
37Extensions
- Multi-user Features
- Observation model
- Predict which restaurants a user will rate, and
- The rating she will make
- Multiple ratings per user/restaurant
- E.g. Food, Service and Décor ratings
- SVD Parameterization
38Multi-User Features
- Feature parameters (V)
- Some are fixed
- Some are learned
- Learn weights (U) for all features
- Fixed part of V does not affect regularization
V
39Observation Model
- Common assumption ratings observed at random
- Restaurant selection
- Geography, popularity, price, food style
- Remove bias model observation process
40Observation Model
- Model as binary classification
- Add binary classification loss
- Tie together rating and observation models
?
XUXV WUWV
41Multiple Ratings
- Users may provide multiple ratings
- Service, Décor, Food
- Add in loss functions
- Stack parameter matrices for regularization
42SVD Parameterization
- Too many parameters UAA-1VX is another
factorization of X - Alternate U,S,V
- U,V orthogonal, S diagonal
- Advantages
- Not over-parameterized
- Exact objective (not a bound)
- No stationary points
43Summary
- Loss function for ratings
- Regularization for multiple users
- Scaled MMMF to large problems (e.g. gt 1000x1000)
- Trace norm widely applicable
- Extensions
Code http//people.csail.mit.edu/jrennie/matlab
44Thanks!
- Helen, for supporting me for 7.5 years!
- Tommi Jaakkola, for answering all my questions
and directing me to the end! - Mike Collins and Tommy Poggio for addl guidance.
- Nati Srebro John Barnett for endless valuable
discussions and ideas. - Amir Globerson, David Sontag, Luis Ortiz, Luis
Perez-Breva, Alan Qi, Patrycja Missiuro all
past members of Tommis reading group for paper
discussions, conference trips and feedback on my
talks. - Many, many others who have helped me along the
way!
45Low-Rank Optimization