Title: Beyond Algorithms: A HumanCentered Evaluation of Recommender Systems
1Beyond Algorithms A Human-Centered Evaluation
of Recommender Systems
- Kirsten Swearingen Rashmi Sinha
- SIMS 213, UC Berkeley
- April 11, 2002
2Overview
- Introduction to recommender systems
- Motivation for project
- Method and findings - User Study 1
- Method and findings - User Study 2
- Discussion and design recommendations
- Limitations of study
- Future work
3In the newsA bet on humans vs. machines
- Ray Kurzweil maintains that a computer (i.e., a
machine intelligence) will pass the Turing test
by 2029. Mitchell Kapor believes this will not
happen. - In a 1950 paper Alan Turing describes his concept
of the Turing Test, in which one or more human
judges interview computers and human foils using
terminals (so that the judges won't be prejudiced
against the computers for lacking a human
appearance). - If the human judges are unable to reliably unmask
the computers (as imposter humans) then the
computer is considered to have demonstrated
human-level intelligence.
4Recommender systems are a technological proxy for
a social process
Which one should I read?
Recommendations from Online Systems
Recommendations from friends
5Basic interaction paradigm of recommender systems
Input (ratings of books) I recently enjoyed
Snow Crash, Seabiscuit, The Soloist, and Love in
a Cold Climate
What should I read next?
Output (Recommendations) Books you might enjoy
are
6Approaches Back End
- Content-based recommendations
- Rely on metadata describing items
- You like action-adventure movies and movies
starring Meryl Streep. - Collaborative filtering
- Rely on correlations between individual ratings.
- You like most of the same movies Joe and Carol
like, so you might like these other movies they
liked.
7Collaborative Filtering Algorithms Depend Upon
Correlations
Meg David correlation .52 Meg Amy
correlation -.67 Meg Joe correlation .23
Recommendations for Meg Books 7 and 8
8Approaches Front End
- Implicit rating (by browsing, clicking, or
purchasing) - Explicit rating, differing in the
- Number of items to rate
- Rating scale used
- Number of items recommended
- Amount of personal information required
- Opportunity for feedback on recommended items
9Amazons Recommendation Process
Input One artist/author name
Output List of recommendations
Opportunity to Explore / Refine
10Sleepers Recommendation Process
Input Ratings of 10 books for all users
continuous scale
Output Display 1 book at a time, with degree of
confidence in prediction
(System designed by Ken Goldberg, UC Berkeley)
11Song Explorers Recommendation Process
Input 20 ratings
Output List of songs/albums
12I know what youll read next summer (Amazon,
BarnesNoble)
- what movies you should watch (Reel, RatingZone,
Amazon) - what music you should listen to (CDNow, Mubu,
Gigabeat) - what websites you should visit (Alexa)
- what jokes you will like (Jester)
- and who you should date (Yenta)
13The recommendation process from the users
perspective
Time and effort to input Privacy concerns
User inputs preferences
receives recommendations
Time and effort to review recs
and decides if he/she will sample recommendation
In the end, a user benefits only if
recommendations turn out to be good ones.
14What Users Want
New to me
Engaging
Fast
RECOMMENDATIONS
PROCESS
Good
Easy
To succeed, collaborative filtering recommender
systems need a LOT of motivated regular users.
15Issues with Recommender Systems
- Coldstart problem
- Latency problem
- Unusual users
- Privacy concerns
- Scalability
- Speed of transaction
- User interface
16Motivation for Project
- General need plenty of research on rec system
backend, little on interface - Personal interest
- Kirsten -- designing Reading Tree
- Rashmi -- interested in community-oriented sites
17Our Project Beyond Algorithms Only -- An HCI
Perspective on Recommender Systems
- Compare the social recommendation process to
online recommender systems - Understand the factors that go into an effective
recommendation by studying user interaction with
systems
18Stages of Project
- Study 1
- Began as class project for SIMS 271 user study
of 6 book and movie systems - Focused on humans vs. recommenders comparison
- Study 2
- User study comparing 5 music recommender systems
- Focused on identifying factors that contribute to
system success
19General Methodology
- Not an experiment, but designed like one.
Conducted in lab environment. - Broad overview to start with, then zeroed in on
some systems - Meshing of quantitative and qualitative methods
(one informing the other) - Pre-test, pre-test, pre-test
- User motivation ascertained before study
- Within-subjects design used wherever possible
- Multiple small studies, rather than 1 big study
20General Methodology
- Comprehensive data collection observation,
behavior logging with time-stamps,
questionnaires, post-test interviews. - The Slim Logger a simple, Excel-based tool for
recording timed observations.
21Study 1 The Human vs. Recommenders Death Match
223 Book Systems
Amazon Books
Rating Zone
Sleeper
233 Movie Systems
Amazon Movies
Movie Critic
Reel
243 Friends Per Person
Participants were asked to choose friends who
knew their tastes in books or movies.
25Method
- 19 participants, age18 to 34 years
- For each of 3 online systems
- Registered at site
- Rated items
- Reviewed and evaluated recommendation set
- Completed questionnaire
- Also reviewed and evaluated sets of
recommendations from 3 friends each
26Defining Success
Good Recs. (Precision)
- items user felt interested in
USEFUL (New to user)
Useful Recs.
- Subset of Good Recs.
- User felt interested in and had not read / viewed
yet
Previously experienced
ALL GOOD RECOMMENDATIONS
27Comparing Human Recommenders to RS Good and
Useful Recommendations
Good Recommendations
100
Useful Recommendations
90
80
70
60
50
40
30
20
10
0
Amazon (15)
Sleeper (10)
Friends (9)
Rating Zone (8)
Amazon (15)
Reel (5-10)
Movie Critic (20)
Friends (9)
Movies
Books
(x) No. of Recommendations
RS Average
28However, users like online RS.
This result was supported by post test interviews.
29Why systems over friends?
- Suggested a number of things I hadnt heard of,
interesting matches. - It was like going to Codyslooking at that
table up front for new and interesting books. - Systems can pull from a large databaseno one
person knows about all the movies I might like.
30Recommender systems broaden horizons
- while friends mostly recommend familiar items.
31Which of the systems did users prefer?
Yes
No
Movies
Books
- Sleeper and Amazon books average highest ratings
- Split opinions on Reel, MovieCritic
32Why did some systems
- Provide useful recommendations but leave users
unsatisfied? - RatingZone
- MovieCritic
- Reel
33Searching for Reasons
- Previously Liked Items Adequate Item
Description are correlated with Usefulness
ratings. - Time to Receive Recommendations No. of Items to
Rate not important!
34A Question of Trust
- Post-test interviews showed that users trusted
systems if they had already sampled (and enjoyed)
some recommendations - Positive Experiences lead to trust
- Negative Experiences with recommended items lead
to mistrust of system
USEFUL (New to user)
TRUST-GENERATING Previously experienced
ALL GOOD RECOMMENDATIONS
35A Question of Trust
Books
Movies
Difference between Amazon and Sleeper highlights
the fact that there are different kinds of good
Recommender Systems
36Adequate Item Description The RatingZone Story
0 of Version 1 and 60 of Version 2 users found
item description adequate
An adequate item description and links to other
sources of information about item were crucial
factors in users being convinced by a
recommendation.
37System Transparency
- Do users think they understand why an item was
recommended?
Users mentioned this factor in post test
interview during Study 2, we explored it in
greater detail.
38Study 2 Music Recommenders
Amazon, CDNow, MediaUnbound, MoodLogic, and
SongExplorer
39Method
- 12 participants
- Very similar to Study 1 method
- Registered at site
- Rated items
- Reviewed and evaluated recommendation set
- Completed questionnaire
- Focused on music systems only (eliminate domain
differences) - Participants listened to clips and evaluated
recommended items (this was not possible with
book and movie systems)
40Findings Effect of Familiarity
- Familiar recommendations liked more than
unfamiliar ones for all five systems
41Transparency Again
User perception that they understand why an item
was recommended
- Transparent recommendations liked more than
not-transparent ones for all five systems
42Side note once trust is established,
transparency may become less important
- The serious-minded, 65-year-old father of one of
my friends uses NetFlix (DVDs) - Based on the items he had rented, he received
this recommendationand ordered the film! - His comment They think Ill like it and they
have done pretty well in the past so Ill take a
chance.
New Wave teen / 20somethings search for love on
New Year's Eve 1981 in this episodic comedy.
432 Models of Recommender System Success
- Recommendations from Amazon received highest
liking rating for Study 1 (for books movies)
and second highest for Study 2 (Music) - Recommendations from MediaUnbound outperformed
Amazon in Study 2 (Music) - Both systems were well liked but differed
dramatically in interaction style
44Amazons Bare-Bones Recommendation Process
45Media-Unbounds long, extended (35 questions)
recommendation process
Genre Selection
46Setting level of familiarity
Rating some songs
Feedback at every stage
47Setting system expectations
More feedback about users tastes
48Users find MediaUnbounds Recommendations More
Useful
Also, most users preferred MediaUnbound over
Amazon
But whose recommendations would they buy?
49Users Express More Interest in Buying Amazons
Recommendations
50Different System Strengths
- Amazon
- Safe, conservative approach to recommendations
- Recommendations are familiar--few new items
- Users find system logic transparent
- Users dont feel like they learnt anything new
- MediaUnbound
- Verifies from user how familiar they want
recommendations to be - Long input process seems to generate trust
- Recommendations are often new, but well liked
51Discussion and Design Recommendations
52Justify Your Recommendations
- Adequate Item Information Provide enough detail
about item for user to make choice - System Transparency Generate (at least some)
recommendations which are clearly linked to the
rated items - Explanation Explain why the item was
recommended. - Community Ratings Provide link to ratings /
reviews by other users. If possible, present
numerical summary of ratings.
53Accuracy vs. Less Input
- Dont sacrifice accuracy for the sake of
generating quick recommendations. Users dont
mind rating more items to receive quality
recommendations. - Multi-level recommendations. Users can initially
use the system by providing a single rating, and
are offered subsequent opportunities to refine
recommendations - Provide a happy medium between too little input
(leading to low accuracy) and too much input
(leading to user impatience) - Unlike with search engines, users are not willing
to try again and again.
54Include New, Unexpected Items
- Users like rec. systems because they provide
information about new, unexpected items. - List of recommended items should include new
items which the user might not learn about in any
other way. - List could also include some unexpected items
(e.g., from other topics / genres) which users
might not have thought of themselves.
55Trust Generating Items
- Users (especially first time users) need to
develop trust in the system. - Trust in system is enhanced by the presence of
items that the user has already enjoyed. - Including some very popular items (which have
probably been experienced previously) in the
initial recommendation set might be one way to
achieve this.
56The Right Mix of Items
- Transparent Items At least some items for which
the user can see the clear link between the items
he /she rated and the recommendation. - Unexpected Items Some unexpected items, whose
purpose is to allow users to broaden horizons. - New Items Some items which are new / just
released. - Trust-Generating Items A few very popular ones,
in which the system has high confidence
Question Should these be presented as a sorted
list / unsorted list / different categories of
recommendations?
57Verify Degree of Familiarity User Wants
This can help produce the right mix of items
for each user.
58What kind of system do you want to design?
- One to sell as many items as possible?
- Or one to help users explore and expand their
tastes? - The 2 goals are often contradictory, at least in
the short term. - Important for system designer to keep goals in
mind while designing system.
59Limitations of Study
- Simulated first-time visitdid not allow system
to learn user preferences over time. - Fairly homogenous group of subjects, no novice
users - Study 1 source of recommendations known to
subjectsmight have been biased towards friends.
60The Recommender Community Responds Favorably to
our Work
- DELOS-NSF 2001 Workshop on Personalization and
Recommender Systems in Digital Libraries (Dublin) - SIGIR 2001 Workshop on Recommender Systems (New
Orleans)
61Future Work
- Develop model to describe interfaces for music
discovery - Build our own system and manipulate interface to
more fully test our hypotheses - Administer Turing test of music recommenders
- Compare systems, friends and experts
- Anonymize the source of recommendation
62Thanks to.
- Rashmi Sinha
- Marti Hearst
- All the user study participants (you know who you
are)