Title: Turning Down the Noise in the Blogosphere
1Turning Down the Noise in the Blogosphere
- Khalid El-Arini, Gaurav Veda, Dafna Shahaf,
Carlos Guestrin
2- Millions of blog posts published every day
- Some stories become disproportionately popular
- Hard to find information you care about
3Our Goal Coverage
- Turn down the noise in the blogosphere
- Select a small set of posts that covers the most
important stories
January 17, 2009
4Our Goal Coverage
- Turn down the noise in the blogosphere
- Select a small set of posts that covers the most
important stories
5Our Goal Personalization
- Tailor post selection to user tastes
Posts selected without personalization
But, I like sports! I want articles like
After personalization based on Zidanes feedback
6Main Contributions
- Formalize notion of covering the blogosphere
- Near-optimal solution for post selection
- Learn a personalized coverage function
- No-regret algorithm for learning user preferences
using limited feedback - Evaluate on real blog data
- Conduct user studies and compare against
7Approach Overview
Blogosphere
Coverage Function
Post Selection
Feature Extraction
8Document Features
- Low level
- Words, noun phrases, named entities
- e.g., Obama, China, peanut butter
- High level
- e.g., Topics from a topic model
- Topic probability distribution over words
Inauguration Topic
National Security Topic
9Coverage
Features
Posts
- cover ( ) amount by which covers
-
- cover ( ) amount by which ,
covers
Document d
Feature f
coverd(f)
Feature f
Set A
coverA(f)
10Simple Coverage MAX-COVER
- Find k posts that cover the most features
- cover ( ) 1 if at least or contain
Problems with MAX-COVER
Feature Significance in Document
Feature Significance in Corpus
at George Mason University in Fairfax, Va.
11Feature Significance in Document
- Solution Define a probabilistic coverage
function - coverd(f) P(feature f post d)
Feature Significance in Document
Feature Significance in Corpus
Not really about Washington
cover (Washington) 0.01
Feature Significance in Document
Feature Significance in Corpus
e.g., with topics as features
P(post d is about topic f)
12Feature Significance in Corpus
- Some features are more important
- Want to cover the important features
- Solution
- Associate a weight wf with each feature f
- e.g., frequency of feature in corpus
- Cover an important feature using multiple posts
Feature Significance in Document
Feature Significance in Corpus
Barack Obama
Carlos Guestrin
13Incremental Coverage
probability at least one post
in set A covers feature f
cover( )
0.5
- Obama Tight noose on Bin Laden as good as
capture - What Obamas win means for China
0.4
- cover ( ) 1 P(neither nor
cover ) - 1 (1 0.5) (1 0.4)
- 0.7
-
cover ( ) lt 0.7 lt cover ( )cover ( )
Gain due to covering using multiple posts
Diminishing returns
14Post Selection Optimization
- Want to select a set of posts A that maximizes
- This function is submodular
- Exact maximization is NP-hard
- Greedy algorithm leads to a (1 1/e) 63
approximation, i.e., a near-optimal solution - We use CELF (Leskovec et al 2007)
15Approach Overview
Blogosphere
Post Selection
Coverage Function
Feature Extraction
Submodular function optimization
16Evaluating Coverage
- Evaluate on real blog data from Spinn3r
- 2 week period in January
- 200K posts per day (after pre-processing)
- Two variants of our algorithm
- User study involving 27 subjects to evaluate
TDNLDA High level features Latent Dirichlet
Allocation topics
TDNNE Low level features
Topicality Redundancy
17Topicality User Study
Downed jet lifted from ice-laden Hudson
River NEW YORK (AP) - The airliner that was
piloted to a safe emergency landing in the Hudson
Is this post topical? i.e., is it related to any
of the major stories of the day?
Reference Stories
Post for evaluation
18Results Topicality
Named entities and common noun phrases as features
LDA topics as features
We do as well as Google Yahoo!
19Evaluation Redundancy
- Israel unilaterally halts fire as rockets persist
- Downed jet lifted from ice-laden Hudson River
- Israeli-trained Gaza doctor loses three daughters
and niece to IDF tank shell - ...
Is this post redundant with respect to any
of the previous posts?
20Results Redundancy
Google performs poorly We do as well as
Yahoo!
21Results Coverage
- Google good topicality, high redundancy
- Yahoo! performs well on both, but uses rich
features - CTR, search trends, user voting, etc.
TDN LDA
TDN NE
Topicality
Redundancy
We do as well as Yahoo! Using only text
based features
22Results January 22, 2009
23Personalization
- People have varied interests
- Our Goal Learn a personalized coverage function
using limited user feedback
Barack Obama
Britney Spears
24Approach Overview
Blogosphere
Post Selection
Coverage Function
Pers. Post Selection
Feature Extraction
Personalized coverage Fn.
Personalization
25Modeling User Preferences
- ¼f represents preference for feature f
- Want to learn preference ¼ over the features
Importance of feature in corpus
¼5
¼4
¼3
¼2
¼1
¼5
¼4
¼3
¼2
¼1
¼ for a politico
¼ for a sports fan
26Learning User Preferences
Multiplicative Weights Update
Multiplicative Weights Update
Before any feedback
After 1 day of personalization
After 2 days of personalization
27No-Regret Learning
Theorem For TDN,
avg( ) avg( ) 0
learned ¼
learned using TDN
optimal fixed
i.e., we achieve no-regret
Given the user ratings in advance, compare with
the optimal fixed ¼
optimal fixed ¼
28Approach Overview
Blogosphere
Submodular function optimization
Pers. Post Selection
Personalized coverage fn.
Feature Extraction
User feedback
Personalization
Online learning
29Simulating a Sports Fan
-
- likes all posts from Fan House (a sports
blog)
Personalized Objective
Personalization Ratio
-
-
Unpersonalized Objective
Dead Spin (Sports Blog)
Personalization ratio
Fan House (Sports Blog)
Unpersonalized
Huffington Post (Politics Blog)
Days of sports personalization
30Personalizing for India
- Like all posts about India
- Dislike everything else
- After 5 epochs
- 1. India keeps up pressure on Pakistan over
Mumbai - After 10 epochs
- 1. Pakistans shift alarms the U.S.
- 3. India among 20 most dangerous places in world
- After 15 epochs
- 1. 26/11 effect Pak delegation gets cold vibes
- 3. Pakistan flaunts its all weather ties with
China - 4. Benjamin Button gets 13 Oscar nominations
mentions Slumdog Millionaire - 8. Miliband was not off-message, he toed the UK
line on Kashmir
31Personalization User Study
- Generate personalized posts
- Obtain user ratings
- Generate posts without using feedback
- Obtain user ratings
32Personalization Evaluation
Personalized
Higher is better
Unpersonalized
Users like personalized posts more than
unpersonalized posts
33Summary
- Formalized covering the blogosphere
- Near-optimal optimization algorithm
- Learned a personalized coverage function
- No-regret learning algorithm
- Evaluated on real blog data
- Coverage using only post content, we perform as
well as other techniques that use richer features - Successfully tailor post selection to user
preferences
www.TurnDownTheNoise.com