Turning Down the Noise in the Blogosphere - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Turning Down the Noise in the Blogosphere

Description:

none – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 34
Provided by: gaura6
Category:

less

Transcript and Presenter's Notes

Title: Turning Down the Noise in the Blogosphere


1
Turning Down the Noise in the Blogosphere
  • Khalid El-Arini, Gaurav Veda, Dafna Shahaf,
    Carlos Guestrin

2
  • Millions of blog posts published every day
  • Some stories become disproportionately popular
  • Hard to find information you care about

3
Our Goal Coverage
  • Turn down the noise in the blogosphere
  • Select a small set of posts that covers the most
    important stories

January 17, 2009
4
Our Goal Coverage
  • Turn down the noise in the blogosphere
  • Select a small set of posts that covers the most
    important stories

5
Our Goal Personalization
  • Tailor post selection to user tastes

Posts selected without personalization
But, I like sports! I want articles like
After personalization based on Zidanes feedback
6
Main Contributions
  • Formalize notion of covering the blogosphere
  • Near-optimal solution for post selection
  • Learn a personalized coverage function
  • No-regret algorithm for learning user preferences
    using limited feedback
  • Evaluate on real blog data
  • Conduct user studies and compare against

7
Approach Overview
Blogosphere
Coverage Function
Post Selection
Feature Extraction

8
Document Features
  • Low level
  • Words, noun phrases, named entities
  • e.g., Obama, China, peanut butter
  • High level
  • e.g., Topics from a topic model
  • Topic probability distribution over words

Inauguration Topic
National Security Topic
9
Coverage

Features
Posts
  • cover ( ) amount by which covers
  • cover ( ) amount by which ,
    covers

Document d
Feature f
coverd(f)
Feature f
Set A
coverA(f)
10
Simple Coverage MAX-COVER
  • Find k posts that cover the most features
  • cover ( ) 1 if at least or contain

Problems with MAX-COVER
Feature Significance in Document
Feature Significance in Corpus
at George Mason University in Fairfax, Va.
11
Feature Significance in Document
  • Solution Define a probabilistic coverage
    function
  • coverd(f) P(feature f post d)

Feature Significance in Document
Feature Significance in Corpus
Not really about Washington
cover (Washington) 0.01
Feature Significance in Document
Feature Significance in Corpus
e.g., with topics as features
P(post d is about topic f)
12
Feature Significance in Corpus
  • Some features are more important
  • Want to cover the important features
  • Solution
  • Associate a weight wf with each feature f
  • e.g., frequency of feature in corpus
  • Cover an important feature using multiple posts

Feature Significance in Document
Feature Significance in Corpus
Barack Obama
Carlos Guestrin
13
Incremental Coverage
probability at least one post
in set A covers feature f
cover( )
0.5
  • Obama Tight noose on Bin Laden as good as
    capture
  • What Obamas win means for China

0.4
  • cover ( ) 1 P(neither nor
    cover )
  • 1 (1 0.5) (1 0.4)
  • 0.7

cover ( ) lt 0.7 lt cover ( )cover ( )
Gain due to covering using multiple posts
Diminishing returns
14
Post Selection Optimization
  • Want to select a set of posts A that maximizes
  • This function is submodular
  • Exact maximization is NP-hard
  • Greedy algorithm leads to a (1 1/e) 63
    approximation, i.e., a near-optimal solution
  • We use CELF (Leskovec et al 2007)

15
Approach Overview
Blogosphere
Post Selection
Coverage Function
Feature Extraction
Submodular function optimization
16
Evaluating Coverage
  • Evaluate on real blog data from Spinn3r
  • 2 week period in January
  • 200K posts per day (after pre-processing)
  • Two variants of our algorithm
  • User study involving 27 subjects to evaluate

TDNLDA High level features Latent Dirichlet
Allocation topics
TDNNE Low level features
Topicality Redundancy
17
Topicality User Study
Downed jet lifted from ice-laden Hudson
River NEW YORK (AP) - The airliner that was
piloted to a safe emergency landing in the Hudson
Is this post topical? i.e., is it related to any
of the major stories of the day?

Reference Stories
Post for evaluation
18
Results Topicality
Named entities and common noun phrases as features
LDA topics as features
We do as well as Google Yahoo!
19
Evaluation Redundancy
  • Israel unilaterally halts fire as rockets persist
  • Downed jet lifted from ice-laden Hudson River
  • Israeli-trained Gaza doctor loses three daughters
    and niece to IDF tank shell
  • ...

Is this post redundant with respect to any
of the previous posts?
20
Results Redundancy
Google performs poorly We do as well as
Yahoo!
21
Results Coverage
  • Google good topicality, high redundancy
  • Yahoo! performs well on both, but uses rich
    features
  • CTR, search trends, user voting, etc.

TDN LDA
TDN NE
Topicality
Redundancy
We do as well as Yahoo! Using only text
based features
22
Results January 22, 2009
23
Personalization
  • People have varied interests
  • Our Goal Learn a personalized coverage function
    using limited user feedback

Barack Obama
Britney Spears
24
Approach Overview
Blogosphere
Post Selection
Coverage Function
Pers. Post Selection
Feature Extraction
Personalized coverage Fn.
Personalization
25
Modeling User Preferences
  • ¼f represents preference for feature f
  • Want to learn preference ¼ over the features

Importance of feature in corpus
¼5
¼4
¼3
¼2
¼1
¼5
¼4
¼3
¼2
¼1
¼ for a politico
¼ for a sports fan
26
Learning User Preferences
Multiplicative Weights Update
Multiplicative Weights Update
Before any feedback
After 1 day of personalization
After 2 days of personalization
27
No-Regret Learning
Theorem For TDN,
avg( ) avg( ) 0
learned ¼
learned using TDN
optimal fixed
i.e., we achieve no-regret
Given the user ratings in advance, compare with
the optimal fixed ¼
optimal fixed ¼
28
Approach Overview
Blogosphere
Submodular function optimization
Pers. Post Selection
Personalized coverage fn.
Feature Extraction
User feedback
Personalization
Online learning
29
Simulating a Sports Fan
  • likes all posts from Fan House (a sports
    blog)

Personalized Objective
Personalization Ratio
-
-
Unpersonalized Objective

Dead Spin (Sports Blog)

Personalization ratio
Fan House (Sports Blog)
Unpersonalized
Huffington Post (Politics Blog)
Days of sports personalization
30
Personalizing for India
  • Like all posts about India
  • Dislike everything else
  • After 5 epochs
  • 1. India keeps up pressure on Pakistan over
    Mumbai
  • After 10 epochs
  • 1. Pakistans shift alarms the U.S.
  • 3. India among 20 most dangerous places in world
  • After 15 epochs
  • 1. 26/11 effect Pak delegation gets cold vibes
  • 3. Pakistan flaunts its all weather ties with
    China
  • 4. Benjamin Button gets 13 Oscar nominations
    mentions Slumdog Millionaire
  • 8. Miliband was not off-message, he toed the UK
    line on Kashmir

31
Personalization User Study
  • Generate personalized posts
  • Obtain user ratings
  • Generate posts without using feedback
  • Obtain user ratings


32
Personalization Evaluation
Personalized
Higher is better
Unpersonalized

Users like personalized posts more than
unpersonalized posts
33
Summary
  • Formalized covering the blogosphere
  • Near-optimal optimization algorithm
  • Learned a personalized coverage function
  • No-regret learning algorithm
  • Evaluated on real blog data
  • Coverage using only post content, we perform as
    well as other techniques that use richer features
  • Successfully tailor post selection to user
    preferences

www.TurnDownTheNoise.com
Write a Comment
User Comments (0)
About PowerShow.com