Blog Mining - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Blog Mining

Description:

Spatial perspective: congregate according to interests and demographics ... Named entities, e.g., 'Heineken', 'netflix', 'Ann Coulter' Low performance topics ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:3.0/5.0
Slides: 48
Provided by: rong7
Category:
Tags: ann | blog | coulter | mining

less

Transcript and Presenter's Notes

Title: Blog Mining


1
Blog Mining
  • Rong Jin

2
Blog Data Mining
  • Blogspace analysis
  • Blog opinion extraction and retrieval

3
Blogspace
  • Blog web pages chronological sequences
  • Analysis of blogspace
  • Temporal perspective evolve over time
  • Spatial perspective congregate according to
    interests and demographics

4
Bloggers Demographical Distribution
Michigan
California
Original source www.livejournal.com
5
Blogger Age Distribution
Original data source www.livejournal.com
6
Blogger Clusters
  • Cluster Bloggers into distilled 300 interest
    clusters

7
Blogger Connectivity
  • On average, each blogger names 14 other bloggers
    as friends.
  • 80 friendship is mutual
  • Clustering coefficient
  • The chance that two of my friends are themselves
    friends
  • ? 20 ? tight clusters (why?)
  • E.g., consider 1millon bloggers, what is the
    probability for any two bloggers to be friends?

8
Blogger Connectivity (contd)
13
45
55
  • Tight clusters due to commonalities

9
Evolution of Blogspace
  • Busty activities
  • Interesting topic arises ? many responses ?
    becomes prominent ? recedes
  • How can we quantitatively describe busty
    activities?
  • First, how do we identify the topics and online
    communities?
  • Linkage patterns among blog entries
  • Community set of blogs linking together and
    discussing
  • Evolution of community topics

10
Evolution of Blogspace
  • How to identify the busty topics?
  • Identify the busty link patterns

Rapid increase.
11
Structure of Blogspace
  • Distribution of blogs over time
  • Link structure among permalink docs
  • Spam blogs (splog)

12
Dataset Blog06 Test Collection
  • Created by University of Glasgow
  • Blogs
  • XML feeds describing recent postings
  • 30 of feeds do not include full content
  • Not include comments
  • HTML permalink documents
  • Monitor 100,649 blogs of varying quality from
    12/2005 to 02/2006
  • Top blogs (70, 701) blogs with high quality
  • Spam blogs (17,969)
  • Gibberish, plagiarised content, and advertisement
  • Fake blogs to create a link farm to fool the
    search engineer ranking
  • Blogs of general interests
  • To introduce varieties.

13
Sample Document
14
Collection Statistics
15
Collection Statistics (Contd)
Distribution of posters over date, (why it
behaviors cyclically?)
16
Spam Blogs vs. Normal Blogs
Splogs has much larger number of posts compared
to normal blogs
(a) Spam Blogs
(b) Normal Blogs Distribution of
posts over hours
17
Spam Blogs vs. Normal Blogs (contd)
  • No obvious difference between Spam and normal
    blogs in their usage of offensive words
  • offensive words list supplied by a major British
    broadcaster
  • However, there is clear difference in the usage
    of content words between spam blogs and normal
    blogs

18
Link Structure Normal Blogs
  • Power law for inlink and outlink of Permalink docs

19
Link Structure Spam Blogs
20
Blog Opinion Retrieval
  • Blog is unlike news articles
  • Opinionated name many for self-expression
  • Opinion oriented user information needs
  • Many blog queries are person names, both
    celebrities and unknown, and the underlying users
    information needs seem to be of an opinion, or
    perspective-finding nature, rather than
    fact-finding
  • Different genres
  • Specific topic
  • Multiple topics
  • Personal life

21
Blog Opinion Retrieval
  • Started in TREC 2006
  • Locate posts that express an opinion about a
    given target.
  • What do people think about X?
  • What are the targets?
  • Named entities (a person, location, or
    organization)
  • Concepts (e.g., a type of technology, a product
    name, or an event)
  • Application
  • Uncover the public sentiment towards a given
    entity (the target)
  • Track consumer-generated content, brand
    monitoring, and, more generally, media analysis.

22
Blog Opinion Retrieval Example
  • Target skype

An opinionated post
An unopinionated post
23
Opinion Retrieval Topics
  • 50 queries selected from a donated collection of
    queries sent to commercial blog search engines

24
Opinion Retrieval Approaches
  • Two-stage process
  • Retrieve relevant blogs
  • Classify opinionated blogs
  • Retrieve relevant blogs
  • off-the-shelf retrieval models (e.g., language
    models, vector space, tf.idf weighting)

25
Opinion Retrieval Approaches (contd)
  • Classify opinionated blogs
  • Dictionary-based approaches
  • Lists of terms and their semantic orientation
    values
  • Rank documents based on the frequency of semantic
    words
  • Text categorization approaches
  • Limited success, may because of the difference
    between training data and the actual opinionated
    content in blog posts.
  • Shallow linguistic approaches
  • Frequency of pronouns or adjectives as indicators
  • Limited success

26
Opinion Retrieval Assessment
  • -1 Not judged
  • 0 Not relevant
  • 1 Relevant

Relevance judgment
  • 2 Negative opinion
  • 3 Mixed opinion
  • 4 Positive opinion

Opinion judgment
27
Opinion Retrieval Evaluation
  • Mean Average Precision (MAP)
  • The most important
  • R-precision (R-prec)
  • Binary Preference (bPref)
  • Precision at 10 documents (P_at_10)

28
Opinion Retrieval Results
  • T topic, D description, N narrative

29
Opinion Retrieval Relevance Results
30
Opinion-finding vs. Topic-relevance
  • High topic relevance ? high accuracy in finding
    opinionated blog posts

31
How Splogs Affect Opinion Retr. ?
  • Spam is an important issue in the blogosphere

5
  • Spam is not an major issue in opinion retrieval

32
Polarity
  • Equal chance to retrieve positive and negative
    opinions

33
Analysis across Topics
  • High performance topics
  • Named entities, e.g., Heineken, netflix, Ann
    Coulter
  • Low performance topics
  • high-level concepts, e.g., cholesterol,
    Business Intelligence Resources

34
Embarrassing Performance
  • Performance by simple document retrieval
  • Best performance by simple document retrieval

35
Information Propagation with Blogspace
  • Characterize information propagation in two
    dimensions
  • Topics
  • Chatter long-term, internally driven (i.e.,
    subtopics are determined by the authors)
  • Spikes short-term, externally driven (i.e.,
    subtopics are decided by real-world events)
  • Individuals
  • Four categories of posting behavior
  • Based on the spread of infectious diseases

36
Modeling Topics
  • How to identify and track topics?
  • Topic detection and tracking (TDT)
  • Strategies
  • Recurring sequences of words as topics
  • Common phrases I dont think I will
  • Entities defined in the TAP ontology
  • 3700 distinct ones, most of them appear only a
    few times
  • Proper nouns
  • 11K, half of them ? 10
  • Term frequency ratio (tfcidf)
  • 20K terms (tf(i) 10, ratio 3)

37
Examples of Selected Words
38
Topic Patterns
  • Just spike
  • inactive ?very active ? inactive
  • Spiky Chatter
  • very sensitive to external world events
  • Multiple spikes
  • Chatter
  • Discussion on a modest level

39
Spiky Chatter
  • Level of subtopics arises due to the real-world
    event
  • Identify subtopics (x) given the target topic (t)
  • Support co-occurrence
  • Conditional probability

40
Spiky Chatter (contd)
  • Confirm that the spikes are caused by the
    subtopics

41
Modeling Individuals
  • Uncover the path of topics through the
    individuals who make up blogspace

The number of posters by individual blogers
follow zips law
42
Life Circle of Posters
43
Associate Users with Post Life Cycle
  • Small numbers of users involved in the regions of
    RampUp and RampDown
  • Many more users involved in the regions of
    Mid-High and Cycle

44
Propagation Model
  • How blog a is affected by the topic raised in
    blog b?
  • Independent Cascade model (random walk)
  • A directed graph
  • Each node is a bloger
  • Edge (u, w) is associated with a copy probability
  • When u writes an article at time t, each node w
    that has an arc from u to w writes an article
    about the topic at time t 1 with probability
  • Probability that u reads ws blog

45
Example of Network
46
Propagation Model Procedure
  • Start u wrote about certain topic at a given day
  • First, v reads the topic from node u with
    probability ru,v by a delay follows an
    exponential distribution
  • Then, with probability , v will choose to
    write about it.
  • If v reads the topic and chooses not to copy it,
    then v will never copy that topic from u
  • A single opportunity for a topic to propagate
    along any given edge.

47
Copy Prob. vs. Read Prob.
read probability
copy probability
  • Very low copying probability
Write a Comment
User Comments (0)
About PowerShow.com