Blog Mining - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Blog Mining

Description:

Spatial perspective: congregate according to interests and demographics ... Named entities, e.g., 'Heineken', 'netflix', 'Ann Coulter' Low performance topics ... – PowerPoint PPT presentation

Number of Views:315

Avg rating:3.0/5.0

Slides: 48

Provided by: rong7

Category:

more less

Transcript and Presenter's Notes

Title: Blog Mining

1
Blog Mining

Rong Jin

2
Blog Data Mining

Blogspace analysis
Blog opinion extraction and retrieval

3
Blogspace

Blog web pages chronological sequences
Analysis of blogspace
Temporal perspective evolve over time
Spatial perspective congregate according to
interests and demographics

4
Bloggers Demographical Distribution
Michigan
California
Original source www.livejournal.com
5
Blogger Age Distribution
Original data source www.livejournal.com
6
Blogger Clusters

Cluster Bloggers into distilled 300 interest
clusters

7
Blogger Connectivity

On average, each blogger names 14 other bloggers
as friends.
80 friendship is mutual
Clustering coefficient
The chance that two of my friends are themselves
friends
? 20 ? tight clusters (why?)
E.g., consider 1millon bloggers, what is the
probability for any two bloggers to be friends?

8
Blogger Connectivity (contd)
13
45
55

Tight clusters due to commonalities

9
Evolution of Blogspace

Busty activities
Interesting topic arises ? many responses ?
becomes prominent ? recedes
How can we quantitatively describe busty
activities?
First, how do we identify the topics and online
communities?
Linkage patterns among blog entries
Community set of blogs linking together and
discussing
Evolution of community topics

10
Evolution of Blogspace

How to identify the busty topics?
Identify the busty link patterns

Rapid increase.
11
Structure of Blogspace

Distribution of blogs over time
Link structure among permalink docs
Spam blogs (splog)

12
Dataset Blog06 Test Collection

Created by University of Glasgow
Blogs
XML feeds describing recent postings
30 of feeds do not include full content
Not include comments
HTML permalink documents
Monitor 100,649 blogs of varying quality from
12/2005 to 02/2006
Top blogs (70, 701) blogs with high quality
Spam blogs (17,969)
Gibberish, plagiarised content, and advertisement
Fake blogs to create a link farm to fool the
search engineer ranking
Blogs of general interests
To introduce varieties.

13
Sample Document
14
Collection Statistics
15
Collection Statistics (Contd)
Distribution of posters over date, (why it
behaviors cyclically?)
16
Spam Blogs vs. Normal Blogs
Splogs has much larger number of posts compared
to normal blogs
(a) Spam Blogs
(b) Normal Blogs Distribution of
posts over hours
17
Spam Blogs vs. Normal Blogs (contd)

No obvious difference between Spam and normal
blogs in their usage of offensive words
offensive words list supplied by a major British
broadcaster
However, there is clear difference in the usage
of content words between spam blogs and normal
blogs

18
Link Structure Normal Blogs

Power law for inlink and outlink of Permalink docs

19
Link Structure Spam Blogs
20
Blog Opinion Retrieval

Blog is unlike news articles
Opinionated name many for self-expression
Opinion oriented user information needs
Many blog queries are person names, both
celebrities and unknown, and the underlying users
information needs seem to be of an opinion, or
perspective-finding nature, rather than
fact-finding
Different genres
Specific topic
Multiple topics
Personal life

21
Blog Opinion Retrieval

Started in TREC 2006
Locate posts that express an opinion about a
given target.
What do people think about X?
What are the targets?
Named entities (a person, location, or
organization)
Concepts (e.g., a type of technology, a product
name, or an event)
Application
Uncover the public sentiment towards a given
entity (the target)
Track consumer-generated content, brand
monitoring, and, more generally, media analysis.

22
Blog Opinion Retrieval Example

Target skype

An opinionated post
An unopinionated post
23
Opinion Retrieval Topics

50 queries selected from a donated collection of
queries sent to commercial blog search engines

24
Opinion Retrieval Approaches

Two-stage process
Retrieve relevant blogs
Classify opinionated blogs
Retrieve relevant blogs
off-the-shelf retrieval models (e.g., language
models, vector space, tf.idf weighting)

25
Opinion Retrieval Approaches (contd)

Classify opinionated blogs
Dictionary-based approaches
Lists of terms and their semantic orientation
values
Rank documents based on the frequency of semantic
words
Text categorization approaches
Limited success, may because of the difference
between training data and the actual opinionated
content in blog posts.
Shallow linguistic approaches
Frequency of pronouns or adjectives as indicators
Limited success

26
Opinion Retrieval Assessment

-1 Not judged
0 Not relevant
1 Relevant

Relevance judgment

2 Negative opinion
3 Mixed opinion
4 Positive opinion

Opinion judgment
27
Opinion Retrieval Evaluation

Mean Average Precision (MAP)
The most important
R-precision (R-prec)
Binary Preference (bPref)
Precision at 10 documents (P_at_10)

28
Opinion Retrieval Results

T topic, D description, N narrative

29
Opinion Retrieval Relevance Results
30
Opinion-finding vs. Topic-relevance

High topic relevance ? high accuracy in finding
opinionated blog posts

31
How Splogs Affect Opinion Retr. ?

Spam is an important issue in the blogosphere

Spam is not an major issue in opinion retrieval

32
Polarity

Equal chance to retrieve positive and negative
opinions

33
Analysis across Topics

High performance topics
Named entities, e.g., Heineken, netflix, Ann
Coulter
Low performance topics
high-level concepts, e.g., cholesterol,
Business Intelligence Resources

34
Embarrassing Performance

Performance by simple document retrieval

Best performance by simple document retrieval

35
Information Propagation with Blogspace

Characterize information propagation in two
dimensions
Topics
Chatter long-term, internally driven (i.e.,
subtopics are determined by the authors)
Spikes short-term, externally driven (i.e.,
subtopics are decided by real-world events)
Individuals
Four categories of posting behavior
Based on the spread of infectious diseases

36
Modeling Topics

How to identify and track topics?
Topic detection and tracking (TDT)
Strategies
Recurring sequences of words as topics
Common phrases I dont think I will
Entities defined in the TAP ontology
3700 distinct ones, most of them appear only a
few times
Proper nouns
11K, half of them ? 10
Term frequency ratio (tfcidf)
20K terms (tf(i) 10, ratio 3)

37
Examples of Selected Words
38
Topic Patterns

Just spike
inactive ?very active ? inactive
Spiky Chatter
very sensitive to external world events
Multiple spikes
Chatter
Discussion on a modest level

39
Spiky Chatter

Level of subtopics arises due to the real-world
event
Identify subtopics (x) given the target topic (t)
Support co-occurrence
Conditional probability

40
Spiky Chatter (contd)

Confirm that the spikes are caused by the
subtopics

41
Modeling Individuals

Uncover the path of topics through the
individuals who make up blogspace

The number of posters by individual blogers
follow zips law
42
Life Circle of Posters
43
Associate Users with Post Life Cycle

Small numbers of users involved in the regions of
RampUp and RampDown
Many more users involved in the regions of
Mid-High and Cycle

44
Propagation Model

How blog a is affected by the topic raised in
blog b?
Independent Cascade model (random walk)
A directed graph
Each node is a bloger
Edge (u, w) is associated with a copy probability
When u writes an article at time t, each node w
that has an arc from u to w writes an article
about the topic at time t 1 with probability
Probability that u reads ws blog

45
Example of Network
46
Propagation Model Procedure

Start u wrote about certain topic at a given day
First, v reads the topic from node u with
probability ru,v by a delay follows an
exponential distribution
Then, with probability , v will choose to
write about it.
If v reads the topic and chooses not to copy it,
then v will never copy that topic from u
A single opportunity for a topic to propagate
along any given edge.

47
Copy Prob. vs. Read Prob.
read probability
copy probability