Title: Blog Mining
1Blog Mining
2Blog Data Mining
- Blogspace analysis
- Blog opinion extraction and retrieval
3Blogspace
- Blog web pages chronological sequences
- Analysis of blogspace
- Temporal perspective evolve over time
- Spatial perspective congregate according to
interests and demographics
4Bloggers Demographical Distribution
Michigan
California
Original source www.livejournal.com
5Blogger Age Distribution
Original data source www.livejournal.com
6Blogger Clusters
- Cluster Bloggers into distilled 300 interest
clusters
7Blogger Connectivity
- On average, each blogger names 14 other bloggers
as friends. - 80 friendship is mutual
- Clustering coefficient
- The chance that two of my friends are themselves
friends - ? 20 ? tight clusters (why?)
- E.g., consider 1millon bloggers, what is the
probability for any two bloggers to be friends?
8Blogger Connectivity (contd)
13
45
55
- Tight clusters due to commonalities
9Evolution of Blogspace
- Busty activities
- Interesting topic arises ? many responses ?
becomes prominent ? recedes - How can we quantitatively describe busty
activities? - First, how do we identify the topics and online
communities? - Linkage patterns among blog entries
- Community set of blogs linking together and
discussing - Evolution of community topics
10Evolution of Blogspace
- How to identify the busty topics?
- Identify the busty link patterns
Rapid increase.
11Structure of Blogspace
- Distribution of blogs over time
- Link structure among permalink docs
- Spam blogs (splog)
12Dataset Blog06 Test Collection
- Created by University of Glasgow
- Blogs
- XML feeds describing recent postings
- 30 of feeds do not include full content
- Not include comments
- HTML permalink documents
- Monitor 100,649 blogs of varying quality from
12/2005 to 02/2006 - Top blogs (70, 701) blogs with high quality
- Spam blogs (17,969)
- Gibberish, plagiarised content, and advertisement
- Fake blogs to create a link farm to fool the
search engineer ranking - Blogs of general interests
- To introduce varieties.
13Sample Document
14Collection Statistics
15Collection Statistics (Contd)
Distribution of posters over date, (why it
behaviors cyclically?)
16Spam Blogs vs. Normal Blogs
Splogs has much larger number of posts compared
to normal blogs
(a) Spam Blogs
(b) Normal Blogs Distribution of
posts over hours
17Spam Blogs vs. Normal Blogs (contd)
- No obvious difference between Spam and normal
blogs in their usage of offensive words - offensive words list supplied by a major British
broadcaster - However, there is clear difference in the usage
of content words between spam blogs and normal
blogs
18Link Structure Normal Blogs
- Power law for inlink and outlink of Permalink docs
19Link Structure Spam Blogs
20Blog Opinion Retrieval
- Blog is unlike news articles
- Opinionated name many for self-expression
- Opinion oriented user information needs
- Many blog queries are person names, both
celebrities and unknown, and the underlying users
information needs seem to be of an opinion, or
perspective-finding nature, rather than
fact-finding - Different genres
- Specific topic
- Multiple topics
- Personal life
21Blog Opinion Retrieval
- Started in TREC 2006
- Locate posts that express an opinion about a
given target. - What do people think about X?
- What are the targets?
- Named entities (a person, location, or
organization) - Concepts (e.g., a type of technology, a product
name, or an event) - Application
- Uncover the public sentiment towards a given
entity (the target) - Track consumer-generated content, brand
monitoring, and, more generally, media analysis.
22Blog Opinion Retrieval Example
An opinionated post
An unopinionated post
23Opinion Retrieval Topics
- 50 queries selected from a donated collection of
queries sent to commercial blog search engines
24Opinion Retrieval Approaches
- Two-stage process
- Retrieve relevant blogs
- Classify opinionated blogs
- Retrieve relevant blogs
- off-the-shelf retrieval models (e.g., language
models, vector space, tf.idf weighting)
25Opinion Retrieval Approaches (contd)
- Classify opinionated blogs
- Dictionary-based approaches
- Lists of terms and their semantic orientation
values - Rank documents based on the frequency of semantic
words - Text categorization approaches
- Limited success, may because of the difference
between training data and the actual opinionated
content in blog posts. - Shallow linguistic approaches
- Frequency of pronouns or adjectives as indicators
- Limited success
26Opinion Retrieval Assessment
- -1 Not judged
- 0 Not relevant
- 1 Relevant
Relevance judgment
- 2 Negative opinion
- 3 Mixed opinion
- 4 Positive opinion
Opinion judgment
27Opinion Retrieval Evaluation
- Mean Average Precision (MAP)
- The most important
- R-precision (R-prec)
- Binary Preference (bPref)
- Precision at 10 documents (P_at_10)
28Opinion Retrieval Results
- T topic, D description, N narrative
29Opinion Retrieval Relevance Results
30Opinion-finding vs. Topic-relevance
- High topic relevance ? high accuracy in finding
opinionated blog posts
31How Splogs Affect Opinion Retr. ?
- Spam is an important issue in the blogosphere
5
- Spam is not an major issue in opinion retrieval
32Polarity
- Equal chance to retrieve positive and negative
opinions
33Analysis across Topics
- High performance topics
- Named entities, e.g., Heineken, netflix, Ann
Coulter - Low performance topics
- high-level concepts, e.g., cholesterol,
Business Intelligence Resources
34Embarrassing Performance
- Performance by simple document retrieval
- Best performance by simple document retrieval
35Information Propagation with Blogspace
- Characterize information propagation in two
dimensions - Topics
- Chatter long-term, internally driven (i.e.,
subtopics are determined by the authors) - Spikes short-term, externally driven (i.e.,
subtopics are decided by real-world events) - Individuals
- Four categories of posting behavior
- Based on the spread of infectious diseases
36Modeling Topics
- How to identify and track topics?
- Topic detection and tracking (TDT)
- Strategies
- Recurring sequences of words as topics
- Common phrases I dont think I will
- Entities defined in the TAP ontology
- 3700 distinct ones, most of them appear only a
few times - Proper nouns
- 11K, half of them ? 10
- Term frequency ratio (tfcidf)
-
- 20K terms (tf(i) 10, ratio 3)
37Examples of Selected Words
38Topic Patterns
- Just spike
- inactive ?very active ? inactive
- Spiky Chatter
- very sensitive to external world events
- Multiple spikes
- Chatter
- Discussion on a modest level
39Spiky Chatter
- Level of subtopics arises due to the real-world
event - Identify subtopics (x) given the target topic (t)
- Support co-occurrence
- Conditional probability
40Spiky Chatter (contd)
- Confirm that the spikes are caused by the
subtopics
41Modeling Individuals
- Uncover the path of topics through the
individuals who make up blogspace
The number of posters by individual blogers
follow zips law
42Life Circle of Posters
43Associate Users with Post Life Cycle
- Small numbers of users involved in the regions of
RampUp and RampDown - Many more users involved in the regions of
Mid-High and Cycle
44Propagation Model
- How blog a is affected by the topic raised in
blog b? - Independent Cascade model (random walk)
- A directed graph
- Each node is a bloger
- Edge (u, w) is associated with a copy probability
- When u writes an article at time t, each node w
that has an arc from u to w writes an article
about the topic at time t 1 with probability - Probability that u reads ws blog
45Example of Network
46Propagation Model Procedure
- Start u wrote about certain topic at a given day
- First, v reads the topic from node u with
probability ru,v by a delay follows an
exponential distribution - Then, with probability , v will choose to
write about it. - If v reads the topic and chooses not to copy it,
then v will never copy that topic from u - A single opportunity for a topic to propagate
along any given edge.
47Copy Prob. vs. Read Prob.
read probability
copy probability
- Very low copying probability