Title: Blog Mining
1Blog Mining Market Research made easy?Bettina
Berendt, K.U.Leuven, www.berendt.de
2About me ...
Computer Science
Information Systems Computer Science /
Cognitive Science Artificial Intelligence
Business Science Economics
3Motivation / Excecutive summary
4Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
5Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
6Whats a blog?
- a (more or less) frequently updated publication
on the Web, sorted in (usually reverse)
chronological order of the constituent blog
posts. - The content may reflect any interests including
personal, journalistic or corporate. - Usually textual, but multimedia forms exist
(photoblog, vblog, )
7Blogs and other social media (Web 2.0)
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
8Blogs and other social media,and their activity
focus
Creating content
Organizing content
Communication
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
(Self-)profiling, Meeting people
Publication Expression
9Blogs and other social media,and some of their
origins in older media
www.dmoz.org
Computer-supported cooperative work
Bookmarks
Chatrooms
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
Dating sites
Diaries
(Often political) journalism
PR press releases
10Blogs Publication bordering on communication
Creating content
Organizing content
Communication
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
(Self-)profiling, Meeting people
Publication Expression
11Whats market research?
- identification, collection, analysis, and
dissemination of information - for the purpose of assisting management in
decision making related to the identification and
solution of problems and opportunities in
marketing
12Traditional methods of (consumer) market research
- Based on questioning
- Focus groups, surveys, questionnaires, ...
- Based on observations
- Ethnographic studies - observe social phenomena
in their natural setting - observations can occur
cross-sectionally or longitudinally - examples
include product-use analysis and computer cookie
traces. - Experimental techniques - create a
quasi-artificial environment to try to control
spurious factors, then manipulates at least one
of the variables - examples include purchase
laboratories and test markets
13Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
14Capturing online buzz
Capturing online buzzBursty communication
actitivies
15Comparing search volume, news and blogs
16The idea of text mining ...
- ... is to go beyond frequency-counting
- ... is to go beyond the search-for-documents
framework - ... is to find patterns (of meaning) within and
across documents - (yes, there is text mining behind some of the
things the above tools do!)
17The steps of text mining
- Application understanding
- Corpus generation
- Data understanding
- Text preprocessing
- Search for patterns / modelling
- Topical analysis
- Sentiment analysis / opinion mining
- Evaluation
- Deployment
18Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
19Application understanding Corpus generation
- What is the question?
- What is the context?
- What could be interesting sources, and where can
they be found? - Crawl
- Use a search engine and/or archive
- Google blogs search
- Technorati
- Blogdigger
- ...
20Preprocessing (1)
- Data cleaning
- Goal get clean ASCII text
- Remove HTML markup, pictures, advertisements, ...
- Automate this wrapper induction
21Preprocessing (2)
- Further text preprocessing
- Goal get processable lexical / syntactical units
- Tokenize (find word boundaries)
- Lemmatize / stem
- ex. buyers, buyer ? buyer / buyer, buying, ... ?
buy - Remove stopwords
- Find Named Entities (people, places, companies,
...) filtering - Resolve polysemy and homonymy word sense
disambiguation synonym unification - Part-of-speech tagging filtering of nouns,
verbs, adjectives, ... - ...
- Most steps are optional and application-dependent!
- Many steps are language-dependent coverage of
non-English varies - Free and/or open-source tools or Web APIs exist
for most steps
22Preprocessing (3)
- Creation of text representation
- Goal a representation that the modelling
algorithm can work on - Most common forms A text as
- a set or (more usually) bag of words /
vector-space representation term-document matrix
with weights reflecting occurrence, importance,
... - a sequence of words
- a tree (parse trees)
23Recall text data pre-processing ...
24An important part of preprocessingNamed-entity
recognition (1)
25An important part of preprocessingNamed-entity
recognition (2)
- Technique Lexica, heuristic rules, syntax
parsing - Re-use lexica and/or develop your own
- configurable tools such as GATE
- A challenge multi-document named-entity
recognition - See proposal in Subašic Berendt (Proc. ICDM
2008)
26The simplest form of content analysis is based on
NER
Berendt, Schlegel und Koch In Zerfaß et al.
(Hrsg.) Kommunikation, Partizipation und
Wirkungen im Social Web, 2008
27Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
28More about named entities co-occurrence
Source Discussion boards ? similar to blogs, but
(more) clearly communication-related
Feldman et al., Proc. ICDM 2007
29Co-occurrence of brands and attributes
Feldman et al., Proc. ICDM 2007
30(No Transcript)
31Capturing online buzz
Capturing online buzzBursty communication
actitivies
32Comparing search volume, news and blogs
33More advanced text modelling Summarization of
time-indexed documents
- Google Trends, Blogpulse etc. associate documents
/ document sets with bursts - But this means the user has to read the
documents! - Can we do better and create a concise summary of
what was discussed in that period? - Can we allow the user to ask as much detail as
s/he is interested in?
34Yes with STORIES
(Subašic Berendt, Proc. ICDM 2008)
35Story elements
- content-bearing words
- the 150 top-TF words without stopwords
36Story stagesco-occurrence in a window
- mother and suspect co-occur
- in a window of size 6 (all words)
- in a window of size 2 (non-stopwords only)
37Salient story elements
- Identify content-bearing terms (e.g. 150
top-TF.IDF over whole corpus) - Split whole corpus T by atomic time period (e.g.,
week) - For each time period (atomic or moving-average)
- Compute the weights for corpus t for this period
- Weight
- Support of co-occurrence of 2 content-bearing
terms w1, w2 in t - ( articles from t containing both w1, w2 in
window) / ( all articles in t) - Threshold
- Number of occurrences of co-occurrence(w1, w2) in
t ?1 (e.g., 5) - Time-relevance TR of co-occurrence(w1, w2)
- support(co-occurrence(w1, w2)) in t /
support(co-occurrence(w1, w2)) in T ?2 (e.g.,
2) - Thresholds are set dynamically interactively by
the user - Story elements relationships all these edges
- Story basics terms all nodes connected by
these edges
38Salient story stages, and story evolution
- Story stage the story graph made of basics and
elements in t - Story evolution how story stages evolve over
the t in T
39An event a missing child
40A central figure emerges in the police
investigations
41Uncovering more details
42Uncovering more details
43An eventless time
44The story and the underlying documents
45Navigating between documents relating different
source types to one another
(Berendt Trümper, in press)
46A simple form of opinion miningFeature-based
Summary (Hu and Liu, Proc. SIGKDD04)
Source Product reviews ? similar to blogs, but
(more) clearly product-related
- GREAT Camera., Jun 3, 2004
- Reviewer jprice174 from Atlanta, Ga.
- I did a lot of research last year before I
bought this camera... It kinda hurt to leave
behind my beloved nikon 35mm SLR, but I was going
to Italy, and I needed something smaller, and
digital. - The pictures coming out of this camera are
amazing. The 'auto' feature takes great pictures
most of the time. And with digital, you're not
wasting film if the picture doesn't come out. - .
- Feature1 picture
- Positive 12
- The pictures coming out of this camera are
amazing. - Overall this is a good camera with a really good
picture clarity. -
- Negative 2
- The pictures come out hazy if your hands shake
even for a moment during the entire process of
taking a picture. - Focusing on a display rack about 20 feet away in
a brightly lit room during day time, pictures
produced by this camera were blurry and in a
shade of orange. - Feature2 battery life
47Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
48An application Crisis PR Step 1Use blogs to
observe public discussions
- Detect products about which there is
controversial discussion - sentiment mining from text
- and/or
- use the structure of blogs (e.g., structure of
blog post comments Mishne Glance, Proc. WWW
2006) - and/or
- discussion in the mainstream media (may be later
though)
49An application Crisis PR Step 2 Use blogs to
communicate facts own concerns
- Example Dells exploding laptops product
recall and aftermath - Dell launched a blog at that time (much maligned
at first, but they learned ...) - Evaluation of all English-language consumer
commentary on the Web before and after
(methodology based on Reichhold 1996, The Loyalty
Effect)
Market sentinel, 2007
50Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
51First ...
- ... the imperfect nature of automatic text
analysis presents a challenge - But human inter-rater agreement on various
aspects of texts also tends to be rather low!
52Some findings
- Only a fraction of the population blog (8 of
adult Internet users - Pew Internet American
Life Project July 2006) - Most blogs are personal
- US survey (Business Week July 2006)
- German-language blogosphere is less mature,
esp. Less politicized, than the US blogosphere
(Berendt, Schlegel Koch, 2008) - This includes fewer mentions of companies
- But what about those personal blogs ...?
53What makes people happy?
54Happiness in blogosphere
55- Well kids, I had an awesome birthday thanks to
you. D Just wanted to so thank you for coming
and thanks for the gifts and junk. ) I have many
pictures and I will post them later. hearts
current mood
What are the characteristic words of these two
moods?
Mihalcea, R. Liu, H. (2006). In Proc. AAAI
Spring Symposium CAAW. Slides based on Rada
Mihalceas presentation.
56Data, data preparation and learning
- LiveJournal.com optional mood annotation
- 10,000 blogs
- 5,000 happy entries / 5,000 sad entries
- average size 175 words / entry
- post-processing remove SGML tags, tokenization,
part-of-speech tagging - quality of automatic mood separation
- naïve bayes text classifier
- five-fold cross validation
- Accuracy 79.13 (gtgt 50 baseline)
57Results Corpus-derived happiness factors
- yay 86.67
- shopping 79.56
- awesome 79.71
- birthday 78.37
- lovely 77.39
- concert 74.85
- cool 73.72
- cute 73.20
- lunch 73.02
- books 73.02
goodbye 18.81 hurt 17.39 tears 14.35 cried 11.3
9 upset 11.12 sad 11.11 cry 10.56 died 10.07 l
onely 9.50 crying 5.50
58Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
59Conclusion
- A brief glance given (Semi-)automated forms of
text analysis, applied to blogs, can produce
useful insights - Only briefly mentioned today It can profit
from the simultaneous analysis of link structures
and/or tags - It is the only way of analysing large-scale
corpora mining methods are improving
continuously - However, machine language understanding is not
human language understanding - Also, representativeness is questionable
- ? Need to combine blog mining with other forms of
market research! - NB These forms include Web usage mining, query
mining, ... - But mining remains exploratory analysis
60Danke!
?
61 62Outlook
- Evaluation
- Usually lacking from temporal text mining!
- Information retrieval quality How to find the/a
ground truth? - (Subašic Berendt, in press) encouraging results
- What are the implications of genre (e.g., news
vs. scientific)? - Narrative vs. declarative?
- Different register (including vocabulary
choices)? - One story vs. multiple story lines?
THANKS !
63Our interest in THE WEB
- Whats out there?
- How do people use it?
- How can we make it more useful? (? information
literacy)
64Our interest in(STORY) EVOLUTION
- What happened?
- Presidential elections
- Crime stories
- How did Saddam become an al-Qaida member?
- What did genes do before they transmitted
information? - Link to anyone who likes the Web programming
- Link to other theories of evolution?
65UnderstandingA continuous interplay between
different representations
- Saddam is friend of US reasons text 1, text 2,
text 3, ... - ltsomething happensgt
- Saddam is foe of US reasons text 10, text 11, ...
66The application problem
What happened?
67... very related to a typical bibliometric problem
What happened?
What happened?
68Ingredients of a solution and the STORIES
approach
Agenda
The problem
Demonstration
Evaluation
Differences between application areas?!
69A case study
http//www.telegraph.co.uk/ news/main.jhtml?xml/n
ews/2007/05/22/nmaddy122.xml
70The story unfolds new actors enter the stage
(and old ones change their roles)
Manual!
71Basic observation A story is about relational
statements
Manual!
Robert Murat suspect
Kate MccCann suspect
72Solution approach 1 Find latent topics
- temporal development only by comparative statics
- no drill down possible
- no fine-grained relational information
- ? lacks structure
Tool Blaž Fortuna http//docatlas.ijs.sii
73Solution approach 2 Temporal latent topics
- no fine-grained relational information
- themes are fixed by the algorithm
- no drill down possible
- ? no combination of machine and human intelligence
Mei Zhai, PKDD 2005
74The ETP3 problem
- Evolutionary theme patterns discovery, summary
and exploration - identify topical sub-structure in a set
(generally, a time-indexed stream) of documents
constrained by being about a common topic - show how these substructures emerge, change, and
disappear (and maybe re-appear) over time - give users intuitive and interactive interfaces
for exploring the topic landscape and the
underlying documents use machine-generated
summarization only as a starting point!
75Ingredients of a solution
Document / text pre-processing
Interaction approach
- Template recognition
- Multi-document named entities
- Stopword removal, lemmatization
- Graphs ( layout)
- Comparative statics or morphing
- Drill-down uncovering relations
- Links to documents (in progress)
Document summarization strategy
- no topics, but salient concepts relations
- time window word-span window
Selection approach for concepts
- concepts words or named entities
- salient concept high TF involved in a
salient relation, time-indexed
ETP3
Similarity measure to determine relations
STORIES
Burstiness measure
- time relevance,
- a temporal co-occurrence lift
76Data collection and preprocessing
- Articles from Google News 05/2007 11/2007 for
search term madeleine mccann - (there was a Google problem in the December
archive) - Only English-language articles
- For each month, the first 100 hits
- Of these, all that were freely available ? 477
documents - Preprocessing
- HTML cleaning
- tokenization
- stopword removal
77Story elements
- content-bearing words
- the 150 top-TF words without stopwords
78Story stagesco-occurrence in a window
- mother and suspect co-occur
- in a window of size 6 (all words)
- in a window of size 2 (non-stopwords only)
79Salient story elements
- Split whole corpus T by week (17 30 Apr until
44 12 Nov ) - For each week
- Compute the weights for corpus t for this week
- Weight
- Support of co-occurrence of 2 content-bearing
words w1, w2 in t - ( articles from t containing both w1, w2 in
window) / ( all articles in t) - Threshold
- Number of occurrences of co-occurrence(w1, w2) in
t ?1 (e.g., 5) - Time-relevance TR of co-occurrence(w1, w2)
- support(co-occurrence(w1, w2)) in t /
support(co-occurrence(w1, w2)) in T ?2 (e.g.,
2) - Rank by TR, for each week identify top 2
- Story elements peak words all elements of
these top 2 pairs ( 38)
80Salient story stages, and story evolution
- Story stage co-occurrences of peak words in t
- For each week t aggregate over t-2, t-1, t ?
moving average - Story evolution how story stages evolve over
the t in T
81Demonstration
82Outlook
- Evaluation
- Usually lacking from temporal text mining!
- Information retrieval quality How to find the/a
ground truth? - (Subašic Berendt, in press) encouraging results
- What are the implications of genre (e.g., news
vs. scientific)? - Narrative vs. declarative?
- Different register (including vocabulary
choices)? - One story vs. multiple story lines?
THANKS !
83(No Transcript)
84Blogs and other social media Where tagging (
adding keywords) is most prominent
Bookmarks
Annotation platforms (e.g., del.icio.us)
Reader tags
Usenet
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
Author tags
Diaries
(Often political) journalism
PR press releases
85(No Transcript)
86Traditional methods of market research
- Based on questioning
- Qualitative MR - generally used for exploratory
purposes - small number of respondents - not
generalizable to the whole population -
statistical significance and confidence not
calculated - examples include focus groups,
in-depth interviews, and projective techniques - Quantitative MR - generally used to draw
conclusions - tests a specific hypothesis - uses
random sampling techniques so as to infer from
the sample to the population - involves a large
number of respondents - examples include surveys
and questionnaires. Techniques include choice
modelling, maximum difference preference scaling,
and covariance analysis. - Based on observations
- Ethnographic studies -, observes social phenomena
in their natural setting - observations can occur
cross-sectionally or longitudinally - examples
include product-use analysis and computer cookie
traces. - Experimental techniques -, creates a
quasi-artificial environment to try to control
spurious factors, then manipulates at least one
of the variables - examples include purchase
laboratories and test markets
87Lexicon dependence ...(geht auch bei den
Clintons nicht!! Family relations must be stated
in the text)
88?
89Who does market research?
- Companies
- Journalists
- Politicians
- ...
90Traditional methods of (consumer) market research
- Based on questioning
- Focus groups, surveys, questionnaires, ...
- Based on observations
- Ethnographic studies - observe social phenomena
in their natural setting - observations can occur
cross-sectionally or longitudinally - examples
include product-use analysis and computer cookie
traces. - Experimental techniques - create a
quasi-artificial environment to try to control
spurious factors, then manipulates at least one
of the variables - examples include purchase
laboratories and test markets
91Agenda
From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
92Capturing online buzz
Capturing online buzz pattern type 1
bursty
93Comparing search volume, news and blogs
94Capturing online buzz pattern type 1 (in news
and blogs) / type 2 (in search) smooth trend
95Capturing online buzz pattern type 2 (in news
and search) / type 3 (in blogs) cyclic
96The idea of text mining ...
- ... is to go beyond frequency-counting
- ... is to go beyond the search-for-documents
framework - ... is to find patterns (of meaning) within and
across documents - (yes, there is text mining behind some of the
things the above tools do!)
97The steps of text mining (e.g., for blogs
analysis)
- Application understanding
- Corpus generation
- Data understanding
- Text preprocessing
- Search for patterns / modelling
- Topical analysis
- Sentiment analysis / opinion mining
- Evaluation
- Deployment
98Whats market research?
- a form of business research
- identification, collection, analysis, and
dissemination of information - for the purpose of assisting management in
decision making related to the identification and
solution of problems and opportunities in
marketing. - two categories consumer market research and
business-to-business (B2B) market research - Consumer market research
- understanding the behaviours and preferences, of
consumers in a market-based economy, and aims to
understand the effects and comparative success of
marketing campaigns.