Blog Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Blog Mining

Description:

Blog Mining Market Research made easy? Bettina Berendt, K.U.Leuven, www.berendt.de About me ... Motivation / Excecutive summary Agenda Concepts Agenda Concepts ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 94
Provided by: peopleCsK
Category:
Tags: blog | data | mining | stream

less

Transcript and Presenter's Notes

Title: Blog Mining


1
Blog Mining Market Research made easy?Bettina
Berendt, K.U.Leuven, www.berendt.de
2
About me ...
Computer Science
Information Systems Computer Science /
Cognitive Science Artificial Intelligence
Business Science Economics
3
Motivation / Excecutive summary
4
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
5
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
6
Whats a blog?
  • a (more or less) frequently updated publication
    on the Web, sorted in (usually reverse)
    chronological order of the constituent blog
    posts.
  • The content may reflect any interests including
    personal, journalistic or corporate.
  • Usually textual, but multimedia forms exist
    (photoblog, vblog, )

7
Blogs and other social media (Web 2.0)
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
8
Blogs and other social media,and their activity
focus
Creating content
Organizing content
Communication
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
(Self-)profiling, Meeting people
Publication Expression
9
Blogs and other social media,and some of their
origins in older media
www.dmoz.org
Computer-supported cooperative work
Bookmarks
Chatrooms
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
Dating sites
Diaries
(Often political) journalism
PR press releases
10
Blogs Publication bordering on communication
Creating content
Organizing content
Communication
Annotation platforms (e.g., del.icio.us)
Wikis (e.g., Wikipedia)
Microblogging (e.g., Twitter)
Social network sites (e.g., MySpace)
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
(Self-)profiling, Meeting people
Publication Expression
11
Whats market research?
  • identification, collection, analysis, and
    dissemination of information
  • for the purpose of assisting management in
    decision making related to the identification and
    solution of problems and opportunities in
    marketing

12
Traditional methods of (consumer) market research
  • Based on questioning
  • Focus groups, surveys, questionnaires, ...
  • Based on observations
  • Ethnographic studies - observe social phenomena
    in their natural setting - observations can occur
    cross-sectionally or longitudinally - examples
    include product-use analysis and computer cookie
    traces.
  • Experimental techniques - create a
    quasi-artificial environment to try to control
    spurious factors, then manipulates at least one
    of the variables - examples include purchase
    laboratories and test markets

13
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
14
Capturing online buzz
Capturing online buzzBursty communication
actitivies
15
Comparing search volume, news and blogs
16
The idea of text mining ...
  • ... is to go beyond frequency-counting
  • ... is to go beyond the search-for-documents
    framework
  • ... is to find patterns (of meaning) within and
    across documents
  • (yes, there is text mining behind some of the
    things the above tools do!)

17
The steps of text mining
  • Application understanding
  • Corpus generation
  • Data understanding
  • Text preprocessing
  • Search for patterns / modelling
  • Topical analysis
  • Sentiment analysis / opinion mining
  • Evaluation
  • Deployment

18
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
19
Application understanding Corpus generation
  • What is the question?
  • What is the context?
  • What could be interesting sources, and where can
    they be found?
  • Crawl
  • Use a search engine and/or archive
  • Google blogs search
  • Technorati
  • Blogdigger
  • ...

20
Preprocessing (1)
  • Data cleaning
  • Goal get clean ASCII text
  • Remove HTML markup, pictures, advertisements, ...
  • Automate this wrapper induction

21
Preprocessing (2)
  • Further text preprocessing
  • Goal get processable lexical / syntactical units
  • Tokenize (find word boundaries)
  • Lemmatize / stem
  • ex. buyers, buyer ? buyer / buyer, buying, ... ?
    buy
  • Remove stopwords
  • Find Named Entities (people, places, companies,
    ...) filtering
  • Resolve polysemy and homonymy word sense
    disambiguation synonym unification
  • Part-of-speech tagging filtering of nouns,
    verbs, adjectives, ...
  • ...
  • Most steps are optional and application-dependent!
  • Many steps are language-dependent coverage of
    non-English varies
  • Free and/or open-source tools or Web APIs exist
    for most steps

22
Preprocessing (3)
  • Creation of text representation
  • Goal a representation that the modelling
    algorithm can work on
  • Most common forms A text as
  • a set or (more usually) bag of words /
    vector-space representation term-document matrix
    with weights reflecting occurrence, importance,
    ...
  • a sequence of words
  • a tree (parse trees)

23
Recall text data pre-processing ...
24
An important part of preprocessingNamed-entity
recognition (1)
25
An important part of preprocessingNamed-entity
recognition (2)
  • Technique Lexica, heuristic rules, syntax
    parsing
  • Re-use lexica and/or develop your own
  • configurable tools such as GATE
  • A challenge multi-document named-entity
    recognition
  • See proposal in Subašic Berendt (Proc. ICDM
    2008)

26
The simplest form of content analysis is based on
NER
Berendt, Schlegel und Koch In Zerfaß et al.
(Hrsg.) Kommunikation, Partizipation und
Wirkungen im Social Web, 2008
27
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
28
More about named entities co-occurrence
Source Discussion boards ? similar to blogs, but
(more) clearly communication-related
Feldman et al., Proc. ICDM 2007
29
Co-occurrence of brands and attributes
Feldman et al., Proc. ICDM 2007
30
(No Transcript)
31
Capturing online buzz
Capturing online buzzBursty communication
actitivies
32
Comparing search volume, news and blogs
33
More advanced text modelling Summarization of
time-indexed documents
  • Recall Michelle Obama
  • Google Trends, Blogpulse etc. associate documents
    / document sets with bursts
  • But this means the user has to read the
    documents!
  • Can we do better and create a concise summary of
    what was discussed in that period?
  • Can we allow the user to ask as much detail as
    s/he is interested in?

34
Yes with STORIES
(Subašic Berendt, Proc. ICDM 2008)
35
Story elements
  • content-bearing words
  • the 150 top-TF words without stopwords

36
Story stagesco-occurrence in a window
  • mother and suspect co-occur
  • in a window of size 6 (all words)
  • in a window of size 2 (non-stopwords only)

37
Salient story elements
  • Identify content-bearing terms (e.g. 150
    top-TF.IDF over whole corpus)
  • Split whole corpus T by atomic time period (e.g.,
    week)
  • For each time period (atomic or moving-average)
  • Compute the weights for corpus t for this period
  • Weight
  • Support of co-occurrence of 2 content-bearing
    terms w1, w2 in t
  • ( articles from t containing both w1, w2 in
    window) / ( all articles in t)
  • Threshold
  • Number of occurrences of co-occurrence(w1, w2) in
    t ?1 (e.g., 5)
  • Time-relevance TR of co-occurrence(w1, w2)
  • support(co-occurrence(w1, w2)) in t /
    support(co-occurrence(w1, w2)) in T ?2 (e.g.,
    2)
  • Thresholds are set dynamically interactively by
    the user
  • Story elements relationships all these edges
  • Story basics terms all nodes connected by
    these edges

38
Salient story stages, and story evolution
  1. Story stage the story graph made of basics and
    elements in t
  2. Story evolution how story stages evolve over
    the t in T

39
An event a missing child
40
A central figure emerges in the police
investigations
41
Uncovering more details
42
Uncovering more details
43
An eventless time
44
The story and the underlying documents
45
Navigating between documents relating different
source types to one another
(Berendt Trümper, in press)
46
A simple form of opinion miningFeature-based
Summary (Hu and Liu, Proc. SIGKDD04)
Source Product reviews ? similar to blogs, but
(more) clearly product-related
  • GREAT Camera., Jun 3, 2004
  • Reviewer jprice174 from Atlanta, Ga.
  • I did a lot of research last year before I
    bought this camera... It kinda hurt to leave
    behind my beloved nikon 35mm SLR, but I was going
    to Italy, and I needed something smaller, and
    digital.
  • The pictures coming out of this camera are
    amazing. The 'auto' feature takes great pictures
    most of the time. And with digital, you're not
    wasting film if the picture doesn't come out.
  • .
  • Feature1 picture
  • Positive 12
  • The pictures coming out of this camera are
    amazing.
  • Overall this is a good camera with a really good
    picture clarity.
  • Negative 2
  • The pictures come out hazy if your hands shake
    even for a moment during the entire process of
    taking a picture.
  • Focusing on a display rack about 20 feet away in
    a brightly lit room during day time, pictures
    produced by this camera were blurry and in a
    shade of orange.
  • Feature2 battery life

47
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
48
An application Crisis PR Step 1Use blogs to
observe public discussions
  • Detect products about which there is
    controversial discussion
  • sentiment mining from text
  • and/or
  • use the structure of blogs (e.g., structure of
    blog post comments Mishne Glance, Proc. WWW
    2006)
  • and/or
  • discussion in the mainstream media (may be later
    though)

49
An application Crisis PR Step 2 Use blogs to
communicate facts own concerns
  • Example Dells exploding laptops product
    recall and aftermath
  • Dell launched a blog at that time (much maligned
    at first, but they learned ...)
  • Evaluation of all English-language consumer
    commentary on the Web before and after
    (methodology based on Reichhold 1996, The Loyalty
    Effect)

Market sentinel, 2007
50
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
51
First ...
  • ... the imperfect nature of automatic text
    analysis presents a challenge
  • But human inter-rater agreement on various
    aspects of texts also tends to be rather low!

52
Some findings
  • Only a fraction of the population blog (8 of
    adult Internet users - Pew Internet American
    Life Project July 2006)
  • Most blogs are personal
  • US survey (Business Week July 2006)
  • German-language blogosphere is less mature,
    esp. Less politicized, than the US blogosphere
    (Berendt, Schlegel Koch, 2008)
  • This includes fewer mentions of companies
  • But what about those personal blogs ...?

53
What makes people happy?
54
Happiness in blogosphere
55
  • Well kids, I had an awesome birthday thanks to
    you. D Just wanted to so thank you for coming
    and thanks for the gifts and junk. ) I have many
    pictures and I will post them later. hearts

current mood
What are the characteristic words of these two
moods?
Mihalcea, R. Liu, H. (2006). In Proc. AAAI
Spring Symposium CAAW. Slides based on Rada
Mihalceas presentation.
56
Data, data preparation and learning
  • LiveJournal.com optional mood annotation
  • 10,000 blogs
  • 5,000 happy entries / 5,000 sad entries
  • average size 175 words / entry
  • post-processing remove SGML tags, tokenization,
    part-of-speech tagging
  • quality of automatic mood separation
  • naïve bayes text classifier
  • five-fold cross validation
  • Accuracy 79.13 (gtgt 50 baseline)

57
Results Corpus-derived happiness factors
  • yay 86.67
  • shopping 79.56
  • awesome 79.71
  • birthday 78.37
  • lovely 77.39
  • concert 74.85
  • cool 73.72
  • cute 73.20
  • lunch 73.02
  • books 73.02

goodbye 18.81 hurt 17.39 tears 14.35 cried 11.3
9 upset 11.12 sad 11.11 cry 10.56 died 10.07 l
onely 9.50 crying 5.50
58
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
59
Conclusion
  • A brief glance given (Semi-)automated forms of
    text analysis, applied to blogs, can produce
    useful insights
  • Only briefly mentioned today It can profit
    from the simultaneous analysis of link structures
    and/or tags
  • It is the only way of analysing large-scale
    corpora mining methods are improving
    continuously
  • However, machine language understanding is not
    human language understanding
  • Also, representativeness is questionable
  • ? Need to combine blog mining with other forms of
    market research!
  • NB These forms include Web usage mining, query
    mining, ...
  • But mining remains exploratory analysis

60
Danke!
?
61
  • Schluss

62
Outlook
  • Evaluation
  • Usually lacking from temporal text mining!
  • Information retrieval quality How to find the/a
    ground truth?
  • (Subašic Berendt, in press) encouraging results
  • What are the implications of genre (e.g., news
    vs. scientific)?
  • Narrative vs. declarative?
  • Different register (including vocabulary
    choices)?
  • One story vs. multiple story lines?

THANKS !
63
Our interest in THE WEB
  • Whats out there?
  • How do people use it?
  • How can we make it more useful? (? information
    literacy)

64
Our interest in(STORY) EVOLUTION
  • What happened?
  • Presidential elections
  • Crime stories
  • How did Saddam become an al-Qaida member?
  • What did genes do before they transmitted
    information?
  • Link to anyone who likes the Web programming
  • Link to other theories of evolution?

65
UnderstandingA continuous interplay between
different representations
  • Saddam is friend of US reasons text 1, text 2,
    text 3, ...
  • ltsomething happensgt
  • Saddam is foe of US reasons text 10, text 11, ...

66
The application problem
What happened?
67
... very related to a typical bibliometric problem
What happened?
What happened?
68
Ingredients of a solution and the STORIES
approach
Agenda
The problem
Demonstration
Evaluation
Differences between application areas?!
69
A case study
http//www.telegraph.co.uk/ news/main.jhtml?xml/n
ews/2007/05/22/nmaddy122.xml
70
The story unfolds new actors enter the stage
(and old ones change their roles)
Manual!
71
Basic observation A story is about relational
statements
Manual!
Robert Murat suspect
Kate MccCann suspect
72
Solution approach 1 Find latent topics
  • temporal development only by comparative statics
  • no drill down possible
  • no fine-grained relational information
  • ? lacks structure

Tool Blaž Fortuna http//docatlas.ijs.sii
73
Solution approach 2 Temporal latent topics
  • no fine-grained relational information
  • themes are fixed by the algorithm
  • no drill down possible
  • ? no combination of machine and human intelligence

Mei Zhai, PKDD 2005
74
The ETP3 problem
  • Evolutionary theme patterns discovery, summary
    and exploration
  • identify topical sub-structure in a set
    (generally, a time-indexed stream) of documents
    constrained by being about a common topic
  • show how these substructures emerge, change, and
    disappear (and maybe re-appear) over time
  • give users intuitive and interactive interfaces
    for exploring the topic landscape and the
    underlying documents use machine-generated
    summarization only as a starting point!

75
Ingredients of a solution
Document / text pre-processing
Interaction approach
  • Template recognition
  • Multi-document named entities
  • Stopword removal, lemmatization
  • Graphs ( layout)
  • Comparative statics or morphing
  • Drill-down uncovering relations
  • Links to documents (in progress)

Document summarization strategy
  • no topics, but salient concepts relations
  • time window word-span window

Selection approach for concepts
  • concepts words or named entities
  • salient concept high TF involved in a
    salient relation, time-indexed

ETP3
Similarity measure to determine relations
STORIES
  • bursty co-occurrence

Burstiness measure
  • time relevance,
  • a temporal co-occurrence lift

76
Data collection and preprocessing
  • Articles from Google News 05/2007 11/2007 for
    search term madeleine mccann
  • (there was a Google problem in the December
    archive)
  • Only English-language articles
  • For each month, the first 100 hits
  • Of these, all that were freely available ? 477
    documents
  • Preprocessing
  • HTML cleaning
  • tokenization
  • stopword removal

77
Story elements
  • content-bearing words
  • the 150 top-TF words without stopwords

78
Story stagesco-occurrence in a window
  • mother and suspect co-occur
  • in a window of size 6 (all words)
  • in a window of size 2 (non-stopwords only)

79
Salient story elements
  • Split whole corpus T by week (17 30 Apr until
    44 12 Nov )
  • For each week
  • Compute the weights for corpus t for this week
  • Weight
  • Support of co-occurrence of 2 content-bearing
    words w1, w2 in t
  • ( articles from t containing both w1, w2 in
    window) / ( all articles in t)
  • Threshold
  • Number of occurrences of co-occurrence(w1, w2) in
    t ?1 (e.g., 5)
  • Time-relevance TR of co-occurrence(w1, w2)
  • support(co-occurrence(w1, w2)) in t /
    support(co-occurrence(w1, w2)) in T ?2 (e.g.,
    2)
  • Rank by TR, for each week identify top 2
  • Story elements peak words all elements of
    these top 2 pairs ( 38)

80
Salient story stages, and story evolution
  • Story stage co-occurrences of peak words in t
  • For each week t aggregate over t-2, t-1, t ?
    moving average
  • Story evolution how story stages evolve over
    the t in T

81
Demonstration
82
Outlook
  • Evaluation
  • Usually lacking from temporal text mining!
  • Information retrieval quality How to find the/a
    ground truth?
  • (Subašic Berendt, in press) encouraging results
  • What are the implications of genre (e.g., news
    vs. scientific)?
  • Narrative vs. declarative?
  • Different register (including vocabulary
    choices)?
  • One story vs. multiple story lines?

THANKS !
83
(No Transcript)
84
Blogs and other social media Where tagging (
adding keywords) is most prominent
Bookmarks
Annotation platforms (e.g., del.icio.us)
Reader tags
Usenet
Blogs (e.g., Livejournal Huffington
Post) Sharing / linking by Hyperlinks, comments,
blogroll, trackback links
Author tags
Diaries
(Often political) journalism
PR press releases
85
(No Transcript)
86
Traditional methods of market research
  • Based on questioning
  • Qualitative MR - generally used for exploratory
    purposes - small number of respondents - not
    generalizable to the whole population -
    statistical significance and confidence not
    calculated - examples include focus groups,
    in-depth interviews, and projective techniques
  • Quantitative MR - generally used to draw
    conclusions - tests a specific hypothesis - uses
    random sampling techniques so as to infer from
    the sample to the population - involves a large
    number of respondents - examples include surveys
    and questionnaires. Techniques include choice
    modelling, maximum difference preference scaling,
    and covariance analysis.
  • Based on observations
  • Ethnographic studies -, observes social phenomena
    in their natural setting - observations can occur
    cross-sectionally or longitudinally - examples
    include product-use analysis and computer cookie
    traces.
  • Experimental techniques -, creates a
    quasi-artificial environment to try to control
    spurious factors, then manipulates at least one
    of the variables - examples include purchase
    laboratories and test markets

87
Lexicon dependence ...(geht auch bei den
Clintons nicht!! Family relations must be stated
in the text)
88
?
89
Who does market research?
  • Companies
  • Journalists
  • Politicians
  • ...

90
Traditional methods of (consumer) market research
  • Based on questioning
  • Focus groups, surveys, questionnaires, ...
  • Based on observations
  • Ethnographic studies - observe social phenomena
    in their natural setting - observations can occur
    cross-sectionally or longitudinally - examples
    include product-use analysis and computer cookie
    traces.
  • Experimental techniques - create a
    quasi-artificial environment to try to control
    spurious factors, then manipulates at least one
    of the variables - examples include purchase
    laboratories and test markets

91
Agenda
  • Concepts

From online buzz to text mining
Text mining first steps
Text mining going deeper
Closing the loop from blogs to blogs
The representativeness challenge
So ...
92
Capturing online buzz
Capturing online buzz pattern type 1
bursty
93
Comparing search volume, news and blogs
94
Capturing online buzz pattern type 1 (in news
and blogs) / type 2 (in search) smooth trend
95
Capturing online buzz pattern type 2 (in news
and search) / type 3 (in blogs) cyclic
96
The idea of text mining ...
  • ... is to go beyond frequency-counting
  • ... is to go beyond the search-for-documents
    framework
  • ... is to find patterns (of meaning) within and
    across documents
  • (yes, there is text mining behind some of the
    things the above tools do!)

97
The steps of text mining (e.g., for blogs
analysis)
  • Application understanding
  • Corpus generation
  • Data understanding
  • Text preprocessing
  • Search for patterns / modelling
  • Topical analysis
  • Sentiment analysis / opinion mining
  • Evaluation
  • Deployment

98
Whats market research?
  • a form of business research
  • identification, collection, analysis, and
    dissemination of information
  • for the purpose of assisting management in
    decision making related to the identification and
    solution of problems and opportunities in
    marketing.
  • two categories consumer market research and
    business-to-business (B2B) market research
  • Consumer market research
  • understanding the behaviours and preferences, of
    consumers in a market-based economy, and aims to
    understand the effects and comparative success of
    marketing campaigns.
Write a Comment
User Comments (0)
About PowerShow.com