Title: Kein Folientitel
1Last update 28 December 2011
Advanced databases Inferring new knowledge
from data(bases) Text Mining II
Bettina Berendt
Katholieke Universiteit Leuven, Department of
Computer Science http//people.cs.kuleuven.be/bet
tina.berendt/teaching
2Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
3Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
4Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
5Motivation for association-rule learning/mining
store layout (Amazon, earlier Wal-Mart, ...)
Where to put spaghetti, butter?
6What makes people happy?
7Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
8News and social media, in particular tweets
9Recall CRISP-DM
- CRISP-DM
- CRoss Industry Standard Process for Data Mining
- a data mining process model that describes
commonly used approaches that expert data miners
use to tackle problems.
10Business understanding
11Data understanding
12Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
13What are the relations between these text (parts)?
14Or these?
15A list of possible (and interesting) text
relations in the News/Blogs/Tweets domain
(relation Tweet -gt news art.)
- Repetition (could be more interesting if repeated
repetition /retweet -gt rep. Weights?) - Repetition of the headline
- ? Pointing to interesting links (diff. To
identify? need to process the link / might have
redirection) - Pointing to the article
- anything becomes more important if its
retweeted (endorsement?) - that may depend on WHO (re)tweets it measured
e.g. by no. Of followers - Comment
- Reference to event or topic via a hashtag (Obama
election) -- hashtags can be used to identify a
topic that might also be present in NAs
(being-about-the-same-topic) ? learn from the
words around the texts, and co-occurring hashtags - use SentiStrength to determine if a text has a
positive or negative relationship with a tweet
(endorsement criticism)
16Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
17What is Content Analysis ?
- A form of textual analysis usually
- Categorizes chunks of text according to Code
- Blend of qualitative and quantitative
Schwandt, Thomas A. Dictionary of Qualitative
Inquiry. 2nd ed. Sage Publications Thousand
Oaks, CA, 2001.
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
18Rough History - 1
- Classical Content Analysis
- Used as early as the 30s in military
intelligence - Analyzed items such as communist propaganda, and
military speeches for themes - Created matrices searching for the number of
occurrences of particular words/phrases
Roberts, C.W. "Content Analysis." International
Encyclopedia of the Social and Behavioral
Sciences. Elsevier Amsterdam, 2001.
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
19Rough History - 2
- (New) Content Analysis
- Moved into Social Science Research
- Study trends in Media, Politics, and provides
method for analyzing open ended questions - Can include visual documents as well as texts
- More of a focus on phrasal/categorical entities
than simple word counting
My own terminology, more generally referred to
as simply Content Analysis
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
20Procedure
- Identifying a corpus of texts and Sample Pop.
- Determine unit of analysis
- Find Themes (inductive or deductive)
- Build a Codebook
- Mark the texts
- Analyze the code from texts quantitatively
Denzin, Norman K. Handbook of Qualitative
Research. Sage Publications Thousand Oaks, CA,
2000.
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
21Coding
- Analyzing the archived content. Includes
- 1. Identifying units of analysis (e.g.,
individual user posts, game characters) - 2. Creating a codebook
- 3. Creating coding sheets (may be electronic now)
- 4. Training, coding, intercoder reliability
assessment, etc.
From Paul Skalski (n.d.) Content Analysis of
Interactive Media. http//academic.csuohio.edu/kne
uendorf/c63309/Interactive09.ppt, p. 10
22Examples
- To be shown in class
- Overview of an example from the Social Web see
http//academic.csuohio.edu/kneuendorf/c63309/Inte
ractive09.ppt, pp. 11ff. - Further resources include
- A detailed example of a codebook for Content
Analysis of Stories about Protest Events
http//www.ssc.wisc.edu/oliver/PROTESTS/ArticleCo
pies/codebook2000.htm - More examples of codebooks and coding schemes
http//academic.csuohio.edu/kneuendorf/content/hco
ding/hcindex.htm
23Thus
24A first project plan (for HWs 4 and 6) HW 4
- PHASE Data understanding / initial data
collection of the class attribute - in different teams
- come up with different possible relations between
texts - find a small number of examples
- NB Sampling strategy?
- develop a codebook and coding scheme
- have several coders code a larger number of
examples - NB Sampling strategy?
- measure inter-rater agreement
- http//en.wikipedia.org/wiki/Krippendorff27s_Alph
a - PHASE Pause revisit the literature and
re-evaluate it (not really a CRISP-DM phase ) - Compare your results!
- In the light of all this, revisit (as an example
from the literature) the Sentistrength coding
procedure and discuss it critically
25A first project plan (for HWs 4 and 6) HW 6
- PHASE Data preparation
- You may skip most of this phase. Take the data as
prepared by Ilija! - PHASE Modelling
- Understand / develop depending on time and
interest formal measures of such text relations - Calculate the measures for the corpora
- Calculate the accuracy of classification
- PHASE Evaluation
- Do an error analysis. Be critical with yourself,
the results, and their meaning for the initial
question -) - PHASE Deployment
- Produce final report
26First round of relations
- R1 summary
- R2 repetition of headline
- R3 (the tweet is a) link (to the article)
- R4 (the tweet is a) link to another article on
the same topic - R5 comment on the articles content
- R6 comment on a topic related to the article
- R7 comment on the article
- (note only if there is a link to the article!)
- R8 hashtag-about-topic
- R9 endorsement of the article
- R10 endorsement of its content
- R11 criticism of the article
- R12 criticism of its content
27Problems/observations
- Non-English tweets
- TODO language classification or different story
selection - Overlapping categories headline repetition
link to article (this is likely to happen on
sites that have an automatic tweet generator) - TODO new category
- Hashtag missing (sometimes only in the Oil
Spill data?) - Comment on a tweet that commented on an article
(in these tweets, there is a syntactic indicator
of retweeting RT) AND most retweets are comments
on the retweeted text - TODO new category indirect comment
- Use the article as an argument
- TODO new category link works as repetition
simplify to category link to the article (a
suggestion to someone to read it) - _at_ answer RT spread in your own network both
may contain commenting (but answer presupposes
the recipient knows what this is about) ? exclude
answer tweets?!
28Problems/observations (2)
- Overlapping (s.a.)
- Found an instance of only-link (not yet clear to
what some links dont work) - Headline sentence (as far as the 140-chars
allow) link - No relation topic too big (Iraq war vs. Iraqi
economy)
29Second round of relations the manually
annotated tweet is a of some news article /
other text
- R1 Summary w link
- R2 Headline w link
- R3 Summary or headline wo link
- R4 Endorsement w link
- R5 Endorsement wo link
- R6 Criticism w link
- R7 Criticism wo link
- R8 Otherwise emotionally charged text
- R9 Just a link
- R10 Comment on another tweet rule always
involves RT or _at_ - R11 Enriching another tweet rule always
involves RT - Rule if there is a link, try to check it to see
whether the tweet text repeats the headline - R12 OTHER
30Outlook
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
Notes about language modelling and about
Inference on/with/for the Semantic Web
31References / background reading
- Stemler, Steve (2001). An overview of content
analysis. Practical Assessment, Research
Evaluation, 7(17). http//PAREonline.net/getvn.asp
?v7n17 - This describes, among other things, the classic
book in the field - Krippendorff, K. (1980). Content Analysis An
Introduction to Its Methodology. Newbury Park,
CA Sage. - The CRISP-DM manual can be found at
http//www.spss.ch/upload/1107356429_CrispDM1.0.pd
f - Our twitter study
- Subašic, I. Berendt, B. (2011). Peddling or
Creating? Investigating the Role of Twitter in
News Reporting. In Proceedings of ECIR 2011
(207-213). Berlin etc. Springer. LNCS 6611. - http//people.cs.kuleuven.be/bettina.berendt/Pape
rs/subasic_berendt_2011.pdf