Kein Folientitel - PowerPoint PPT Presentation

About This Presentation
Title:

Kein Folientitel

Description:

Title: Kein Folientitel Last modified by: ichuser Created Date: 1/25/2002 2:55:41 PM Document presentation format: Bildschirmpr sentation Other titles – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 32
Provided by: peopleCsK
Category:

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
Last update 28 December 2011
Advanced databases Inferring new knowledge
from data(bases) Text Mining II
Bettina Berendt
Katholieke Universiteit Leuven, Department of
Computer Science http//people.cs.kuleuven.be/bet
tina.berendt/teaching
2
Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
3
Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
4
Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
5
Motivation for association-rule learning/mining
store layout (Amazon, earlier Wal-Mart, ...)
Where to put spaghetti, butter?
6
What makes people happy?
7
Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
8
News and social media, in particular tweets
9
Recall CRISP-DM
  • CRISP-DM
  • CRoss Industry Standard Process for Data Mining
  • a data mining process model that describes
    commonly used approaches that expert data miners
    use to tackle problems.

10
Business understanding
11
Data understanding
12
Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
13
What are the relations between these text (parts)?
14
Or these?
15
A list of possible (and interesting) text
relations in the News/Blogs/Tweets domain
(relation Tweet -gt news art.)
  • Repetition (could be more interesting if repeated
    repetition /retweet -gt rep. Weights?)
  • Repetition of the headline
  • ? Pointing to interesting links (diff. To
    identify? need to process the link / might have
    redirection)
  • Pointing to the article
  • anything becomes more important if its
    retweeted (endorsement?)
  • that may depend on WHO (re)tweets it measured
    e.g. by no. Of followers
  • Comment
  • Reference to event or topic via a hashtag (Obama
    election) -- hashtags can be used to identify a
    topic that might also be present in NAs
    (being-about-the-same-topic) ? learn from the
    words around the texts, and co-occurring hashtags
  • use SentiStrength to determine if a text has a
    positive or negative relationship with a tweet
    (endorsement criticism)

16
Agenda
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
17
What is Content Analysis ?
  • A form of textual analysis usually
  • Categorizes chunks of text according to Code
  • Blend of qualitative and quantitative

Schwandt, Thomas A. Dictionary of Qualitative
Inquiry. 2nd ed. Sage Publications Thousand
Oaks, CA, 2001.
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
18
Rough History - 1
  • Classical Content Analysis
  • Used as early as the 30s in military
    intelligence
  • Analyzed items such as communist propaganda, and
    military speeches for themes
  • Created matrices searching for the number of
    occurrences of particular words/phrases

Roberts, C.W. "Content Analysis." International
Encyclopedia of the Social and Behavioral
Sciences. Elsevier Amsterdam, 2001.
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
19
Rough History - 2
  • (New) Content Analysis
  • Moved into Social Science Research
  • Study trends in Media, Politics, and provides
    method for analyzing open ended questions
  • Can include visual documents as well as texts
  • More of a focus on phrasal/categorical entities
    than simple word counting

My own terminology, more generally referred to
as simply Content Analysis
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
20
Procedure
  1. Identifying a corpus of texts and Sample Pop.
  2. Determine unit of analysis
  3. Find Themes (inductive or deductive)
  4. Build a Codebook
  5. Mark the texts
  6. Analyze the code from texts quantitatively

Denzin, Norman K. Handbook of Qualitative
Research. Sage Publications Thousand Oaks, CA,
2000.
From Eric S. Riley (n.d.) Content Analysis (pp.
3-6). http//www.geocities.com/licinius/washington
/contentanalysis.ppt
21
Coding
  • Analyzing the archived content. Includes
  • 1. Identifying units of analysis (e.g.,
    individual user posts, game characters)
  • 2. Creating a codebook
  • 3. Creating coding sheets (may be electronic now)
  • 4. Training, coding, intercoder reliability
    assessment, etc.

From Paul Skalski (n.d.) Content Analysis of
Interactive Media. http//academic.csuohio.edu/kne
uendorf/c63309/Interactive09.ppt, p. 10
22
Examples
  • To be shown in class
  • Overview of an example from the Social Web see
    http//academic.csuohio.edu/kneuendorf/c63309/Inte
    ractive09.ppt, pp. 11ff.
  • Further resources include
  • A detailed example of a codebook for Content
    Analysis of Stories about Protest Events
    http//www.ssc.wisc.edu/oliver/PROTESTS/ArticleCo
    pies/codebook2000.htm
  • More examples of codebooks and coding schemes
    http//academic.csuohio.edu/kneuendorf/content/hco
    ding/hcindex.htm

23
Thus
24
A first project plan (for HWs 4 and 6) HW 4
  • PHASE Data understanding / initial data
    collection of the class attribute
  • in different teams
  • come up with different possible relations between
    texts
  • find a small number of examples
  • NB Sampling strategy?
  • develop a codebook and coding scheme
  • have several coders code a larger number of
    examples
  • NB Sampling strategy?
  • measure inter-rater agreement
  • http//en.wikipedia.org/wiki/Krippendorff27s_Alph
    a
  • PHASE Pause revisit the literature and
    re-evaluate it (not really a CRISP-DM phase )
  • Compare your results!
  • In the light of all this, revisit (as an example
    from the literature) the Sentistrength coding
    procedure and discuss it critically

25
A first project plan (for HWs 4 and 6) HW 6
  • PHASE Data preparation
  • You may skip most of this phase. Take the data as
    prepared by Ilija!
  • PHASE Modelling
  • Understand / develop depending on time and
    interest formal measures of such text relations
  • Calculate the measures for the corpora
  • Calculate the accuracy of classification
  • PHASE Evaluation
  • Do an error analysis. Be critical with yourself,
    the results, and their meaning for the initial
    question -)
  • PHASE Deployment
  • Produce final report

26
First round of relations
  • R1 summary
  • R2 repetition of headline
  • R3 (the tweet is a) link (to the article)
  • R4 (the tweet is a) link to another article on
    the same topic
  • R5 comment on the articles content
  • R6 comment on a topic related to the article
  • R7 comment on the article
  • (note only if there is a link to the article!)
  • R8 hashtag-about-topic
  • R9 endorsement of the article
  • R10 endorsement of its content
  • R11 criticism of the article
  • R12 criticism of its content

27
Problems/observations
  • Non-English tweets
  • TODO language classification or different story
    selection
  • Overlapping categories headline repetition
    link to article (this is likely to happen on
    sites that have an automatic tweet generator)
  • TODO new category
  • Hashtag missing (sometimes only in the Oil
    Spill data?)
  • Comment on a tweet that commented on an article
    (in these tweets, there is a syntactic indicator
    of retweeting RT) AND most retweets are comments
    on the retweeted text
  • TODO new category indirect comment
  • Use the article as an argument
  • TODO new category link works as repetition
    simplify to category link to the article (a
    suggestion to someone to read it)
  • _at_ answer RT spread in your own network both
    may contain commenting (but answer presupposes
    the recipient knows what this is about) ? exclude
    answer tweets?!

28
Problems/observations (2)
  • Overlapping (s.a.)
  • Found an instance of only-link (not yet clear to
    what some links dont work)
  • Headline sentence (as far as the 140-chars
    allow) link
  • No relation topic too big (Iraq war vs. Iraqi
    economy)

29
Second round of relations the manually
annotated tweet is a of some news article /
other text
  • R1 Summary w link
  • R2 Headline w link
  • R3 Summary or headline wo link
  • R4 Endorsement w link
  • R5 Endorsement wo link
  • R6 Criticism w link
  • R7 Criticism wo link
  • R8 Otherwise emotionally charged text
  • R9 Just a link
  • R10 Comment on another tweet rule always
    involves RT or _at_
  • R11 Enriching another tweet rule always
    involves RT
  • Rule if there is a link, try to check it to see
    whether the tweet text repeats the headline
  • R12 OTHER

30
Outlook
Some advanced forms of text mining (index7.ppt,
pp. 32-47)
Recall The importance of business and data
understanding (BU DU) for knowledge discovery
The Twitter Study and its questions BU DU
Relations between texts
Content analysis as a method for
generating ground-truth annotations
Notes about language modelling and about
Inference on/with/for the Semantic Web
31
References / background reading
  • Stemler, Steve (2001). An overview of content
    analysis. Practical Assessment, Research
    Evaluation, 7(17). http//PAREonline.net/getvn.asp
    ?v7n17
  • This describes, among other things, the classic
    book in the field
  • Krippendorff, K. (1980). Content Analysis An
    Introduction to Its Methodology. Newbury Park,
    CA Sage.
  • The CRISP-DM manual can be found at
    http//www.spss.ch/upload/1107356429_CrispDM1.0.pd
    f
  • Our twitter study
  • Subašic, I. Berendt, B. (2011). Peddling or
    Creating? Investigating the Role of Twitter in
    News Reporting. In Proceedings of ECIR 2011
    (207-213). Berlin etc. Springer. LNCS 6611.
  • http//people.cs.kuleuven.be/bettina.berendt/Pape
    rs/subasic_berendt_2011.pdf
Write a Comment
User Comments (0)
About PowerShow.com