Carmen Banea, Rada Mihalcea - PowerPoint PPT Presentation

About This Presentation

Title:

Carmen Banea, Rada Mihalcea

Description:

Online dictionary. Fixed filtering. Seed Set. Category ... Candidates are filtered based on a measure of similarity with the original seeds ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 19

Provided by: Carm185

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Carmen Banea, Rada Mihalcea

1
A Bootstrapping Method for Building Subjectivity
Lexicons for Languages with Scarce Resources

Carmen Banea, Rada Mihalcea
University of North Texas
carmenb_at_unt.edu, rada_at_cs.unt.edu

Janyce Wiebe University of Pittsburg wiebe_at_cs.pitt
.edu
2
Subjectivity analysis

Subjectivity analysis (opinions and sentiments)
Used in a wide variety of applications
Tracking sentiment timelines in news (Lloyd et.
al, 2005)
Review classification (Turney, 2002 Pang et. al,
2002)
Mining opinions from product reviews (Hu and Liu,
2004)
Expressive text-to-speech synthesis (Alm et. al,
2005)
Text semantic analysis (Wiebe and Mihalcea, 2006
Esuli and Sebastiani, 2006)
Question answering (Yu and Hatzivassiloglou,
2003)
Much work on subjectivity analysis has focused on
English
Japanese (Takumura et. al, 2006), Chinese (Hu et.
al, 2005), German (Kim and Hovy, 2006)

3
Proportion of Languages on the Web
internetworldstats.com updated November 30, 2007
4
Objective

Develop a method for subjectivity analysis that
Requires few electronic resources
Can be easily ported to a new language
Applicable to the large number of languages that
have scarce electronic resources

5
Related Work

Tools that rely on manually or semi-automatically
constructed lexicons
Yu and Hatzivassiloglou, 2003 Riloff and Wiebe,
2003 Kim and Hovy, 2006
Enable the efficient rule-based subjectivity and
sentiment classifiers that rely on the presence
of lexicon entries in text
These tools assume the availability of
advanced language processing tools
Syntactic parsers (Wiebe, 2000), Information
extraction (Riloff and Wiebe, 2003)
broad-coverage rich lexical resources
WordNet (Essuli and Sebastiani, 2006)
Our approach relates most closely to the method
of (Turney, 2002) for the construction of
lexicons annotated for polarity
We address the task of acquiring a subjectivity
lexicon
We rely on fewer, smaller-scale resources

6
Our Method

Based on bootstrapping
Requires
A small seed set of subjective entries
One/multiple electronic dictionaries
A small training corpus (approx. 500,000 words)
Experiments focused on Romanian
Applicable to other languages as well

7
Bootstrapping Process
8
Seed Set
Category Sample Entries (with their English translation)
Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)
Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)
Adjective frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)
Adverb posibil (possibly), probabil (probably), desigur (of course), enervant (unnerving)

60 seeds, evenhandedly sampled from verbs, nouns,
adjectives and adverbs.
Manually selected
Seed sources
XI-th grade curriculum for Romanian Language and
Literature
Translations of instances appearing in the
OpinionFinder strong subjective lexicon (Wiebe
and Riloff, 2005)

9
Expansion

Romanian dictionary http//www.dexonline.ro
Dictionaries for other languages are also
available, or can be obtained from paper
dictionaries through OCR

10
Filtering

Candidates are filtered based on a measure of
similarity with the original seeds
We use Latent Semantic Analysis (LSA)(Dumais et
al., 1988) trained on the SemCor corpus (Miller
et al., 1993)
After each iteration, only candidates with an LSA
score higher than a given threshold are selected
for further expansion
Example
Seed dulce (sweet)
Candidate synonyms cu gust dulce
(sweet-tasting). placut (pleasant), dulceag
(quasi-sweet)

11
Filtering

Several iterations of the bootstrapping process
will result in a subjectivity lexicon consisting
of a ranked list of candidates in decreasing
order of similarity to the original seeds
A variable filtering threshold can be used to
further restrict the similarity for a more pure
lexicon
Filtering parameters
Similarity threshold
Number of iterations

12
Lexicon Acquisition
13
Evaluation

Rule-based classifier of subjectivity
(Riloff and Wiebe, 2003)
Subjective sentence three or more subjective
entries.
Objective sentence two subjective entries or
less.
Gold standard data set
(Mihalcea, Banea and Wiebe, 2007)
504 sentences from five SemCor documents
(manually translated in Romanian)
Labeled by two annotators
Agreement (all) 83 (?0.67)
Agreement (uncertain removed) 89 (?0.77)
Baseline 54 (all subjective)

14
Number of Iterations
F-measure for the bootstrapping subjectivity
lexicon over 5 iterations and an LSA threshold of
0.5
15
Similarity Threshold
F-measure for the fifth bootstrapping iteration
for varying LSA scores
16
Comparison

Bootstrapping rule-based classifier uses a 3913
entries subjectivity
lexicon obtained through 5 iterations and
similarity threshold of 0.5

17
Conclusions

Our bootstrapping method uses few electronic
resources
A small seed set
One/multiple dictionaries
A small corpus of half a million words
A large subjectivity lexicon of approx. 4000
entries was extracted
Using an unsupervised rule-based classifier, a
subjectivity F-measure of 66.20 and an overall
F-measure of 61.69 can be achieved

18
Questions?

Write a Comment

User Comments (0)