Title: Carmen Banea, Rada Mihalcea
1A Bootstrapping Method for Building Subjectivity
Lexicons for Languages with Scarce Resources
- Carmen Banea, Rada Mihalcea
- University of North Texas
- carmenb_at_unt.edu, rada_at_cs.unt.edu
Janyce Wiebe University of Pittsburg wiebe_at_cs.pitt
.edu
2Subjectivity analysis
- Subjectivity analysis (opinions and sentiments)
- Used in a wide variety of applications
- Tracking sentiment timelines in news (Lloyd et.
al, 2005) - Review classification (Turney, 2002 Pang et. al,
2002) - Mining opinions from product reviews (Hu and Liu,
2004) - Expressive text-to-speech synthesis (Alm et. al,
2005) - Text semantic analysis (Wiebe and Mihalcea, 2006
Esuli and Sebastiani, 2006) - Question answering (Yu and Hatzivassiloglou,
2003) - Much work on subjectivity analysis has focused on
English - Japanese (Takumura et. al, 2006), Chinese (Hu et.
al, 2005), German (Kim and Hovy, 2006)
3Proportion of Languages on the Web
internetworldstats.com updated November 30, 2007
4Objective
- Develop a method for subjectivity analysis that
- Requires few electronic resources
- Can be easily ported to a new language
- Applicable to the large number of languages that
have scarce electronic resources
5Related Work
- Tools that rely on manually or semi-automatically
constructed lexicons - Yu and Hatzivassiloglou, 2003 Riloff and Wiebe,
2003 Kim and Hovy, 2006 - Enable the efficient rule-based subjectivity and
sentiment classifiers that rely on the presence
of lexicon entries in text - These tools assume the availability of
- advanced language processing tools
- Syntactic parsers (Wiebe, 2000), Information
extraction (Riloff and Wiebe, 2003) - broad-coverage rich lexical resources
- WordNet (Essuli and Sebastiani, 2006)
- Our approach relates most closely to the method
of (Turney, 2002) for the construction of
lexicons annotated for polarity - We address the task of acquiring a subjectivity
lexicon - We rely on fewer, smaller-scale resources
6Our Method
- Based on bootstrapping
- Requires
- A small seed set of subjective entries
- One/multiple electronic dictionaries
- A small training corpus (approx. 500,000 words)
- Experiments focused on Romanian
- Applicable to other languages as well
7Bootstrapping Process
8Seed Set
Category Sample Entries (with their English translation)
Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)
Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)
Adjective frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)
Adverb posibil (possibly), probabil (probably), desigur (of course), enervant (unnerving)
- 60 seeds, evenhandedly sampled from verbs, nouns,
adjectives and adverbs. - Manually selected
- Seed sources
- XI-th grade curriculum for Romanian Language and
Literature - Translations of instances appearing in the
OpinionFinder strong subjective lexicon (Wiebe
and Riloff, 2005)
9Expansion
- Romanian dictionary http//www.dexonline.ro
- Dictionaries for other languages are also
available, or can be obtained from paper
dictionaries through OCR
10Filtering
- Candidates are filtered based on a measure of
similarity with the original seeds - We use Latent Semantic Analysis (LSA)(Dumais et
al., 1988) trained on the SemCor corpus (Miller
et al., 1993) - After each iteration, only candidates with an LSA
score higher than a given threshold are selected
for further expansion - Example
- Seed dulce (sweet)
- Candidate synonyms cu gust dulce
(sweet-tasting). placut (pleasant), dulceag
(quasi-sweet)
11Filtering
- Several iterations of the bootstrapping process
will result in a subjectivity lexicon consisting
of a ranked list of candidates in decreasing
order of similarity to the original seeds - A variable filtering threshold can be used to
further restrict the similarity for a more pure
lexicon - Filtering parameters
- Similarity threshold
- Number of iterations
12Lexicon Acquisition
13Evaluation
- Rule-based classifier of subjectivity
- (Riloff and Wiebe, 2003)
- Subjective sentence three or more subjective
entries. - Objective sentence two subjective entries or
less. - Gold standard data set
- (Mihalcea, Banea and Wiebe, 2007)
- 504 sentences from five SemCor documents
(manually translated in Romanian) - Labeled by two annotators
- Agreement (all) 83 (?0.67)
- Agreement (uncertain removed) 89 (?0.77)
- Baseline 54 (all subjective)
14Number of Iterations
F-measure for the bootstrapping subjectivity
lexicon over 5 iterations and an LSA threshold of
0.5
15Similarity Threshold
F-measure for the fifth bootstrapping iteration
for varying LSA scores
16Comparison
- Bootstrapping rule-based classifier uses a 3913
entries subjectivity - lexicon obtained through 5 iterations and
similarity threshold of 0.5
17Conclusions
- Our bootstrapping method uses few electronic
resources - A small seed set
- One/multiple dictionaries
- A small corpus of half a million words
- A large subjectivity lexicon of approx. 4000
entries was extracted - Using an unsupervised rule-based classifier, a
subjectivity F-measure of 66.20 and an overall
F-measure of 61.69 can be achieved
18Questions?