Carmen Banea, Rada Mihalcea - PowerPoint PPT Presentation

About This Presentation
Title:

Carmen Banea, Rada Mihalcea

Description:

Online dictionary. Fixed filtering. Seed Set. Category ... Candidates are filtered based on a measure of similarity with the original seeds ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 19
Provided by: Carm185
Learn more at: http://www.lrec-conf.org
Category:
Tags: banea | carmen | mihalcea | rada

less

Transcript and Presenter's Notes

Title: Carmen Banea, Rada Mihalcea


1
A Bootstrapping Method for Building Subjectivity
Lexicons for Languages with Scarce Resources
  • Carmen Banea, Rada Mihalcea
  • University of North Texas
  • carmenb_at_unt.edu, rada_at_cs.unt.edu

Janyce Wiebe University of Pittsburg wiebe_at_cs.pitt
.edu
2
Subjectivity analysis
  • Subjectivity analysis (opinions and sentiments)
  • Used in a wide variety of applications
  • Tracking sentiment timelines in news (Lloyd et.
    al, 2005)
  • Review classification (Turney, 2002 Pang et. al,
    2002)
  • Mining opinions from product reviews (Hu and Liu,
    2004)
  • Expressive text-to-speech synthesis (Alm et. al,
    2005)
  • Text semantic analysis (Wiebe and Mihalcea, 2006
    Esuli and Sebastiani, 2006)
  • Question answering (Yu and Hatzivassiloglou,
    2003)
  • Much work on subjectivity analysis has focused on
    English
  • Japanese (Takumura et. al, 2006), Chinese (Hu et.
    al, 2005), German (Kim and Hovy, 2006)

3
Proportion of Languages on the Web
internetworldstats.com updated November 30, 2007
4
Objective
  • Develop a method for subjectivity analysis that
  • Requires few electronic resources
  • Can be easily ported to a new language
  • Applicable to the large number of languages that
    have scarce electronic resources

5
Related Work
  • Tools that rely on manually or semi-automatically
    constructed lexicons
  • Yu and Hatzivassiloglou, 2003 Riloff and Wiebe,
    2003 Kim and Hovy, 2006
  • Enable the efficient rule-based subjectivity and
    sentiment classifiers that rely on the presence
    of lexicon entries in text
  • These tools assume the availability of
  • advanced language processing tools
  • Syntactic parsers (Wiebe, 2000), Information
    extraction (Riloff and Wiebe, 2003)
  • broad-coverage rich lexical resources
  • WordNet (Essuli and Sebastiani, 2006)
  • Our approach relates most closely to the method
    of (Turney, 2002) for the construction of
    lexicons annotated for polarity
  • We address the task of acquiring a subjectivity
    lexicon
  • We rely on fewer, smaller-scale resources

6
Our Method
  • Based on bootstrapping
  • Requires
  • A small seed set of subjective entries
  • One/multiple electronic dictionaries
  • A small training corpus (approx. 500,000 words)
  • Experiments focused on Romanian
  • Applicable to other languages as well

7
Bootstrapping Process
8
Seed Set
Category Sample Entries (with their English translation)
Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)
Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)
Adjective frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)
Adverb posibil (possibly), probabil (probably), desigur (of course), enervant (unnerving)
  • 60 seeds, evenhandedly sampled from verbs, nouns,
    adjectives and adverbs.
  • Manually selected
  • Seed sources
  • XI-th grade curriculum for Romanian Language and
    Literature
  • Translations of instances appearing in the
    OpinionFinder strong subjective lexicon (Wiebe
    and Riloff, 2005)

9
Expansion
  • Romanian dictionary http//www.dexonline.ro
  • Dictionaries for other languages are also
    available, or can be obtained from paper
    dictionaries through OCR

10
Filtering
  • Candidates are filtered based on a measure of
    similarity with the original seeds
  • We use Latent Semantic Analysis (LSA)(Dumais et
    al., 1988) trained on the SemCor corpus (Miller
    et al., 1993)
  • After each iteration, only candidates with an LSA
    score higher than a given threshold are selected
    for further expansion
  • Example
  • Seed dulce (sweet)
  • Candidate synonyms cu gust dulce
    (sweet-tasting). placut (pleasant), dulceag
    (quasi-sweet)

11
Filtering
  • Several iterations of the bootstrapping process
    will result in a subjectivity lexicon consisting
    of a ranked list of candidates in decreasing
    order of similarity to the original seeds
  • A variable filtering threshold can be used to
    further restrict the similarity for a more pure
    lexicon
  • Filtering parameters
  • Similarity threshold
  • Number of iterations

12
Lexicon Acquisition
13
Evaluation
  • Rule-based classifier of subjectivity
  • (Riloff and Wiebe, 2003)
  • Subjective sentence three or more subjective
    entries.
  • Objective sentence two subjective entries or
    less.
  • Gold standard data set
  • (Mihalcea, Banea and Wiebe, 2007)
  • 504 sentences from five SemCor documents
    (manually translated in Romanian)
  • Labeled by two annotators
  • Agreement (all) 83 (?0.67)
  • Agreement (uncertain removed) 89 (?0.77)
  • Baseline 54 (all subjective)

14
Number of Iterations
F-measure for the bootstrapping subjectivity
lexicon over 5 iterations and an LSA threshold of
0.5
15
Similarity Threshold
F-measure for the fifth bootstrapping iteration
for varying LSA scores
16
Comparison
  • Bootstrapping rule-based classifier uses a 3913
    entries subjectivity
  • lexicon obtained through 5 iterations and
    similarity threshold of 0.5

17
Conclusions
  • Our bootstrapping method uses few electronic
    resources
  • A small seed set
  • One/multiple dictionaries
  • A small corpus of half a million words
  • A large subjectivity lexicon of approx. 4000
    entries was extracted
  • Using an unsupervised rule-based classifier, a
    subjectivity F-measure of 66.20 and an overall
    F-measure of 61.69 can be achieved

18
Questions?
Write a Comment
User Comments (0)
About PowerShow.com