Title: CS 124LINGUIST 180: From Languages to Information
1CS 124/LINGUIST 180 From Languages to
Information
- Dan Jurafsky
- Lecture 5 Sentiment Analysis
IP notice many slides for today from Janyce
Wiebe, plus some from Marti Hearst and Marta Tatu
2Sentiment, Style, Identity Classification
- Identity
- Authorship identification
- Age/gender identification
- Sentiment Analysis
- Movie is a review positive or negative
- Products (new MacBook Pro)
- Sentiments over time (is anger increasing or
decreasing?) - Politics (is this editorial left or right?)
- Prediction (election outcomes, market trends).
Will stock go up after this news report? - Style/Emotion
- Is this conversation (or blog) friendly,
aggressive, polite, flirtatious
3Part I Author Identification(Stylometry)
4Author Identification
- Also called Stylometry in the humanities
- An example of a Classification Problem
- Classifiers
- Decide which of N buckets to put an item in
- (Some classifiers allow for multiple buckets)
5The Disputed Federalist Papers
- In 1787-1788, Jay, Madison, and Hamilton wrote a
series of anonymous essays to convince the voters
of New York to ratify the new U. S. Constitution. - Scholars have consensus that
- 5 authored by Jay
- 51 authored by Hamilton
- 14 authored by Madison
- 3 jointly by Hamilton and Madison
- 12 remain in dispute Hamilton or Madison?
6Author identification
- Federalist papers
- In 1963 Mosteller and Wallace solved the problem
- They identified function words as good candidates
for authorships analysis - Using statistical inference they concluded the
author was Madison - Since then, other statistical techniques have
supported this conclusion.
7Function vs. Content Words
High rates for by favor M, low favor H High
rates for from favor M, low says little High
rats for to favor H, low favor M
8Function vs. Content Words
No consistent pattern for war
9Federalist Papers Problem
Fung, The Disputed Federalist Papers SVM Feature
Selection Via Concave Minimization, ACM TAPIA03
10Part II Political Sentiment
- Two examples of classifiers
- Using words as features
- And a Naïve Bayes or SVM classifier
- To make a binary decision
- About the political stance of a text
11Part II Political Sentiment
- Lin, Wilson, Wiebe, Hauptmann (2006)
- Bitterlemon.com
- A website designed to contribute to mutual
understanding between Palestinians and Israelis
through the open exchange of ideas - Can we label Israeli Palestinian perspective
- The inadvertent killing by Israeli forces of
Palestinian civilians usually in the course of
shooting at Palestinian terrorists is
considered no different at the moral and ethical
level than the deliberate targeting of Israeli
civilians by Palestinian suicide bombers. - In the first weeks of the Intifada, for example,
Palestinian public protests and civilian
demonstrations were answered brutally by Israel,
which killed tens of unarmed protesters.
12Lin et al on Political Perspective
- 594 articles from 2001-2005
- Naïve Bayes classifier
- Accuracy 89-99
13Naïve Bayes Top 20 words
- Palestinian
- palestinian, israel, state, politics, peace,
international, people, settle, occupation,
sharon, right, govern, two, secure, end,
conflict, process, side, negotiate - Israeli
- israel, palestinian, state, settle, sharon,
peace, arafat, arab, politics, two, process,
secure, conflict, lead, america, agree, right,
gaza, govern
14Thomas, Pang, Lee Get out the vote Determining
support or opposition from Congressional
floor-debate transcripts
- Goal label a speech as pro or con a bill
- Data transcripts of all debates in House of
Representatives in 2005 - From GovTrack (http//govtrack.us) website
- Each speech segment (sequence of uninterrupted
utterances by speaker) - Labeled by the vote (yea or nay) cast
- Labeled by SVM classifier, using all word
unigrams as features
15Results
- majority baseline 58.37
- (support) - (oppos) 62.67
- SVM classifier 66.05
- Add network of agreements 70.81
16Part III How to choose sentiment vocabulary
- Key task Vocabulary
- The previous work used all words
- Can we do better by focusing on subset of words?
- How to find words, phrases, patterns that express
sentiment or polarity?
17Words
- Adjectives
- positive honest important mature large patient
- Ron Paul is the only honest man in Washington.
- Kitchells writing is unbelievably mature and is
only likely to get better. - To humour me my patient father agrees yet again
to my choice of film
18Words
- Adjectives
- negative harmful hypocritical inefficient
insecure - It was a macabre and hypocritical circus.
- Why are they being so inefficient ?
- subjective curious, peculiar, odd, likely,
probably
Slide from Janyce Wiebe
19Other parts of speech
- Verbs
- positive praise, love
- negative blame, criticize
- Nouns
- positive pleasure, enjoyment
- negative pain, criticism
20Phrases
- Phrases containing adjectives and adverbs
- positive high intelligence, low cost
- negative little variation, many troubles
21Identifying polarity words
- Assume that contexts are coherent
- Fair and legitimate, corrupt and brutal
- fair and brutal, corrupt and legitimate
22Hatzivassiloglou McKeown 1997Predicting the
semantic orientation of adjectives
- Step 1
- From 21-million word WSJ corpus
- For every adjective with frequency gt 20
- Label for polarity
- Total of 1336 adjectives
- 657 positive
- 679 negative
23Hatzivassiloglou McKeown 1997
- Step 2 Extract all conjoined adjectives
nice and comfortable nice and scenic
Slide adapted from Janyce Wiebe
23
24Hatzivassiloglou McKeown 1997
- 3. A supervised learning algorithm builds a graph
of adjectives linked by the same or different
semantic orientation
scenic
nice
terrible
painful
handsome
fun
expensive
comfortable
25Hatzivassiloglou McKeown 1997
- 4. A clustering algorithm partitions the
adjectives into two subsets
slow
scenic
nice
terrible
handsome
painful
fun
expensive
comfortable
26Hatzivassiloglou McKeown 1997
27Turney (2002) Thumbs Up or Thumbs Down?
Semantic Orientation Applied to Unsupervised
Classification of Reviews
- Input review
- Identify phrases that contain adjectives or
adverbs by using a part-of-speech tagger - Estimate the semantic orientation of each phrase
- Assign a class to the given review based on the
average semantic orientation of its phrases - Output classification (? or ?)
28Turney Step 1
- Extract all two-word phrases including an
adjective
29Turney Step 2
- Estimate the semantic orientation of the
extracted phrases using Pointwise Mutual
Information
30Pointwise Mutual Information
- Mutual information between 2 random variables X
and Y - Pointwise mutual information measure of how
often two events x and y occur, compared with
what we would expect if they were independent
31Weighting Mutual Information
- Pointwise mutual information measure of how
often two events x and y occur, compared with
what we would expect if they were independent - PMI between two words how much more often they
occur together than we would expect if they were
independent
32Turney Step 2
- Semantic Orientation of a phrase defined as
- Estimate PMI by issuing queries to a search
engine (Altavista, 350 million pages)
33Turney Step 3
- Calculate average semantic orientation of phrases
in review - Positive ?
- Negative ?
34Experiments
- 410 reviews from Epinions
- 170 (41) (?)
- 240 (59) (?)
- Average phrases per review 26
- Baseline accuracy 59
35Discussion
- What makes movies hard to classify?
- Sentiment can be subtle
- Perfume review in Perfumes the Guide
- If you are reading this because it is your
darling fragrance, please wear it at home
exclusively, and tape the windows shut. - She runs the gamut of emotions from A to B
- (Dorothy Parker on Katherine Hepburn)
- Order effects
- This film should be brilliant. It sounds like a
great plot, the actors are first grade, and the
supporting cast is good as well, and Stallone is
attempting to deliver a good performance.
However, it cant hold up.
36Patterns
- Lexico-syntactic patterns Riloff Wiebe 2003
- way with ltnpgt to ever let China use force to
have its way with - expense of ltnpgt at the expense of the worlds
security and stability - underlined ltdobjgt Jiangs subdued tone
underlined his desire to avoid disputes
37Labeling Conversations for Style
- Dan Jurafsky, Rajesh Ranganath, Dan McFarland
38Extraction of Social Meaning/Emotion/ Style from
Speech and Text
- Detection of student uncertainty in tutoring
- Forbes-Riley et al. (2008)
- Emotion detection (annoyance)
- Ang et al. (2002)
- Detection of deception
- Newman et al. (2003)
- Detection of charisma
- Rosenberg and Hirschberg (2005)
- Speaker stress, trauma
- Rude et al. (2004), Pennebaker and Lay (2002)
39Our task Detect Interactional Style
- Given speech and text from a conversation
- Can we tell if a speaker is
- Awkward?
- Flirtatious?
- Friendly?
- Dataset
- 1000 4-minute speed-dates
- Each subject rated their partner for these styles
- The following segment has been lightly
signal-processed
40Features Prosodic and Lexical
- In a regularized logistic regression classifier
- Pitch min, max, mean, std, range
- Amplitude min, max, mean, std
- Duration of turn
- Number of words
- Use of past tense
- Use of you
- Use of we
41Features Discourse
- of Backchannels
- Uh-huh. Yeah. Right. Oh, Okay
- of Appreciations
- Wow. Thats true. Oh, great!
- of Questions
- Amount of Laughter
- Total number of turns
- of disfluencies
- Amount of overlapped speech
- NTRI
- Wait, say it again, do you have another one?
42Results so far
43Regression weights
44Regression weights
45Good predictors, across both genders
- Awkward speaker slow, lower pitched, stilted
talk - Flirtatious speaker greater laughter, more
questions, and referring to the past. - Friendly speaker greater laughter.
46Gender differences
- Flirtation
- Women raise pitch, men drop pitch
- Women laughter is highest weighted
- Women quiet talk
- Men expanded pitch range
47Flirtation versus friendliness
- Men
- Flirtation
- more questions, you, lower pitch, greater
pitch range - Friendliness
- Higher pitch, decreased pitch range
- Women
- Flirtation
- more questions, you, quiet speech, higher pitch
- Friendliness
- no questions, loud speech, more pitch variation
48Summary on Sentiment and Style
- Function words are a good cue to identity
- All words work well for some tasks
- Finding subsets of words may help in other tasks
- Other features may also help
- Questions
- Length of sentences
- Speech features