Spoken Language Processing Lab - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Spoken Language Processing Lab

Description:

Spoken Dialogue Systems. Discourse phenomena in dialogue. Turn-taking. Given/new information ... convey suprasegmental information in different languages? ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 26
Provided by: juliahir
Category:

less

Transcript and Presenter's Notes

Title: Spoken Language Processing Lab


1
Spoken Language Processing Lab
Who we are Julia Hirschberg, Stefan Benus, Fadi
Biadsy, Frank Enos, Agus Gravano, Jackson
Liscombe, Sameer Maskey, Andrew Rosenberg

Lab The Speech Lab, CEPSR 7LW3-A
2
Prosody, Emotion and Speaker State
  • A speakers emotional state represents important
    and useful information
  • To recognize (e.g. anger/frustration in IVR
    systems)
  • To generate (e.g. any emotion for games)
  • Many studies have shown that prosody helps to
    convey/identify classic emotions (anger,
    happiness,) with some accuracy
  • Can prosody also signal other types of speaker
    state?
  • In a tutoring domain (confidence vs. uncertainty)
  • Charisma
  • Deception

3
LDC Emotional Speech Corpus
  • happy
  • sad
  • angry
  • confident
  • frustrated
  • friendly
  • interested

anxious bored encouraging
4
Identifying Confidence vs. Uncertainty (Liscombe)
  • The ITSpoke Corpus physics tutoring Collected at
    U. Pittsburgh by Diane Litman and students
  • 17 students, 1 tutor
  • 130 human/human dialogues
  • 7000 student turns (mean length 2.5 sec)
  • Hand labeled for confidence, uncertainty, anger,
    frustration

5
A Certain Example
6
An Uncertain Example
7
pr01_sess00_prob58
8
Direct Modeling of Prosodic Features
  • Automatically extracted acoustic/prosodic
  • Pitch, energy, speaking rate, unit duration (hand
    labeled), pausal duration within and preceding
    unit of analysis, filled pauses (hand labeled)
  • Units
  • Entire turns
  • Breath groups
  • Context Same features from prior turn(s)

9
Classifying Uncertainty
  • Human-Human Corpus
  • AdaBoost (C4.5) 90/10 split
  • Classes Uncertain vs Certain vs Neutral
  • Results

Features Accuracy
Baseline 66
Acoustic-prosodic 75
contextual 76
breath-groups 77
10
Charismatic Speech (Rosenberg, Biadsy)
  • What is charisma?
  • The ability to attract, and retain followers by
    virtue of personality as opposed to tradition or
    laws. (Weber 47)
  • E.g. JFK, Hitler, Castro, Martin Luther King
  • Why study it?
  • Identify new leaders early
  • Help people improve their public speaking
  • Produce more compelling TTS
  • What makes leaders charismatic?
  • Can prosody help us identify charisma?

11
(No Transcript)
12
Method
  • Data 45 2-10s speech segments, 5 each from 9
    candidates for Democratic nomination for
    president
  • 2 charismatic, 2 not charismatic
  • Topics greeting, reasons for running, tax cuts,
    postwar Iraq, healthcare
  • 13 subjects rated each segment on a Likert scale
    (1-5) for 26 questions
  • Correlation of lexical and acoustic/prosodic
    features with mean charisma ratings

13
Acoustic/Prosodic and Lexical Features
  • Min, max, mean, stdev F0
  • Raw and normalized by speaker
  • Min, max, mean, stdev intensity
  • Speaking rate (syls/sec)
  • Mean and stdev of normalized F0 and intensity
    across phrases
  • Duration (secs)
  • Length (words, syls)
  • Number of intonational, intermediate, and
    internal phrases
  • Mean words per intermediate and intonational
    phrase
  • Mean syllables/word
  • 1st, 2nd, 3rd person pronoun density
  • Function to content word ratio

14
What makes speech charismatic?
  • More content
  • Length in secs, words, syllables, and phrases
  • Use of polysyllabic words
  • Lexical complexity (mean syllables per word)
  • Use of more first person pronouns
  • First person pronoun density
  • Higher and more dynamic raw F0
  • Min, max, mean, std. dev. of F0 over male
    speakers
  • Greater intensity
  • Mean intensity

15
  • Higher in a speakers pitch range
  • Mean normalized F0
  • Faster speaking rate
  • Syllables per second
  • Greater variation in F0 and intensity across
    phrases
  • Std. dev. of normalized phrase F0 and intensity
  • But...what about cultural differences?
  • Next
  • Swedish ratings of American tokens
  • Palestinian Arabs of Arabic tokens

16
Acoustic/Prosodic and Lexical Cues to Deception
(Enos)
  • Deception evokes emotion in deceivers (Ekman
    85-92)
  • Fear of discovery higher pitch, faster, louder,
    pauses, disfluencies, indirect speech
  • Elation at successful deceiving duping delight
    higher pitch, faster, louder, greater elaboration
  • Detecting cues to these emotions may also
    identify deception
  • Can prosody help us identify deceptive speakers?

17
Columbia/SRI/Colorado Corpus
  • 15.2 hrs. of interviews 7 hrs subject speech
  • Lexically transcribed automatically aligned
  • Labeling conditions Global / Local
  • Segmentation (LT/LL)
  • slash units (5709/3782)
  • phrases (11,612/7108)
  • turns (2230/1573)
  • Acoustic/prosodic features extracted from ASR
    output and lexical and discourse features
    extracted

18
Sample Features
  • Duration features
  • Phone / Vowel / Syllable Durations
  • Normalized by Phone/Vowel Means, Speaker
  • Speaking rate features (vowels/time)
  • Pause features (cf Benus et al 2006)
  • Speech to pause ratio, number of long pauses
  • Maximum pause length
  • Energy features (RMS energy)
  • Pitch features
  • Pitch stylization (Sonmez et al.)
  • LTM model of F0 to estimate speaker range
  • Pitch ranges, slopes, locations of interest
  • Spectral tilt features

19
(No Transcript)
20
Speech summarization in Broadcast News
  • Problem How do we summarize text and speech
    documents together?
  • Recognition Errors
  • Named Entities
  • Misrecognized rare terms
  • Error propagation in the processing pipeline of
    ASR transcripts
  • Ex Sentence boundary -gt Turn boundary -gt Speaker
    Roles -gt Summarization
  • Solution Combining lexical and acoustic
    information in one framework

21
Current Approach
  • Use acoustic/prosodic features to compute
    acoustic significance of sentences
  • Remove disfluencies from ASR transcripts
  • Compute ASR confidence for sentences
  • Cluster text and speech transcripts together
  • Use acoustic scores as additional weights
  • Word or Phrase level acoustic significance
  • Emphasized George Bush vs. non-emphasized
    George Bush
  • Use Broadcast News structure in summarization
  • Headlines, Soundbites, Interviews, Weather
    report, Sports section may be useful for certain
    questions opinion, attribution, disaster

22
Spoken Dialogue Systems
  • Discourse phenomena in dialogue
  • Turn-taking
  • Given/new information
  • Cue phrases
  • Entrainment
  • The GAMES corpus
  • 12 sessions of dialogue
  • 12.2h
  • Annotations orthographic, turns, cue phrases,
    ToBI, question form and function

23
(No Transcript)
24
(No Transcript)
25
Translating Prosody Mandarin/English (Rosenberg)
  • Prosodic variation is the last thing we learn
  • How do speakers convey suprasegmental information
    in different languages?
  • To translate, first identify
  • Automatic Identification of Prosodic Events
  • Pitch Accents and Phrase Boundaries
  • What are the correspondences?
  • Discourse structure
  • Intonational contours
  • Information status
  • Emotion
Write a Comment
User Comments (0)
About PowerShow.com