Prosody%20in%20Recognition/Understanding - PowerPoint PPT Presentation

About This Presentation
Title:

Prosody%20in%20Recognition/Understanding

Description:

Prosody in Recognition/Understanding – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 38
Provided by: JuliaH178
Category:

less

Transcript and Presenter's Notes

Title: Prosody%20in%20Recognition/Understanding


1
Prosody in Recognition/Understanding
2
Prosody in ASR Today
  • Little success in improving ASR transcription
  • More promise in non-traditional ASR-related
    tasks
  • Improving rejection
  • Shrinking search space
  • Automatic segmentation
  • Identifying salient words
  • Disambiguating speech/dialogue acts
  • Prosody in ASR understanding

3
Overview
  • Recognizing communicative problems
  • ASR errors
  • User corrections
  • Identifying speech acts
  • Locating topic boundaries for topic tracking and
    audio browsing
  • Recognizing speaker emotion

4
But...Systems Have Trouble Knowing When Theyve
Made a Mistake
  • Hard for humans to correct system misconceptions
    (Krahmer et al 99)
  • User I want to go to Boston.
  • System What day do you want to go to Baltimore?
  • Easier answering explicit requests for
    confirmation or responding to ASR rejections
  • System Did you say you want to go to Baltimore?
  • System I'm sorry. I didn't understand you. Could
    you please repeat your utterance?
  • But constant confirmation or over-cautious
    rejection lengthens dialogue and decreases user
    satisfaction

5
And Systems Have Trouble Recognizing User
Corrections
  • Probability of recognition failures increases
    after a misrecognition (Levow 98)
  • Corrections of system errors often
    hyperarticulated (louder, slower, more internal
    pauses, exaggerated pronunciation) ? more ASR
    error (Wade et al 92, Oviatt et al 96, Swerts
    Ostendorf 97, Levow 98, Bell Gustafson 99)

6
Can Prosodic Information Help Systems Perform
Better?
  • If errors occur where speaker turns are
    prosodically marked.
  • Can we recognize turns that will be misrecognized
    by examining their prosody?
  • Can we modify our dialogue and recognition
    strategies to handle corrections more
    appropriately?

7
Approach
  • Collect corpus from interactive voice response
    system
  • Identify speaker turns
  • incorrectly recognized
  • where speakers first aware of error
  • that correct misrecognitions
  • Identify prosodic features of turns in each
    category and compare to other turns
  • Use Machine Learning techniques to train a
    classifier to make these distinctions
    automatically

misrecognition
aware site
correction
8
Turn Types
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User
New York.
misrecognition
correction
aware site
9
TOOT Dialogues
  • Collected to study effects of differences in
    dialogue strategy on user performance and
    satisfaction (Litman Pan 99)
  • type of initiative (system, user, mixed)
  • type of confirmation (explicit, implicit, none)
  • adaptability condition
  • Subjects
  • 39 summer students
  • 16/23 (F/M)
  • 20/19 (native speaker/non)

10
  • Platform combined over-the-phone ASR and TTS
    (Kamm et al 97) with web access to train
    information
  • Task find train information for 4 scenarios
  • Corpus for current study
  • 2328 speaker turns
  • 52 dialogues
  • Misrecognitions
  • Overall word accuracy 61
  • Overall concept accuracy(CA) 71
  • I want to go to Boston from Philadelphia (2
    domain concepts)
  • recognized as I want to go to Boston (one
    concept) 50

11
A Successful Dialogue
  • S Hi, this is ATT Amtrak schedule system. This
    is TOOT. How may I help you?
  • U I would like a train to New York City from
    Philadelphia on Sunday at ten thirty P M
  • S I heard you say go to New York City from
    Philadelphia on Sunday approximately at ten
    thirty p.m." Do you want me to find the trains
    from Philadelphia to New York City on Sunday
    approximately at ten thirty p. m. now?
  • U Yes.
  • S I am going to get the train schedule for
    you....

12
An Unsuccessful Dialogue
  • S Hi, this is ATT Amtrak schedule system. This
    is TOOT. How may I help you?
  • U I'd like to find a train from Ch- to Chicago
    from Baltimore.
  • S Sorry, I can't understand you. Please repeat
    your utterance.
  • U I'd like to find a train from Baltimore to
    Chicago.
  • S I heard you say leave today around 5 thirty
    in the morning . Which city do you want to go
    to?
  • U I would like to leave Saturday at eight in the
    morning.

13
S You can say the arrival city name, such as
New York City." U Baltimore to Chicago. S You
can say the arrival city name, such as New York
City. U Baltimore to Chicago. (I hate it.) U
Train from Baltimore ...S ...I heard you s...U
to Chicago.... Hello? S You can say the name of
your departure city, such as "New York City." U
Baltimore.
14
U Bal-ti-more.... S You can say... U
Baltimore. S Do you want to continue and find a
new set of trains? U Yes. S You can say "yes"
or "I am done here" to find a new set of
trains. U Yes. S Thank you for using ATT
Amtrak train time table system. See you next
time. U I nev-
15
Are Misrecognitions, Aware Turns, Corrections
Measurably Different from Other Turns?
  • For each type of turn
  • For each speaker, for each prosodic feature,
    calculate mean values for e.g. all correctly
    recognized speaker turns and for all incorrectly
    recognized turns
  • Perform paired t-tests on these speaker pairs of
    means (e.g., for each speaker, pairing mean
    values for correctly and incorrectly recognized
    turns)

16
How Prosodic Features Examined per Turn
  • Raw prosodic/acoustic features
  • f0 maximum and mean (pitch excursion/range)
  • rms maximum and mean (amplitude)
  • total duration
  • duration of preceding silence
  • amount of silence within turn
  • speaking rate (estimated from syllables of
    recognized string per second)
  • Normalized versions of each feature (compared to
    first turn in task, to previous turn in task, Z
    scores)

17
Distinguishing Correct Recognitions from
Misrecognitions (NAACL 00)
  • Misrecognitions differ prosodically from correct
    recognitions in
  • F0 maximum (higher)
  • RMS maximum (louder)
  • turn duration (longer)
  • preceding pause (longer)
  • slower
  • Effect holds up across speakers and even when
    hyperarticulated turns are excluded

18
WER-Based Results
Misrecognitions are higher in pitch, louder,
longer, more preceding pause and less internal
silence
19
Does Hyperarticulation Lead to ASR Error?
  • In TOOT corpus
  • 24.1 of turns (perceived as) hyperarticulated
  • Hyperarticulated turns are recognized more poorly
    (59.5 WER) than non-hyperarticulated turns
    (32.8)
  • More misrecognized turns are hyperarticulated
    (36.5) than correctly recognized turns (16.0)
  • But .. same results w/out hyperarticulated turns

20
Predicting Turn Types Using Machine Learning
  • Ripper (Cohen 96) automatically induces rule
    sets for predicting turn types
  • greedy search guided by measure of information
    gain
  • input vectors of feature values
  • output ordered rules for predicting dependent
    variable and X-validated scores for each ruleset
  • Independent variables
  • all prosodic features, raw and normalized
  • experimental conditions (initiative type,
    confirmation style, adaptability, subject, task)
  • gender, native/non-native status
  • ASR recognized string, grammar, and acoustic
    confidence score

21
ML Results WER-defined Misrecognition
22
Best Rule-Set for Predicting WER
Using prosody, ASR conf, ASR string, ASR grammar
if (conf lt -2.85 (duration gt 1.27) then
F if (conf lt -4.34) then F if (tempo lt .81)
then F If (conf lt -4.09 then F If (conf lt
-2.46 str contains help then F If conf lt
-2.47 ppau gt .77 tempo lt .25 then F If str
contains nope then F If dur gt 1.71 tempo lt
1.76 then F else T
23
Analyses of Awares and Corrections
  • Awares
  • Shorter, somewhat louder, with less internal
    silence compared to other turns
  • Poorly recognized (49.9 misrecd vs. 34.6)
  • ML results 30.4 baseline (!aware)/Mean error
    12.2 (/-.61)
  • Corrections
  • longer, louder, higher in pitch excursion, longer
    preceding pause, less internal silence
  • ML results 30 baseline/Mean error 21.48 /-
    0.68

24
Turn Types
S Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? U Hello. I
would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
S Which city do you want to go to? U New
York.
aware site
25
Awareness Sites
  • Shorter, somewhat louder, with less internal
    silence compared to other turns
  • Poorly recognized (49.9 vs. 34.6)

26
ML Rules for Aware Prediction
  • 30.4 baseline (!aware)/Average error 12.2
    (/-.61)
  • T - preconflt-4.06, pretpolt2.65, ppaugt0.25
  • T - preconflt-3.59, prerejT
  • T - preconflt-2.85, predurgt1.04, tponm2gt1.04,
    preppaugt0.57, pretpolt2.18
  • T - preconflt-3.78, pmnsylsgt4.04
  • T - preconflt-2.75, prestr help
  • T - pregramuniversal, pprewordsgt2
  • T - preconflt-2.60, predurgt1.04,
    zerosnm1lt1.06, prermsavgt370.65
  • T - pretpolt0.13
  • T - predurgt1.27, pretpolt2.36, prermsavgt245.36
  • T - pretpolt0.80, pmntpolt1.75, ppretponm2lt1.39
  • default F

27
TOOT Corrections (ICSLP 00b)
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User New
York.
correction
28
Serious Problem for Spoken Dialogue Systems
  • 29 of turns in our corpus are corrections
  • 52 of corrections are hyperarticulated but only
    12 of other turns
  • Corrections are misrecognized at least twice as
    often as non-corrections (60 vs. 31)
  • But corrections are no more likely to be rejected
    than non-corrections. (9 vs. 8)
  • Are corrections also measurably distinct from
    non-corrections?

29
Prosodic Indicators of Corrections
  • Corrections differ from other turns prosodically
    longer, louder, higher in pitch excursion,
    longer preceding pause, less internal silence
  • ML results
  • Baseline 30 error
  • normd prosody non-prosody 18.45 /- 0.78
  • automatic 21.48 /- 0.68

30
Prosodic Indicators of Corrections
  • Corrections differ from other turns prosodically
    longer, louder, higher in pitch excursion,
    longer preceding pause, less internal silence

31
ML Rules for Correction Prediction
  • Baseline 30 error (predict not correction)
  • normd prosody non-prosody 18.45 /- 0.78
  • automatic 21.48 /- 0.68
  • T - gramuniversal, f0maxgt0.96, durgt6.55
  • T - gramuniversal, zerosgt0.57, asrlt-2.95
  • T - gramuniversal, f0maxlt1.98, durlt1.10,
    tempogt1.21, zerosgt0.71
  • T - durgt0.76, asrlt-2.97, stratUsrNoConf
  • T - durgt2.28, ppault0.86
  • T - rmsavgt1.11, stratMixedImplicit,
    gramcityname, f0maxgt0.70
  • default FALSE

32
Corrections in Context
  • Similar in prosodic features but
  • What about their form and content?
  • How do system behaviors affect the corrections
    users produce?
  • What sort of corrections are most, least
    effective?
  • When users correct the same mistake more than
    once, do they vary their strategy in productive
    ways?

33
User Correction Behavior
  • Correction classes
  • omits and repetitions lead to fewer
    misrecognitions than adds and paraphrases
  • Turns that correct rejections are more likely to
    be repetitions, while turns correcting
    misrecognitions are more likely to be omits

34
Role of System Strategy
Would you use again (1-5) SE (3.5), MI (2.6),
UNC (1.7) Satisfaction (0-40) SE (31.25), MI
(24.00), UNC (22.10)
35
  • Type of correction sensitive to strategy
  • much more likely to exactly repeat their
    misrecognized utterance in a system-initiative
    environment
  • much more likely to correct by omitting
    information if no system confirmation than with
    explicit confirmation
  • omits used more in MixedImplicit and
    UserNoConfirm conditions
  • Restarts unlikely to be recognized (77
    misrecognized) and skewed in distribution
  • 31 of corrections are restarts in MI and UNC
  • None for SE, where initial turns well recognized
  • It doesnt pay to start over!

36
Correction Chains
S You can say the arrival city name, such as
New York City. U Baltimore to Chicago. (I hate
it.) U Train from Baltimore ...S ...I heard you
s...U to Chicago.... Hello? S You can say the
name of your departure city, such as "New York
City." U Baltimore. Bal-ti-more.... S You can
say... U Baltimore.
37
Effects of Distance from Error on User Corrections
  • Corrections farther from original error in turns
    of chain position
  • higher in f0 (max,mean), lower in rms (max,mean),
    longer in duration (sec,words), slower, with more
    pause preceding, and lower CA (and word accuracy
    for latter)
  • A puzzle corrections more distant from
    immediately preceding error more likely to be
    recognized
  • But similar in f0, rms, and duration
  • Not a difference among strategies

38
Future Research
  • Hypothesis We can improve system recognition and
    error recovery by
  • Analyzing user input differently to identify
    higher level features of turns systems are likely
    to misrecognize -- and turns speakers produce to
    correct their errors
  • Anticipating user responses to system errors in
    the context of different strategies and targeting
    special error recovery procedures
  • Next combining our predictors and an
    over-the-phone interface to SCANMail, our
    voicemail browsing and retrieval system

39
Overview
  • Recognizing communicative problems
  • ASR errors
  • User corrections
  • Disfluencies and self-repairs
  • Identifying speech acts
  • Locating topic boundaries for topic tracking and
    audio browsing
  • Recognizing speaker emotion

40
Disfluencies Problems and Possibilities for
Dialogue Systems
  • Disfluencies abound in spontaneous speech
  • every 4.6s in radio call-in (Blackmer Mitton
    91)
  • many more in audio-only conditions (Kasl Mahl
    65, Oviatt 95)
  • Hard to define and often unperceived by listeners
  • hesitation Ch- change strategy.
  • filled pause Um Baltimore.
  • self-repair Ba- uh Chicago.

41
Recognizing Disfluency
  • Hard to detect in ASR systems and often leads to
    recognition error
  • Ch- change strategy. --gt to D C D C today ten
    fifteen.
  • Um Baltimore. --gt From Baltimore ten.
  • Ba- uh Chicago. --gt For Boston Chicago.
  • ASR systems usually try to treat disfluencies as
    garbage or assumes e.g. repairs can be
    identified from (correct) transcription
  • But may lose valuable meta-level information

42
What Might Disfluencies Convey?
  • Speakers cognitive load?
  • Difficulty of domain (Clark Brennan 91)
  • Difficulty of task
  • Inter-speaker co-ordination?
  • Turn-taking (Boomer 65, Goodwin 81, Shriberg
    96)
  • Difficulties in accessing information (Brennan
    Williams 95, Bortfeld Brennan 97)
  • Speaker error (Brennan Schober, TA)
  • Discourse structure (Swerts 98)
  • Effects of familiarity, age, gender? (Bortfeld et
    al 99)

43
Can Prosodic Cues Help in Detection?
  • Filled pauses
  • Shriberg 94 longer duration and falling f0
    contour
  • Self-repairs
  • Levelt Cutler 83 increase in prominence on
    repair
  • Hindle 83 edit signal hypothesis
  • Blackmer Mitton 91 aberrant phones
  • Howell Young 91 increase in prior pause and
    f0 before repair (latter facilitated perception)
  • Bear et al 92 increase in internal pausal
    duration and f0 of repair

44
(No Transcript)
45
RIM Model of Self-Repairs (Nakatani Hirschberg
94)
  • ATIS corpus
  • 6414 turns with 346 (5.4) repairs from 122
    speakers
  • hand-labeled for repairs and prosodic features
  • Findings
  • Reparanda 73 end in fragments, 30 in
    glottalization, co-articulatory gestures
  • DI pausal duration differs significantly from
    fluent boundaries,small increase in f0 and
    amplitude
  • Repairsoffsets occur at phrase boundaries,
    phrasing differences
  • CART prediction 86 precision, 91 recall
  • Duration of interval, presence of fragment, pause
    filler, p.o.s., lexical matching across DI

46
Overview
  • Recognizing communicative problems
  • ASR errors
  • User corrections
  • Disfluencies and self-repairs
  • Identifying speech acts
  • Locating topic boundaries for topic tracking and
    audio browsing
  • Recognizing speaker emotion

47
Interpreting Speech/Dialogue Acts
  • What function(s) is speaker turn serving?
  • Same phrase can perform different speech acts
  • Yes acknowledgment, acceptance, question,
  • Different phrases can perform the same speech act
  • Yes, Right, Okay, Certainly,..
  • Can prosody distinguish differences/identify
    commonalities?
  • Okay
  • Contours distinguish different uses (Hockey 91)
  • Contours context distinguish different uses
    (Kowtko 96)

48
Automatic Speech/Dialogue Act Recognition
  • S/DA recognition important for
  • Turn recognition (which grammar to use when)
  • Turn disambiguation, e.g.
  • S What city do you want to go to?
  • U1 Boston. (reply)
  • U2 Pardon? (request for information)
  • S Do you want to go to Boston?
  • U1 Boston. (confirmation)
  • U2 Boston? (question)

49
Current Approaches
  • Statistical modeling to recognize phrase
    boundaries and accent from acoustic evidence for
    Verbmobil (Nöth et al 99)
  • Prosodic boundaries provide potential DA
    boundaries
  • Most frequently accented words (salient words) in
    training corpus p.o.s. improve key-word
    selection for identifying DAs
  • ACCEPT (ok, all right, marvellous, Friday, free)
  • SUGGEST (Monday, Friday, Thursday, Wednesday,
    Saturday)
  • Improvements in DA identification over
    non-prosody approaches (also cf Shriberg et al
    98 on Switchboard, Taylor et al 98 on Map Task)

50
  • Key features
  • f0 (range better than accent or phrasing),
    duration, energy, rate (Shriberg et al 98)
  • But little improvement of ASR accuracy
  • Importance of DA coding scheme
  • Some DAs more usefully disambiguated than others
  • Some coding schemes more disambiguable than others

51
Overview
  • Recognizing communicative problems
  • ASR errors
  • User corrections
  • Disfluencies and self-repairs
  • Identifying speech acts
  • Locating topic boundaries for topic tracking and
    audio browsing
  • Recognizing speaker emotion

52
Prosodic Correlates of Discourse/Topic Structure
  • Pitch range
  • Lehiste 75, Brown et al 83, Silverman 86,
    Avesani Vayra 88, Ayers 92, Swerts et al 92,
    Grosz Hirschberg92, Swerts Ostendorf 95,
    Hirschberg Nakatani 96
  • Preceding pause
  • Lehiste 79, Chafe 80, Brown et al 83,
    Silverman 86, Woodbury 87, Avesani Vayra 88,
    Grosz Hirschberg92, Passoneau Litman 93,
    Hirschberg Nakatani 96

53
  • Rate
  • Butterworth 75, Lehiste 80, Grosz
    Hirschberg92, Hirschberg Nakatani 96
  • Amplitude
  • Brown et al 83, Grosz Hirschberg92,
    Hirschberg Nakatani 96
  • Contour
  • Brown et al 83, Woodbury 87, Swerts et al 92

54
Automatic Topic Segmentation
  • Important for audio browsing and retrieval tasks
  • Broadcast News (NIST TREC SDR track)
  • Topic Detection and Tracking (NIST/DARPA TDT)
  • Customer care call recordings, focus groups
  • Most relies on lexical information (Hearst 94,
    Reynar 98, Beeferman et al 99)

55
Prosodic Cues to Segmentation
  • Paratones intonational paragraphs (Brown et al
    80, Nakatani Hirschberg 95)
  • Recent results (Shriberg et al 00) show prosodic
    cues perform as well or better than text-based
    cues at topic segmentation -- and generalize
    better?
  • Goal identify sentence and topic boundaries at
    ASR-defined word boundaries
  • Procedure
  • CART decision trees provided boundary predictions
  • HMM combined these with lexical boundary
    predictions

56
  • Features
  • Pause at boundary (raw and normalized by speaker)
  • Pause at word before boundary
  • Normalized phone and rhyme duration
  • F0 (smoothed/stylized) reset, range, slope and
    continuity
  • Voice quality (halving/doubling estimates as
    correlates of creak or glottalization)
  • Speaker change, time from start of turn, turns
    in conversation and gender
  • Topic segmentation results (BN only)
  • Prosody alone better than LM combined improves
    significantly
  • Useful features pause at boundary, f0 range,
    turn/no turn, gender, time in turn

57
Features Examined
  • For each potential boundary location
  • Pause at boundary (raw and normalized by speaker)
  • Pause at word before boundary (is this a new
    turn or part of continuous speech segment?)
  • Phone and rhyme duration (normalized by inherent
    duration) (phrase-fine lengthening?)
  • F0 (smoothed and stylized) reset, range
    (topline, baseline), slope and continuity
  • Voice quality (halving/doubling estimates as
    correlates of creak or glottalization)
  • Speaker change, time from start of turn, turns
    in conversation and gender
  • Trained/tested on Switchboard and Broadcast News

58
  • Sentence segmentation results
  • Prosody better than LM for BN but worse (on
    transcription) and same for recognition results
    on SB all better than chance
  • Useful features for BN pause at boundary
    ,turn/no turn, f0 diff across boundary, rhyme
    duration
  • Useful features for SB phone/rhyme duration
    before boundary, pause at boundary, turn/no turn,
    pause at preceding word boundary, time in turn
  • Topic segmentation results (BN only)
  • Useful features pause at boundary, f0 range,
    turn/no turn, gender, time in turn
  • Prosody alone better than LM combined improves
    significantly

59
SCAN
60
SCAN demo
61
POP-3 Server
ASR Server
Email Server
AUDIX Server
SCANMail HUB/DB
Information Extraction Server
Caller Id Server
IR Server
Client
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
Overview
  • Recognizing communicative problems
  • ASR errors
  • User corrections
  • Disfluencies and self-repairs
  • Identifying speech acts
  • Locating topic boundaries for topic tracking and
    audio browsing
  • Recognizing speaker emotion

66
Identifying Emotion
  • Human perception (Cahn 88, Murray Arnott 93)
  • Automatic identification (Nöth et al 99)

67
Future of Prosody in Recognition/Understanding
  • Finding more non-traditional aspects of
    recognition/understanding where prosody can be
    useful
  • Finding better ways to map linguistic information
    (e.g. accent) into objective acoustic measures
  • Finding applications where prosodic information
    makes a difference
Write a Comment
User Comments (0)
About PowerShow.com