Predicting ASR Performance Using Prosodic Cues - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Predicting ASR Performance Using Prosodic Cues

Description:

Speaker-relative differences of prosodic values are important (versus ... Adding prosodic information to ASR- available information improves prediciton of ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 18
Provided by: sabrina95
Category:

less

Transcript and Presenter's Notes

Title: Predicting ASR Performance Using Prosodic Cues


1
Predicting ASR Performance Using Prosodic Cues

Diane J. Litman
Julia B. Hirschberg
Marc Swerts
2
Problem
  • ASR is not perfect but prone to error
  • Users
  • Have to try to correct system misunderstandings
  • Are frustrated by unnecessary confirmations and
    rejections
  • Recognize errors
  • How likely is the recognition to be correct?
  • Handle errors
  • Verifiying users input
  • Reprompting for new input
  • Changing interaction strategy
  • Switching to human attendant

3
ASR performance depends on
  • Speaker
  • gender
  • age
  • Native versus non-native
  • Speaking style
  • Hyperarticulated speech
  • Occurs when people try to correct the system
  • Decreases performance
  • Deviation of new speech from training data

Prosodic cues
4
Tasks
  • Compare correctly and incorrectly recognized
    speaker turns regarding to prosodic features
  • How useful are prosodic features to predict
    errors in recognition
  • On their own
  • Combined with other information

5
Design
  • TOOT corpus speech dialog system for train
    information via telephone
  • Subjects 39 students
  • (20/19 native/non-native)
  • (16/23 female/male)
  • 1994 user turns, 152 dialogues
  • Concept accuracy (CA) manually labeled
  • 1 if ASR correctly recognized all task related
    information in the turn (time, departure, arrival
    cities ), otherwise age of correct recognitions
  • 1410 of 1975 CA 1 (mean 71 )
  • Word error rate (WER)
  • Manual transcription compared to recognized
    string
  • 961 of 1975 WER 0 (mean 49, mean per turn 47 )

6
Distinguishing Correct from Incorrect Recognitions
  • Prosodic features
  • F0 mean, max
  • Energy (rms) mean, max
  • Total duration
  • Prior pause
  • Speaking rate (syllables per second)
  • silence
  • Scoring results for each speaker separately and
    combing results with paired t-test

7
Results
  • Misrecognized turns
  • Higher F0 max, higher rms energy
  • Longer durations
  • Longer preceding pauses
  • WER slightly higher in mean F0 and rms max,
    lower age of silence
  • CA tempo lower (?)
  • Results derived from raw values for all prosodic
    features are similar to results derived from
    normalized values (normalized to first or
    preceding turn, or converted to z scores)
  • Speaker-relative differences of prosodic values
    are important (versus differences from some
    speaker-independent range)
  • Cf. Table 1, 2 in paper
  • Features found are associated with
    hyperarticulated speech, but excluding
    hyperarticulated turns led to same results

8
Predicting Misrecognitions Using Machine Learning
  • Machine learning program RIPPER (Cohen, 1996)
  • Input
  • classes to be learned (i. e. correct/incorrect
    recognition)
  • Set of feature names and possible values
  • Prosodic features
  • ASR grammar
  • ASR confidence
  • ASR string
  • Experiment conditions (subject, gender, native
    speaker, task number, dialogue strategy.)
  • Training data
  • Output classification model for predicting the
    class

9
Results for WER-misrecogn.
  • Comparing the performance of classification
    models derived from different feature sets
  • ASR confidence score (22.23 ) worse than ASR
    confidence plus prosodic (10.99 ) and worse than
    prosodic alone (12.76 )
  • ASR confidence plus ASR grammar (17.77 ) worse
    than prosodic
  • ASR string is best single value, available to ASR
    systems but not used so far!
  • Unnormalized prosodic features are better than
    normalized (-gt ranges of optimal recognition)

10
Classification rules
  • Duration, confidence score, string, tempo,
    silence, F0 are used
  • Not the same features as these shown to be
    significant in statistical analyses
  • No features from experimental conditions were used

11
Results for CA-definition
  • features which predicted WER-defined
    misrecognition were less succesful in predicting
    CA-defined misrecognition
  • Predictive power of prosodic over ASR features
    decreases ! ASR confidence scores should predict
    WER rather than CA!
  • Prosodic features still improve predictive power
    of ASR confidence scores (13.52 vs 11.34 - but
    not significant at 95 confidence level)
  • For CA-defined adding prosodic features results
    only in minor improvements

12
Summary
  • Adding prosodic information to ASR- available
    information improves prediciton of misrecognition
    based on WER
  • CA-based improvements are less
  • Future research will focuss on the reasons of
    this differences
  • Why is prosody so useful for indicate
    misrecognition?
  • Direct or indirect correlation?
  • Longer utterances may provide more chance for
    error than shorter ones

13
W99 corpus
  • Spoken dialog system used for conference
    registration
  • 3000 utterances obtained by experimental testing
    as well as by non-experimental data collection
  • W99 uses newer and more robust technology than
    TOOT
  • Only WER-based misrecognitions computable

14
Results for W99
15
Results for W99
  • normalizing values by preceding turn, only
    duration, F0 maximum and tempo remain significant
  • absolute deviation from some particular range is
    associated with recognition failures rather than
    relative differences in prosodic values

16
W99 vs. TOOT
  • Both
  • Prosody plus ASR confidence is better than
    confidence alone
  • Multiple prosodic features outperform singles
  • Differences
  • Best single value W99 silence, TOOT duration
  • TOOT best performing set included prosody, W99
    not
  • TOOT only prosody is better than ASR confidence,
    W99 worse (but not significant)
  • Conclusion the better the ASR model the smaller
    the advantage of using prosodic feature

17
Goat phenomenon
  • Some speakers are worse recognized than others
  • Goats
  • higher overall F0 maxima
  • tempo lower
  • Normalizing over first turn F0 not significant
    anymore -gt goats have higher ranges
  • Goats speech does not exhibit exactly the
    features that characterizes misrecognized turns!
Write a Comment
User Comments (0)
About PowerShow.com