Ideas%20in%20Confidence%20Annotation - PowerPoint PPT Presentation

About This Presentation
Title:

Ideas%20in%20Confidence%20Annotation

Description:

... thomas.pdf/kemp97estimating.pdf ... N-best list, word lattices are therefore used. ... Word-based posterior probability is one effective way to compute confidence. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 51
Provided by: Arthu61
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Ideas%20in%20Confidence%20Annotation


1
Ideas in Confidence Annotation
  • Arthur Chan

2
Three papers for today
  • Frank Wessel et al,
  • Using Word Probabilities as Confidence Measures
  • http//www-i6.informatik.rwth-aachen.de/PostScript
    /InterneArbeiten/Wessel_Word_Probabilities_ConfMea
    s_ICASSP1998.ps
  • Timothy Hazen et al,
  • Recognition Confidence Scoring for Use in Speech
    Understanding Systems
  • http//groups.csail.mit.edu/sls/publications/2000/
    asr2000.pdf
  • Dan Bohus and Alex Rudnicky
  • A Principled Approach for Rejection Threshold
    Optimization in Spoken Dialogue System
  • http//www.cs.cmu.edu/dbohus/docs/dbohus_interspe
    ech05.pdf

3
Application of Confidence Annotation
  • Provides system a decision whether ASR output
    could be trusted,
  • Possible response strategies
  • Reject the sentence all together.
  • Confirm with the users again
  • Both e.g. bi-threshold system.
  • Detection of OOV e.g.
  • If ASR doesnt include the OOV in the vocabulary.
  • What is the focus for paramus park new jersey
  • What is the forecast for paris park new jersey
  • Paramus is OOV, then the system should be not
    confident about the phoneme transcription.
  • Improve speech recognition performance
  • Why? In general, posterior should be used instead
    of likelihood
  • Does it help? 2-5 relative level.

4
How this seminar proceed
  • In each idea, 3 papers are studied.
  • Only the most representative became suggested
    reading.
  • Results will be quoted from different papers.

5
Preliminary
  • Mathematical Foundation
  • Neyman-Pearson Theorem (NPT)
  • Consequence of NPT
  • In general, likelihood ratio test is the most
    powerful test to decide which one of the two
    distributions is in force.
  • H1 Distribution A is in force.
  • H2 Distribution B is in force.
  • Compute
  • F(H1)/F(H2) ltgt T
  • In speech recognition,
  • H1 could be the speech model, H2 could be the
    non-speech (garbage) model.

6
Idea 1 Belief of a single ASR feature
  • 3 studied papers
  • (suggested) Frank Wessel et al, Using Word
    Probabilities as Confidence Measures
  • http//www-i6.informatik.rwth-aachen.de/PostScript
    /InterneArbeiten/Wessel_Word_Probabilities_ConfMea
    s_ICASSP1998.ps
  • Stephen Cox and Richard Rose, Confidence
    Measures for the Switchboard Database
  • http//www.ece.mcgill.ca/rose/papers/cox_rose_ica
    ssp96.pdf
  • Thomas Kemp and Thomas Schaaf, Estimating
    Confidence using Word Lattices.
  • http//overcite.lcs.mit.edu/cache/papers/cs/1116/h
    ttpzSzzSzwww.is.cs.cmu.eduzSzwwwadmzSzpaperszSzs
    peechzSzEUROSPEECH97zSzEUROSPEECH97-thomas.pdf/kem
    p97estimating.pdf
  • Paper chosen because
  • It has clearest math in minute detail.
  • though less motivating than Coxs paper.

7
Origins of Confidence Measure in speech
recognition
  • Formulation of speech recognition
  • P(W A ) P( A W ) P( W) / P (A)
  • In decoding, P(A) is ignored because it is a
    common term
  • W argmax P(AW) P(W)
  • W
  • Problem
  • P(A,W) is just a relative measure
  • P(WA) is the true measure of how probable a
    word given the feature.

8
In reality
  • P(A) could only be approximated
  • By law of total probability
  • P(A) Sum of P(A,W) for all W
  • N-best list, word lattices are therefore used.
  • Other Ideas filler/garbage/general speech
    models. -gt keyword spotter tricks
  • A threshold of ratio need to be found.
  • ROC curve always need to be manually interpreted.

9
Things that people are not confident
  • All sorts of things
  • Frame
  • Frame likelihood ratio
  • Phone
  • Phone likelihood ratio
  • Word
  • Posterior probability -gt a kind of likelihood
    ratio too
  • Word likelihood ratio.
  • Sentence
  • Likelihood

10
General Observation from Literature
  • Word-level confidence perform the best (CER)
  • Word lattice method is slightly more general.
  • This part of presentation will focus on
    word-lattice-based method.

11
Word posterior probability. Authors Definition
  • W_a Word hypothesis preceding w
  • W_e Word hypothesis succeeding w

12
Computation with lattice
  • Only the hypotheses included by lattice need to
    be computed.
  • Alpha-beta type of computation could be used.
  • Similar to forward-backward algorithm

13
Forward probability
  • For an end time t,
  • Read Total posterior probability end at t which
    are identical to h
  • Recursive formula

14
Backward probability
  • For a begin time t
  • One LM score is missing in the definition, later
    added back to the computation
  • Recursion

15
Posterior Computation
16
Practical Implementation
  • According to the author
  • Posterior found using the above formula have
    poorer discriminative capability
  • Timing is fuzzy from the recognizer
  • Segments of 30 of overlapping is then used.
  • Acoustic score and language score
  • Both are scaled
  • AM scaled by a number equal to 1
  • LM scaled by a number larger than 1

17
Experimental Results
  • Confidence error rate is computed.
  • Definition of CER
  • correctly assigned tags/ tags
  • Threshold is optimized by cross-validation set.
  • Compared to baseline
  • (Insertions Deletion)/Number of recognized
    words
  • Results relatively 14-18 of improvement

18
Summary
  • Word-based posterior probability is one effective
    way to compute confidence.
  • In practice, AM and LM scores need to be scaled
    appropriately.
  • Further reading.
  • Frank Soong et al, Generalized Word Posterior
    Probability (GWPP) For Measuring Reliability of
    Recognized Words

19
Idea 2 Belief in multiple ASR features
  • Background
  • Single ASR feature is not the best
  • Multiple features could be combined to improve
    results
  • Combination could be done by machine-learning
    algorithm

20
Reviewed papers
  • (Suggested) Timothy et al, Recognition
    Confidence Scoring for Use in Speech
    Understanding Systems
  • http//groups.csail.mit.edu/sls/publications/2000/
    asr2000.pdf
  • Zhang et al,
  • http//www.cs.cmu.edu/rongz/eurospeech_2001_1.pdf
  • A survey http//fife.speech.cs.cmu.edu/Courses/11
    716/2000/Word_Confidence_Annotation.ps
  • Chase et al, Word and Acoustic Confidence
    Annotation for Large Vocabulary Speech
    Recognition
  • http//www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca
    .ps
  • Paper chosen because
  • it is more recent.
  • Combination method is motivated by speech-rec.

21
General structure of papers in Idea 2
  • 10-30 features from the acoustic model is listed
  • Combination scheme is chosen.
  • Usually it is based on machine learning method
    e.g
  • Decision tree
  • Neural network
  • Support vector machine
  • Fisher Linear separator
  • Or any super-duper ML method.

22
Outline
  • Motivation of the paper
  • Decide whether OOV exists
  • Marked potentially mis-recognized words.
  • What the author tries to do
  • Decide whether an utterance should be accepted
  • 3 different levels of feature
  • Phonetic Level Scoring
  • Never used
  • Utterance Level Scoring
  • 15 features
  • Word Level Scoring
  • 10 features

23
Phone-Level Scoring
  • From the author
  • Several work in the past has already show phone
    and frame scores are unlikely to help
  • However, phone score will be used to generate
    word-level and sentence-level scores.
  • Scores are normalized by catch-all model
  • In other words, garbage model is used to
    approximate p(A)
  • Normalized scores are always used.

24
Utterance Level Features (the boring group)
  • 1, 1st best hypothesis total score
  • (AM LM PM)
  • 2, 1st best hypothesis average (word) score
  • The avg. score per word.
  • 3, 1st best hypothesis total LM score
  • 4, 1st best hypothesis avg. LM score
  • 5, 1st best hypothesis total AM score
  • 6, 1st best hypothesis avg. AM score
  • 7, Difference of total score between 1st best hyp
    2nd best hyp
  • 8, Difference of LM score between 1st best hyp
    and 2nd best hyp
  • 9, Difference of AM score between 1st best hyp
    and 2nd best hyp
  • 14, Number of N-best
  • 15, Number of words in the 1st best hyp.

25
Utterance Level Features (the interesting group)
  • N-best Purity
  • The N-best purity for a hypothesized word is the
    fraction of N-best hypotheses in which that
    particular hypothesized word appear in the same
    location in the sentence
  • Or
  • agreement/Total
  • Similar to rover voting on the N-best list.

26
Utterance Level Features (the interesting group)
(cont.)
  • 10, 1st best hypothesis avg. N-best purity
  • 11, 1st best hypothesis high N-best purity
  • The fraction of words in the top choice
    hypothesis which have N-best purity greater than
    one half.
  • 12, Average N-best purity
  • 13, High N-best purity

27
Word Level Feature
  • 1, Mean acoustic score -gt The mean of log
    likelihood
  • 2, Mean of acoustic likliehood score -gt The mean
    of likelihood (not log likelihood)
  • 3, Minimum acoustic score
  • 4, Standard Deviation of acoustic score
  • 5, Mean difference from max score
  • The average log likelihood ratio between acoustic
    scores of the best path and from phoneme
    recognition.
  • 6, Mean Catch-All Score
  • 7, Number of Acoustic Observation
  • 8, N-best Purity
  • 9, Number of N-best
  • 10, Utterance Score

28
Classifier Training
  • Linear Separator
  • Input features
  • Output (correct, incorrect) pair
  • Training process
  • 1, Fisher Linear discriminative analysis is used
    to produced the first version of the separator
  • 2, A hill climbing algorithm is used to minimized
    the classification error.

29
Results (Word-Level)
30
Discussion Is there any meaning in combination
method?
  • IMO, Yes,
  • Provided that the breakdown of the feature
    contribution to the reduction of CER is provided.
  • E.g. The goodies in other papers
  • In Timothy et al, N-best purity is the most
    useful.
  • In Lin, LM-jitter is the first question that
    provide most gain
  • In Rong, back-off mode and parsing score provide
    significant improvement.
  • Also, timothy et al is special because the
    optimization of combination is also MCE trained.
  • So, how things combined does matter too.

31
Summary
  • 25 features were used in this paper
  • 15 for utterance level
  • 10 for word level
  • N-best purity was found to be the most helpful
  • Both simple linear separator training and minimum
    classification training was used.
  • That explains the huge relative reduction in
    error.

32
Idea 3, Believe information other than ASR
  • ASR output has certain limitation.
  • When apply in different applications,
  • ASR confidence need to be modified/combined with
    application-specific information.

33
Reviewed Papers
  • Dialogue System
  • (Suggested) Dan Bohus, A principled approach for
    rejection threshold optimization in Spoken
    Dialogue System
  • http//www.cs.cmu.edu/dbohus/docs/dbohus_interspe
    ech05.pdf
  • Sameer Pradhan, Wayne Ward, Estimating Semantic
    Confidence for Spoken Dialogue Systemss
  • http//oak.colorado.edu/spradhan/publications/sem
    antic-confidence.pdf
  • CALL
  • Simon Ho, Brian Mak, Joint Estimation of
    Thresholds in a Bi-threshold Verification
    Problem
  • http//www.cs.ust.hk/mak/PDF/eurospeech2003-bithr
    eshold.pdf
  • Paper chosen because
  • It is most recent
  • Representative from a dialogue system
    stand-point.

34
Big Picture of this type of papers
  • Use feature external to ASR as confidence
    feature.
  • Dialogue context
  • Use cost external to ASR error rate as
    optimization criterion
  • Cost of misunderstanding
  • 10 FA/FR
  • As most commented
  • It usually makes more sense than just relying on
    ASR features.
  • The quality of feature also depends on the ASR
    scores.

35
Overview of the paper
  • Motivation
  • Recognition error significantly affect the
    quality of success of interaction (for the
    dialogue system)
  • Rejection Threshold introduces trade-off between
  • the number of misunderstanding
  • false rejections.

36
Incorrect and Correct Transfer of Concepts
  • An alternative formulation by the authors
  • User tries to convey system concepts
  • If the confidence is below threshold
  • The system reject the utterance and no concept is
    transferred
  • If the confidence is above the threshold
  • The system accept some correct concept
  • But also accept some wrong concept.

37
Questions the authors want to answer
  • Given the existence of this tradeoff, what is
    the optimal value for rejection threshold?
  • this tradeoff
  • The trade-off between correctly and incorrectly
    transfer concepts.

38
Logistic regression
  • Generalized linear model which
  • g
  • link function.
  • could be log, logit, identity and reciprocal
  • http//userwww.sfsu.edu/efc/classes/biol710/Glz/G
    eneralized20Linear20Models.htm

39
Logistic regression (cont.)
  • Usually used in
  • Categorical or non-continuous dependent variable
  • Or the relationship itself is actually not
    linear.
  • Also used in combination features in ASR
  • See Siu Improved Estimation, Evaluation and
    Applications of Confidence Measures for Speech
    Recognition
  • And generally BBN systems

40
Impact of Incorrect and Correct Concept to the
task success
  • Logit (TS) 0.21 2.14 CTC 4.12 ITC
  • The odds of ITC vs CTC is nearly 2 times.

41
The procedure
  • Identify a set of variables A, B, . Involved in
    the rejection tradeoff (e.g. CTC and ITC)
  • Choose a global dialogue performance metric P to
    optimize for (e.g T.S.)
  • Fit models m which relates the trade-off
    variables to the chosen global dialogue
    performance metric Plt-m(A,B)
  • Find the threshold which maximizes the
    performance
  • Th arg max(P) arg max (m(A(th), B(th))

42
Data
  • RoomLine system
  • Baseline, fixed rejection threshold 0.3
  • Each participant attempted
  • a max of 10 scenario-based interactions
  • 71 states in the dialogue system
  • In general

43
Rejection Optimization
  • The 71 state are manually clustered into 3 types
  • Open-request, system asks open questions
  • How may I help you?
  • Request (bool), system asks a yes/no question
  • Do you want a reservation for this room?
  • Request (non-bool), system request an answer for
    more than 2 possible values
  • Starting at what time do you need the room?
  • Cost are then optimized for individual state

44
Results
45
Summary of the paper
  • Principled idea for dialogue system.
  • Logistic regression is used to optimized the
    threshold of rejection.
  • A neat paper.
  • Several clever point
  • Logistic regression
  • Using external metric in the paper

46
Discussion
  • 3 different types of ideas in confidence
    annotation
  • Questions
  • Which idea should we used?
  • Could ideas be combined?

47
Goodies in Idea 1
  • Word Posterior Probability, LM Jitter were found
    to be very useful in different papers.
  • Word Posterior Probability is a generalization of
    many techniques in the field
  • LM Jitter could be generalized with other
    parameters in the decoder as well.
  • Utterance score help word scores.

48
Goodies in Idea 2
  • Combination always help
  • Combination in ML sense and DT sense each give a
    chunk of gain.
  • Combination methods
  • Generalized linear model is easy to interpret and
    principled.
  • linear separator could be easily trained ML and
    DT sense
  • Neural network and SVM come with standard goodie
    general non-linear modeling

49
Goodies in Idea 3
  • Every types of application has their own concern
    which is more important than WER
  • Researcher should take the liberty to optimize
    them instead of relying on ASR.

50
Conclusion
  • For an ASR-based system
  • Idea 1 2 are wins
  • For an application based on ASR-based system
  • Idea 123 would be the most helpful.
Write a Comment
User Comments (0)
About PowerShow.com