Title: Ideas%20in%20Confidence%20Annotation
1Ideas in Confidence Annotation
2Three papers for today
- Frank Wessel et al,
- Using Word Probabilities as Confidence Measures
- http//www-i6.informatik.rwth-aachen.de/PostScript
/InterneArbeiten/Wessel_Word_Probabilities_ConfMea
s_ICASSP1998.ps - Timothy Hazen et al,
- Recognition Confidence Scoring for Use in Speech
Understanding Systems - http//groups.csail.mit.edu/sls/publications/2000/
asr2000.pdf - Dan Bohus and Alex Rudnicky
- A Principled Approach for Rejection Threshold
Optimization in Spoken Dialogue System - http//www.cs.cmu.edu/dbohus/docs/dbohus_interspe
ech05.pdf
3Application of Confidence Annotation
- Provides system a decision whether ASR output
could be trusted, - Possible response strategies
- Reject the sentence all together.
- Confirm with the users again
- Both e.g. bi-threshold system.
- Detection of OOV e.g.
- If ASR doesnt include the OOV in the vocabulary.
- What is the focus for paramus park new jersey
- What is the forecast for paris park new jersey
- Paramus is OOV, then the system should be not
confident about the phoneme transcription. - Improve speech recognition performance
- Why? In general, posterior should be used instead
of likelihood - Does it help? 2-5 relative level.
4How this seminar proceed
- In each idea, 3 papers are studied.
- Only the most representative became suggested
reading. - Results will be quoted from different papers.
5Preliminary
- Mathematical Foundation
- Neyman-Pearson Theorem (NPT)
- Consequence of NPT
- In general, likelihood ratio test is the most
powerful test to decide which one of the two
distributions is in force. - H1 Distribution A is in force.
- H2 Distribution B is in force.
- Compute
- F(H1)/F(H2) ltgt T
- In speech recognition,
- H1 could be the speech model, H2 could be the
non-speech (garbage) model.
6Idea 1 Belief of a single ASR feature
- 3 studied papers
- (suggested) Frank Wessel et al, Using Word
Probabilities as Confidence Measures - http//www-i6.informatik.rwth-aachen.de/PostScript
/InterneArbeiten/Wessel_Word_Probabilities_ConfMea
s_ICASSP1998.ps - Stephen Cox and Richard Rose, Confidence
Measures for the Switchboard Database - http//www.ece.mcgill.ca/rose/papers/cox_rose_ica
ssp96.pdf - Thomas Kemp and Thomas Schaaf, Estimating
Confidence using Word Lattices. - http//overcite.lcs.mit.edu/cache/papers/cs/1116/h
ttpzSzzSzwww.is.cs.cmu.eduzSzwwwadmzSzpaperszSzs
peechzSzEUROSPEECH97zSzEUROSPEECH97-thomas.pdf/kem
p97estimating.pdf - Paper chosen because
- It has clearest math in minute detail.
- though less motivating than Coxs paper.
7Origins of Confidence Measure in speech
recognition
- Formulation of speech recognition
- P(W A ) P( A W ) P( W) / P (A)
- In decoding, P(A) is ignored because it is a
common term - W argmax P(AW) P(W)
- W
- Problem
- P(A,W) is just a relative measure
- P(WA) is the true measure of how probable a
word given the feature.
8In reality
- P(A) could only be approximated
- By law of total probability
- P(A) Sum of P(A,W) for all W
- N-best list, word lattices are therefore used.
- Other Ideas filler/garbage/general speech
models. -gt keyword spotter tricks - A threshold of ratio need to be found.
- ROC curve always need to be manually interpreted.
9Things that people are not confident
- All sorts of things
- Frame
- Frame likelihood ratio
- Phone
- Phone likelihood ratio
- Word
- Posterior probability -gt a kind of likelihood
ratio too - Word likelihood ratio.
- Sentence
- Likelihood
10General Observation from Literature
- Word-level confidence perform the best (CER)
- Word lattice method is slightly more general.
- This part of presentation will focus on
word-lattice-based method.
11Word posterior probability. Authors Definition
- W_a Word hypothesis preceding w
- W_e Word hypothesis succeeding w
12Computation with lattice
- Only the hypotheses included by lattice need to
be computed. - Alpha-beta type of computation could be used.
- Similar to forward-backward algorithm
13Forward probability
- For an end time t,
- Read Total posterior probability end at t which
are identical to h - Recursive formula
14Backward probability
- For a begin time t
- One LM score is missing in the definition, later
added back to the computation - Recursion
15Posterior Computation
16Practical Implementation
- According to the author
- Posterior found using the above formula have
poorer discriminative capability - Timing is fuzzy from the recognizer
- Segments of 30 of overlapping is then used.
- Acoustic score and language score
- Both are scaled
- AM scaled by a number equal to 1
- LM scaled by a number larger than 1
17Experimental Results
- Confidence error rate is computed.
- Definition of CER
- correctly assigned tags/ tags
- Threshold is optimized by cross-validation set.
- Compared to baseline
- (Insertions Deletion)/Number of recognized
words - Results relatively 14-18 of improvement
18Summary
- Word-based posterior probability is one effective
way to compute confidence. - In practice, AM and LM scores need to be scaled
appropriately. - Further reading.
- Frank Soong et al, Generalized Word Posterior
Probability (GWPP) For Measuring Reliability of
Recognized Words
19Idea 2 Belief in multiple ASR features
- Background
- Single ASR feature is not the best
- Multiple features could be combined to improve
results - Combination could be done by machine-learning
algorithm
20Reviewed papers
- (Suggested) Timothy et al, Recognition
Confidence Scoring for Use in Speech
Understanding Systems - http//groups.csail.mit.edu/sls/publications/2000/
asr2000.pdf - Zhang et al,
- http//www.cs.cmu.edu/rongz/eurospeech_2001_1.pdf
- A survey http//fife.speech.cs.cmu.edu/Courses/11
716/2000/Word_Confidence_Annotation.ps - Chase et al, Word and Acoustic Confidence
Annotation for Large Vocabulary Speech
Recognition - http//www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca
.ps - Paper chosen because
- it is more recent.
- Combination method is motivated by speech-rec.
21General structure of papers in Idea 2
- 10-30 features from the acoustic model is listed
- Combination scheme is chosen.
- Usually it is based on machine learning method
e.g - Decision tree
- Neural network
- Support vector machine
- Fisher Linear separator
- Or any super-duper ML method.
22Outline
- Motivation of the paper
- Decide whether OOV exists
- Marked potentially mis-recognized words.
- What the author tries to do
- Decide whether an utterance should be accepted
- 3 different levels of feature
- Phonetic Level Scoring
- Never used
- Utterance Level Scoring
- 15 features
- Word Level Scoring
- 10 features
23Phone-Level Scoring
- From the author
- Several work in the past has already show phone
and frame scores are unlikely to help - However, phone score will be used to generate
word-level and sentence-level scores. - Scores are normalized by catch-all model
- In other words, garbage model is used to
approximate p(A) - Normalized scores are always used.
24Utterance Level Features (the boring group)
- 1, 1st best hypothesis total score
- (AM LM PM)
- 2, 1st best hypothesis average (word) score
- The avg. score per word.
- 3, 1st best hypothesis total LM score
- 4, 1st best hypothesis avg. LM score
- 5, 1st best hypothesis total AM score
- 6, 1st best hypothesis avg. AM score
- 7, Difference of total score between 1st best hyp
2nd best hyp - 8, Difference of LM score between 1st best hyp
and 2nd best hyp - 9, Difference of AM score between 1st best hyp
and 2nd best hyp - 14, Number of N-best
- 15, Number of words in the 1st best hyp.
25Utterance Level Features (the interesting group)
- N-best Purity
- The N-best purity for a hypothesized word is the
fraction of N-best hypotheses in which that
particular hypothesized word appear in the same
location in the sentence - Or
- agreement/Total
- Similar to rover voting on the N-best list.
26Utterance Level Features (the interesting group)
(cont.)
- 10, 1st best hypothesis avg. N-best purity
- 11, 1st best hypothesis high N-best purity
- The fraction of words in the top choice
hypothesis which have N-best purity greater than
one half. - 12, Average N-best purity
- 13, High N-best purity
27Word Level Feature
- 1, Mean acoustic score -gt The mean of log
likelihood - 2, Mean of acoustic likliehood score -gt The mean
of likelihood (not log likelihood) - 3, Minimum acoustic score
- 4, Standard Deviation of acoustic score
- 5, Mean difference from max score
- The average log likelihood ratio between acoustic
scores of the best path and from phoneme
recognition. - 6, Mean Catch-All Score
- 7, Number of Acoustic Observation
- 8, N-best Purity
- 9, Number of N-best
- 10, Utterance Score
28Classifier Training
- Linear Separator
- Input features
- Output (correct, incorrect) pair
- Training process
- 1, Fisher Linear discriminative analysis is used
to produced the first version of the separator - 2, A hill climbing algorithm is used to minimized
the classification error.
29Results (Word-Level)
30Discussion Is there any meaning in combination
method?
- IMO, Yes,
- Provided that the breakdown of the feature
contribution to the reduction of CER is provided.
- E.g. The goodies in other papers
- In Timothy et al, N-best purity is the most
useful. - In Lin, LM-jitter is the first question that
provide most gain - In Rong, back-off mode and parsing score provide
significant improvement. - Also, timothy et al is special because the
optimization of combination is also MCE trained. - So, how things combined does matter too.
31Summary
- 25 features were used in this paper
- 15 for utterance level
- 10 for word level
- N-best purity was found to be the most helpful
- Both simple linear separator training and minimum
classification training was used. - That explains the huge relative reduction in
error.
32Idea 3, Believe information other than ASR
- ASR output has certain limitation.
- When apply in different applications,
- ASR confidence need to be modified/combined with
application-specific information.
33Reviewed Papers
- Dialogue System
- (Suggested) Dan Bohus, A principled approach for
rejection threshold optimization in Spoken
Dialogue System - http//www.cs.cmu.edu/dbohus/docs/dbohus_interspe
ech05.pdf - Sameer Pradhan, Wayne Ward, Estimating Semantic
Confidence for Spoken Dialogue Systemss - http//oak.colorado.edu/spradhan/publications/sem
antic-confidence.pdf - CALL
- Simon Ho, Brian Mak, Joint Estimation of
Thresholds in a Bi-threshold Verification
Problem - http//www.cs.ust.hk/mak/PDF/eurospeech2003-bithr
eshold.pdf - Paper chosen because
- It is most recent
- Representative from a dialogue system
stand-point.
34Big Picture of this type of papers
- Use feature external to ASR as confidence
feature. - Dialogue context
- Use cost external to ASR error rate as
optimization criterion - Cost of misunderstanding
- 10 FA/FR
- As most commented
- It usually makes more sense than just relying on
ASR features. - The quality of feature also depends on the ASR
scores.
35Overview of the paper
- Motivation
- Recognition error significantly affect the
quality of success of interaction (for the
dialogue system) - Rejection Threshold introduces trade-off between
- the number of misunderstanding
- false rejections.
36Incorrect and Correct Transfer of Concepts
- An alternative formulation by the authors
- User tries to convey system concepts
- If the confidence is below threshold
- The system reject the utterance and no concept is
transferred - If the confidence is above the threshold
- The system accept some correct concept
- But also accept some wrong concept.
37Questions the authors want to answer
- Given the existence of this tradeoff, what is
the optimal value for rejection threshold? - this tradeoff
- The trade-off between correctly and incorrectly
transfer concepts.
38Logistic regression
- Generalized linear model which
- g
- link function.
- could be log, logit, identity and reciprocal
- http//userwww.sfsu.edu/efc/classes/biol710/Glz/G
eneralized20Linear20Models.htm
39Logistic regression (cont.)
- Usually used in
- Categorical or non-continuous dependent variable
- Or the relationship itself is actually not
linear. - Also used in combination features in ASR
- See Siu Improved Estimation, Evaluation and
Applications of Confidence Measures for Speech
Recognition - And generally BBN systems
40Impact of Incorrect and Correct Concept to the
task success
- Logit (TS) 0.21 2.14 CTC 4.12 ITC
- The odds of ITC vs CTC is nearly 2 times.
41The procedure
- Identify a set of variables A, B, . Involved in
the rejection tradeoff (e.g. CTC and ITC) - Choose a global dialogue performance metric P to
optimize for (e.g T.S.) - Fit models m which relates the trade-off
variables to the chosen global dialogue
performance metric Plt-m(A,B) - Find the threshold which maximizes the
performance - Th arg max(P) arg max (m(A(th), B(th))
42Data
- RoomLine system
- Baseline, fixed rejection threshold 0.3
- Each participant attempted
- a max of 10 scenario-based interactions
- 71 states in the dialogue system
- In general
43Rejection Optimization
- The 71 state are manually clustered into 3 types
- Open-request, system asks open questions
- How may I help you?
- Request (bool), system asks a yes/no question
- Do you want a reservation for this room?
- Request (non-bool), system request an answer for
more than 2 possible values - Starting at what time do you need the room?
- Cost are then optimized for individual state
44Results
45Summary of the paper
- Principled idea for dialogue system.
- Logistic regression is used to optimized the
threshold of rejection. - A neat paper.
- Several clever point
- Logistic regression
- Using external metric in the paper
46Discussion
- 3 different types of ideas in confidence
annotation - Questions
- Which idea should we used?
- Could ideas be combined?
47Goodies in Idea 1
- Word Posterior Probability, LM Jitter were found
to be very useful in different papers. - Word Posterior Probability is a generalization of
many techniques in the field - LM Jitter could be generalized with other
parameters in the decoder as well. - Utterance score help word scores.
48Goodies in Idea 2
- Combination always help
- Combination in ML sense and DT sense each give a
chunk of gain. - Combination methods
- Generalized linear model is easy to interpret and
principled. - linear separator could be easily trained ML and
DT sense - Neural network and SVM come with standard goodie
general non-linear modeling
49Goodies in Idea 3
- Every types of application has their own concern
which is more important than WER - Researcher should take the liberty to optimize
them instead of relying on ASR.
50Conclusion
- For an ASR-based system
- Idea 1 2 are wins
- For an application based on ASR-based system
- Idea 123 would be the most helpful.