Ideas%20in%20Confidence%20Annotation - PowerPoint PPT Presentation

About This Presentation

Title:

Ideas%20in%20Confidence%20Annotation

Description:

... thomas.pdf/kemp97estimating.pdf ... N-best list, word lattices are therefore used. ... Word-based posterior probability is one effective way to compute confidence. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 51

Provided by: Arthu61

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ideas%20in%20Confidence%20Annotation

1
Ideas in Confidence Annotation

Arthur Chan

2
Three papers for today

Frank Wessel et al,
Using Word Probabilities as Confidence Measures
http//www-i6.informatik.rwth-aachen.de/PostScript
/InterneArbeiten/Wessel_Word_Probabilities_ConfMea
s_ICASSP1998.ps
Timothy Hazen et al,
Recognition Confidence Scoring for Use in Speech
Understanding Systems
http//groups.csail.mit.edu/sls/publications/2000/
asr2000.pdf
Dan Bohus and Alex Rudnicky
A Principled Approach for Rejection Threshold
Optimization in Spoken Dialogue System
http//www.cs.cmu.edu/dbohus/docs/dbohus_interspe
ech05.pdf

3
Application of Confidence Annotation

Provides system a decision whether ASR output
could be trusted,
Possible response strategies
Reject the sentence all together.
Confirm with the users again
Both e.g. bi-threshold system.
Detection of OOV e.g.
If ASR doesnt include the OOV in the vocabulary.
What is the focus for paramus park new jersey
What is the forecast for paris park new jersey
Paramus is OOV, then the system should be not
confident about the phoneme transcription.
Improve speech recognition performance
Why? In general, posterior should be used instead
of likelihood
Does it help? 2-5 relative level.

4
How this seminar proceed

In each idea, 3 papers are studied.
Only the most representative became suggested
reading.
Results will be quoted from different papers.

5
Preliminary

Mathematical Foundation
Neyman-Pearson Theorem (NPT)
Consequence of NPT
In general, likelihood ratio test is the most
powerful test to decide which one of the two
distributions is in force.
H1 Distribution A is in force.
H2 Distribution B is in force.
Compute
F(H1)/F(H2) ltgt T
In speech recognition,
H1 could be the speech model, H2 could be the
non-speech (garbage) model.

6
Idea 1 Belief of a single ASR feature

3 studied papers
(suggested) Frank Wessel et al, Using Word
Probabilities as Confidence Measures
http//www-i6.informatik.rwth-aachen.de/PostScript
/InterneArbeiten/Wessel_Word_Probabilities_ConfMea
s_ICASSP1998.ps
Stephen Cox and Richard Rose, Confidence
Measures for the Switchboard Database
http//www.ece.mcgill.ca/rose/papers/cox_rose_ica
ssp96.pdf
Thomas Kemp and Thomas Schaaf, Estimating
Confidence using Word Lattices.
http//overcite.lcs.mit.edu/cache/papers/cs/1116/h
ttpzSzzSzwww.is.cs.cmu.eduzSzwwwadmzSzpaperszSzs
peechzSzEUROSPEECH97zSzEUROSPEECH97-thomas.pdf/kem
p97estimating.pdf
Paper chosen because
It has clearest math in minute detail.
though less motivating than Coxs paper.

7
Origins of Confidence Measure in speech
recognition

Formulation of speech recognition
P(W A ) P( A W ) P( W) / P (A)
In decoding, P(A) is ignored because it is a
common term
W argmax P(AW) P(W)
W
Problem
P(A,W) is just a relative measure
P(WA) is the true measure of how probable a
word given the feature.

8
In reality

P(A) could only be approximated
By law of total probability
P(A) Sum of P(A,W) for all W
N-best list, word lattices are therefore used.
Other Ideas filler/garbage/general speech
models. -gt keyword spotter tricks
A threshold of ratio need to be found.
ROC curve always need to be manually interpreted.

9
Things that people are not confident

All sorts of things
Frame
Frame likelihood ratio
Phone
Phone likelihood ratio
Word
Posterior probability -gt a kind of likelihood
ratio too
Word likelihood ratio.
Sentence
Likelihood

10
General Observation from Literature

Word-level confidence perform the best (CER)
Word lattice method is slightly more general.
This part of presentation will focus on
word-lattice-based method.

11
Word posterior probability. Authors Definition

W_a Word hypothesis preceding w
W_e Word hypothesis succeeding w

12
Computation with lattice

Only the hypotheses included by lattice need to
be computed.
Alpha-beta type of computation could be used.
Similar to forward-backward algorithm

13
Forward probability

For an end time t,
Read Total posterior probability end at t which
are identical to h
Recursive formula

14
Backward probability

For a begin time t
One LM score is missing in the definition, later
added back to the computation
Recursion

15
Posterior Computation
16
Practical Implementation

According to the author
Posterior found using the above formula have
poorer discriminative capability
Timing is fuzzy from the recognizer
Segments of 30 of overlapping is then used.
Acoustic score and language score
Both are scaled
AM scaled by a number equal to 1
LM scaled by a number larger than 1

17
Experimental Results

Confidence error rate is computed.
Definition of CER
correctly assigned tags/ tags
Threshold is optimized by cross-validation set.
Compared to baseline
(Insertions Deletion)/Number of recognized
words
Results relatively 14-18 of improvement

18
Summary

Word-based posterior probability is one effective
way to compute confidence.
In practice, AM and LM scores need to be scaled
appropriately.
Further reading.
Frank Soong et al, Generalized Word Posterior
Probability (GWPP) For Measuring Reliability of
Recognized Words

19
Idea 2 Belief in multiple ASR features

Background
Single ASR feature is not the best
Multiple features could be combined to improve
results
Combination could be done by machine-learning
algorithm

20
Reviewed papers

(Suggested) Timothy et al, Recognition
Confidence Scoring for Use in Speech
Understanding Systems
http//groups.csail.mit.edu/sls/publications/2000/
asr2000.pdf
Zhang et al,
http//www.cs.cmu.edu/rongz/eurospeech_2001_1.pdf
A survey http//fife.speech.cs.cmu.edu/Courses/11
716/2000/Word_Confidence_Annotation.ps
Chase et al, Word and Acoustic Confidence
Annotation for Large Vocabulary Speech
Recognition
http//www.cs.cmu.edu/afs/cs/user/lindaq/mosaic/ca
.ps
Paper chosen because
it is more recent.
Combination method is motivated by speech-rec.

21
General structure of papers in Idea 2

10-30 features from the acoustic model is listed
Combination scheme is chosen.
Usually it is based on machine learning method
e.g
Decision tree
Neural network
Support vector machine
Fisher Linear separator
Or any super-duper ML method.

22
Outline

Motivation of the paper
Decide whether OOV exists
Marked potentially mis-recognized words.
What the author tries to do
Decide whether an utterance should be accepted
3 different levels of feature
Phonetic Level Scoring
Never used
Utterance Level Scoring
15 features
Word Level Scoring
10 features

23
Phone-Level Scoring

From the author
Several work in the past has already show phone
and frame scores are unlikely to help
However, phone score will be used to generate
word-level and sentence-level scores.
Scores are normalized by catch-all model
In other words, garbage model is used to
approximate p(A)
Normalized scores are always used.

24
Utterance Level Features (the boring group)

1, 1st best hypothesis total score
(AM LM PM)
2, 1st best hypothesis average (word) score
The avg. score per word.
3, 1st best hypothesis total LM score
4, 1st best hypothesis avg. LM score
5, 1st best hypothesis total AM score
6, 1st best hypothesis avg. AM score
7, Difference of total score between 1st best hyp
2nd best hyp
8, Difference of LM score between 1st best hyp
and 2nd best hyp
9, Difference of AM score between 1st best hyp
and 2nd best hyp
14, Number of N-best
15, Number of words in the 1st best hyp.

25
Utterance Level Features (the interesting group)

N-best Purity
The N-best purity for a hypothesized word is the
fraction of N-best hypotheses in which that
particular hypothesized word appear in the same
location in the sentence
Or
agreement/Total
Similar to rover voting on the N-best list.

26
Utterance Level Features (the interesting group)
(cont.)

10, 1st best hypothesis avg. N-best purity
11, 1st best hypothesis high N-best purity
The fraction of words in the top choice
hypothesis which have N-best purity greater than
one half.
12, Average N-best purity
13, High N-best purity

27
Word Level Feature

1, Mean acoustic score -gt The mean of log
likelihood
2, Mean of acoustic likliehood score -gt The mean
of likelihood (not log likelihood)
3, Minimum acoustic score
4, Standard Deviation of acoustic score
5, Mean difference from max score
The average log likelihood ratio between acoustic
scores of the best path and from phoneme
recognition.
6, Mean Catch-All Score
7, Number of Acoustic Observation
8, N-best Purity
9, Number of N-best
10, Utterance Score

28
Classifier Training

Linear Separator
Input features
Output (correct, incorrect) pair
Training process
1, Fisher Linear discriminative analysis is used
to produced the first version of the separator
2, A hill climbing algorithm is used to minimized
the classification error.

29
Results (Word-Level)
30
Discussion Is there any meaning in combination
method?

IMO, Yes,
Provided that the breakdown of the feature
contribution to the reduction of CER is provided.
E.g. The goodies in other papers
In Timothy et al, N-best purity is the most
useful.
In Lin, LM-jitter is the first question that
provide most gain
In Rong, back-off mode and parsing score provide
significant improvement.
Also, timothy et al is special because the
optimization of combination is also MCE trained.
So, how things combined does matter too.

31
Summary

25 features were used in this paper
15 for utterance level
10 for word level
N-best purity was found to be the most helpful
Both simple linear separator training and minimum
classification training was used.
That explains the huge relative reduction in
error.

32
Idea 3, Believe information other than ASR

ASR output has certain limitation.
When apply in different applications,
ASR confidence need to be modified/combined with
application-specific information.

33
Reviewed Papers

Dialogue System
(Suggested) Dan Bohus, A principled approach for
rejection threshold optimization in Spoken
Dialogue System
http//www.cs.cmu.edu/dbohus/docs/dbohus_interspe
ech05.pdf
Sameer Pradhan, Wayne Ward, Estimating Semantic
Confidence for Spoken Dialogue Systemss
http//oak.colorado.edu/spradhan/publications/sem
antic-confidence.pdf
CALL
Simon Ho, Brian Mak, Joint Estimation of
Thresholds in a Bi-threshold Verification
Problem
http//www.cs.ust.hk/mak/PDF/eurospeech2003-bithr
eshold.pdf
Paper chosen because
It is most recent
Representative from a dialogue system
stand-point.

34
Big Picture of this type of papers

Use feature external to ASR as confidence
feature.
Dialogue context
Use cost external to ASR error rate as
optimization criterion
Cost of misunderstanding
10 FA/FR
As most commented
It usually makes more sense than just relying on
ASR features.
The quality of feature also depends on the ASR
scores.

35
Overview of the paper

Motivation
Recognition error significantly affect the
quality of success of interaction (for the
dialogue system)
Rejection Threshold introduces trade-off between
the number of misunderstanding
false rejections.

36
Incorrect and Correct Transfer of Concepts

An alternative formulation by the authors
User tries to convey system concepts
If the confidence is below threshold
The system reject the utterance and no concept is
transferred
If the confidence is above the threshold
The system accept some correct concept
But also accept some wrong concept.

37
Questions the authors want to answer

Given the existence of this tradeoff, what is
the optimal value for rejection threshold?
this tradeoff
The trade-off between correctly and incorrectly
transfer concepts.

38
Logistic regression

Generalized linear model which
g
link function.
could be log, logit, identity and reciprocal
http//userwww.sfsu.edu/efc/classes/biol710/Glz/G
eneralized20Linear20Models.htm

39
Logistic regression (cont.)

Usually used in
Categorical or non-continuous dependent variable
Or the relationship itself is actually not
linear.
Also used in combination features in ASR
See Siu Improved Estimation, Evaluation and
Applications of Confidence Measures for Speech
Recognition
And generally BBN systems

40
Impact of Incorrect and Correct Concept to the
task success

Logit (TS) 0.21 2.14 CTC 4.12 ITC
The odds of ITC vs CTC is nearly 2 times.

41
The procedure

Identify a set of variables A, B, . Involved in
the rejection tradeoff (e.g. CTC and ITC)
Choose a global dialogue performance metric P to
optimize for (e.g T.S.)
Fit models m which relates the trade-off
variables to the chosen global dialogue
performance metric Plt-m(A,B)
Find the threshold which maximizes the
performance
Th arg max(P) arg max (m(A(th), B(th))

42
Data

RoomLine system
Baseline, fixed rejection threshold 0.3
Each participant attempted
a max of 10 scenario-based interactions
71 states in the dialogue system
In general

43
Rejection Optimization

The 71 state are manually clustered into 3 types
Open-request, system asks open questions
How may I help you?
Request (bool), system asks a yes/no question
Do you want a reservation for this room?
Request (non-bool), system request an answer for
more than 2 possible values
Starting at what time do you need the room?
Cost are then optimized for individual state

44
Results
45
Summary of the paper

Principled idea for dialogue system.
Logistic regression is used to optimized the
threshold of rejection.
A neat paper.
Several clever point
Logistic regression
Using external metric in the paper

46
Discussion

3 different types of ideas in confidence
annotation
Questions
Which idea should we used?
Could ideas be combined?

47
Goodies in Idea 1

Word Posterior Probability, LM Jitter were found
to be very useful in different papers.
Word Posterior Probability is a generalization of
many techniques in the field
LM Jitter could be generalized with other
parameters in the decoder as well.
Utterance score help word scores.

48
Goodies in Idea 2

Combination always help
Combination in ML sense and DT sense each give a
chunk of gain.
Combination methods
Generalized linear model is easy to interpret and
principled.
linear separator could be easily trained ML and
DT sense
Neural network and SVM come with standard goodie
general non-linear modeling

49
Goodies in Idea 3

Every types of application has their own concern
which is more important than WER
Researcher should take the liberty to optimize
them instead of relying on ASR.

50
Conclusion