HUMAN AND SYSTEMS ENGINEERING: - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

HUMAN AND SYSTEMS ENGINEERING:

Description:

The general equations used to compute the alphas and betas for an HMM are as ... Computing alphas: ... we just initialize the first alpha value with a constant. ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 22

Provided by: RAG123

Category:

more less

Transcript and Presenter's Notes

Title: HUMAN AND SYSTEMS ENGINEERING:

1
HUMAN AND SYSTEMS ENGINEERING
Confidence Measures Based on Word Posteriors and
Word Graphs
Sridhar Raghavan1 and Joseph Picone2 Graduate
research assistant1, Human and Systems
Engineering Professor2,Electrical and Computer
Engineering URL www.isip.msstate.edu/publication
s/seminars/msstate/2005/confidence/
2
Abstract

Confidence measure using word posterior
There is a strong need for determining the
confidence of a word hypothesis in a LVCSR
systems, because in conventional viterbi decoding
the objective function is to reduce the sentence
error rate and not the word error rate.
A good estimate of the confidence is the word
posterior probability.
The word posteriors can be computed from a word
graph.
A forward-backward algorithm can be used to
compute the word posteriors.

Foundation

The equation for computing the posterior
probability of the word is as follows Wessel.F
The posterior probability of a word hypothesis
is the sum of the posterior probabilities of all
lattice paths of which the word is a part by
Lidia Mangu et al Finding consensus in speech
recognition word error minimization and other
applications of confusion networks
4

Foundation continued

We cannot compute the posterior probability
directly, so we decompose it into likelihood and
priors using Bayes rule.
N
There are 6 different ways to reach the node N
and 2 different ways to leave N, so we need to
obtain the forward probability as well as the
backward probability to determine the probability
of passing through the node N, and this is where
the forward-backward algorithm comes into picture.
The numerator is computed using the forward
backward algorithm. The denominator term is
simply the by product of the forward-backward
algorithm.
5

Scaling

Scaling is used to obtain a flat posterior
distribution so that the distribution is not
dominated by the best path G. Evermann.
Experimentally it has been determined that
(1/language model scale factor) is a good value
that can be used to scale down the acoustic model
score. The acoustic model is scaled down using
the language model scale factor as follows
6

How to combine word-posteriors?

The word posteriors corresponding to the
same word can be combined in order to obtain a
better confidence estimate. There are several
ways to do this, and some of the methods are as
follows
Sum up the posteriors of similar words that fall
within the same time frame or choose the maximum
posterior value among the similar words in the
same time frame F. Wessel, R. Schlüter, K.
Macherey, H. Ney. "Confidence Measures for Large
Vocabulary Continuous Speech Recognition.
Build a confusion network where the entire
lattice is mapped into a single linear graph i.e.
where the links pass through all the nodes in the
same order.

Full Lattice Network
quest
sense
sil
is
the
a
sil
guest
Confusion Network
this
this
is
the
test
sentence
Note The redundant silent edges can be fused
together in the full lattice network before
computing the forward-backward probabilities.
This will save a lot of computation if there are
many silence edges in the lattice.
7

Some challenges during posterior rescoring!

Apparently the word posteriors are not very
good estimate of confidence when the WER on the
data is very poor. This is described in the paper
by G.Evermann P.C.Woodland Large Vocabulary
Decoding and Confidence Estimation using Word
Posterior Probabilities. The reason is because
the posteriors are overestimated since the words
in the lattices are not the full set of possible
words, and in case of poor WER the lattice will
contain a lot of wrong hypothesis. In such a case
the depth of the lattice becomes a critical
factor in determining the effectiveness of using
the confidence measure. The paper
mentioned above cites two techniques to solve
this problem. 1. A Decision tree based
technique 2. A neural network based
technique. Different confidence measure
techniques are judged on a metric known as
normalized cross entropy (NCE).
8

How can we compute the word posterior from a
word graph?

The word posterior probability is computed by
considering the words acoustic score, language
model score and its position and history in a
particular path through the word graph. An
example of a word graph is given below, note that
the nodes hold the start-stop times and the links
hold the word labels, language model score ane
acoustic score.
quest
9

Example

Let us consider the example as shown below
The values on the links are the likelihoods. Some
nodes are outlined with red to signify that they
occur at the same time.
10

Forward-backward algorithm

We will use forward-backward type algorithm for
determining the link probability. The general
equations used to compute the alphas and betas
for an HMM are as follows from any speech text
book Computing alphas Step 1 Initialization
In a conventional HMM forward-backward algorithm
we would perform the following
We need to use a slightly modified version of the
above equation for processing a word graph. The
emission probability will be the acoustic score .
In our implementation we just initialize the
first alpha value with a constant. Since, we use
log domain for implementation we assign the first
alpha value as 0.
11

Forward-backward algorithm continue

The a for the first node is 1
Step 2 Induction
The alpha values computed in the previous step
are used to compute the alphas for the succeeding
nodes. Note Unlike in HMMs where we move from
left to right at fixed intervals of time, over
here we move from one node to the next based on
node indexes which are time aligned.
12

Forward-backward algorithm continue

Let us see the computation of the alphas from
node 2, the alpha for node 1 was initialized as 1
in the previous step during initialization. Node
2
a1.675E-05
a 0.005025
is
4
Node 3
a 1
3/6
2/6
Sil
1
4/6
3
3/6
3/6
Sil
Node 4
this
2
a 0.005
The alpha calculation continues in this manner
for all the remaining nodes.
The forward backward calculation on word-graphs
is similar to the calculations used on HMMs, but
in word graphs the transition matrix is populated
by the language model probabilities and the
emission probability corresponds to the acoustic
score.
13

Forward-backward algorithm continue

Once we compute the alphas using the forward
algorithm we begin the beta computation using the
backward algorithm. The backward algorithm is
similar to the forward algorithm, but we start
from the last node and proceed from right to
left. Step 1 Initialization
Step 2 Induction
14

Forward-backward algorithm continue

Let us see the computation of the beta values
from node 14 and backwards. Node 14
ß1.66E-5
ß0.001667
1/6
14
sense
1/6
Sil
1/6
11
Sil
sentence
Node 13
5/6
13
15
4/6
ß1
sentence
ß0.00833
12
Node 12
ß5.55E-5
15

Forward-backward algorithm continue

Node 11
In a similar manner we obtain the beta values for
all the nodes till node 1. The alpha for the last
node should be the same as the beta in the first
node.
We can compute the probabilities on the links
(between two nodes) as follows Let us call this
link probability as G. Therefore G(t-1,t) is
computed as the product of a(t-1)ß(t)aij. These
values give the un-normalized posterior
probabilities of the word on the link considering
all possible paths through the link.
16

Word graph showing the computed alphas and betas

This word graph shows every node with its
corresponding alpha and beta value.
a1.675E-7 ß4.61E-11
a2.79E-10 ß2.766E-8
Assumption here is that the probability of
occurrence of any word is 0.01. i.e. when we have
100 words in a loop grammar
17

Link probabilities calculated from alphas and
betas

The following word graph shows the links with
their corresponding link posterior probabilities
normalized by the sum of all paths.
G4.98E-03
By choosing the links with the maximum posterior
probability we can be certain that we have
included most probable words in the final
sequence.
18

Some Alternate approaches

The normalization of the posteriors is done by
dividing the value by the sum of the posterior
probabilities of all the paths in the lattice.
This example suffers from the fact that the
lattice is not deep enough, hence normalization
might result in the values of some of the links
to go very close to 1. This phenomenon is
explained in the paper by G.Evermann and P.C
Woodland.
The paper by F.Wessel (confidence Measures for
Large Vocabulary Continuous Speech Recognition)
describes alternate techniques to compute the
posterior, the drawback of the approach described
above is that the lattice has to be very deep to
accommodate sufficient links at the same time
instant. To overcome the problem we can use a
soft time margin instead of a hard margin, and
this is achieved by considering overlapping words
to a certain degree. But, by doing this the
author states that the normalization part will no
longer work since the probabilities are not
summed in the same time frame, and hence the
probabilities will not sum to one. Hence, the
author suggests an approach where the posteriors
are computed frame-by-frame so that the
normalization of the posteriors is possible. In
the end it was found that normalization using
frame-by-frame approach did not perform
significantly better than the overlapping time
marks approach.
19

Logarithmic computations

Instead of using the probabilities as described
above, we can use logarithmic approximations of
the above probabilities so that the
multiplications are converted to additions. We
can directly use the acoustic and language model
scores from the word graphs. We will use the
following log trick to add two logarithmic
values log(xy) log(x) log(1y/x) The
logarithmic alphas and betas computed are shown
below
a-6.2503 ß-1.7917
a-4.4586 ß-1.7916
a-3.1778 ß-3.5833
a -1.3862 ß-6.4736
a-1.3861 ß-5.375
quest
-1.7917
a -0.2876 ß-2.8562
a
13
sense
-1.7917
the
-1.7917
-1.7917
7
-1.0986
is
a 0 ß-3.1438
3
-0.6931
-1.7917
10
Sil
-1.0986
guest
-0.4054
p-4.0224 -1.7917
Sil
sentence
5
-1.0986
Sil
0
12
14
2
-0.1823
is
This
8
-0.4054
a-3.5833 ß-3.5833
a-3.1442 ß0
the
Sil
sentence
-0.6931
4
-0.6931
-1.0986
a-2.9694 ß-0.1823
-1.7917
this
a-0.6930 ß-2.4598
is
1
test
a
11
6
9
a -0.6931 ß-3.5493
-0.4054
-0.4054
a-2.6024 ß-0.5877
a-1.7916 ß-1.3799
a-2.1970 ß-0.9931
20

Logarithmic posterior probabilities

p-1.0982
From the lattice we can obtain the best word
sequence by picking the words with the highest
posterior probability as we traverse from node to
node.
21

References

F. Wessel, R. Schlüter, K. Macherey, H. Ney.
"Confidence Measures for Large Vocabulary
Continuous Speech Recognition". IEEE Trans. on
Speech and Audio Processing. Vol. 9, No. 3, pp.
288-298, March 2001
Wessel, Macherey, and Schauter, "Using Word
Probabilities as Confidence Measures, ICASSP'97
G. Evermann and P.C. Woodland, Large Vocabulary
Decoding and Confidence Estimation using Word
Posterior Probabilities in Proc. ICASSP 2000, pp.
2366-2369, Istanbul.
X. Huang, A. Acero, and H.W. Hon, Spoken Language
Processing - A Guide to Theory, Algorithm, and
System Development, Prentice Hall, ISBN
0-13-022616-5, 2001
J. Deller, et. al., Discrete-Time Processing of
Speech Signals, MacMillan Publishing Co., ISBN
0-7803-5386-2, 2000