Hidden Markov Models: Probabilistic Reasoning Over Time

About This Presentation

Title:

Hidden Markov Models: Probabilistic Reasoning Over Time

Description:

Bins and Balls Example. Assume there are two bins filled with red and blue balls. ... Bins and Balls. Assume the observation sequence: Blue Blue Red (BBR) Both ... – PowerPoint PPT presentation

Number of Views:474

Avg rating:3.0/5.0

Slides: 56

Provided by: classesCs

Category:

more less

Transcript and Presenter's Notes

Title: Hidden Markov Models: Probabilistic Reasoning Over Time

1
Hidden Markov ModelsProbabilistic Reasoning
Over Time

Artificial Intelligence
CMSC 25000
February 26, 2008

2
Agenda

Hidden Markov Models
Uncertain observation
Temporal Context
Recognition Viterbi
Training the model Baum-Welch
Speech Recognition
Framing the problem Sounds to Sense
Speech Recognition as Modern AI

3
Modelling Processes over Time

Infer underlying state sequence from observed
Issue New state depends on preceding states
Analyzing sequences
Problem 1 Possibly unbounded prob tables
ObservationStateTime
Solution 1 Assume stationary process
Rules governing process same at all time
Problem 2 Possibly unbounded parents
Markov assumption Only consider finite history
Common 1 or 2 Markov depend on last couple

4
Hidden Markov Models (HMMs)

An HMM is
1) A set of states
2) A set of transition probabilities
Where aij is the probability of transition qi -gt
qj
3)Observation probabilities
The probability of observing ot in state i
4) An initial probability dist over states
The probability of starting in state i
5) A set of accepting states

5
Three Problems for HMMs

Find the probability of an observation sequence
given a model
Forward algorithm
Find the most likely path through a model given
an observed sequence
Viterbi algorithm (decoding)
Find the most likely model (parameters) given an
observed sequence
Baum-Welch (EM) algorithm

6
Bins and Balls Example

Assume there are two bins filled with red and
blue balls. Behind a curtain, someone selects a
bin and then draws a ball from it (and replaces
it). They then select either the same bin or the
other one and then select another ball
(Example due to J. Martin)

7
Bins and Balls Example
.6
.7
.4
Bin 1
Bin 2
.3
8
Bins and Balls

? Bin 1 0.9 Bin 2 0.1
A
B

Bin1 Bin2
Bin1 0.6 0.4
Bin2 0.3 0.7
Bin 1 Bin 2
Red 0.7 0.4
Blue 0.3 0.6
9
Bins and Balls

Assume the observation sequence
Blue Blue Red (BBR)
Both bins have Red and Blue
Any state sequence could produce observations
However, NOT equally likely
Big difference in start probabilities
Observation depends on state
State depends on prior state

10
Bins and Balls

Blue Blue Red

1 1 1 (0.90.3)(0.60.3)(0.60.7)0.0204
1 1 2 (0.90.3)(0.60.3)(0.40.4)0.0077
1 2 1 (0.90.3)(0.40.6)(0.30.7)0.0136
1 2 2 (0.90.3)(0.40.6)(0.70.4)0.0181
2 1 1 (0.10.6)(0.30.7)(0.60.7)0.0052
2 1 2 (0.10.6)(0.30.7)(0.40.4)0.0020
2 2 1 (0.10.6)(0.70.6)(0.30.7)0.0052
2 2 2 (0.10.6)(0.70.6)(0.70.4)0.0070
11
Answers and Issues

Here, to compute probability of observed
Just add up all the state sequence probabilities
To find most likely state sequence
Just pick the sequence with the highest value
Problem Computing all paths expensive
2TNT
Solution Dynamic Programming
Sweep across all states at each time step
Summing (Problem 1) or Maximizing (Problem 2)

12
Forward Probability
Where a is the forward probability, t is the time
in utterance, i,j are states in the
HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in
state bj N is the max state, T is the last time
13
Forward Algorithm

Idea matrix where each cell forwardt,j
represents probability of being in state j after
seeing first t observations.
Each cell expresses the probability
forwardt,j P(o1,o2,...,ot,qtjw)
qt j means "the probability that the tth state
in the sequence of states is state j.
Compute probability by summing over extensions of
all paths leading to current cell.
An extension of a path from a state i at time t-1
to state j at t is computed by multiplying
together i. previous path probability from the
previous cell forwardt-1,i, ii. transition
probability aij from previous state i to current
state j iii. observation likelihood bjt that
current state j matches observation symbol t.

14
Forward Algorithm

Function Forward(observations length T,
state-graph) returns best-path
Num-stateslt-num-of-states(state-graph)
Create path prob matrix forwardinum-states2,T2
Forward0,0lt- 1.0
For each time step t from 0 to T do
for each state s from 0 to num-states do
for each transition s from s in
state-graph
new-scorelt-Forwards,tats,sbs(ot)
Forwards,t1 lt- Forwards,t1new-score

15
Viterbi Algorithm

Find BEST sequence given signal
Best P(sequencesignal)
Take HMM observation sequence
gt seq (prob)
Dynamic programming solution
Record most probable path ending at a state i
Then most probable path from i to end
O(bMn)

16
Viterbi Code
Function Viterbi(observations length T,
state-graph) returns best-path Num-stateslt-num-of-
states(state-graph) Create path prob matrix
viterbinum-states2,T2 Viterbi0,0lt- 1.0 For
each time step t from 0 to T do for each state
s from 0 to num-states do for each
transition s from s in state-graph
new-scorelt-viterbis,tats,sbs(ot)
if ((viterbis,t10) (viterbis,t1ltnew-
score)) then viterbis,t1 lt-
new-score back-pointers,t1lt-s Backtrace
from highest prob state in final column of
viterbi return
17
Learning HMMs

Issue Where do the probabilities come from?
Solution Learn from data
Trains transition (aij) and emission (bj)
probabilities
Typically assume structure
Baum-Welch aka forward-backward algorithm
Iteratively estimate counts of transitions/emitted
Get estimated probabilities by forward computn
Divide probability mass over contributing paths

18
Learning HMMs

Issue Where do the probabilities come from?
Supervised/manual construction
Solution Learn from data
Trains transition (aij), emission (bj), and
initial (pi) probabilities
Typically assume state structure is given
Unsupervised
Baum-Welch aka forward-backward algorithm
Iteratively estimate counts of transitions/emitted
Get estimated probabilities by forward computn
Divide probability mass over contributing paths

19
Manual Construction

Manually labeled data
Observation sequences, aligned to
Ground truth state sequences
Compute (relative) frequencies of state
transitions
Compute frequencies of observations/state
Compute frequencies of initial states
Bootstrapping iterate tag, correct, reestimate,
tag.
Problem
Labeled data is expensive, hard/impossible to
obtain, may be inadequate to fully estimate
Sparseness problems

20
Less Supervised Learning

Re-estimation from unlabeled data
Baum-Welch aka forward-backward algorithm
Assume representative collection of data
E.g. recorded speech, gene sequences, etc
Assign initial probabilities
Or estimate from very small labeled sample
Compute state sequences given the data
I.e. use forward algorithm
Update transition, emission, initial probabilities

21
Updating Probabilities

Intuition
Observations identify state sequences
Adjust probability of transitions/emissions
Make closer to those consistent with observed
Increase P(ObservationsModel)
Functionally
For each state i, what proportion of transitions
from state i go to state j
For each state i, what proportion of observations
match O?
How often is state i the initial state?

22
Estimating Transitions

Consider updating transition aij
Compute probability of all paths using aij
Compute probability of all paths through i (w/
and w/o i-gtj)

i
j
23
Forward Probability
Where a is the forward probability, t is the time
in utterance, i,j are states in the
HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in
state bj N is the max state, T is the last time
24
Backward Probability
Where ß is the backward probability, t is the
time in sequence, i,j are states in
the HMM, aij is the transition probability,
bj(ot) is the probability of observing ot
in state bj N is the final state, and T is the
last time
25
Re-estimating

Estimate transitions from i-gtj
Estimate observations in j
Estimate initial i

26
Speech Recognition

Goal
Given an acoustic signal, identify the sequence
of words that produced it
Speech understanding goal
Given an acoustic signal, identify the meaning
intended by the speaker
Issues
Ambiguity many possible pronunciations,
Uncertainty what signal, what word/sense
produced this sound sequence

27
Decomposing Speech Recognition

Q1 What speech sounds were uttered?
Human languages 40-50 phones
Basic sound units b, m, k, ax, ey, (arpabet)
Distinctions categorical to speakers
Acoustically continuous
Part of knowledge of language
Build per-language inventory
Could we learn these?

28
Decomposing Speech Recognition

Q2 What words produced these sounds?
Look up sound sequences in dictionary
Problem 1 Homophones
Two words, same sounds too, two
Problem 2 Segmentation
No space between words in continuous speech
I scream/ice cream, Wreck a nice
beach/Recognize speech
Q3 What meaning produced these words?
NLP (But thats not all!)

29
(No Transcript)
30
Signal Processing

Goal Convert impulses from microphone into a
representation that
is compact
encodes features relevant for speech recognition
Compactness Step 1
Sampling rate how often look at data
8KHz, 16KHz,(44.1KHz CD quality)
Quantization factor how much precision
8-bit, 16-bit (encoding u-law, linear)

31
(A Little More) Signal Processing

Compactness Feature identification
Capture mid-length speech phenomena
Typically frames of 10ms (80 samples)
Overlapping
Vector of features e.g. energy at some frequency
Vector quantization
n-feature vectors n-dimension space
Divide into m regions (e.g. 256)
All vectors in region get same label - e.g. C256

32
Speech Recognition Model

Question Given signal, what words?
Problem uncertainty
Capture of sound by microphone, how phones
produce sounds, which words make phones, etc
Solution Probabilistic model
P(wordssignal)
P(signalwords)P(words)/P(signal)
Idea Maximize P(signalwords)P(words)
P(signalwords) acoustic model P(words) lang
model

33
Language Model

Idea some utterances more probable
Standard solution n-gram model
Typically tri-gram P(wiwi-1,wi-2)
Collect training data
Smooth with bi- uni-grams to handle sparseness
Product over words in utterance

34
Acoustic Model

P(signalwords)
words -gt phones phones -gt vector quantizn
Words -gt phones
Pronunciation dictionary lookup
Multiple pronunciations?
Probability distribution
Dialect Variation tomato
Coarticulation
Product along path

0.5
0.5
0.5
0.2
0.5
0.8
35
Pronunciation Example

Observations 0/1

36
Acoustic Model

P(signal phones)
Problem Phones can be pronounced differently
Speaker differences, speaking rate, microphone
Phones may not even appear, different contexts
Observation sequence is uncertain
Solution Hidden Markov Models
1) Hidden gt Observations uncertain
2) Probability of word sequences gt
State transition probabilities
3) 1st order Markov gt use 1 prior state

37
Acoustic Model

3-state phone model for m
Use Hidden Markov Model (HMM)
Probability of sequence sum of prob of paths

0.3
0.9
0.4
Transition probabilities
0.7
0.1
0.6
C3 0.3
C5 0.1
C6 0.4
C1 0.5
C3 0.2
C4 0.1
C2 0.2
C4 0.7
C6 0.5
Observation probabilities
38
ASR Training

Models to train
Language model typically tri-gram
Observation likelihoods B
Transition probabilities A
Pronunciation lexicon sub-phone, word
Training materials
Speech files word transcription
Large text corpus
Small phonetically transcribed speech corpus

39
Training

Language model
Uses large text corpus to train n-grams
500 M words
Pronunciation model
HMM state graph
Manual coding from dictionary
Expand to triphone context and sub-phone models

40
HMM Training

Training the observations
E.g. Gaussian set uniform initial mean/variance
Train based on contents of small (e.g. 4hr)
phonetically labeled speech set (e.g.
Switchboard)
Training AB
Forward-Backward algorithm training

41
Does it work?

Yes
99 on isolated single digits
95 on restricted short utterances (air travel)
89 professional news broadcast
No
77 Conversational English
67 Conversational Mandarin (CER)
55 Meetings
?? Noisy cocktail parties

42
N-grams

Perspective
Some sequences (words/chars) are more likely than
others
Given sequence, can guess most likely next
Used in
Speech recognition
Spelling correction,
Augmentative communication
Other NL applications

43
Probabilistic Language Generation

Coin-flipping models
A sentence is generated by a randomized algorithm
The generator can be in one of several states
Flip coins to choose the next state.
Flip other coins to decide which letter or word
to output

44
Shannons Generated Language

1. Zero-order approximation
XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD
QPAAMKBZAACIBZLHJQD
2. First-order approximation
OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI
ALHENHTTPA OOBTTVA NAH RBL
3. Second-order approximation
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY
ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN
ANDY TOBE SEACE CTISBE

45
Shannons Word Models

1. First-order approximation
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE
TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE
2. Second-order approximation
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED

46
Corpus Counts