Hidden Markov Models: Probabilistic Reasoning Over Time - PowerPoint PPT Presentation

About This Presentation
Title:

Hidden Markov Models: Probabilistic Reasoning Over Time

Description:

Bins and Balls Example. Assume there are two bins filled with red and blue balls. ... Bins and Balls. Assume the observation sequence: Blue Blue Red (BBR) Both ... – PowerPoint PPT presentation

Number of Views:466
Avg rating:3.0/5.0
Slides: 56
Provided by: classesCs
Category:

less

Transcript and Presenter's Notes

Title: Hidden Markov Models: Probabilistic Reasoning Over Time


1
Hidden Markov ModelsProbabilistic Reasoning
Over Time
  • Artificial Intelligence
  • CMSC 25000
  • February 26, 2008

2
Agenda
  • Hidden Markov Models
  • Uncertain observation
  • Temporal Context
  • Recognition Viterbi
  • Training the model Baum-Welch
  • Speech Recognition
  • Framing the problem Sounds to Sense
  • Speech Recognition as Modern AI

3
Modelling Processes over Time
  • Infer underlying state sequence from observed
  • Issue New state depends on preceding states
  • Analyzing sequences
  • Problem 1 Possibly unbounded prob tables
  • ObservationStateTime
  • Solution 1 Assume stationary process
  • Rules governing process same at all time
  • Problem 2 Possibly unbounded parents
  • Markov assumption Only consider finite history
  • Common 1 or 2 Markov depend on last couple

4
Hidden Markov Models (HMMs)
  • An HMM is
  • 1) A set of states
  • 2) A set of transition probabilities
  • Where aij is the probability of transition qi -gt
    qj
  • 3)Observation probabilities
  • The probability of observing ot in state i
  • 4) An initial probability dist over states
  • The probability of starting in state i
  • 5) A set of accepting states

5
Three Problems for HMMs
  • Find the probability of an observation sequence
    given a model
  • Forward algorithm
  • Find the most likely path through a model given
    an observed sequence
  • Viterbi algorithm (decoding)
  • Find the most likely model (parameters) given an
    observed sequence
  • Baum-Welch (EM) algorithm

6
Bins and Balls Example
  • Assume there are two bins filled with red and
    blue balls. Behind a curtain, someone selects a
    bin and then draws a ball from it (and replaces
    it). They then select either the same bin or the
    other one and then select another ball
  • (Example due to J. Martin)

7
Bins and Balls Example
.6
.7
.4
Bin 1
Bin 2
.3
8
Bins and Balls
  • ? Bin 1 0.9 Bin 2 0.1
  • A
  • B

Bin1 Bin2
Bin1 0.6 0.4
Bin2 0.3 0.7
Bin 1 Bin 2
Red 0.7 0.4
Blue 0.3 0.6
9
Bins and Balls
  • Assume the observation sequence
  • Blue Blue Red (BBR)
  • Both bins have Red and Blue
  • Any state sequence could produce observations
  • However, NOT equally likely
  • Big difference in start probabilities
  • Observation depends on state
  • State depends on prior state

10
Bins and Balls
  • Blue Blue Red

1 1 1 (0.90.3)(0.60.3)(0.60.7)0.0204
1 1 2 (0.90.3)(0.60.3)(0.40.4)0.0077
1 2 1 (0.90.3)(0.40.6)(0.30.7)0.0136
1 2 2 (0.90.3)(0.40.6)(0.70.4)0.0181
2 1 1 (0.10.6)(0.30.7)(0.60.7)0.0052
2 1 2 (0.10.6)(0.30.7)(0.40.4)0.0020
2 2 1 (0.10.6)(0.70.6)(0.30.7)0.0052
2 2 2 (0.10.6)(0.70.6)(0.70.4)0.0070
11
Answers and Issues
  • Here, to compute probability of observed
  • Just add up all the state sequence probabilities
  • To find most likely state sequence
  • Just pick the sequence with the highest value
  • Problem Computing all paths expensive
  • 2TNT
  • Solution Dynamic Programming
  • Sweep across all states at each time step
  • Summing (Problem 1) or Maximizing (Problem 2)

12
Forward Probability
Where a is the forward probability, t is the time
in utterance, i,j are states in the
HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in
state bj N is the max state, T is the last time
13
Forward Algorithm
  • Idea matrix where each cell forwardt,j
    represents probability of being in state j after
    seeing first t observations.
  • Each cell expresses the probability
    forwardt,j P(o1,o2,...,ot,qtjw)
  • qt j means "the probability that the tth state
    in the sequence of states is state j.
  • Compute probability by summing over extensions of
    all paths leading to current cell.
  • An extension of a path from a state i at time t-1
    to state j at t is computed by multiplying
    together i. previous path probability from the
    previous cell forwardt-1,i, ii. transition
    probability aij from previous state i to current
    state j iii. observation likelihood bjt that
    current state j matches observation symbol t.

14
Forward Algorithm
  • Function Forward(observations length T,
    state-graph) returns best-path
  • Num-stateslt-num-of-states(state-graph)
  • Create path prob matrix forwardinum-states2,T2
  • Forward0,0lt- 1.0
  • For each time step t from 0 to T do
  • for each state s from 0 to num-states do
  • for each transition s from s in
    state-graph
  • new-scorelt-Forwards,tats,sbs(ot)
  • Forwards,t1 lt- Forwards,t1new-score

15
Viterbi Algorithm
  • Find BEST sequence given signal
  • Best P(sequencesignal)
  • Take HMM observation sequence
  • gt seq (prob)
  • Dynamic programming solution
  • Record most probable path ending at a state i
  • Then most probable path from i to end
  • O(bMn)

16
Viterbi Code
Function Viterbi(observations length T,
state-graph) returns best-path Num-stateslt-num-of-
states(state-graph) Create path prob matrix
viterbinum-states2,T2 Viterbi0,0lt- 1.0 For
each time step t from 0 to T do for each state
s from 0 to num-states do for each
transition s from s in state-graph
new-scorelt-viterbis,tats,sbs(ot)
if ((viterbis,t10) (viterbis,t1ltnew-
score)) then viterbis,t1 lt-
new-score back-pointers,t1lt-s Backtrace
from highest prob state in final column of
viterbi return
17
Learning HMMs
  • Issue Where do the probabilities come from?
  • Solution Learn from data
  • Trains transition (aij) and emission (bj)
    probabilities
  • Typically assume structure
  • Baum-Welch aka forward-backward algorithm
  • Iteratively estimate counts of transitions/emitted
  • Get estimated probabilities by forward computn
  • Divide probability mass over contributing paths

18
Learning HMMs
  • Issue Where do the probabilities come from?
  • Supervised/manual construction
  • Solution Learn from data
  • Trains transition (aij), emission (bj), and
    initial (pi) probabilities
  • Typically assume state structure is given
  • Unsupervised
  • Baum-Welch aka forward-backward algorithm
  • Iteratively estimate counts of transitions/emitted
  • Get estimated probabilities by forward computn
  • Divide probability mass over contributing paths

19
Manual Construction
  • Manually labeled data
  • Observation sequences, aligned to
  • Ground truth state sequences
  • Compute (relative) frequencies of state
    transitions
  • Compute frequencies of observations/state
  • Compute frequencies of initial states
  • Bootstrapping iterate tag, correct, reestimate,
    tag.
  • Problem
  • Labeled data is expensive, hard/impossible to
    obtain, may be inadequate to fully estimate
  • Sparseness problems

20
Less Supervised Learning
  • Re-estimation from unlabeled data
  • Baum-Welch aka forward-backward algorithm
  • Assume representative collection of data
  • E.g. recorded speech, gene sequences, etc
  • Assign initial probabilities
  • Or estimate from very small labeled sample
  • Compute state sequences given the data
  • I.e. use forward algorithm
  • Update transition, emission, initial probabilities

21
Updating Probabilities
  • Intuition
  • Observations identify state sequences
  • Adjust probability of transitions/emissions
  • Make closer to those consistent with observed
  • Increase P(ObservationsModel)
  • Functionally
  • For each state i, what proportion of transitions
    from state i go to state j
  • For each state i, what proportion of observations
    match O?
  • How often is state i the initial state?

22
Estimating Transitions
  • Consider updating transition aij
  • Compute probability of all paths using aij
  • Compute probability of all paths through i (w/
    and w/o i-gtj)

i
j
23
Forward Probability
Where a is the forward probability, t is the time
in utterance, i,j are states in the
HMM, aij is the transition probability,
bj(ot) is the probability of observing ot in
state bj N is the max state, T is the last time
24
Backward Probability
Where ß is the backward probability, t is the
time in sequence, i,j are states in
the HMM, aij is the transition probability,
bj(ot) is the probability of observing ot
in state bj N is the final state, and T is the
last time
25
Re-estimating
  • Estimate transitions from i-gtj
  • Estimate observations in j
  • Estimate initial i

26
Speech Recognition
  • Goal
  • Given an acoustic signal, identify the sequence
    of words that produced it
  • Speech understanding goal
  • Given an acoustic signal, identify the meaning
    intended by the speaker
  • Issues
  • Ambiguity many possible pronunciations,
  • Uncertainty what signal, what word/sense
    produced this sound sequence

27
Decomposing Speech Recognition
  • Q1 What speech sounds were uttered?
  • Human languages 40-50 phones
  • Basic sound units b, m, k, ax, ey, (arpabet)
  • Distinctions categorical to speakers
  • Acoustically continuous
  • Part of knowledge of language
  • Build per-language inventory
  • Could we learn these?

28
Decomposing Speech Recognition
  • Q2 What words produced these sounds?
  • Look up sound sequences in dictionary
  • Problem 1 Homophones
  • Two words, same sounds too, two
  • Problem 2 Segmentation
  • No space between words in continuous speech
  • I scream/ice cream, Wreck a nice
    beach/Recognize speech
  • Q3 What meaning produced these words?
  • NLP (But thats not all!)

29
(No Transcript)
30
Signal Processing
  • Goal Convert impulses from microphone into a
    representation that
  • is compact
  • encodes features relevant for speech recognition
  • Compactness Step 1
  • Sampling rate how often look at data
  • 8KHz, 16KHz,(44.1KHz CD quality)
  • Quantization factor how much precision
  • 8-bit, 16-bit (encoding u-law, linear)

31
(A Little More) Signal Processing
  • Compactness Feature identification
  • Capture mid-length speech phenomena
  • Typically frames of 10ms (80 samples)
  • Overlapping
  • Vector of features e.g. energy at some frequency
  • Vector quantization
  • n-feature vectors n-dimension space
  • Divide into m regions (e.g. 256)
  • All vectors in region get same label - e.g. C256

32
Speech Recognition Model
  • Question Given signal, what words?
  • Problem uncertainty
  • Capture of sound by microphone, how phones
    produce sounds, which words make phones, etc
  • Solution Probabilistic model
  • P(wordssignal)
  • P(signalwords)P(words)/P(signal)
  • Idea Maximize P(signalwords)P(words)
  • P(signalwords) acoustic model P(words) lang
    model

33
Language Model
  • Idea some utterances more probable
  • Standard solution n-gram model
  • Typically tri-gram P(wiwi-1,wi-2)
  • Collect training data
  • Smooth with bi- uni-grams to handle sparseness
  • Product over words in utterance

34
Acoustic Model
  • P(signalwords)
  • words -gt phones phones -gt vector quantizn
  • Words -gt phones
  • Pronunciation dictionary lookup
  • Multiple pronunciations?
  • Probability distribution
  • Dialect Variation tomato
  • Coarticulation
  • Product along path

0.5
0.5
0.5
0.2
0.5
0.8
35
Pronunciation Example
  • Observations 0/1

36
Acoustic Model
  • P(signal phones)
  • Problem Phones can be pronounced differently
  • Speaker differences, speaking rate, microphone
  • Phones may not even appear, different contexts
  • Observation sequence is uncertain
  • Solution Hidden Markov Models
  • 1) Hidden gt Observations uncertain
  • 2) Probability of word sequences gt
  • State transition probabilities
  • 3) 1st order Markov gt use 1 prior state

37
Acoustic Model
  • 3-state phone model for m
  • Use Hidden Markov Model (HMM)
  • Probability of sequence sum of prob of paths

0.3
0.9
0.4
Transition probabilities
0.7
0.1
0.6
C3 0.3
C5 0.1
C6 0.4
C1 0.5
C3 0.2
C4 0.1
C2 0.2
C4 0.7
C6 0.5
Observation probabilities
38
ASR Training
  • Models to train
  • Language model typically tri-gram
  • Observation likelihoods B
  • Transition probabilities A
  • Pronunciation lexicon sub-phone, word
  • Training materials
  • Speech files word transcription
  • Large text corpus
  • Small phonetically transcribed speech corpus

39
Training
  • Language model
  • Uses large text corpus to train n-grams
  • 500 M words
  • Pronunciation model
  • HMM state graph
  • Manual coding from dictionary
  • Expand to triphone context and sub-phone models

40
HMM Training
  • Training the observations
  • E.g. Gaussian set uniform initial mean/variance
  • Train based on contents of small (e.g. 4hr)
    phonetically labeled speech set (e.g.
    Switchboard)
  • Training AB
  • Forward-Backward algorithm training

41
Does it work?
  • Yes
  • 99 on isolated single digits
  • 95 on restricted short utterances (air travel)
  • 89 professional news broadcast
  • No
  • 77 Conversational English
  • 67 Conversational Mandarin (CER)
  • 55 Meetings
  • ?? Noisy cocktail parties

42
N-grams
  • Perspective
  • Some sequences (words/chars) are more likely than
    others
  • Given sequence, can guess most likely next
  • Used in
  • Speech recognition
  • Spelling correction,
  • Augmentative communication
  • Other NL applications

43
Probabilistic Language Generation
  • Coin-flipping models
  • A sentence is generated by a randomized algorithm
  • The generator can be in one of several states
  • Flip coins to choose the next state.
  • Flip other coins to decide which letter or word
    to output

44
Shannons Generated Language
  • 1. Zero-order approximation
  • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD
    QPAAMKBZAACIBZLHJQD
  • 2. First-order approximation
  • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI
    ALHENHTTPA OOBTTVA NAH RBL
  • 3. Second-order approximation
  • ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY
    ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN
    ANDY TOBE SEACE CTISBE

45
Shannons Word Models
  • 1. First-order approximation
  • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
    CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE
    TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE
    MESSAGE HAD BE THESE
  • 2. Second-order approximation
  • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
    WRITER THAT THE CHARACTER OF THIS POINT IS
    THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE
    TIME OF WHO EVER TOLD THE PROBLEM FOR AN
    UNEXPECTED

46
Corpus Counts
  • Estimate probabilities by counts in large
    collections of text/speech
  • Issues
  • Wordforms (surface) vs lemma (root)
  • Case? Punctuation? Disfluency?
  • Type (distinct words) vs Token (total)

47
Basic N-grams
  • Most trivial 1/tokens too simple!
  • Standard unigram frequency
  • word occurrences/total corpus size
  • E.g. the0.07 rabbit 0.00001
  • Too simple no context!
  • Conditional probabilities of word sequences

48
Markov Assumptions
  • Exact computation requires too much data
  • Approximate probability given all prior wds
  • Assume finite history
  • Bigram Probability of word given 1 previous
  • First-order Markov
  • Trigram Probability of word given 2 previous
  • N-gram approximation

Bigram sequence
49
Issues
  • Relative frequency
  • Typically compute count of sequence
  • Divide by prefix
  • Corpus sensitivity
  • Shakespeare vs Wall Street Journal
  • Very unnatural
  • Ngrams
  • Unigram little bigrams colloctrigramsphrase

50
Evaluating n-gram models
  • Entropy Perplexity
  • Information theoretic measures
  • Measures information in grammar or fit to data
  • Conceptually, lower bound on bits to encode
  • Entropy H(X) X is a random var, p prob fn
  • E.g. 8 things number as code gt 3 bits/trans
  • Alt. short code if high prob longer if lower
  • Can reduce
  • Perplexity
  • Weighted average of number of choices

51
Computing Entropy
  • Picking horses (Cover and Thomas)
  • Send message identify horse - 1 of 8
  • If all horses equally likely, p(i) 1/8
  • Some horses more likely
  • 1 ½ 2 ¼ 3 1/8 4 1/16 5,6,7,8 1/64

52
Entropy of a Sequence
  • Basic sequence
  • Entropy of language infinite lengths
  • Assume stationary ergodic

53
Cross-Entropy
  • Comparing models
  • Actual distribution unknown
  • Use simplified model to estimate
  • Closer match will have lower cross-entropy

54
Perplexity Model Comparison
  • Compare models with different history
  • Train models
  • 38 million words Wall Street Journal
  • Compute perplexity on held-out test set
  • 1.5 million words (20K unique, smoothed)
  • N-gram Order Perplexity
  • Unigram 962
  • Bigram 170
  • Trigram 109

55
Entropy of English
  • Shannons experiment
  • Subjects guess strings of letters, count guesses
  • Entropy of guess seq Entropy of letter seq
  • 1.3 bits Restricted text
  • Build stochastic model on text compute
  • Brown computed trigram model on varied corpus
  • Compute (per-char) entropy of model
  • 1.75 bits

56
Speech Recognition asModern AI
  • Draws on wide range of AI techniques
  • Knowledge representation manipulation
  • Optimal search Viterbi decoding
  • Machine Learning
  • Baum-Welch for HMMs
  • Nearest neighbor k-means clustering for signal
    id
  • Probabilistic reasoning/Bayes rule
  • Manage uncertainty in signal, phone, word mapping
  • Enables real world application
Write a Comment
User Comments (0)
About PowerShow.com