Title: Articulatory FeatureBased Speech Recognition
1Articulatory Feature-Based Speech Recognition
Team updateAugust 9, 2006
2WS06AFSR update, in brief
- Audio-only (SVitchboard) work
- Comparison of AF-based observation models in
phone-based systems (Ozgur, Arthur, Simon) - Implementation of AF-based pronunciation models
(Chris, Arthur, Nash, Lisa, Bronwyn) - Audio-visual (CUAVE) work
- Implementation and testing of several
phoneme-viseme and AF-based models (Partha,
Ozgur, Mark, Karen) - Other
- Tying of articulatory states (Partha, Chris)
- Generation of forced feature alignments (Ari,
Steve)
3Observation modeling
- Comparison of observation models in a phone-based
system
fully generative
hybrid
tandem
phoneState
phoneState
phoneState
dg1
pl1
rd
PLPs
PLPs
. . .
log MLP outputs KLT
odg1
opl1
ord
Virtual evidence from MLPsp(fo) OR p(of)
p(fo) p(o) / p(f)
4(Mostly) observation modeling Hybrid models
(Simon)
phoneState
- Deterministic phoneState feature mapping
- p(dg1 phoneState) 1 if dg1 is phoneStates
canonical value - Non-deterministic mapping
- p(dg1 phoneState) learned distribution
- Hybrid PLP
dg1
pl1
rd
. . .
odg1
opl1
ord
phoneState
dg1
pl1
rd
. . .
PLPs
odg1
opl1
ord
5Hybrid observation models (Simon)
- Requires tuning relative weights of different
MLPs - Non-deterministic model is slow to train with
dense CPTs. Instead - Recipe 1
- Train on 1000 utterances for 2 iterations
- Make the dense CPTs more sparse by zeroing all
entries less than 0.1 - Using these parameters, run the genetic
triangulation script to find a fast
triangulation, given this particular sparsity of
the DCPTs. - Starting from these parameters, train to 0.5
tolerance (takes 8 its) on full training set - Find a decoding graph triangulation using the
final trained parameters. - Recipe 2
- Using a faster triangulation, train model with
fully dense CPTs - Make CPTs sparse by zeroing all entries less than
0.1
6Tandem models (Ozgur, Arthur)
partially factored
fully factored (not yet implemented)
phoneState
phoneState
phoneState
pl1
rd
dg1
PLPs
PLPs
. . .
log MLP outputs KLT
PLPs
log outputs of separate MLPs
log MLP outputs KLT
7Tandem observation models (Ozgur, Arthur)
- Fisher train MLPs trained on 1776 hours of
Fisher data - SVB train MLPs trained on SVitchboard data only
- Phone MLP Tandem system using phone MLP
classifer (trained on SVitchboard) instead of
feature MLPs
8Reminder phone-based models (Ozgur, Chris)
frame 0
frame i
last frame
variable name
values
word one, two ,...
1
wordTransition 0,1
subWordState 0,1,2,...
0
stateTransition 0,1
phoneState w1, w2, w3, s1, s2, s3,...
observation
(Note missing pronunciation variants)
9Pronunciation modeling (Arthur, Chris, Nash,
Lisa, Bronwyn)
wordTransition
word
wordTransitionL
subWordStateL
async
stateTransitionL
phoneStateL
wordTransitionT
L
subWordStateT
stateTransitionT
phoneStateT
T
(differences from actual model 3rd feature
stream, pron variants, word transition
bookkeeping)
10Pronunciation models (Chris, Arthur)
- Fisher train MLPs trained on 1776 hours of
Fisher data - SVB train MLPs trained on SVitchboard data only
- Phone MLP Tandem system using phone MLP
classifer (trained on SVitchboard) instead of
feature MLPs
11Summary of selected experiments
- (Note Some models still being tuned)
12Audio-only experiments Ongoing
- State tying for AF-based models (Chris)
- Factored tandem models (Arthur)
- Combining LTG-based pronunciation models with
hybrid observation models (Simon, Steve) - Articulatory substitution modeling (Bronwyn)
- Part-of-speech dependent asynchrony (Lisa)
- Crossword asynchrony (Nash)
13Audio-visual models (Partha, Ozgur, Mark, Karen)
Synchronous phoneme-viseme
Audio-only/ video-only
phoneState
phoneState
obsA
obs
obsV
14Asynchronous phoneme-viseme model with an
asynchrony variable
- Analogous to AF-based model with asynchrony
variables
subWordStateA
async
phoneStateA
obsA
subWordStateV
phoneStateV
obsV
15Phoneme-viseme model with coupled HMM-based
asynchrony
subWordStateA
stateTransitionA
phoneStateA
obsA
subWordStateV
stateTransitionV
phoneStateV
obsV
16AF-based model with asynchrony variables
subWordStateL
async
phoneStateL
subWordStateT
phoneStateT
obsA
obsV
17CUAVE experimental setup
- Training on clean data, number of Gaussians tuned
on clean dev set - Audio/video weights tuned on noise-specific dev
sets - Uniform language model
- Decoding constrained to 10-word utterances
(avoids language model scale/penalty tuning)
18CUAVE selected development set results
19Audio-visual experiments Ongoing
- AF models with CHMM-based asynchrony (Mark)
- State tying for AF models (Partha)
- Cross-word asynchrony (Partha)
- Multi-rate modeling (Ozgur)
- Stream weighting by framewise rejection modeling
(Ozgur)
20Other ongoing work
- Generation of forced feature alignments and
analysis tool (Ari) - Embedded training of MLPs (Simon, Ari, Joe
Frankel (thanks!)) - Analysis of recognition outputs (Lisa)
- Structure learning (Steve)
21Summary
- Tandem observation models the most promising so
far - But also the simplest... Much work to be done on
other models - Monofeat and hybrid models approaching monophone
models in performance - Main challenges to new models
- Speed/memory ? tuning of triangulations and
pruning parameters - Tuning parameters differ widely across models ?
lots of cross-validation decoding runs - For asynchronous structures, low-occupancy states
? tying - In the next week
- Wrap-up of ongoing experiments
- Testing on final test sets
- Analysis of decoding outputs
- Combination of most promising directions
22 23 24Project outline Multistream AF-based models
25Project outline Asynchrony modeling
Coupled HMMs and variations
26Project outline Asynchrony modeling (2)
- Instantaneous asynchrony constraints over feature
subsets
Single asynchrony constraint over all features
Also Cross-word asynchrony, context-dependent
asynchrony
27Project outline Reduction/substitution modeling
- Similar to pronunciation modeling in phone-based
recognition, but on a per-stream/per-frame basis - Context-independent vs. context-dependent feature
substitutions - What kind of context?
- Phonetic
- Articulatory
- Higher-level speaker, speaking rate, dialect...
28Manual feature transcriptions (Xuemin Chi, Lisa
Lavoie, Karen)
- Purpose Testing of AF classifiers, automatic
alignments - Main transcription guidelines
- Should contain enough information for the speaker
to reproduce the acoustics (up to 20ms shifts in
boundaries) - Should correspond to what we would like our AF
classifiers to detect
29Manual feature transcriptions (Xuemin Chi, Lisa
Lavoie, Karen)
- Details
- 2 transcribers phonetician, PhD student in
speech group - 78 SVitchboard utterances
- 9 utterances from Switchboard Transcription
Project for comparison - Multipass transcription using WaveSurfer (KTH)
- 1st pass Phone-feature hybrid
- 2nd pass All-feature
- 3rd pass Discussion, error-correction
- Transcription speed
- 623 x RT for 1st pass
- 335 x RT for 2nd pass
- Why phone-feature hybrid in 1st pass?
- In a preliminary test, gt 2x slower to do
all-feature transcription in 1st pass - Transcribers found all-feature format very tedious
30Manual feature transcriptions Analysis (Nash,
Lisa, Ari)
- How does the multipass strategy affect agreement?
- How well do transcribers agree?
- How does agreement compare with phonetic
transcriptions in STP?
31Models implemented so far...
- Phone-based models
- Non-classifier-based AF models
- Tandem models
- Hybrid model
32Phone-based models monophone, word-internal
triphone (Ozgur, Chris)
frame 0
frame i
last frame
variable name
values
word one, two ,...
1
wordTransition 0,1
subWordState 0,1,2,...
0
stateTransition 0,1
phoneState w1, w2, w3, s1, s2, s3,...
observation
(Note missing pronunciation variants)
33What are hybrid models?
- Conventional HMMs generate observations via a
likelihood p(Ostate) or p(Oclass) using a
mixture of Gaussians - Hybrid models use another classifier (typically
an MLP) to obtain the posterior P(classO) - Dividing by the prior gives the likelihood, which
can be used directly in the HMM no Gaussians
required - Advantages of hybrid models include
- Can easily train the classifier discriminatively
- Once trained, MLPs will compute P(classO)
relatively fast - MLPs can use a long window of acoustic input
frames - MLPs dont require input feature distribution to
have diagonal covariance (e.g. can use filterbank
outputs from computational auditory scene
analysis front-ends)
34Configurations
- Standard hybrid
- Train an MLP to classify phonemes, frame by frame
- Decode the MLP output using simple HMMs
(transition probabilities easily derived from
phone duration statistics dont even need to
train them)
- Standard tandem
- Instead of using MLP output to directly obtain
the likelihood, just use it as a feature vector,
after some transformations (e.g. taking logs) and
dimensionality reduction - Append the resulting features to standard
features, e.g. PLPs or MFCCs - Use this vector as the observation for a standard
HMM with a mixture-of-Gaussians observation model - Currently used in state-of-art systems such as
from SRI
- Our configuration
- Use ANNs to classify articulatory features
instead of phones - 8 MLPs, classifying pl1, dg1, etc. frame-by-frame
35In other news...
- gmtkTie
- Linking pronunciation and observation modeling
- Structure learning
- Other ongoing/planned work
36gmtkTie (Simon)
- General parameter clustering and tying tool for
GMTK - Currently most developed parts
- Decision-tree clustering of Gaussians, using same
technique as HTK - Bottom-up agglomerative clustering
- gmtkTie more general than HTK
- HTK asks questions about previous/next phone
identity - HTK clusters states only within the same phone
- gmtkTie can ask user-supplied questions about
user-supplied features no assumptions about
states, triphones, or anything else - gmtkTie clusters user-defined groups of
parameters, not just states - gmtkTie can compute cluster sizes and centroids
in 101 different ways (approx.) - Will be stress-tested on various observation
models using Gaussians - Can tie based on values of any variables in the
graph, not just the phone state (e.g. feature
values)
37Linking pronunciation and observation models
- Pronunciation model generates, from the words,
three feature streams L, T, G - Observation model starts with 8 features (pl1,
dg1, etc) and generates acoustics (possibly via
MLP posteriors or tandem features) - How to link L,T,G with the 8 features?
- Deterministic mapping
- Learned mapping
- Dense conditional probability table (CPT)
- Sparse CPT
- Link them by choosing dependencies
discriminatively?
38Structure learning (Steve, Simon)
- In various parts of our models, may want to learn
which dependencies to include and which to omit - Between elements of observation vector (similar
to Bilmes 2001) - Between L, T, G and pl1, dg1, ...
- Among pl1, dg1, ... (similar to King et al.)
- May not need all the dependencies that we think
are necessary for a correct generative model - Currently investigating Bilmes EAR measure
I(X,YC) I(X,Y), - EAR formulated for X,Y observed
- We plan to use forced alignment to generate
observed values for hidden variables - So far computed EAR between pairs of individual
values of dg1, pl1, etc - Suggests adding links to a model with tandem
observations - Also computed EAR between pairs of features
- Less promising results needs further
investigation - Will also compute EAR on tandem observation
features can implement these arcs in GMTK as
dlinks
39AVICAR Front-end tuning (Mark)
- Current status Confirming results, then
implementing audio-visual structures
40AVICAR Front-end tuning (Mark)
41CUAVE Implemented models (Partha)
- Synchronous model now running
- In progress Monofeat model with two observation
vectors
phoneState
obsA
obsV
42Summary
- gmtkTie ready for prime-time
- All basic models running
- Tandem result encouraging
- Experiments taking longer than expected
- Too many choices of experiments!
43Definitions Pronunciation and observation
modeling
44Types of observation model
- Observation model can be one of
- 8 hidden random variables (one per feature) which
each obtain virtual evidence (VE) from the ANN - 8 hidden RVs that generate tandem features using
their own Gaussians - Tandem features for one RV use only the ANN for
that feature, plus PLPs - Single tandem observation generated using
Gaussians from a monophone/triphone HMM state - Tandem feature vector obtained by concatenating 8
ANN output vectors plus PLPs - Compare this to a standard tandem feature vector
derived using a phone-classifying ANN - Non-hybrid/tandem model PLP observations
generated using Gaussians