Title: Toward ASR Without Viterbi Search:
1Toward ASR Without Viterbi Search A Prototype
System for the SPINE Evaluation John-Paul
Hosom Center for Spoken Language Understanding
(CSLU) OGI School of Science Engineering
(OGI) Oregon Health Science University (OHSU)
2- Scope of Project, Performance
- Goal The goal of the ROAR RFI is to challenge
the architecture, science, and dogma of the
current ASR research community, and to look for
high-risk, high-yield alternative ideas - The current work challenges standard approaches
by combining frame-based and segment-based
approaches to ASR. - Development of this prototype began in Spring
2001 - System is still incomplete major
simplifications in implementation in order to
develop prototype for SPINE evaluation - Evaluated on SPINE2 task Sub Del Ins WER
- 49.0 26.6 13.7 89.3
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
3- Motivations
- Assumption of independence between frames is
usually not valid - Implicit assumption that all frames contribute
equally to final result no weighting of
information in time domain, yet
perceptualstudies indicate more perceptual
relevance of transition regions. - Transition probabilities in an HMM model average
durations speech that is faster or slower than
average not classified as robustly. - Duration is sometimes the single-best cue for
distinguishing voiced from unvoiced consonants,
yet duration not accounted for in frame-based
classification of phonetic likelihoods.
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
4- Proposed Approach
- Classification? At each frame, blindly
hypothesize a large number of segments
beginning at that frame but with different end
times.? Perform segment-based phonetic
classification, using specific information
about phonetic boundaries and duration? Keep
only those segments that have sufficient
likelihood - Dynamic-Programming Search ? Standard Viterbi
search combines two sources of stochastic inform
ation? Instead, combine duration and phoneme
identity in single classification step? Then,
use more general dynamic-programming search. - Combination of frame- and segment-based ASR
called fragment-based ASR
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
5Proposed Approach
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
6- Features
- Simple energy-based multi-band features used
instead of PLP or MFCC - At each frame, feature set consists of 1 value
of log energy of entire frame 10 values of log
energy in 10 Mel-spaced bands, relative
to total log energy of entire frame 11 values
of change in energy, relative to longer
250-msec window (intensity discrimination) - Features for classifying segment taken from a
number of frames, as well as a single measure of
the absolute duration of the segment.
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
7SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
8- Dynamic Programming Search
- Consider each segment, moving forward in
time ? If segment is at beginning of utterance,
and if segment is valid word beginning, add to
list of hypotheses ? If segment beginning
connects with an existing hypothesis end, both
in time and in vocabulary, add this segment
to existing hypothesis, creating new
hypothesis ? If segment is word beginning and
is connecting with a hypothesis containing a
word ending, add LM information. ? Search
proceeds over all segments, adding each
segment to a hypothesis or rejecting the
segment - At end of search, most likely word sequence is
obtained from the hypothesis with highest score.
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
9- Training
- For SPINE2 task, training set included SPINE1
training, SPINE1 evaluation, and SPINE2 training
data. - Important to have accurate time-aligned phonetic
transcriptions. - Automatic phonetic segmentation used to generate
time-aligned transcriptions phonemes with
unrealistic durations excluded from training. - Training on ANN classifier with 397 inputs, 200
hidden nodes, and 88 outputs. Outputs were 44
phonemes and 44 near-miss phonemes - Context-independent phoneme set used due to
limited development time.
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
10- Language Model, Dictionary
- CMU-provided SPINE2 trigram language model was
used - Dictionary contained 1396 words from the training
set, excluding uncommon DRT words found in
SPINE1. Improvements in the execution time
(currently 73RT) will allow larger vocabulary
sizes. - Word pronunciations obtained from the CMU
dictionary and NIST text-to-phone software, and
all pronunciations were hand-checked. - As phoneme insertions and deletions are not as
well accounted for in this prototype as in HMMs,
attention was paid to detailed word modeling with
a number of optional and alternative phonemes.
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
11- Future Work
- The prototype system was developed over a 6-month
period. - A large number of areas for improvement (1) cont
ext-dependent phonemes (or diphone-based
units) (2) more accurately-labeled training
data (3) improved use of duration information in
classification (4) improved efficiency of
search, allowing larger vocabulary
(5) accounting for phonemic insertions and
deletions (6) a forward-backward type of
training procedure (7) improved
speech/non-speech segmentation (8) improved
feature set (9) use of near-miss phoneme
information (10) improved evidence
combination (11) optimization of
parameters (12) debugging (still memory leaks!),
etc.
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
12- Conclusion
- Key Ideas ? Fragment-based ASR allows better
duration modeling (including contextual and
prosodic effects) than standard HMMs ? Ability
to focus on perceptually-important phonetic
transition regions during classification ? Se
gmentation and classification performed
simultaneously allows full power of classifier
in segmentation - Prototype system Demonstrates that there are no
fundamental roadblocks to developing complete
system
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)