Toward ASR Without Viterbi Search:

About This Presentation

Title:

Toward ASR Without Viterbi Search:

Description:

SPINE: Toward ASR without Viterbi Search (Hosom & van Santen, OGI/CSLU) ... SPINE: Toward ASR without Viterbi Search (Hosom & van Santen, OGI/CSLU) Proposed Approach ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 13

Provided by: johnpau1

Category:

more less

Transcript and Presenter's Notes

Title: Toward ASR Without Viterbi Search:

1
Toward ASR Without Viterbi Search A Prototype
System for the SPINE Evaluation John-Paul
Hosom Center for Spoken Language Understanding
(CSLU) OGI School of Science Engineering
(OGI) Oregon Health Science University (OHSU)
2

Scope of Project, Performance
Goal The goal of the ROAR RFI is to challenge
the architecture, science, and dogma of the
current ASR research community, and to look for
high-risk, high-yield alternative ideas
The current work challenges standard approaches
by combining frame-based and segment-based
approaches to ASR.
Development of this prototype began in Spring
2001
System is still incomplete major
simplifications in implementation in order to
develop prototype for SPINE evaluation
Evaluated on SPINE2 task Sub Del Ins WER
49.0 26.6 13.7 89.3

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
3

Motivations
Assumption of independence between frames is
usually not valid
Implicit assumption that all frames contribute
equally to final result no weighting of
information in time domain, yet
perceptualstudies indicate more perceptual
relevance of transition regions.
Transition probabilities in an HMM model average
durations speech that is faster or slower than
average not classified as robustly.
Duration is sometimes the single-best cue for
distinguishing voiced from unvoiced consonants,
yet duration not accounted for in frame-based
classification of phonetic likelihoods.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
4

Proposed Approach
Classification? At each frame, blindly
hypothesize a large number of segments
beginning at that frame but with different end
times.? Perform segment-based phonetic
classification, using specific information
about phonetic boundaries and duration? Keep
only those segments that have sufficient
likelihood
Dynamic-Programming Search ? Standard Viterbi
search combines two sources of stochastic inform
ation? Instead, combine duration and phoneme
identity in single classification step? Then,
use more general dynamic-programming search.
Combination of frame- and segment-based ASR
called fragment-based ASR

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
5
Proposed Approach
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
6

Features
Simple energy-based multi-band features used
instead of PLP or MFCC
At each frame, feature set consists of 1 value
of log energy of entire frame 10 values of log
energy in 10 Mel-spaced bands, relative
to total log energy of entire frame 11 values
of change in energy, relative to longer
250-msec window (intensity discrimination)
Features for classifying segment taken from a
number of frames, as well as a single measure of
the absolute duration of the segment.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
7

Features

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
8

Dynamic Programming Search
Consider each segment, moving forward in
time ? If segment is at beginning of utterance,
and if segment is valid word beginning, add to
list of hypotheses ? If segment beginning
connects with an existing hypothesis end, both
in time and in vocabulary, add this segment
to existing hypothesis, creating new
hypothesis ? If segment is word beginning and
is connecting with a hypothesis containing a
word ending, add LM information. ? Search
proceeds over all segments, adding each
segment to a hypothesis or rejecting the
segment
At end of search, most likely word sequence is
obtained from the hypothesis with highest score.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
9

Training
For SPINE2 task, training set included SPINE1
training, SPINE1 evaluation, and SPINE2 training
data.
Important to have accurate time-aligned phonetic
transcriptions.
Automatic phonetic segmentation used to generate
time-aligned transcriptions phonemes with
unrealistic durations excluded from training.
Training on ANN classifier with 397 inputs, 200
hidden nodes, and 88 outputs. Outputs were 44
phonemes and 44 near-miss phonemes
Context-independent phoneme set used due to
limited development time.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
10

Language Model, Dictionary
CMU-provided SPINE2 trigram language model was
used
Dictionary contained 1396 words from the training
set, excluding uncommon DRT words found in
SPINE1. Improvements in the execution time
(currently 73RT) will allow larger vocabulary
sizes.
Word pronunciations obtained from the CMU
dictionary and NIST text-to-phone software, and
all pronunciations were hand-checked.
As phoneme insertions and deletions are not as
well accounted for in this prototype as in HMMs,
attention was paid to detailed word modeling with
a number of optional and alternative phonemes.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
11

Future Work
The prototype system was developed over a 6-month
period.
A large number of areas for improvement (1) cont
ext-dependent phonemes (or diphone-based
units) (2) more accurately-labeled training
data (3) improved use of duration information in
classification (4) improved efficiency of
search, allowing larger vocabulary
(5) accounting for phonemic insertions and
deletions (6) a forward-backward type of
training procedure (7) improved
speech/non-speech segmentation (8) improved
feature set (9) use of near-miss phoneme
information (10) improved evidence
combination (11) optimization of
parameters (12) debugging (still memory leaks!),
etc.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
12

Conclusion
Key Ideas ? Fragment-based ASR allows better
duration modeling (including contextual and
prosodic effects) than standard HMMs ? Ability
to focus on perceptually-important phonetic
transition regions during classification ? Se
gmentation and classification performed
simultaneously allows full power of classifier
in segmentation
Prototype system Demonstrates that there are no
fundamental roadblocks to developing complete
system

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)

Write a Comment

User Comments (0)