Toward ASR Without Viterbi Search: - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Toward ASR Without Viterbi Search:

Description:

SPINE: Toward ASR without Viterbi Search (Hosom & van Santen, OGI/CSLU) ... SPINE: Toward ASR without Viterbi Search (Hosom & van Santen, OGI/CSLU) Proposed Approach ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 13
Provided by: johnpau1
Category:
Tags: asr | search | toward | van | viterbi | without

less

Transcript and Presenter's Notes

Title: Toward ASR Without Viterbi Search:


1
Toward ASR Without Viterbi Search A Prototype
System for the SPINE Evaluation John-Paul
Hosom Center for Spoken Language Understanding
(CSLU) OGI School of Science Engineering
(OGI) Oregon Health Science University (OHSU)
2
  • Scope of Project, Performance
  • Goal The goal of the ROAR RFI is to challenge
    the architecture, science, and dogma of the
    current ASR research community, and to look for
    high-risk, high-yield alternative ideas
  • The current work challenges standard approaches
    by combining frame-based and segment-based
    approaches to ASR.
  • Development of this prototype began in Spring
    2001
  • System is still incomplete major
    simplifications in implementation in order to
    develop prototype for SPINE evaluation
  • Evaluated on SPINE2 task Sub Del Ins WER
  • 49.0 26.6 13.7 89.3

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
3
  • Motivations
  • Assumption of independence between frames is
    usually not valid
  • Implicit assumption that all frames contribute
    equally to final result no weighting of
    information in time domain, yet
    perceptualstudies indicate more perceptual
    relevance of transition regions.
  • Transition probabilities in an HMM model average
    durations speech that is faster or slower than
    average not classified as robustly.
  • Duration is sometimes the single-best cue for
    distinguishing voiced from unvoiced consonants,
    yet duration not accounted for in frame-based
    classification of phonetic likelihoods.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
4
  • Proposed Approach
  • Classification? At each frame, blindly
    hypothesize a large number of segments
    beginning at that frame but with different end
    times.? Perform segment-based phonetic
    classification, using specific information
    about phonetic boundaries and duration? Keep
    only those segments that have sufficient
    likelihood
  • Dynamic-Programming Search ? Standard Viterbi
    search combines two sources of stochastic inform
    ation? Instead, combine duration and phoneme
    identity in single classification step? Then,
    use more general dynamic-programming search.
  • Combination of frame- and segment-based ASR
    called fragment-based ASR

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
5
Proposed Approach
SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
6
  • Features
  • Simple energy-based multi-band features used
    instead of PLP or MFCC
  • At each frame, feature set consists of 1 value
    of log energy of entire frame 10 values of log
    energy in 10 Mel-spaced bands, relative
    to total log energy of entire frame 11 values
    of change in energy, relative to longer
    250-msec window (intensity discrimination)
  • Features for classifying segment taken from a
    number of frames, as well as a single measure of
    the absolute duration of the segment.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
7
  • Features

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
8
  • Dynamic Programming Search
  • Consider each segment, moving forward in
    time ? If segment is at beginning of utterance,
    and if segment is valid word beginning, add to
    list of hypotheses ? If segment beginning
    connects with an existing hypothesis end, both
    in time and in vocabulary, add this segment
    to existing hypothesis, creating new
    hypothesis ? If segment is word beginning and
    is connecting with a hypothesis containing a
    word ending, add LM information. ? Search
    proceeds over all segments, adding each
    segment to a hypothesis or rejecting the
    segment
  • At end of search, most likely word sequence is
    obtained from the hypothesis with highest score.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
9
  • Training
  • For SPINE2 task, training set included SPINE1
    training, SPINE1 evaluation, and SPINE2 training
    data.
  • Important to have accurate time-aligned phonetic
    transcriptions.
  • Automatic phonetic segmentation used to generate
    time-aligned transcriptions phonemes with
    unrealistic durations excluded from training.
  • Training on ANN classifier with 397 inputs, 200
    hidden nodes, and 88 outputs. Outputs were 44
    phonemes and 44 near-miss phonemes
  • Context-independent phoneme set used due to
    limited development time.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
10
  • Language Model, Dictionary
  • CMU-provided SPINE2 trigram language model was
    used
  • Dictionary contained 1396 words from the training
    set, excluding uncommon DRT words found in
    SPINE1. Improvements in the execution time
    (currently 73RT) will allow larger vocabulary
    sizes.
  • Word pronunciations obtained from the CMU
    dictionary and NIST text-to-phone software, and
    all pronunciations were hand-checked.
  • As phoneme insertions and deletions are not as
    well accounted for in this prototype as in HMMs,
    attention was paid to detailed word modeling with
    a number of optional and alternative phonemes.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
11
  • Future Work
  • The prototype system was developed over a 6-month
    period.
  • A large number of areas for improvement (1) cont
    ext-dependent phonemes (or diphone-based
    units) (2) more accurately-labeled training
    data (3) improved use of duration information in
    classification (4) improved efficiency of
    search, allowing larger vocabulary
    (5) accounting for phonemic insertions and
    deletions (6) a forward-backward type of
    training procedure (7) improved
    speech/non-speech segmentation (8) improved
    feature set (9) use of near-miss phoneme
    information (10) improved evidence
    combination (11) optimization of
    parameters (12) debugging (still memory leaks!),
    etc.

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
12
  • Conclusion
  • Key Ideas ? Fragment-based ASR allows better
    duration modeling (including contextual and
    prosodic effects) than standard HMMs ? Ability
    to focus on perceptually-important phonetic
    transition regions during classification ? Se
    gmentation and classification performed
    simultaneously allows full power of classifier
    in segmentation
  • Prototype system Demonstrates that there are no
    fundamental roadblocks to developing complete
    system

SPINE Toward ASR without Viterbi Search (Hosom
van Santen, OGI/CSLU)
Write a Comment
User Comments (0)
About PowerShow.com