CMU Shpinx Speech Recognition Engine - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

CMU Shpinx Speech Recognition Engine

Description:

Finding out how an efficient speech recognition engine can be implemented. ... Conversa Web 3.0. Free TTS. Fully implemented with Java. ... – PowerPoint PPT presentation

Number of Views:468

Avg rating:3.0/5.0

Slides: 35

Provided by: try3

Category:

more less

Transcript and Presenter's Notes

Title: CMU Shpinx Speech Recognition Engine

1
CMU Shpinx Speech Recognition Engine

Reporter Chun-Feng Liao
NCCU Dept. of Computer Sceince
Intelligent Media Lab

2
Purposes of this project

Finding out how an efficient speech recognition
engine can be implemented.
Examine the source code of Sphinx2 to find out
the role and function of each component.
Reading key chapters of Dr. Mosur K.
Ravishankars thesis as a reference.
Some demo programs will be given during oral
presentation.

3
Presentation Agenda

Project Summary/ Agenda/ Goal. (In English)
Introduction.
Basics of Speech Recognitions.
Architecture of CMU Sphinx.
Acoustic Model and HMM.
Language Model.
Java Platform Issues.
Demo
Conclusion.

4
Voice Technologies

In the mid- to late 1990s, personal computers
started to become powerful enough to support ASR
The two key underlying technologies behind these
advances are speech recognition (SR) and
text-to-speech synthesis (TTS).

5
Basics of Speech Recognition
6
Speech Recognition

Capturing speech (analog) signals
Digitizing the sound waves, converting them to
basic language units or phonemes(??).
Constructing words from phonemes, and
contextually analyzing the words to ensure
correct spelling for words that sound alike (such
as write and right).

7
Speech Recognition Process Flow
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
8
Recognition Process Flow Summary

Step 1User Input
The system catches users voice in the form of
analog acoustic signal .
Step 2Digitization
Digitize the analog acoustic signal.
Step 3Phonetic Breakdown
Breaking signals into phonemes.

9
Recognition Process Flow Summary(2)

Step 4Statistical Modeling
Mapping phonemes to their phonetic representation
using statistics model.
Step 5Matching
According to grammar , phonetic representation
and Dictionary , the system returns an n-best
list (I.e.a word plus a confidence score)
Grammar-the union words or phrases to constraint
the range of input or output in the voice
application.
Dictionary-the mapping table of phonetic
representation and word(EXthu,thee?the)

10
Architecture of CMU Sphinx.
11
Introduction to CMU Sphinx

A speech recognition system developed at Carnegie
Mellon University.
Consists of a set of libraries
core speech recognition functions
low-level audio capture
Continuous speech decoding
Speaker-independent

12
Brief History of CMU Sphinx

Sphinx-I (1987)
The first user independent ,high performance ASR
of the world.
Written in C by Kai-Fu Lee (?????,??Microsoft
Asia??????/???).
Sphinx-II (1992)
Written by Xuedong Huang in C. (?????,??Microsoft
Speech.NET?????)
5-state HMM / N-gram LM.
(??????,CMU Sphinx??????Microsoft Speech SDK?????)

13
Brief History of CMU Sphinx (2)

Sphinx 3 (1996)
Built by Eric Thayer and Mosur Ravishankar.
Slower than Sphinx-II but the design is more
flexible.
Sphinx 4 (Originally Sphinx 3j)
Refactored from Sphinx 3.
Fully implemented in Java.
Not finished yet.

14
Components of CMU Sphinx
15
Front End

libsphinx2fe.lib / libsphinx2ad.lib
Low-level audio access
Continuous Listening and Silence Filtering
Front End API overview.

16
Knowledge Base

The data that drives the decoder.
Three sets of data
Acoustic Model.
Language Model.
Lexicon (Dictionary).

17
Acoustic Model

/model/hmm/6k
Database of statistical model.
Each statistical model represents a phoneme.
Acoustic Models are trained by analyzing large
amount of speech data.

18
HMM in Acoustic Model

HMM represent each unit of speech in the Acoustic
Model.
Typical HMM use 3-5 states to model a phoneme.
Each state of HMM is represented by a set of
Gaussian mixture density functions.
Sphinx2 default phone set.

19
Gaussian Mixtures

Refer to text book p 33 eq 38
Represent each state in HMM.
Each set of Gaussian Mixtures are called
senones.
HMM can share senones.

20
(No Transcript)
21
Language Model

Describes what is likely to be spoken in a
particular context
Word transitions are defined in terms of
transition probabilities
Helps to constrain the search space
See examples of LM.

22
N-gram Language Model

Probability of word N dependent on word N-1, N-2,
...
Bigrams and trigrams most commonly used
Used for large vocabulary applications such as
dictation
Typically trained by very large (millions of
words) corpus

23
Decoder

Selects next set of likely states
Scores incoming features against these states
Drop low scoring states
Generates results

24
Speech in Java Platform
25
Sun Java Speech API

First released on October 26, 1998.
The Java Speech API allows Java applications to
incorporate speech technology into their user
interfaces.
Defines a cross-platform API to support command
and control recognizers, dictation systems and
speech synthesizers.

26
Implementations of Java Speech API

Open Source
FreeTTS / CMU Sphinx4.
IBM Speech for Java.
Cloud Garden.
LH TTS for Java Speech API.
Conversa Web 3.0.

27
Free TTS

Fully implemented with Java.
Based upon Flite 1.1 a small run-time speech
synthesis engine developed at CMU.
Partial support for JSAPI 1.0.
Speech Recognition functions.
JSML.

28
Sphinx 4 (Sphinx 3j)

Fully implemented with Java.
Speed is equal or faster than Sphinx3.
Acoustic model and Language model is under
construction.
Source code are available by CVS.(but you can not
run any applications without models !)

For Example To check out the Sphinx4 ,you can
using the following command. cvs -z3
-dpserveranonymous_at_cvs.sourceforge.net/cvsroot/
cmusphinx co sphinx4
29
Java Platform Issues

GC makes managing data much easier
Native engines typically optimize inner loops for
the CPU ? can't do that on the Java platform.
Native engines arrange data to
optimize cache hits ? can't really do that either.

30
DEMO

Sphinx-II batch mode.
Sphinx-II live mode.
Sphinx-II Client / Server mode.
A Simple Free TTS Application.
(Java-based) TTS vs (c-based)SR .
Motion Planner with Free TTS-using Java Web
Start.(This is GRA course final project)

31
Summary

Sphinx is a open source Speech Recognition
developed at CMU.
FE / KB / Decoder form the core of SR system.
FE receives and processes speech signal.
Knowledge Base provide data for Decoder.
Decoder search the states and return the results.
Speech Recognition is a challenging problem for
the Java platform.

32
Reference

Mosur K.Ravishankar, Efficient Alogrithms for
Speech Recognition, CMU, 1996.
Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II
User Guide , CMU,2001.
Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken
Language Processing,Prentice Hall,2000.

33
Reference (on-line)