Title: CMU Shpinx Speech Recognition Engine
1CMU Shpinx Speech Recognition Engine
- Reporter Chun-Feng Liao
- NCCU Dept. of Computer Sceince
- Intelligent Media Lab
2Purposes of this project
- Finding out how an efficient speech recognition
engine can be implemented. - Examine the source code of Sphinx2 to find out
the role and function of each component. - Reading key chapters of Dr. Mosur K.
Ravishankars thesis as a reference. - Some demo programs will be given during oral
presentation.
3Presentation Agenda
- Project Summary/ Agenda/ Goal. (In English)
- Introduction.
- Basics of Speech Recognitions.
- Architecture of CMU Sphinx.
- Acoustic Model and HMM.
- Language Model.
- Java Platform Issues.
- Demo
- Conclusion.
4Voice Technologies
- In the mid- to late 1990s, personal computers
started to become powerful enough to support ASR - The two key underlying technologies behind these
advances are speech recognition (SR) and
text-to-speech synthesis (TTS).
5Basics of Speech Recognition
6Speech Recognition
- Capturing speech (analog) signals
- Digitizing the sound waves, converting them to
basic language units or phonemes(??). - Constructing words from phonemes, and
contextually analyzing the words to ensure
correct spelling for words that sound alike (such
as write and right).
7Speech Recognition Process Flow
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
8Recognition Process Flow Summary
- Step 1User Input
- The system catches users voice in the form of
analog acoustic signal . - Step 2Digitization
- Digitize the analog acoustic signal.
- Step 3Phonetic Breakdown
- Breaking signals into phonemes.
9Recognition Process Flow Summary(2)
- Step 4Statistical Modeling
- Mapping phonemes to their phonetic representation
using statistics model. - Step 5Matching
- According to grammar , phonetic representation
and Dictionary , the system returns an n-best
list (I.e.a word plus a confidence score) - Grammar-the union words or phrases to constraint
the range of input or output in the voice
application. - Dictionary-the mapping table of phonetic
representation and word(EXthu,thee?the)
10Architecture of CMU Sphinx.
11Introduction to CMU Sphinx
- A speech recognition system developed at Carnegie
Mellon University. - Consists of a set of libraries
- core speech recognition functions
- low-level audio capture
- Continuous speech decoding
- Speaker-independent
12Brief History of CMU Sphinx
- Sphinx-I (1987)
- The first user independent ,high performance ASR
of the world. - Written in C by Kai-Fu Lee (?????,??Microsoft
Asia??????/???). - Sphinx-II (1992)
- Written by Xuedong Huang in C. (?????,??Microsoft
Speech.NET?????) - 5-state HMM / N-gram LM.
- (??????,CMU Sphinx??????Microsoft Speech SDK?????)
13Brief History of CMU Sphinx (2)
- Sphinx 3 (1996)
- Built by Eric Thayer and Mosur Ravishankar.
- Slower than Sphinx-II but the design is more
flexible. - Sphinx 4 (Originally Sphinx 3j)
- Refactored from Sphinx 3.
- Fully implemented in Java.
- Not finished yet.
14Components of CMU Sphinx
15Front End
- libsphinx2fe.lib / libsphinx2ad.lib
- Low-level audio access
- Continuous Listening and Silence Filtering
- Front End API overview.
16Knowledge Base
- The data that drives the decoder.
- Three sets of data
- Acoustic Model.
- Language Model.
- Lexicon (Dictionary).
17Acoustic Model
- /model/hmm/6k
- Database of statistical model.
- Each statistical model represents a phoneme.
- Acoustic Models are trained by analyzing large
amount of speech data.
18HMM in Acoustic Model
- HMM represent each unit of speech in the Acoustic
Model. - Typical HMM use 3-5 states to model a phoneme.
- Each state of HMM is represented by a set of
Gaussian mixture density functions. - Sphinx2 default phone set.
19Gaussian Mixtures
- Refer to text book p 33 eq 38
- Represent each state in HMM.
- Each set of Gaussian Mixtures are called
senones. - HMM can share senones.
20(No Transcript)
21Language Model
- Describes what is likely to be spoken in a
particular context - Word transitions are defined in terms of
transition probabilities - Helps to constrain the search space
- See examples of LM.
22N-gram Language Model
- Probability of word N dependent on word N-1, N-2,
... - Bigrams and trigrams most commonly used
- Used for large vocabulary applications such as
dictation - Typically trained by very large (millions of
words) corpus
23Decoder
- Selects next set of likely states
- Scores incoming features against these states
- Drop low scoring states
- Generates results
24Speech in Java Platform
25Sun Java Speech API
- First released on October 26, 1998.
- The Java Speech API allows Java applications to
incorporate speech technology into their user
interfaces. - Defines a cross-platform API to support command
and control recognizers, dictation systems and
speech synthesizers.
26Implementations of Java Speech API
- Open Source
- FreeTTS / CMU Sphinx4.
- IBM Speech for Java.
- Cloud Garden.
- LH TTS for Java Speech API.
- Conversa Web 3.0.
27Free TTS
- Fully implemented with Java.
- Based upon Flite 1.1 a small run-time speech
synthesis engine developed at CMU. - Partial support for JSAPI 1.0.
- Speech Recognition functions.
- JSML.
28Sphinx 4 (Sphinx 3j)
- Fully implemented with Java.
- Speed is equal or faster than Sphinx3.
- Acoustic model and Language model is under
construction. - Source code are available by CVS.(but you can not
run any applications without models !)
For Example To check out the Sphinx4 ,you can
using the following command. cvs -z3
-dpserveranonymous_at_cvs.sourceforge.net/cvsroot/
cmusphinx co sphinx4
29Java Platform Issues
- GC makes managing data much easier
- Native engines typically optimize inner loops for
the CPU ? can't do that on the Java platform. - Native engines arrange data to
- optimize cache hits ? can't really do that either.
30DEMO
- Sphinx-II batch mode.
- Sphinx-II live mode.
- Sphinx-II Client / Server mode.
- A Simple Free TTS Application.
- (Java-based) TTS vs (c-based)SR .
- Motion Planner with Free TTS-using Java Web
Start.(This is GRA course final project)
31Summary
- Sphinx is a open source Speech Recognition
developed at CMU. - FE / KB / Decoder form the core of SR system.
- FE receives and processes speech signal.
- Knowledge Base provide data for Decoder.
- Decoder search the states and return the results.
- Speech Recognition is a challenging problem for
the Java platform.
32Reference
- Mosur K.Ravishankar, Efficient Alogrithms for
Speech Recognition, CMU, 1996. - Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II
User Guide , CMU,2001. - Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken
Language Processing,Prentice Hall,2000.
33Reference (on-line)
- On-line documents of Java Speech API
- http//java.sun.com/products/java-media/speech/
- On-line documents of Free TTS
- http//freetts.sourceforge.net/docs/
- On-line documents of Sphinx-II
- http//www.speech.cs.cmu.edu/sphinx/
34Q A