Title: Speech Technologies and VoiceXML
1Speech Technologies and VoiceXML
- try
- Department of Computer Science
- National Cheng-Chi University
2Reference
- 1Bob Edgar(2001),The VoiceXML Handbook
,NYCMP Books. - 2Dave Raggett(2001),Getting started with
VoiceXML 2.0,W3C. - 3Sun Microsystems(1998),Java Speech Grammar
Format Specification v1.0,Sun Microsystems. - 4Chetan Sharma and Jeff Kunins(2002),VoiceXMLS
trategies and Techniques for Effective Voice
Application Development with VoiceXML 2.0,Wiley. - 5Brian Eberman,Jerry Carter,Darren Meyer,David
Goddeau(2002),Building VoiceXML Browsers with
OpenVXI, NYACM Press.
3Reference
- 6Microsoft (2002),Speech Technology Overview
, http//www.microsoft.com/speech/evaluation/techo
ver/ - 7 VoiceGenie Technologies Inc.(2001),White
PaperSpeaking Freely About The VoiceGenie
VoiceXML Gateway and the VoiceXML
Interpreter,VoiceGenie Technologies Inc. - 8W3C(2002),VoiceXML Specification v2.0,W3C.
- 9Chun-Feng,Liao(2002), Basics of Speech
Recognition,NCCU Computer Center.
4Presentation Agenda
- Voice technologies Backgrounds
- ASR/TTS
- Voice browsing with VoiceXML
- VoiceXML architecture
- Implementations of VoiceXML Platform
- VoiceXML document structure
- Bringing Voice Technologies into Virtual
Environment
5Voice Technologies
- In the mid- to late 1990s, personal computers
started to become powerful enough to support ASR - The two key underlying technologies behind these
advances are speech recognition (SR) and
text-to-speech synthesis (TTS).
6Classification of Voice Application
- Basic interactive voice response (IVR)
- Computer For stock quotes, press 1. For
trading, press 2. - Human (presses DTMF 1)
- Basic speech ASR
- C Say the stock name for a price quote.
- H Lucent Technologies
7Classification of Voice Application
- Advanced speech ASR
- C Stock Services, how may I help you?
- H Uh, whats Lucent trading at?
- Near-natural language ASR
- C How may I help you?
- H Um, yeah, Id like to get the current price
of Lucent Technologies - C Lucent is up two at sixty eight and a half.
- H OK. I want to buy one hundred shares at
market price. - C
8Speech Recognition
- Capturing speech (analog) signals
- Digitizing the sound waves, converting them to
basic language units or phonemes, - Constructing words from phonemes, and
contextually analyzing the words to ensure
correct spelling for words that sound alike (such
as write and right).
9Speech Recognition Process Flow
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
10Speech Recognition Process Flow
- Step 1User Input
- The system catches users voice in the form of
analog acoustic signal . - Step 2Digitization
- Digitize the analog acoustic signal.
- Step 3Phonetic Breakdown
- Breaking signals into phonemes.
11Speech Recognition Process Flow
- Step 4Statistical Modeling
- Mapping phonemes to their phonetic representation
using statistics model (exHMM) - Step 5Matching
- According to grammar , phonetic representation
and Dictionary , the system returns an n-best
list (I.e.a word plus a confidence score - Grammar-the union words or phrases to constraint
the range of input or output in the voice
application. - Dictionary-the mapping table of phonetic
representation and word(EXthu,thee?the)
12Speech Synthesis
- Speech Synthesis, or text-to-speech, is the
process of converting text into spoken language. - Breaking down the words into phonemes
- Analyzing for special handling of text such as
numbers, currency amounts. - Generating the digital audio for playback.
13Speech Synthesis
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
14Pervasive Computing Model
- E-business has changed from client-server model
to web-centric model - Once connect to the Internet,one can get any
information he want. But people wants more
convenient way to connect to Internet. - Lou Gerstner,CEO of IBMPervasive Computing Model
is billion people interacting with million
e-business with trillion devices interconnected.
15(No Transcript)
16Voice Browsing
- VoiceXML instead of HTML
- A voice browser instead of an ordinary web
browser - Phone instead of PC.
17Show An Scenario of Using VoiceXML
18VoiceXML Overview
- A language for specifying voice dialogs.
- Voice dialogs use audio prompts and
text-to-speech (TTS) for output touch-tone keys
(DTMF) and automatic speech recognition (ASR) for
input. - Main input/output device (initially) is the
phone. - Leverages the Internet for application
development and delivery. - Standard language enables portability.(unifies
dialog control languages)
19History of VoiceXML
SourceVoiceXML forum(http//www.voicexml.org)
20Making use of mature Internet Technologies
- Leverage existing web application development
tools. - Leverage existing web infrastructure for
application delivery. - Clean separation of service logic from user
interaction.
21VoiceXML Platform Architecture
22VoiceXML Platform Architecture-1
- Telephone and Telephone network-Connects callers
telephone with Telephony Server - VoiceXML Gateway
- Voice Browser
- Audio input-Speech Recognition (ASR), Touchtone
(DTMF), Audio recording. - Audio output-Audio playback, Speech Synthesis
(TTS) - Interface, Call Controls
23VoiceXML Platform Architecture-2
- VoiceXML Documents
- Dialog and flow control
- Client-side scripting (ECMAScript)
- Speech Recognition grammar
- Speech Synthesis pronunciation control
- Document servers(web server)
- Feeding Static VoiceXML documents or audio files.
- Application servers
- Generate VoiceXML documents dynamically.
- Server-side application logic
- Connect to Database, or database interface
24Voice Gateway
25VoiceXML Gateway(detail)
26Implementations of VoiceXML Gateways
- In Taiwan
- Yes Mobile
- Chunghwa Telecom Laboratories
- eWings Technologies, Inc
- Free
- IBM VoiceServerSDK
- Open Source
- CMUOpenVXI
27DEMOHow to Write and Run VoiceXML Applications?
28DEMOGenerate VoiceXML Document
Dynamically-using ASP.NET
29VoiceXML Document Structure.
30A Simple VoiceXML Document
31DEMOVoiceXML /HTML Comparison
32Bringing Voice Technologies to 3D Virtual
Environment
33Related Research
- Raymond L.Smith,III and Stephen D.Roberts
- Using voice input command to operate
simulation-animation. - The efficiency issues of ASR/TTS are taken into
account. - Satoru,Osamu,Katunobu,Takashi,Tomoyoshi,Hideki,Sho
taro,Takio and Katsuhiko - Create 3D virtual user who can speak with user
via speaker and microphone. - Virtual User have the ability to learn words and
recognize human face.
34We can do more..
- Speak to many users who are moving in virtual
environment. - System are built in distributed environment.(I.e.
web) - Make use of XML technology (VoiceXML/SALT).
35Problems to Solve
- Voice /Animation synchronization.
- Protocol integration.
- ASR/TTS integration and its performance issues.
- Virtual user autonomy.
- The Voice propagation range issues.
36System Design Prototype
37Summary
- Speech is the most natural way for human to
communicate thus it will become an important way
in HCI. - VoiceXML has revolutionized speech recognition
telephony application development deployment. - Adding Speech facilities into 3D virtual
environment will make UI more friendly and enable
multi-modal input/output. - My research interest on this topic will focus on
voice-animation synchronization and enable SR/TTS
in distributed 3D virtual environment .
38Q A