Speech Technologies and VoiceXML - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Speech Technologies and VoiceXML

Description:

... are speech recognition (SR) and text-to-speech synthesis ... Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 39
Provided by: try3
Category:

less

Transcript and Presenter's Notes

Title: Speech Technologies and VoiceXML


1
Speech Technologies and VoiceXML
  • try
  • Department of Computer Science
  • National Cheng-Chi University

2
Reference
  • 1Bob Edgar(2001),The VoiceXML Handbook
    ,NYCMP Books.
  • 2Dave Raggett(2001),Getting started with
    VoiceXML 2.0,W3C.
  • 3Sun Microsystems(1998),Java Speech Grammar
    Format Specification v1.0,Sun Microsystems.
  • 4Chetan Sharma and Jeff Kunins(2002),VoiceXMLS
    trategies and Techniques for Effective Voice
    Application Development with VoiceXML 2.0,Wiley.
  • 5Brian Eberman,Jerry Carter,Darren Meyer,David
    Goddeau(2002),Building VoiceXML Browsers with
    OpenVXI, NYACM Press.

3
Reference
  • 6Microsoft (2002),Speech Technology Overview
    , http//www.microsoft.com/speech/evaluation/techo
    ver/
  • 7 VoiceGenie Technologies Inc.(2001),White
    PaperSpeaking Freely About The VoiceGenie
    VoiceXML Gateway and the VoiceXML
    Interpreter,VoiceGenie Technologies Inc.
  • 8W3C(2002),VoiceXML Specification v2.0,W3C.
  • 9Chun-Feng,Liao(2002), Basics of Speech
    Recognition,NCCU Computer Center.

4
Presentation Agenda
  • Voice technologies Backgrounds
  • ASR/TTS
  • Voice browsing with VoiceXML
  • VoiceXML architecture
  • Implementations of VoiceXML Platform
  • VoiceXML document structure
  • Bringing Voice Technologies into Virtual
    Environment

5
Voice Technologies
  • In the mid- to late 1990s, personal computers
    started to become powerful enough to support ASR
  • The two key underlying technologies behind these
    advances are speech recognition (SR) and
    text-to-speech synthesis (TTS).

6
Classification of Voice Application
  • Basic interactive voice response (IVR)
  • Computer For stock quotes, press 1. For
    trading, press 2.
  • Human (presses DTMF 1)
  • Basic speech ASR
  • C Say the stock name for a price quote.
  • H Lucent Technologies

7
Classification of Voice Application
  • Advanced speech ASR
  • C Stock Services, how may I help you?
  • H Uh, whats Lucent trading at?
  • Near-natural language ASR
  • C How may I help you?
  • H Um, yeah, Id like to get the current price
    of Lucent Technologies
  • C Lucent is up two at sixty eight and a half.
  • H OK. I want to buy one hundred shares at
    market price.
  • C

8
Speech Recognition
  • Capturing speech (analog) signals
  • Digitizing the sound waves, converting them to
    basic language units or phonemes,
  • Constructing words from phonemes, and
    contextually analyzing the words to ensure
    correct spelling for words that sound alike (such
    as write and right).

9
Speech Recognition Process Flow
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
10
Speech Recognition Process Flow
  • Step 1User Input
  • The system catches users voice in the form of
    analog acoustic signal .
  • Step 2Digitization
  • Digitize the analog acoustic signal.
  • Step 3Phonetic Breakdown
  • Breaking signals into phonemes.

11
Speech Recognition Process Flow
  • Step 4Statistical Modeling
  • Mapping phonemes to their phonetic representation
    using statistics model (exHMM)
  • Step 5Matching
  • According to grammar , phonetic representation
    and Dictionary , the system returns an n-best
    list (I.e.a word plus a confidence score
  • Grammar-the union words or phrases to constraint
    the range of input or output in the voice
    application.
  • Dictionary-the mapping table of phonetic
    representation and word(EXthu,thee?the)

12
Speech Synthesis
  • Speech Synthesis, or text-to-speech, is the
    process of converting text into spoken language.
  • Breaking down the words into phonemes
  • Analyzing for special handling of text such as
    numbers, currency amounts.
  • Generating the digital audio for playback.

13
Speech Synthesis
SourceMicrosoft Speech.NET Home(http//www.micros
oft.com/speech/ )
14
Pervasive Computing Model
  • E-business has changed from client-server model
    to web-centric model
  • Once connect to the Internet,one can get any
    information he want. But people wants more
    convenient way to connect to Internet.
  • Lou Gerstner,CEO of IBMPervasive Computing Model
    is billion people interacting with million
    e-business with trillion devices interconnected.

15
(No Transcript)
16
Voice Browsing
  • VoiceXML instead of HTML
  • A voice browser instead of an ordinary web
    browser
  • Phone instead of PC.

17
Show An Scenario of Using VoiceXML
18
VoiceXML Overview
  • A language for specifying voice dialogs.
  • Voice dialogs use audio prompts and
    text-to-speech (TTS) for output touch-tone keys
    (DTMF) and automatic speech recognition (ASR) for
    input.
  • Main input/output device (initially) is the
    phone.
  • Leverages the Internet for application
    development and delivery.
  • Standard language enables portability.(unifies
    dialog control languages)

19
History of VoiceXML
SourceVoiceXML forum(http//www.voicexml.org)
20
Making use of mature Internet Technologies
  • Leverage existing web application development
    tools.
  • Leverage existing web infrastructure for
    application delivery.
  • Clean separation of service logic from user
    interaction.

21
VoiceXML Platform Architecture
22
VoiceXML Platform Architecture-1
  • Telephone and Telephone network-Connects callers
    telephone with Telephony Server
  • VoiceXML Gateway
  • Voice Browser
  • Audio input-Speech Recognition (ASR), Touchtone
    (DTMF), Audio recording.
  • Audio output-Audio playback, Speech Synthesis
    (TTS)
  • Interface, Call Controls

23
VoiceXML Platform Architecture-2
  • VoiceXML Documents
  • Dialog and flow control
  • Client-side scripting (ECMAScript)
  • Speech Recognition grammar
  • Speech Synthesis pronunciation control
  • Document servers(web server)
  • Feeding Static VoiceXML documents or audio files.
  • Application servers
  • Generate VoiceXML documents dynamically.
  • Server-side application logic
  • Connect to Database, or database interface

24
Voice Gateway
25
VoiceXML Gateway(detail)

26
Implementations of VoiceXML Gateways
  • In Taiwan
  • Yes Mobile
  • Chunghwa Telecom Laboratories
  • eWings Technologies, Inc
  • Free
  • IBM VoiceServerSDK
  • Open Source
  • CMUOpenVXI

27
DEMOHow to Write and Run VoiceXML Applications?
28
DEMOGenerate VoiceXML Document
Dynamically-using ASP.NET
29
VoiceXML Document Structure.
30
A Simple VoiceXML Document
31
DEMOVoiceXML /HTML Comparison
32
Bringing Voice Technologies to 3D Virtual
Environment
33
Related Research
  • Raymond L.Smith,III and Stephen D.Roberts
  • Using voice input command to operate
    simulation-animation.
  • The efficiency issues of ASR/TTS are taken into
    account.
  • Satoru,Osamu,Katunobu,Takashi,Tomoyoshi,Hideki,Sho
    taro,Takio and Katsuhiko
  • Create 3D virtual user who can speak with user
    via speaker and microphone.
  • Virtual User have the ability to learn words and
    recognize human face.

34
We can do more..
  • Speak to many users who are moving in virtual
    environment.
  • System are built in distributed environment.(I.e.
    web)
  • Make use of XML technology (VoiceXML/SALT).

35
Problems to Solve
  • Voice /Animation synchronization.
  • Protocol integration.
  • ASR/TTS integration and its performance issues.
  • Virtual user autonomy.
  • The Voice propagation range issues.

36
System Design Prototype
37
Summary
  • Speech is the most natural way for human to
    communicate thus it will become an important way
    in HCI.
  • VoiceXML has revolutionized speech recognition
    telephony application development deployment.
  • Adding Speech facilities into 3D virtual
    environment will make UI more friendly and enable
    multi-modal input/output.
  • My research interest on this topic will focus on
    voice-animation synchronization and enable SR/TTS
    in distributed 3D virtual environment .

38
Q A
Write a Comment
User Comments (0)
About PowerShow.com