Architecture of the SpeechEnabled Web - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Architecture of the SpeechEnabled Web

Description:

Discuss core architectural principles behind the World Wide Web and how they ... Allows flexible prosodic control such as changing volume, rate, and contour of ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 63

Provided by: JGUS1

Category:

more less

Transcript and Presenter's Notes

Title: Architecture of the SpeechEnabled Web

1
(No Transcript)
2
Architecture of the Speech-Enabled Web

Dr. Dave Burke, CTO, Voxpilot

3
Contents

Discuss core architectural principles behind the
World Wide Web and how they have been adopted by
the speech and media server industry
Introduce the important specifications powering
the Speech-Enabled Web
Look at how Speech-Enabled Web technology is
evolving

4
Background

Speech technologies are undergoing a shift from
closed proprietary systems to an open standard,
Web-centric model
Natural progression to a Web model
Forums for standardising APIs and protocols
Solutions to distribution, scalability, security
Information sources readily available
Rich tools and skilled developers

5
Web Architectural Principles

The World Wide Web is a global information space
of interrelated resources
Three core concepts underlying the architecture
of the Web
Identification
Interaction
Representation

6
Web Architectural Principles

Identification refers to the URI mechanism by
which resources are uniquely identified
Interaction refers to protocols which define the
syntax and semantics of messages exchanged by
agents over a network (e.g. HTTP)
Representation refers to the formats for which
data are encoded in (e.g. HTML)

7
Web Architectural Principles
The concepts of identification, interaction, and
representation are orthogonal
8
Speech-Enabled Web

Existing Speech-Enabled Web technology typically
reuses existing back-end infrastructure or
results in the deployment of new Web
infrastructure

9
Speech-Related Standards

Two standard bodies primarily responsible for
evolving the Web and Internet
World Wide Web Consortium (W3C)
Internet Engineering Task Force (IETF)
W3C is responsible for Web standards such as
XHTML, CSS, XML, XSL
IETF is the protocol engineering arm responsible
for protocols such as HTTP, FTP, SMTP

10
Speech-Related Standards

Representation
W3C Speech Interface Framework
Interaction
IETF HTTP
IETF MRCP
IETF SIP
Identification HTTP and SIP URIs

11
W3C Speech Interface Framework

Family of specifications
Voice eXtensible Markup Language (VoiceXML)
Speech Recognition Grammar Specification (SRGS)
Speech Synthesis Markup Language (SSML)
Semantic Interpretation for Speech Recognition
(SISR)
Call Control eXtensible Markup Language (CCXML)
Markup based languages for creating rich
human-computer dialogs and for managing call
control
Can work together or independently

12
W3C Speech Interface Framework
W3C Speech Interface Framework specifications
constitute standard representation formats for
the Speech-Enabled Web
13
W3C Speech Interface Framework

VoiceXML is a W3C Recommendation which allows
application authors to script directed dialogs
and mixed initiative dialogs
Advantages
Replaces complex, proprietary application APIs
with a portable, open standard
Flexible, easy-to-use language which abstracts
the complexities of the platform
Exploits the Web architecture model
Supports DTMF, speech, and video

14
W3C Speech Interface Framework

Simple VoiceXML example
lt?xml version"1.0"?gt
ltvxml version"2.0"gt
ltformgt
ltblockgt
Hello World!
lt/blockgt
lt/formgt
lt/vxmlgt

15
W3C Speech Interface Framework

Speech Synthesis Markup Language (SSML) is a W3C
Recommendation for assisting in the generation of
speech in Web applications
Give the underlying speech synthesiser hints in
how to render text-to-speech
Allows flexible prosodic control such as changing
volume, rate, and contour of the synthesised
speech

16
W3C Speech Interface Framework

Simple SSML example
lt?xml version"1.0"?gt
ltspeak version"1.0"gt
Would you like
ltemphasisgt debit lt/emphasisgt or
ltemphasisgt credit lt/emphasisgt
lt/speakgt

17
W3C Speech Interface Framework

Speech Recognition Grammar Specification (SRGS)
is a W3C Recommendation for specifying speech
recognition grammars
SRGS defines the words or phrases a speech
recogniser may recognise
Grammars constrain speech recognition input to
improve recognition performance and accuracy

18
W3C Speech Interface Framework

Simple SRGS example
lt?xml version"1.0"?gt
ltgrammar version"1.0"gt
ltone-ofgt
ltitemgt yes please lt/itemgt
ltitemgt no thank you lt/itemgt
lt/one-ofgt
ltgrammargt

19
W3C Speech Interface Framework

Semantic Interpretation for Speech Recognition
(SISR) is a specification for extracting the
semantics or meanings of a raw utterance
SISR is used inside SRGS grammars to annotate the
meaning of the matched words
Used to implement Natural Language Understanding

20
W3C Speech Interface Framework

Simple SISR example
lt?xml version"1.0"?gt
ltgrammar version"1.0"
tag-format"semantics/1.0-literals"gt
ltone-ofgt
ltitemgt yes lt/itemgt
ltitemgt sure lttaggtyeslt/taggt ltitemgt
ltitemgt aye lttaggtyeslt/taggt lt/itemgt
lt/one-ofgt
ltgrammargt

21
W3C Speech Interface Framework

Call Control XML (CCXML) provides call control
support for VoiceXML and other dialog languages
Supports
Multi-party conferencing
Multi-call handling and control
Asynchronous event handling
Call control protocol independence

22
W3C Speech Interface Framework

Simple CCXML example
lt?xml version"1.0"?gt
ltccxml version"1.0"gt
lteventprocessorgt
lttransition event"connection.connected"gt
ltdialogstart uri"helloworld.vxml"/gt
lt/transitiongt
lt/eventprocessorgt
ltccxmlgt

23
IETF Interaction Protocols

Three important IETF protocols powering the
Speech-Enabled Web
Hyper Text Transfer Protocol (HTTP)
Session Initiation Protocol (SIP)
Media Resource Control Protocol (MRCP)
HTTP, SIP, and MRCP are common interaction
protocols employed in the Speech-Enabled Web

24
IETF Interaction Protocols

HTTP is an open protocol designed for
distributed, collaborative, hypermedia
information systems
A lightweight, request/response protocol that
enables a robust and scalable distribution of
resources within the Web
Speech applications use HTTP for fetching and
transporting resources such as VoiceXML
documents, SRGS grammars, and audio files

25
IETF Interaction Protocols

HTTP affords the application developer the
ability to deploy his/her application remotely
from the platform provider
HTTP employs the http and https URI scheme for
identification of resources
Speech-Enabled Web Technology inherits resource
discovery, load-balancing, and failover solutions
from HTTP

26
IETF Interaction Protocols

Simple HTTP example

VoiceXML Browser
Webserver
27
IETF Interaction Protocols

Simple HTTP example

VoiceXML Browser
Webserver
GET /application.vxml HTTP/1.1 Host
webserver1.voxpilot.com
28
IETF Interaction Protocols

Simple HTTP example

VoiceXML Browser
Webserver
GET /application.vxml HTTP/1.1 Host
webserver1.voxpilot.com
HTTP/1.1 200 OK Date Tue, 25 May 2005 120000
GMT Content-Type application/voicexmlxml Content
-Length 128 lt?xml version"1.0"?gt ltvxml
version"2.0"gt . . .
29
IETF Interaction Protocols

SIP is an open IP signalling protocol for
audio/video telephony, conferencing, and presence
instant messaging
SIP is often called a rendezvous protocol
Gaining rapid adoption as the signalling protocol
of choice The 3GPP has selected it for powering
the IP Multimedia Subsystem (IMS) architecture

30
IETF Interaction Protocols

SIP is a popular protocol for providing the
telephony interface to VoiceXML and CCXML servers
The sip and sips URI schemes are used for
identification of VoiceXML and CCXML resources

31
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
32
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
33
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
34
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
35
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
36
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
BYE
37
IETF Interaction Protocols

Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
BYE
200 OK
38
IETF Interaction Protocols

MRCP is an open protocol for controlling
network-based media resources such as speech
recognisers and speech synthesisers
Problem statement
Different markets have different preferred speech
engine vendors
Speech engine APIs are complex, diverse and
moving targets, often changing per version!
Platform integrators need to maintain
integrations to multiple vendors

39
IETF Interaction Protocols

MRCP delivers a standard protocol that alleviates
the integration burden for everyone
Win-win situation speech vendors concentrate on
the speech engine, platform vendors concentrate
on the platform
MRCP is being widely adopted by leading speech
vendors

40
IETF Interaction Protocols

MRCP employs SIP to establish media and control
sessions to speech recognisers and from speech
synthesisers
MRCP is a text-based control protocol (inspired
by HTTP) and provides hooks to control media
resources and to receive progress notifications
By leveraging SIP, MRCP inherits resource
discovery, load-balancing, and failover solutions

41
IETF Interaction Protocols

Simple MRCP example

VoiceXML Browser
Speech Recognizer
42
IETF Interaction Protocols

Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
43
IETF Interaction Protocols

Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
44
IETF Interaction Protocols

Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
START-OF-SPEECH
45
IETF Interaction Protocols

Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
START-OF-SPEECH
RECOGNITION-COMPLETE
46
Putting It All Together

Orthogonality allows new speech standards to be
created and evolved in parallel to each other

47
Putting It All Together

Web and Internet standards greatly alleviate the
hurdles of closed, proprietary interfaces and
APIs
Creating applications no longer requires
specialised professional services
Existing Web infrastructure and skills can be
leveraged
Scalability, robustness, security, resource
discovery solutions are inherited for free

48
A Glimpse Of The (Near) Future

Video and VoiceXML
Multimodal Interaction
IP Multimedia Subsystem (IMS)

49
A Glimpse Of The (Near) Future - Video

New media features, driven by improved terminal
and network capabilities, are enabling new and
exciting video applications and multimedia
services
Current 3G networks can support video through
the 3G-324M circuit switched protocol

50
A Glimpse Of The (Near) Future - Video

VoiceXML 2.0 can be extended to support video
with virtually no modifications
Multimedia vs. Multimodal Video is as another,
independent channel in a human-computer dialog
Reuse the ltaudiogt, ltrecordgt, and ltsubmitgt
elements in VoiceXML for video

51
A Glimpse Of The (Near) Future - Video

Video and VoiceXML Examples
ltpromptgt
Message 1, received yesterday at 1045 pm.
ltaudio src"http//192.168.1.1/msg01.3gp"/gt
End of message.
lt/promptgt
ltrecord name"videomsg" beep"true"
type"video/3gpp"gt
ltprompt timeout"5s"gt
Record a video message after the beep.
lt/promptgt
lt/recordgt
ltsubmit next"save_message.pl"
enctype"multipart/form-data"
method"post" namelist"videomsg"/gt

52
A Glimpse Of The (Near) Future - Multimodal

Multimodal is about combining multiple modes of
interaction simultaneously

53
A Glimpse Of The (Near) Future - Multimodal

Multimodal extends input/output
Speech
Graphical / Textual
Ink
Gestures
Multiple devices
Mobile
Desktop
Kiosks

54
A Glimpse Of The (Near) Future - Multimodal

Design Time vs. Run-Time challenges
Design Time Combine representation languages,
e.g.
XHTML VoiceXML (XV)
SALT
Run-Time More difficult problem being worked on
by the W3C Multimodal Interaction group

55
A Glimpse Of The (Near) Future - IMS

IP Multimedia Subsystem (IMS) is a complete
network architecture introduced by the 3GPP
Focused on convergence of cellular and Internet
technologies on an all-IP network delivering
voice, data, content, video
Gaining momentum as the next generation network
architecture positioned for wireline wireless
convergence

56
A Glimpse Of The (Near) Future - IMS

IMS architecture, which is based on SIP, fits
seamlessly with Speech-Enabled Web technology
Speech-Enabled Web technologies offer a more
efficient, standard, and portable paradigm for
service creation than SIP servlets / CGI
CCXML Browsers and VoiceXML Browsers map cleanly
to IMS functional elements

57
A Glimpse Of The (Near) Future - IMS

CCXML is a SIP Application Server VoiceXML is a
Media Resource Function Controller

CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
58
A Glimpse Of The (Near) Future - IMS

Different physical packaging options

CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Physical MRF
Mixer MRFP
DTMF/Speech MRFP
RTP
59
A Glimpse Of The (Near) Future - IMS

Different physical packaging options

Physical MRF
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
60
A Glimpse Of The (Near) Future - IMS

Different physical packaging options

Physical MRF
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
61
Summary

Discussed WWW architecture principles and how
theyve been applied to speech technology
Reviewed important technical standards being
employed in Speech-Enabled Web technology
Took a brief look at new developments relevant to
Speech-Enabled Web technology

62
Summary
Thank you!

Write a Comment

User Comments (0)