Title: Architecture of the SpeechEnabled Web
1(No Transcript)
2Architecture of the Speech-Enabled Web
- Dr. Dave Burke, CTO, Voxpilot
3Contents
- Discuss core architectural principles behind the
World Wide Web and how they have been adopted by
the speech and media server industry - Introduce the important specifications powering
the Speech-Enabled Web - Look at how Speech-Enabled Web technology is
evolving
4Background
- Speech technologies are undergoing a shift from
closed proprietary systems to an open standard,
Web-centric model - Natural progression to a Web model
- Forums for standardising APIs and protocols
- Solutions to distribution, scalability, security
- Information sources readily available
- Rich tools and skilled developers
5Web Architectural Principles
- The World Wide Web is a global information space
of interrelated resources - Three core concepts underlying the architecture
of the Web - Identification
- Interaction
- Representation
6Web Architectural Principles
- Identification refers to the URI mechanism by
which resources are uniquely identified - Interaction refers to protocols which define the
syntax and semantics of messages exchanged by
agents over a network (e.g. HTTP) - Representation refers to the formats for which
data are encoded in (e.g. HTML)
7Web Architectural Principles
The concepts of identification, interaction, and
representation are orthogonal
8Speech-Enabled Web
- Existing Speech-Enabled Web technology typically
reuses existing back-end infrastructure or
results in the deployment of new Web
infrastructure
9Speech-Related Standards
- Two standard bodies primarily responsible for
evolving the Web and Internet - World Wide Web Consortium (W3C)
- Internet Engineering Task Force (IETF)
- W3C is responsible for Web standards such as
XHTML, CSS, XML, XSL - IETF is the protocol engineering arm responsible
for protocols such as HTTP, FTP, SMTP
10Speech-Related Standards
- Representation
- W3C Speech Interface Framework
- Interaction
- IETF HTTP
- IETF MRCP
- IETF SIP
- Identification HTTP and SIP URIs
11W3C Speech Interface Framework
- Family of specifications
- Voice eXtensible Markup Language (VoiceXML)
- Speech Recognition Grammar Specification (SRGS)
- Speech Synthesis Markup Language (SSML)
- Semantic Interpretation for Speech Recognition
(SISR) - Call Control eXtensible Markup Language (CCXML)
- Markup based languages for creating rich
human-computer dialogs and for managing call
control - Can work together or independently
12W3C Speech Interface Framework
W3C Speech Interface Framework specifications
constitute standard representation formats for
the Speech-Enabled Web
13W3C Speech Interface Framework
- VoiceXML is a W3C Recommendation which allows
application authors to script directed dialogs
and mixed initiative dialogs - Advantages
- Replaces complex, proprietary application APIs
with a portable, open standard - Flexible, easy-to-use language which abstracts
the complexities of the platform - Exploits the Web architecture model
- Supports DTMF, speech, and video
14W3C Speech Interface Framework
- Simple VoiceXML example
- lt?xml version"1.0"?gt
- ltvxml version"2.0"gt
- ltformgt
- ltblockgt
- Hello World!
- lt/blockgt
- lt/formgt
- lt/vxmlgt
15W3C Speech Interface Framework
- Speech Synthesis Markup Language (SSML) is a W3C
Recommendation for assisting in the generation of
speech in Web applications - Give the underlying speech synthesiser hints in
how to render text-to-speech - Allows flexible prosodic control such as changing
volume, rate, and contour of the synthesised
speech
16W3C Speech Interface Framework
- Simple SSML example
- lt?xml version"1.0"?gt
- ltspeak version"1.0"gt
- Would you like
- ltemphasisgt debit lt/emphasisgt or
- ltemphasisgt credit lt/emphasisgt
- lt/speakgt
17W3C Speech Interface Framework
- Speech Recognition Grammar Specification (SRGS)
is a W3C Recommendation for specifying speech
recognition grammars - SRGS defines the words or phrases a speech
recogniser may recognise - Grammars constrain speech recognition input to
improve recognition performance and accuracy
18W3C Speech Interface Framework
- Simple SRGS example
- lt?xml version"1.0"?gt
- ltgrammar version"1.0"gt
- ltone-ofgt
- ltitemgt yes please lt/itemgt
- ltitemgt no thank you lt/itemgt
- lt/one-ofgt
- ltgrammargt
19W3C Speech Interface Framework
- Semantic Interpretation for Speech Recognition
(SISR) is a specification for extracting the
semantics or meanings of a raw utterance - SISR is used inside SRGS grammars to annotate the
meaning of the matched words - Used to implement Natural Language Understanding
20W3C Speech Interface Framework
- Simple SISR example
- lt?xml version"1.0"?gt
- ltgrammar version"1.0"
- tag-format"semantics/1.0-literals"gt
- ltone-ofgt
- ltitemgt yes lt/itemgt
- ltitemgt sure lttaggtyeslt/taggt ltitemgt
- ltitemgt aye lttaggtyeslt/taggt lt/itemgt
- lt/one-ofgt
- ltgrammargt
21W3C Speech Interface Framework
- Call Control XML (CCXML) provides call control
support for VoiceXML and other dialog languages - Supports
- Multi-party conferencing
- Multi-call handling and control
- Asynchronous event handling
- Call control protocol independence
22W3C Speech Interface Framework
- Simple CCXML example
- lt?xml version"1.0"?gt
- ltccxml version"1.0"gt
- lteventprocessorgt
- lttransition event"connection.connected"gt
- ltdialogstart uri"helloworld.vxml"/gt
- lt/transitiongt
- lt/eventprocessorgt
- ltccxmlgt
23IETF Interaction Protocols
- Three important IETF protocols powering the
Speech-Enabled Web - Hyper Text Transfer Protocol (HTTP)
- Session Initiation Protocol (SIP)
- Media Resource Control Protocol (MRCP)
- HTTP, SIP, and MRCP are common interaction
protocols employed in the Speech-Enabled Web
24IETF Interaction Protocols
- HTTP is an open protocol designed for
distributed, collaborative, hypermedia
information systems - A lightweight, request/response protocol that
enables a robust and scalable distribution of
resources within the Web - Speech applications use HTTP for fetching and
transporting resources such as VoiceXML
documents, SRGS grammars, and audio files
25IETF Interaction Protocols
- HTTP affords the application developer the
ability to deploy his/her application remotely
from the platform provider - HTTP employs the http and https URI scheme for
identification of resources - Speech-Enabled Web Technology inherits resource
discovery, load-balancing, and failover solutions
from HTTP
26IETF Interaction Protocols
VoiceXML Browser
Webserver
27IETF Interaction Protocols
VoiceXML Browser
Webserver
GET /application.vxml HTTP/1.1 Host
webserver1.voxpilot.com
28IETF Interaction Protocols
VoiceXML Browser
Webserver
GET /application.vxml HTTP/1.1 Host
webserver1.voxpilot.com
HTTP/1.1 200 OK Date Tue, 25 May 2005 120000
GMT Content-Type application/voicexmlxml Content
-Length 128 lt?xml version"1.0"?gt ltvxml
version"2.0"gt . . .
29IETF Interaction Protocols
- SIP is an open IP signalling protocol for
audio/video telephony, conferencing, and presence
instant messaging - SIP is often called a rendezvous protocol
- Gaining rapid adoption as the signalling protocol
of choice The 3GPP has selected it for powering
the IP Multimedia Subsystem (IMS) architecture
30IETF Interaction Protocols
- SIP is a popular protocol for providing the
telephony interface to VoiceXML and CCXML servers - The sip and sips URI schemes are used for
identification of VoiceXML and CCXML resources
31IETF Interaction Protocols
SIP Phone
VoiceXML Browser
32IETF Interaction Protocols
SIP Phone
VoiceXML Browser
INVITE
33IETF Interaction Protocols
SIP Phone
VoiceXML Browser
INVITE
200 OK
34IETF Interaction Protocols
SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
35IETF Interaction Protocols
SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
36IETF Interaction Protocols
SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
BYE
37IETF Interaction Protocols
SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
BYE
200 OK
38IETF Interaction Protocols
- MRCP is an open protocol for controlling
network-based media resources such as speech
recognisers and speech synthesisers - Problem statement
- Different markets have different preferred speech
engine vendors - Speech engine APIs are complex, diverse and
moving targets, often changing per version! - Platform integrators need to maintain
integrations to multiple vendors
39IETF Interaction Protocols
- MRCP delivers a standard protocol that alleviates
the integration burden for everyone - Win-win situation speech vendors concentrate on
the speech engine, platform vendors concentrate
on the platform - MRCP is being widely adopted by leading speech
vendors
40IETF Interaction Protocols
- MRCP employs SIP to establish media and control
sessions to speech recognisers and from speech
synthesisers - MRCP is a text-based control protocol (inspired
by HTTP) and provides hooks to control media
resources and to receive progress notifications - By leveraging SIP, MRCP inherits resource
discovery, load-balancing, and failover solutions
41IETF Interaction Protocols
VoiceXML Browser
Speech Recognizer
42IETF Interaction Protocols
VoiceXML Browser
Speech Recognizer
RECOGNIZE
43IETF Interaction Protocols
VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
44IETF Interaction Protocols
VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
START-OF-SPEECH
45IETF Interaction Protocols
VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
START-OF-SPEECH
RECOGNITION-COMPLETE
46Putting It All Together
- Orthogonality allows new speech standards to be
created and evolved in parallel to each other -
47Putting It All Together
- Web and Internet standards greatly alleviate the
hurdles of closed, proprietary interfaces and
APIs - Creating applications no longer requires
specialised professional services - Existing Web infrastructure and skills can be
leveraged - Scalability, robustness, security, resource
discovery solutions are inherited for free -
48A Glimpse Of The (Near) Future
- Video and VoiceXML
- Multimodal Interaction
- IP Multimedia Subsystem (IMS)
49A Glimpse Of The (Near) Future - Video
- New media features, driven by improved terminal
and network capabilities, are enabling new and
exciting video applications and multimedia
services - Current 3G networks can support video through
the 3G-324M circuit switched protocol
50A Glimpse Of The (Near) Future - Video
- VoiceXML 2.0 can be extended to support video
with virtually no modifications - Multimedia vs. Multimodal Video is as another,
independent channel in a human-computer dialog - Reuse the ltaudiogt, ltrecordgt, and ltsubmitgt
elements in VoiceXML for video
51A Glimpse Of The (Near) Future - Video
- Video and VoiceXML Examples
- ltpromptgt
- Message 1, received yesterday at 1045 pm.
- ltaudio src"http//192.168.1.1/msg01.3gp"/gt
- End of message.
- lt/promptgt
- ltrecord name"videomsg" beep"true"
type"video/3gpp"gt - ltprompt timeout"5s"gt
- Record a video message after the beep.
- lt/promptgt
- lt/recordgt
- ltsubmit next"save_message.pl"
enctype"multipart/form-data" - method"post" namelist"videomsg"/gt
52A Glimpse Of The (Near) Future - Multimodal
- Multimodal is about combining multiple modes of
interaction simultaneously
53A Glimpse Of The (Near) Future - Multimodal
- Multimodal extends input/output
- Speech
- Graphical / Textual
- Ink
- Gestures
- Multiple devices
- Mobile
- Desktop
- Kiosks
54A Glimpse Of The (Near) Future - Multimodal
- Design Time vs. Run-Time challenges
- Design Time Combine representation languages,
e.g. - XHTML VoiceXML (XV)
- SALT
- Run-Time More difficult problem being worked on
by the W3C Multimodal Interaction group
55A Glimpse Of The (Near) Future - IMS
- IP Multimedia Subsystem (IMS) is a complete
network architecture introduced by the 3GPP - Focused on convergence of cellular and Internet
technologies on an all-IP network delivering
voice, data, content, video - Gaining momentum as the next generation network
architecture positioned for wireline wireless
convergence
56A Glimpse Of The (Near) Future - IMS
- IMS architecture, which is based on SIP, fits
seamlessly with Speech-Enabled Web technology - Speech-Enabled Web technologies offer a more
efficient, standard, and portable paradigm for
service creation than SIP servlets / CGI - CCXML Browsers and VoiceXML Browsers map cleanly
to IMS functional elements
57A Glimpse Of The (Near) Future - IMS
- CCXML is a SIP Application Server VoiceXML is a
Media Resource Function Controller
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
58A Glimpse Of The (Near) Future - IMS
- Different physical packaging options
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Physical MRF
Mixer MRFP
DTMF/Speech MRFP
RTP
59A Glimpse Of The (Near) Future - IMS
- Different physical packaging options
Physical MRF
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
60A Glimpse Of The (Near) Future - IMS
- Different physical packaging options
Physical MRF
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
61Summary
- Discussed WWW architecture principles and how
theyve been applied to speech technology - Reviewed important technical standards being
employed in Speech-Enabled Web technology - Took a brief look at new developments relevant to
Speech-Enabled Web technology
62Summary
Thank you!