Architecture of the SpeechEnabled Web - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Architecture of the SpeechEnabled Web

Description:

Discuss core architectural principles behind the World Wide Web and how they ... Allows flexible prosodic control such as changing volume, rate, and contour of ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 63
Provided by: JGUS1
Category:

less

Transcript and Presenter's Notes

Title: Architecture of the SpeechEnabled Web


1
(No Transcript)
2
Architecture of the Speech-Enabled Web
  • Dr. Dave Burke, CTO, Voxpilot

3
Contents
  • Discuss core architectural principles behind the
    World Wide Web and how they have been adopted by
    the speech and media server industry
  • Introduce the important specifications powering
    the Speech-Enabled Web
  • Look at how Speech-Enabled Web technology is
    evolving

4
Background
  • Speech technologies are undergoing a shift from
    closed proprietary systems to an open standard,
    Web-centric model
  • Natural progression to a Web model
  • Forums for standardising APIs and protocols
  • Solutions to distribution, scalability, security
  • Information sources readily available
  • Rich tools and skilled developers

5
Web Architectural Principles
  • The World Wide Web is a global information space
    of interrelated resources
  • Three core concepts underlying the architecture
    of the Web
  • Identification
  • Interaction
  • Representation

6
Web Architectural Principles
  • Identification refers to the URI mechanism by
    which resources are uniquely identified
  • Interaction refers to protocols which define the
    syntax and semantics of messages exchanged by
    agents over a network (e.g. HTTP)
  • Representation refers to the formats for which
    data are encoded in (e.g. HTML)

7
Web Architectural Principles
The concepts of identification, interaction, and
representation are orthogonal
8
Speech-Enabled Web
  • Existing Speech-Enabled Web technology typically
    reuses existing back-end infrastructure or
    results in the deployment of new Web
    infrastructure

9
Speech-Related Standards
  • Two standard bodies primarily responsible for
    evolving the Web and Internet
  • World Wide Web Consortium (W3C)
  • Internet Engineering Task Force (IETF)
  • W3C is responsible for Web standards such as
    XHTML, CSS, XML, XSL
  • IETF is the protocol engineering arm responsible
    for protocols such as HTTP, FTP, SMTP

10
Speech-Related Standards
  • Representation
  • W3C Speech Interface Framework
  • Interaction
  • IETF HTTP
  • IETF MRCP
  • IETF SIP
  • Identification HTTP and SIP URIs

11
W3C Speech Interface Framework
  • Family of specifications
  • Voice eXtensible Markup Language (VoiceXML)
  • Speech Recognition Grammar Specification (SRGS)
  • Speech Synthesis Markup Language (SSML)
  • Semantic Interpretation for Speech Recognition
    (SISR)
  • Call Control eXtensible Markup Language (CCXML)
  • Markup based languages for creating rich
    human-computer dialogs and for managing call
    control
  • Can work together or independently

12
W3C Speech Interface Framework
W3C Speech Interface Framework specifications
constitute standard representation formats for
the Speech-Enabled Web
13
W3C Speech Interface Framework
  • VoiceXML is a W3C Recommendation which allows
    application authors to script directed dialogs
    and mixed initiative dialogs
  • Advantages
  • Replaces complex, proprietary application APIs
    with a portable, open standard
  • Flexible, easy-to-use language which abstracts
    the complexities of the platform
  • Exploits the Web architecture model
  • Supports DTMF, speech, and video

14
W3C Speech Interface Framework
  • Simple VoiceXML example
  • lt?xml version"1.0"?gt
  • ltvxml version"2.0"gt
  • ltformgt
  • ltblockgt
  • Hello World!
  • lt/blockgt
  • lt/formgt
  • lt/vxmlgt

15
W3C Speech Interface Framework
  • Speech Synthesis Markup Language (SSML) is a W3C
    Recommendation for assisting in the generation of
    speech in Web applications
  • Give the underlying speech synthesiser hints in
    how to render text-to-speech
  • Allows flexible prosodic control such as changing
    volume, rate, and contour of the synthesised
    speech

16
W3C Speech Interface Framework
  • Simple SSML example
  • lt?xml version"1.0"?gt
  • ltspeak version"1.0"gt
  • Would you like
  • ltemphasisgt debit lt/emphasisgt or
  • ltemphasisgt credit lt/emphasisgt
  • lt/speakgt

17
W3C Speech Interface Framework
  • Speech Recognition Grammar Specification (SRGS)
    is a W3C Recommendation for specifying speech
    recognition grammars
  • SRGS defines the words or phrases a speech
    recogniser may recognise
  • Grammars constrain speech recognition input to
    improve recognition performance and accuracy

18
W3C Speech Interface Framework
  • Simple SRGS example
  • lt?xml version"1.0"?gt
  • ltgrammar version"1.0"gt
  • ltone-ofgt
  • ltitemgt yes please lt/itemgt
  • ltitemgt no thank you lt/itemgt
  • lt/one-ofgt
  • ltgrammargt

19
W3C Speech Interface Framework
  • Semantic Interpretation for Speech Recognition
    (SISR) is a specification for extracting the
    semantics or meanings of a raw utterance
  • SISR is used inside SRGS grammars to annotate the
    meaning of the matched words
  • Used to implement Natural Language Understanding

20
W3C Speech Interface Framework
  • Simple SISR example
  • lt?xml version"1.0"?gt
  • ltgrammar version"1.0"
  • tag-format"semantics/1.0-literals"gt
  • ltone-ofgt
  • ltitemgt yes lt/itemgt
  • ltitemgt sure lttaggtyeslt/taggt ltitemgt
  • ltitemgt aye lttaggtyeslt/taggt lt/itemgt
  • lt/one-ofgt
  • ltgrammargt

21
W3C Speech Interface Framework
  • Call Control XML (CCXML) provides call control
    support for VoiceXML and other dialog languages
  • Supports
  • Multi-party conferencing
  • Multi-call handling and control
  • Asynchronous event handling
  • Call control protocol independence

22
W3C Speech Interface Framework
  • Simple CCXML example
  • lt?xml version"1.0"?gt
  • ltccxml version"1.0"gt
  • lteventprocessorgt
  • lttransition event"connection.connected"gt
  • ltdialogstart uri"helloworld.vxml"/gt
  • lt/transitiongt
  • lt/eventprocessorgt
  • ltccxmlgt

23
IETF Interaction Protocols
  • Three important IETF protocols powering the
    Speech-Enabled Web
  • Hyper Text Transfer Protocol (HTTP)
  • Session Initiation Protocol (SIP)
  • Media Resource Control Protocol (MRCP)
  • HTTP, SIP, and MRCP are common interaction
    protocols employed in the Speech-Enabled Web

24
IETF Interaction Protocols
  • HTTP is an open protocol designed for
    distributed, collaborative, hypermedia
    information systems
  • A lightweight, request/response protocol that
    enables a robust and scalable distribution of
    resources within the Web
  • Speech applications use HTTP for fetching and
    transporting resources such as VoiceXML
    documents, SRGS grammars, and audio files

25
IETF Interaction Protocols
  • HTTP affords the application developer the
    ability to deploy his/her application remotely
    from the platform provider
  • HTTP employs the http and https URI scheme for
    identification of resources
  • Speech-Enabled Web Technology inherits resource
    discovery, load-balancing, and failover solutions
    from HTTP

26
IETF Interaction Protocols
  • Simple HTTP example

VoiceXML Browser
Webserver
27
IETF Interaction Protocols
  • Simple HTTP example

VoiceXML Browser
Webserver
GET /application.vxml HTTP/1.1 Host
webserver1.voxpilot.com
28
IETF Interaction Protocols
  • Simple HTTP example

VoiceXML Browser
Webserver
GET /application.vxml HTTP/1.1 Host
webserver1.voxpilot.com
HTTP/1.1 200 OK Date Tue, 25 May 2005 120000
GMT Content-Type application/voicexmlxml Content
-Length 128 lt?xml version"1.0"?gt ltvxml
version"2.0"gt . . .
29
IETF Interaction Protocols
  • SIP is an open IP signalling protocol for
    audio/video telephony, conferencing, and presence
    instant messaging
  • SIP is often called a rendezvous protocol
  • Gaining rapid adoption as the signalling protocol
    of choice The 3GPP has selected it for powering
    the IP Multimedia Subsystem (IMS) architecture

30
IETF Interaction Protocols
  • SIP is a popular protocol for providing the
    telephony interface to VoiceXML and CCXML servers
  • The sip and sips URI schemes are used for
    identification of VoiceXML and CCXML resources

31
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
32
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
33
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
34
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
35
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
36
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
BYE
37
IETF Interaction Protocols
  • Simple SIP example

SIP Phone
VoiceXML Browser
INVITE
200 OK
ACK
media
media
BYE
200 OK
38
IETF Interaction Protocols
  • MRCP is an open protocol for controlling
    network-based media resources such as speech
    recognisers and speech synthesisers
  • Problem statement
  • Different markets have different preferred speech
    engine vendors
  • Speech engine APIs are complex, diverse and
    moving targets, often changing per version!
  • Platform integrators need to maintain
    integrations to multiple vendors

39
IETF Interaction Protocols
  • MRCP delivers a standard protocol that alleviates
    the integration burden for everyone
  • Win-win situation speech vendors concentrate on
    the speech engine, platform vendors concentrate
    on the platform
  • MRCP is being widely adopted by leading speech
    vendors

40
IETF Interaction Protocols
  • MRCP employs SIP to establish media and control
    sessions to speech recognisers and from speech
    synthesisers
  • MRCP is a text-based control protocol (inspired
    by HTTP) and provides hooks to control media
    resources and to receive progress notifications
  • By leveraging SIP, MRCP inherits resource
    discovery, load-balancing, and failover solutions

41
IETF Interaction Protocols
  • Simple MRCP example

VoiceXML Browser
Speech Recognizer
42
IETF Interaction Protocols
  • Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
43
IETF Interaction Protocols
  • Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
44
IETF Interaction Protocols
  • Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
START-OF-SPEECH
45
IETF Interaction Protocols
  • Simple MRCP example

VoiceXML Browser
Speech Recognizer
RECOGNIZE
200 IN-PROGRESS
START-OF-SPEECH
RECOGNITION-COMPLETE
46
Putting It All Together
  • Orthogonality allows new speech standards to be
    created and evolved in parallel to each other

47
Putting It All Together
  • Web and Internet standards greatly alleviate the
    hurdles of closed, proprietary interfaces and
    APIs
  • Creating applications no longer requires
    specialised professional services
  • Existing Web infrastructure and skills can be
    leveraged
  • Scalability, robustness, security, resource
    discovery solutions are inherited for free

48
A Glimpse Of The (Near) Future
  • Video and VoiceXML
  • Multimodal Interaction
  • IP Multimedia Subsystem (IMS)

49
A Glimpse Of The (Near) Future - Video
  • New media features, driven by improved terminal
    and network capabilities, are enabling new and
    exciting video applications and multimedia
    services
  • Current 3G networks can support video through
    the 3G-324M circuit switched protocol

50
A Glimpse Of The (Near) Future - Video
  • VoiceXML 2.0 can be extended to support video
    with virtually no modifications
  • Multimedia vs. Multimodal Video is as another,
    independent channel in a human-computer dialog
  • Reuse the ltaudiogt, ltrecordgt, and ltsubmitgt
    elements in VoiceXML for video

51
A Glimpse Of The (Near) Future - Video
  • Video and VoiceXML Examples
  • ltpromptgt
  • Message 1, received yesterday at 1045 pm.
  • ltaudio src"http//192.168.1.1/msg01.3gp"/gt
  • End of message.
  • lt/promptgt
  • ltrecord name"videomsg" beep"true"
    type"video/3gpp"gt
  • ltprompt timeout"5s"gt
  • Record a video message after the beep.
  • lt/promptgt
  • lt/recordgt
  • ltsubmit next"save_message.pl"
    enctype"multipart/form-data"
  • method"post" namelist"videomsg"/gt

52
A Glimpse Of The (Near) Future - Multimodal
  • Multimodal is about combining multiple modes of
    interaction simultaneously

53
A Glimpse Of The (Near) Future - Multimodal
  • Multimodal extends input/output
  • Speech
  • Graphical / Textual
  • Ink
  • Gestures
  • Multiple devices
  • Mobile
  • Desktop
  • Kiosks

54
A Glimpse Of The (Near) Future - Multimodal
  • Design Time vs. Run-Time challenges
  • Design Time Combine representation languages,
    e.g.
  • XHTML VoiceXML (XV)
  • SALT
  • Run-Time More difficult problem being worked on
    by the W3C Multimodal Interaction group

55
A Glimpse Of The (Near) Future - IMS
  • IP Multimedia Subsystem (IMS) is a complete
    network architecture introduced by the 3GPP
  • Focused on convergence of cellular and Internet
    technologies on an all-IP network delivering
    voice, data, content, video
  • Gaining momentum as the next generation network
    architecture positioned for wireline wireless
    convergence

56
A Glimpse Of The (Near) Future - IMS
  • IMS architecture, which is based on SIP, fits
    seamlessly with Speech-Enabled Web technology
  • Speech-Enabled Web technologies offer a more
    efficient, standard, and portable paradigm for
    service creation than SIP servlets / CGI
  • CCXML Browsers and VoiceXML Browsers map cleanly
    to IMS functional elements

57
A Glimpse Of The (Near) Future - IMS
  • CCXML is a SIP Application Server VoiceXML is a
    Media Resource Function Controller

CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
58
A Glimpse Of The (Near) Future - IMS
  • Different physical packaging options

CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Physical MRF
Mixer MRFP
DTMF/Speech MRFP
RTP
59
A Glimpse Of The (Near) Future - IMS
  • Different physical packaging options

Physical MRF
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
60
A Glimpse Of The (Near) Future - IMS
  • Different physical packaging options

Physical MRF
CCXML AS / MRFC
VoiceXML MRFC
SIP
SIP
SIP
UE
P-CSCF
S-CSCF
SIP
Mixer MRFP
DTMF/Speech MRFP
RTP
61
Summary
  • Discussed WWW architecture principles and how
    theyve been applied to speech technology
  • Reviewed important technical standards being
    employed in Speech-Enabled Web technology
  • Took a brief look at new developments relevant to
    Speech-Enabled Web technology

62
Summary
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com