Spoken Dialogue Systems

About This Presentation

Title:

Spoken Dialogue Systems

Description:

Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999 Tutorial Overview: Data Flow Speech and Audio Processing Signal processing: Convert the ... – PowerPoint PPT presentation

Number of Views:846

Avg rating:3.0/5.0

Slides: 125

Provided by: csColumb4

Learn more at: http://www.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Spoken Dialogue Systems

1
Spoken Dialogue Systems

Bob Carpenter and Jennifer Chu-Carroll
June 20, 1999

2
Tutorial Overview Data Flow
Part I
Part II
Discourse Interpretation
Signal Processing
Speech Recognition
Semantic Interpretation
Dialogue Management
Response Generation
Speech Synthesis
3
Speech and Audio Processing

Signal processing
Convert the audio wave into a sequence of feature
vectors
Speech recognition
Decode the sequence of feature vectors into a
sequence of words
Semantic interpretation
Determine the meaning of the recognized words
Speech synthesis
Generate synthetic speech from a marked-up word
string

4
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

5
Acoustic Waves

Human speech generates a wave
like a loudspeaker moving
A wave for the words speech lab looks like

s p ee ch
l a b
l to a transition
Graphs from Simon Arnfields web tutorial on
speech, Sheffield http//lethe.leeds.ac.uk/resear
ch/cogn/speech/tutorial/
6
Acoustic Sampling

10 ms frame (ms millisecond 1/1000 second)
25 ms window around frame to smooth signal
processing

25 ms
. . .
10ms
Result Acoustic Feature Vectors
a1 a2 a3
7
Spectral Analysis

Frequency gives pitch amplitude gives volume
sampling at 8 kHz phone, 16 kHz mic (kHz1000
cycles/sec)
Fourier transform of wave yields a spectrogram
darkness indicates energy at each frequency
hundreds to thousands of frequency samples

s p ee ch
l a b
amplitude
frequency
8
Acoustic Features Mel Scale Filterbank

Derive Mel Scale Filterbank coefficients
Mel scale
models non-linearity of human audio perception
mel(f) 2595 log10(1 f / 700)
roughly linear to 1000Hz and then logarithmic
Filterbank
collapses large number of FFT parameters by
filtering with 20 triangular filters spaced on
mel scale

...
frequency

m1 m2 m3 m4 m5 m6

coefficients
9
Cepstral Coefficients

Cepstral Transform is a discrete cosine transform
of log filterbank amplitudes
Result is 12 Mel Frequency Cepstral Coefficients
(MFCC)
Almost independent (unlike mel filterbank)
Use Delta (velocity / first derivative) and
Delta2 (acceleration / second derivative) of MFCC
( 24 features)

10
Additional Signal Processing

Pre-emphasis prior to Fourier transform to boost
high level energy
Liftering to re-scale cepstral coefficients
Channel Adaptation to deal with line and
microphone characteristics (example cepstral
mean normalization)
Echo Cancellation to remove background noise
(including speech generated from the synthesizer)
Adding a Total (log) Energy feature (/-
normalization)
End-pointing to detect signal start and stop

11
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

12
Properties of Recognizers

Speaker Independent vs. Speaker Dependent
Large Vocabulary (2K-200K words) vs. Limited
Vocabulary (2-200)
Continuous vs. Discrete
Speech Recognition vs. Speech Verification
Real Time vs. multiples of real time
Spontaneous Speech vs. Read Speech
Noisy Environment vs. Quiet Environment
High Resolution Microphone vs. Telephone vs.
Cellphone
Adapt to speaker vs. non-adaptive
Low vs. High Latency
With online incremental results vs. final results

13
The Speech Recognition Problem

Bayes Law
P(a,b) P(ab) P(b) P(ba) P(a)
Joint probability of a and b probability of b
times the probability of a given b
The Recognition Problem
Find most likely sequence w of words given the
sequence of acoustic observation vectors a
Use Bayes law to create a generative model
ArgMaxw P(wa) ArgMaxw P(aw) P(w) / P(a)
ArgMaxw
P(aw) P(w)
Acoustic Model P(aw)
Language Model P(w)

14
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

15
Hidden Markov Models (HMMs)

HMMs provide generative acoustic models P(aw)
probabilistic, non-deterministic finite-state
automaton
state n generates feature vectors with density
Pn
transitions from state j to n are probabilistic
Pj,n

P1,1
P2,2
P3,3
P3(.)
P1(.)
P2(.)
P1,2
P2,3
P3,4
16
HMMs Single Gaussian Distribution

Outgoing likelihoods
Feature vector a generated by normal density
(Gaussian) with mean h and covariance matrix S

17
HMMs Gaussian Mixtures

To account for variable pronunciations
Each state generates acoustic vectors according
to a linear combination of m Gaussian models,
weighted by lm

Three-component mixture model in two dimensions
18
Acoustic Modeling with HMMs

Train HMMs to represent subword units
Units typically segmental may vary in
granularity
phonological (40 for English)
phonetic (60 for English)
context-dependent triphones (14,000 for
English) models temporal and spectral
transitions between phones
silence and noise are usually additional symbols
Standard architecture is three successive states
per phone

19
Pronunciation Modeling

Needed for speech recognition and synthesis
Maps orthographic representation of words to
sequence(s) of phones
Dictionary doesnt cover language due to
open classes
names
inflectional and derivational morphology
Pronunciation variation can be modeled with
multiple pronunciation and/or acoustic mixtures
If multiple pronunciations are given, estimate
likelihoods
Use rules (e.g. assimilation, devoicing,
flapping), or statistical transducers

20
Lexical HMMs

Create compound HMM for each lexical entry by
concatenating the phones making up the
pronunciation
example of HMM for lab (following speech for
crossword triphone)
Multiple pronunciations can be weighted by
likelihood into compound HMM for a word
(Tri)phone models are independent parts of word
models

triphone ch-la l-ab
a-b
phone l a
b
21
HMM Training Baum-Welch Re-estimation

Determines the probabilities for the acoustic HMM
models
Bootstraps from initial model
hand aligned data, previous models or flat start
Allows embedded training of whole utterances
transcribe utterance to words w1,,wk and
generate a compound HMM by concatenating compound
HMMs for words m1,,mk
calculate acoustic vectors a1,,an
Iteratively converges to a new estimate
Re-estimates all paths because states are hidden
Provides a maximum likelihood estimate
model that assigns training data the highest
likelihood

22
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

23
Probabilistic Language Modeling History

Assigns probability P(w) to word sequence w w1
,w2,,wk
Bayes Law provides a history-based model
P(w1 ,w2,,wk)
P(w1) P(w2w1) P(w3w1,w2)
P(wkw1,,wk-1)
Cluster histories to reduce number of parameters

24
N -gram Language Modeling

n-gram assumption clusters based on last n-1
words
P(wjw1,,wj-1) P(wjwj-n-1,,wj-2 ,wj-1)
unigrams P(wj)
bigrams P(wjwj-1)
trigrams P(wjwj-2 ,wj-1)
Trigrams often interpolated with bigram and
unigram
the li typically estimated by maximum likelihood
estimation on held out data (F(..) are relative
frequencies)
many other interpolations exist (another standard
is a non-linear backoff)

25
Extended Probabilistic Language Modeling

Histories can include some indication of semantic
topic
latent-semantic indexing (vector-based
information retrieval model)
topic-spotting and blending of topic-specific
models
dialogue-state specific language models
Language models can adapt over time
recent history updates model through
re-estimation or blending
often done by boosting estimates for seen words
(triggers)
new words and/or pronunciations can be added
Can estimate category tags (syntactic and/or
semantic)
Joint word/category model P(word1tag1,,wordkta
gk)
example P(wordtagHistory) P(wordtag)
P(tagHistory)

26
Finite State Language Modeling

Write a finite-state task grammar (with
non-recursive CFG)
Simple Java Speech API example (from users
guide)
public ltCommandgt ltPolitegt ltActiongt
ltObjectgt (and ltObjectgt)
ltActiongt open close delete
ltObjectgt the window the file
ltPolitegt please
Typically assume that all transitions are
equi-probable
Technology used in most current applications
Can put semantic actions in the grammar

27
Information Theory Perplexity

Perplexity is standard model of recognition
complexity given a language model
Perplexity measures the conditional likelihood of
a corpus, given a language model P(.)
Roughly the number of equi-probable choices per
word
Typically computed by taking logs and applying
history-based Bayesian decomposition
But lower perplexity doesnt guarantee better
recognition

28
Zipfs Law

Lexical frequency is inversely proportional to
rank
Frequency(n) Frequency of n-th most frequent
word
Zipfs Law Frequency(Rank) Frequency(1)/Rank
Thus log Frequency(Rank) ? - log Rank

From G.R. Turners web site on Zipfs
law http//www.btinternet.com/g.r.turner/ZipfDoc
.htm
29
Vocabulary Acquisition

IBM personal E-mail corpus of PDB (by R.L.
Mercer)
static coverage is given by most frequent n words
dynamic coverage is most recent n words

30
Language HMMs

Can take HMMs for each word and combine into a
single HMM for the whole language (allows
cross-word models)
Result is usually too large to expand statically
in memory
A two word example is given by

P(word1word1)
word1
P(word1)
P(word2word1)
P(word1word2)
word2
P(word2)
P(word2word2)
31
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

32
HMM Decoding

Decoding Problem is finding best word sequence
ArgMax w1,,wm P(w1,,wm a1,,an)
Words w1wm are fully determined by sequences of
states
Many state sequences produce the same words
The Viterbi assumption
the word sequence derived from the most likely
path will be the most likely word sequence (as
would be computed over all paths)

Max over previous states r
likelihood previous state is r
Transition probability from r to s
Acoustics for state s for input a
33
Visualizing Viterbi Decoding The Trellis
ai-1
ai1
ai
input
P1,1
P1,1
s1
s1
s1
fi(s1)
fi1(s1)
fi-1(s1)
P2,1
P1,2
...
s2
s2
...
s2
Pk,1
fi(s2)
fi1(s2)
fi-1(s2)
P1,k
best path
...
...
...
sk
sk
sk
fi(sk)
fi1(sk)
fi-1(sk)
time
ti-1
ti
ti1
34
Viterbi Search Dynamic Programming Token Passing

Algorithm
Initialize all states with a token with a null
history and the likelihood that its a start
state
For each frame ak
For each token t in state s with probability
P(t), history H
For each state r
Add new token to s with probability P(t) Ps,r
Pr(ak), and history s.H
Time synchronous from left to right
Allows incremental results to be evaluated

35
Pruning the Search Space

Entire search space for Viterbi search is much
too large
Solution is to prune tokens for paths whose score
is too low
Typical method is to use
histogram only keep at most n total hypotheses
beam only keep hypotheses whose score is a
fraction of best score
Need to balance small n and tight beam to limit
search and minimal search error (good hypotheses
falling off beam)
HMM densities are usually scaled differently than
the discrete likelihoods from the language model
typical solution boost language models dynamic
range, using P(w)n P(aw), usually with with n
15
Often include penalty for each word to favor
hypotheses with fewer words

36
N-best Hypotheses and Word Graphs

Keep multiple tokens and return n-best
paths/scores
p1 flights from Boston today
p2 flights from Austin today
p3 flights for Boston to pay
p4 lights for Boston to pay
Can produce a packed word graph (a.k.a. lattice)
likelihoods of paths in lattice should equal
likelihood for n-best

37
Search-based Decoding

A search
Compute all initial hypotheses and place in
priority queue
For best hypothesis in queue
extend by one observation, compute next state
score(s) and place into the queue
Scoring now compares derivations of different
lengths
would like to, but cant compute cost to complete
until all data is seen
instead, estimate with simple normalization for
length
usually prune with beam and/or histogram
constraints
Easy to include unbounded amounts of history
because no collapsing of histories as in dynamic
programming n-gram
Also known as stack decoder (priority queue is
stack)

38
Multiple Pass Decoding

Perform multiple passes, applying successively
more fine-grained language models
Can much more easily go beyond finite state or
n-gram
Can use for Viterbi or stack decoding
Can use word graph as an efficient interface
Can compute likelihood to complete hypotheses
after each pass and use in next round to tighten
beam search
First pass can even be a free phone decoder
without a word-based language model

39
Measuring Recognition Accuracy

Word Error Rate
Example scoring
actual utterance four six seven
nine three three seven
recognizer four oh six seven
five three seven
insert
subst delete
WER (1 1 1)/7 43
Would like to study concept accuracy
typically count only errors on content words
application dependent
ignore case marking (singular, plural, etc.)
For word/concept spotting applications
recall percentage of target words (concept)
found
precision percentage of hypothesized words
(concepts) in target

40
Empirical Recognition Accuracies

Cambridge HTK, 1997 multipass HMM w. lattice
rescoring
Top Performer in ARPAs HUB-4 Broadcast News
Task
65,000 word vocabulary Out of Vocabulary 0.5
Perplexities
word bigram 240
(6.9 million bigrams)
backoff trigram of 1000 categories 238
(803K bi, 7.1G tri)
word trigram 159
(8.4 million trigrams)
word 4-gram 147
(8.6 million 4-grams)
word 4-gram category trigram 137
Word Error Rates
clean, read speech 9.4
clean, spontaneous speech 15.2
low fidelity speech 19.5

41
Empirical Recognition Accuracies (contd)

Lucent 1998, single pass HMM
Typical of real-time telephony performance (low
fidelity)
3,000 word vocabulary Out of Vocabulary 1.5
Blended models from customer/operator
customer/system
Perplexities customer/op customer/system
bigram 105.8 (27,200) 32.1
(12,808)
trigram 99.5 (68,500) 24.4
(25,700)
Word Error Rate 23
Content Term (single, pair, triple of words)
Precision/Recall
one-word terms 93.7 / 88.4
two-word terms 96.9 / 85.4
three-word terms 98.5 / 84.3

42
Confidence Scoring and Rejection

Alternative to standard acoustic density scoring
compute HMM acoustic score for word(s) in usual
way
baseline score for an anti-model
compute hypothesis ratio (Word Score / Baseline
Score)
test hypothesis ratio vs. threshold
Can be applied to
free word spotting (given pronunciations)
(word-by-word) acoustic confidence scoring for
later processing
verbal information verification
existing info name, address, social security
number
password

43
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

44
Semantic Interpretation Word Strings

Content is just words
System What is your address?
User fourteen eleven main street
Can also do concept extraction / keyword(s)
spotting
User My address is fourteen eleven main
street
Applications
template filling
directory services
information retrieval

45
Semantic Interpretation Pattern-Based

Simple (typically regular) patterns specify
content
ATIS (Air Traffic Information System) Task
System What are your travel plans?
User On Monday, Im going from Boston
to San Francisco.
Content DATEMonday, ORIGINBoston,
DESTINATIONSFO
Can combine content-extraction and language
modeling
but can be too restrictive as a language model
Java Speech API (curly brackets show semantic
actions)
public ltcommandgt ltactiongt ltobjectgt
ltpolitegt
ltactiongt open OP close
CL move MV
ltobjectgt ltthis_that_etcgt
window door
ltthis_that_etcgt a the
this that the current
ltpolitegt please kindly
Can be generated and updated on the fly (eg. Web
Apps)

46
Semantic Interpretation Parsing

In general case, have to uncover who did what to
whom
System What would you like me to do next?
User Put the block in the box on Platform 1.
ambiguous
System How can I help you?
User Where is A Bugs Life playing in Summit?
Requires some kind of parsing to produce
relations
Who did what to whom ?(where(present(in(Summit,p
lay(BugsLife)))))
This kind of representation often used for
machine translation
Often transferred to flatter frame-based
representation
Utterance type where-question
Movie A Bugs Life
Town Summit

47
Robustness and Partiality

Controlled Speech
limited task vocabulary limited task grammar
Spontaneous Speech
Can have high out-of-vocabulary (OOV) rate
Includes restarts, word fragments, omissions,
phrase fragments, disagreements, and other
disfluencies
Contains much grammatical variation
Causes high word error-rate in recognizer
Parsing is often partial, allowing
omission
parsing fragments

48
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

49
Recorded Prompts

The simplest (and most common) solution is to
record prompts spoken by a (trained) human
Produces human quality voice
Limited by number of prompts that can be recorded
Can be extended by limited cut-and-paste or
template filling

50
Speech Synthesis

Rule-based Synthesis
Uses linguistic rules (/- training) to generate
features
Example DECTalk
Concatenative Synthesis
Record basic inventory of sounds
Retrieve appropriate sequence of units at run
time
Concatenate and adjust durations and pitch
Waveform synthesis

51
Diphone and Polyphone Synthesis

Phone sequences capture co-articulation
Cut speech in positions that minimize context
contamination
Need single phones, diphones and sometimes
triphones
Reduce number collected by
phonotactic constraints
collapsing in cases of no co-articulation
Data Collection Methods
Collect data from a single (professional) speaker
Select text with maximal coverage (typically with
greedy algorithm), or
Record minimal pairs in desired contexts (real
words or nonsense)

52
Duration Modeling

Must generate segments with the appropriate
duration
Segmental Identity
/ai/ in like twice as long as /I/ in lick
Surrounding Segments
vowels longer following voiced fricatives than
voiceless stops
Syllable Stress
onsets and nuclei of stressed syllables longer
than in unstressed
Word importance
word accent with major pitch movement lengthens
Location of Syllable in Word
word ending longer than word starting longer than
word internal
Location of the Syllable in the Phrase
phrase final syllables longer than same syllable
in other positions

53
Intonation Tone Sequence Models

Functional Information can be encoded via tones
given/new information (information status)
contrastive stress
phrasal boundaries (clause structure)
dialogue act (statement/question/command)
Tone Sequence Models
F0 contours generated from phonologically
distinctive tones/pitch accents which are locally
independent
generate a sequence of tonal targets and fit with
signal processing

54
Intonation for Function

ToBI (Tone and Break Index) System, is one
example
Pitch Accent (H, L, HL, HL, LH,
LH)
Phrase Accent - (H-, L-)
Boundary Tone (H, L)
Intonational Phrase
ltPitch Accentgt ltPhrase
Accentgt ltBoundary Tonegt

statement vs. question example
source Multilingual Text-to-Speech Synthesis, R.
Sproat, ed., Kluwer, 1998
55
Text Markup for Synthesis

Bell Labs TTS Markup
r(0.9) LH(0.8) Humpty LH(0.8) Dumpty
r(0.85) L(0.5) sat on a H(1.2) wall.
Tones Tone(Prominence)
Speaking Rate r(Rate) and pauses
Top Line (highest pitch) Reference Line
(reference pitch) Base Line (lowest pitch)
SABLE is an emerging standard extending SGML
http//www.cstr.ed.ac.uk/projects/sable.html
marks emphasis(), break(), pitch(base/mid/rang
e,), rate(), volume(), semanticMode(date/time/e
mail/URL/...), speaker(age,sex)
Implemented in Festival Synthesizer (free for
research, etc.)
http//www.cstr.ed.ac.uk/projec
ts/festival.html

56
Intonation in Bell Labs TTS

Generate a sequence of F0 targets for synthesis
Example
We were away a year ago.
phones w E w R w A y E r g O
Default Declarative intonation (H) H L- L
question L H- H

source Multilingual Text-to-Speech Synthesis, R.
Sproat, ed., Kluwer, 1998
57
Signal Processing for Speech Synthesis

Diphones recorded in one context must be
generated in other contexts
Features are extracted from recorded units
Signal processing manipulates features to smooth
boundaries where units are concatenated
Signal processing modifies signal via
interpolation
intonation
duration

58
The Source-Filter Model of Synthesis

Model of features to be extracted and fitted
Excitation or Voicing Source(s) to model sound
source
standard wave of glottal pulses for voiced sounds
randomly varying noise for unvoiced sounds
modification of airflow due to lips, etc.
high frequency (F0 rate), quasi-periodic, choppy
modeled with vector of glottal waveform patterns
in voiced regions
Acoustic Filter(s)
shapes the frequency character of vocal tract and
radiation character at the lips
relatively slow (samples around 5ms suffice) and
stationary
modeled with LPC (linear predictive coding)

59
Barge-in

Technique to allow speaker to interrupt the
systems speech
Combined processing of input signal and output
signal
Signal detector runs looking for speech start and
endpoints
tests a generic speech model against noise model
typically cancels echoes created by outgoing
speech
If speech is detected
Any synthesized or recorded speech is cancelled
Recognition begins and continues until end point
is detected

60
Speech Application Programming Interfaces

Abstract from recognition/synthesis engines
Recognizer and synthesizer loading
Acoustic and grammar model loading (dynamic
updates)
Recognition
online
n-best or lattice
Synthesis
markup
barge in
Acoustic control
telephony interface
microphone/speaker interface

61
Speech API Examples

SAPI Microsoft Speech API (recsynth)
communicates through COM objects
instances most systems implement all or some of
this (Dragon, IBM, Lucent, LH, etc.)
JSAPI Java Speech API (rec synth)
communicates through Java events (like GUI)
concurrency through threads
instances IBM ViaVoice (rec), LH (synth)
(J)HAPI (Java) HTK API (recognition)
communicates through C or Java port of C
interface
eg Entropics Cambridge Research Labs HMM Tool
Kit (HTK)
Galaxy (rec synth)
communicates through a production system
scripting language
MIT System, ported by MITRE for DARPA Communicator

62
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

63
Discourse Dialogue Processing

Discourse interpretation
Understand what the user really intends by
interpreting utterances in context
Dialogue management
Determine system goals in response to user
utterances based on user intention
Response generation
Generate natural language utterances to achieve
the selected goals

64
Discourse Interpretation

Goal understand what the user really intends
Example Can you move it?
What does it refer to?
Is the utterance intended as a simple yes-no
query or a request to perform an action?
Issues addressed
Reference resolution
Intention recognition
Interpret user utterances in context

65
Reference Resolution
U Where is A Bugs Life playing in Summit? S A
Bugs Life is playing at the Summit theater. U
When is it playing there? S Its playing at 2pm,
5pm, and 8pm. U Id like 1 adult and 2 children
for the first show. How much would that cost?

Knowledge sources
Domain knowledge
Discourse knowledge
World knowledge

66
Reference Resolution In Theory

Focus stacks
Maintain recent objects in stack
Select objects that satisfy semantic/pragmatic
constraints starting from top of stack
May take into account discourse structure
Centering
Backward-looking center (Cb) object connecting
the current sentence with the previous sentence
Forward-looking centers (Cf) potential Cb of the
next sentence
Rule-based filtering ranking of objects for
pronoun resolution

67
Reference Resolution In Practice

Non-existent does not allow the use of anaphoric
references
Allows only simple references
utilizes the focus stack reference resolution
mechanism
does not take into account discourse structure
information
Example

U Where is A Bugs Life playing in Summit?
Summit
A Bugs Life
68

S A Bugs Life is playing at the
Summit theater.
Summit theater
A Bugs Life
69
U When is it playing there?
70
Intention Recognition
B I have to wash my hair.
71

A Would you like to go to the hairdresser?

Bs utterance should be interpreted as an
acceptance of As proposal.

72

A Whats that smell around here?

Bs utterance should be interpreted as an answer
to As question.

73

A Would you be interested in going out to
dinner tonight?

Bs utterance should be interpreted as a
rejection of As proposal.

74
Intention Recognition (Contd)

Goal to recognize the intent of each user
utterance as one (or more) of a set of dialogue
acts based on context
Sample dialogue actions
Switchboard DAMSL
Conventional-closing
Statement-(non-)opinion
Agree/Accept
Acknowledgment
Yes-No-Question/Yes-Answer
Non-verbal
Abandoned
On-going standardization efforts (Discourse
Resource Initiative)

Verbmobil
Greet/Thank/Bye
Suggest
Accept/Reject
Confirm
Clarify-Query/Answer
Give-Reason
Deliberate

75
Intention Recognition In Theory

Knowledge sources
Overall dialogue goals
Orthographic features, e.g.
punctuation
cue words/phrases but, furthermore, so
transcribed words would you please, I want
to
Dialogue history, i.e., previous dialogue act
types
Dialogue structure, e.g.
subdialogue boundaries
dialogue topic changes
Prosodic features of utterance duration, pause,
F0, speaking rate

76
Intention Recognition In Theory (Contd)

Finite-state dialogue grammar
e.g.
Plan-based discourse understanding
Recipes templates for performing actions
Inference rules to construct plausible plans
Empirical methods
Probabilistic dialogue act classifiers HMMs
Rule-based dialogue act recognition CART,
Transformation-based learning

S / Greet
U / Question
S / Answer
U / Bye
U / Question
77
Intention Recognition In Practice

Makes assumptions about (high-level)
task-specific intentions e.g.,
Call routing giving destination information
ATIS requesting flight information
Movie information system movie showtime or
theater playlist information
Does not allow user-initiated complex dialogue
acts, e.g. confirmation, clarification, or
indirect responses

S1 Whats your account number? U1 Is that the
number on my ATM card? S2 Would you like to
transfer 1,500 from savings to checking? U2 If
I have enough in savings.
78
Intention Recognition In Practice (Contd)

User utterances can play one of two roles
Identify one of a set of possible task intentions
Provide necessary information for performing a
task
Based on either keywords in an utterance or its
syntactic/semantic representation
Maps keywords or representations to intentions
using
Template matching
Probabilistic model
Vector-based similarity measures

79
Intention Recognition Example
U What time is A Bugs Life playing at the
Summit theater?

Using keyword extraction and vector-based
similarity measures
Intention Ask-Reference _time
Movie A Bugs Life
Theater the Summit quadplex

80
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

81
Dialogue Management Motivating Examples

Dialogue 1

S Would you like movie showtime or theater
playlist information? U Movie showtime. S
What movie do you want showtime information
about? U Saving Private Ryan. S At what theater
do you want to see Saving Private Ryan? U
Paramount theater. S Saving Private Ryan is not
playing at the Paramount theater.
82
DM Motivating Examples (Contd)

Dialogue 2

S How may I help you? U When is Saving Private
Ryan playing? S For what theater? U The
Paramount theater. S Saving Private Ryan is not
playing at the Paramount theater, but its
playing at the Madison theater at 300, 530,
800, and 1030.
83
DM Motivating Examples (Contd)

Dialogue 3

S How may I help you? U When is Saving Private
Ryan playing? S For what theater? U The
Paramount theater. S Saving Private Ryan is
playing at the Fairmont theater at 600 and
830. U I wanted to know about the Paramount
theater, not the Fairmont theater. S Saving
Private Ryan is not playing at the Paramount
theater, but its playing at the Madison
theater at 300, 530, 800, and 1030.
84
Comparison of Sample Dialogues

Dialogue 1
System-initiative
Implicit confirmation
Merely informs user of failed query
Mechanical
Least efficient

Dialogue 2
Mixed-initiative
No confirmation
Suggests alternative when query fails
More natural
Most efficient

Dialogue 3
Mixed-initiative
No confirmation
Suggests alternative when query fails
More natural
Moderately efficient

85
Dialogue Management

Goal determine what to accomplish in response to
user utterances, e.g.
Answer user question
Solicit further information
Confirm/Clarify user utterance
Notify invalid query
Notify invalid query and suggest alternative
Interface between user/language processing
components and system knowledge base

86
Dialogue Management (Contd)

Main design issues
Functionality how much should the system do?
Process how should the system do them?
Affected by
Task complexity how hard the task is
Dialogue complexity what dialogue phenomena are
allowed
Affects
robustness
naturalness
perceived intelligence

87
Task Complexity

Application dependent
Examples

Weather Information ATIS
Call Routing
Travel Planning
Automatic Banking
University Course Advisement
Simple
Complex

Directly affects
Types and quantity of system knowledge
Complexity of systems reasoning abilities

88
Dialogue Complexity

Determines what can be talked about
The task only
Subdialogues e.g., clarification, confirmation
The dialogue itself meta-dialogues
Could you hold on for a minute?
What was that click? Did you hear it?
Determines who can talk about them
System only
User only
Both participants

89
Dialogue Management Functionality

Determines the set of possible goals that the
system may select at each turn
At the task level, dictated by task complexity
At the dialogue level, determined by system
designer in terms of dialogue complexity
Are subdialogues allowed?
Are meta-dialogues allowed?
Only by the system, by the user, or by both
agents?

90
DM Functionality In Theory

Task complexity moderate to complex
Travel planning
University course advisement
Dialogue complexity
System/user-initiated complex subdialogues
Embedded negotiation subdialogues
Expressions of doubt
Meta-dialogues
Multiple dialogue threads

91
DM Functionality In Practice

Task complexity simple to moderate
Call routing
Weather information query
Train schedule inquiry
Dialogue complexity
About task only
Limited system-initiated subdialogues

92
Dialogue Management Process

Determines how the system will go about selecting
among the possible goals
At the dialogue level, determined by system
designer in terms of initiative strategies
System-initiative system always has control,
user only responds to system questions
User-initiative user always has control, system
passively answers user questions
Mixed-initiative control switches between system
and user using fixed rules
Variable-initiative control switches between
system and user dynamically based on participant
roles, dialogue history, etc.

93
DM Process In Theory

Initiative strategies
Mixed-initiative
Variable-initiative
Mechanisms for modeling initiative
Planning and reasoning
Theorem proving
Belief modeling
Knowledge sources for modeling initiative
System beliefs, user beliefs, and mutual beliefs
System domain knowledge
Dialogue history
User preferences

94
DM Process In Practice

Initiative strategies
User-initiative
System-initiative
Mixed-initiative
Variable-initiative
Mechanisms for modeling initiative
System and mixed-initiative finite-state
machines
Variable-initiative evidential model for
computing initiative
Knowledge sources
Dialogue history e.g. user fails to make
progress in task
Participant roles advisor/advisee vs.
collaborators
Features of current utterance e.g. ambiguous
utterance, underspecified utterance

95
Dialogue Management Summary

In theory
Focuses on developing algorithms for addressing
difficult dialogue phenomena in complex tasks
In practice
Focuses on developing robust, real-time dialogue
strategies for handling common dialogue phenomena
in simpler tasks

96
Dialogue Management Example

System for providing movie showtime and theater
playlist information
Functionality
Task provide simple movie information
Dialogue allow system-initiated subdialogues
Process
Variable-initiative

97
Dialogue Management Example (Contd)
ltU has task dialogue initiativegt U When
is Antz playing at the Fairmont theater? ltS
has dialogue initiative (low score for theater)gt
ltgoal confirm Paramount theatergt S Did you
say the Paramount theater? U No, the Fairmont
theater. ltS has no initiativegt ltgoal
answer user questiongt S Antz is playing at the
Fairmont theater at 7pm. U When is it playing at
the Summit theater? ltS has task dialogue
initiativegt ltgoal notify failure suggest
alternativegt S Antz is not playing at the Summit
theater, but its playing at the AMC theater
in Summit at 8pm.
98
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

99
Response Generation
U Is Saving Private Ryan playing at the Chatham
cinema?
100

S No, its not.

S provides elliptical response

101

S No, Saving Private Ryan is not playing
at the Chatham cinema.

S provides full response (which provides
grounding information)

102

S No, Saving Private Ryan is not playing
at the Chatham cinema the theaters under
renovation.

S provides full response and supporting evidence

103
Response Generation (Contd)

Goal generate natural language utterances to
achieve goal(s) selected by the dialogue manager
Issues
Content selection determining what to say
Surface realization determining how to say it
Generation gap discrepancy between the actual
output of the content selection process and the
expected input of the surface realization process

104
Content Selection

Goal determine the propositional content of
utterances to achieve goal(s)
Examples
Antz is not playing at the Maplewood theater
Nucleus
Would you like the suite? Nucleus
Can you get the groceries from the car?
Nucleus

105

the theaters under renovation. (evidence)
Satellite
Its the same price as the regular room.
(motivation) Satellite
The key is on the dryer. (enablement) Satellite

106
Content Selection In Theory

Knowledge sources
Domain knowledge base
User beliefs
User model user characteristics, preferences,
etc.
Dialogue history
Content selection mechanisms
Schemas patterns of predicates
Rule-based generation
Plan-based generation
Recipes templates for performing actions
Planner to construct plans for given goal
Case-based reasoning

107
Content Selection In Practice

Knowledge sources
Domain knowledge base
Dialogue history
Pre-determined content selection strategies
Nucleus only, no satellite information
Nucleus fixed satellite

108
Surface Realization

Goal determine how the selected content will be
conveyed by natural language utterances
Examples
Antz is showing (shown) at the Maplewood theater.
The Maplewood theater is showing Antz.
It is at the Maplewood theater that Antz is
shown.
Antz, thats whats being shown at the Maplewood
theater.
Issues
Clausal structure construction
Lexical selection

109
Surface Realization In Theory

Typical surface generator requires as input
Semantic representation to be realized
Clausal structure for generated utterance
Surface realization component utilizes a grammar
to generate utterance that conveys the given
semantic representation

110
Surface Realization In Practice

Canned utterances
Pre-determined utterances for goals e.g.
Greetings Hello, this is the ABC banks
operator.
Repeat Could you please repeat your request?
Facilitates pre-recorded prompts for speech
output
Template-based generation
Templates for goals e.g.
Notification Your call is being transferred to
X.
Inform A,B,C,D, and E are playing at the F
theater.
Clarify Did you say X or Y?
Needs cut-and-paste of pre-recorded segments or
full TTS system

111
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

112
Dialogue Evaluation

Goal determine how well a dialogue system
performs
Main difficulties
No strict right or wrong answers
Difficult to determine what features make a
dialogue system better than another
Difficult to select metrics that contribute to
the overall goodness of the system
Difficult to determine how the metrics compensate
for one another
Expensive to collect new data for evaluating
incremental improvement of systems

113
Dialogue Evaluation (Contd)

System-initiative, explicit confirmation
better task success rate
lower WER
longer dialogues
fewer recovery subdialogues
less natural

Mixed-initiative, no confirmation
lower task success rate
higher WER
shorter dialogues
more recovery subdialogues
more natural

114
Dialogue Evaluation Paradigms

Evaluating the end result only
Reference answers
Evaluating both the end result and the process
toward it
Evaluation metrics
Performance functions

115
Evaluation Paradigms Reference Answers

Evaluates the task success rate only
Suitable for query-answering systems for which a
correct answer can be defined for each query
For each query
Obtain answer from dialogue system
Compare with reference answer
Score system performance
Advantage simple
Disadvantage ignores many other important
factors that contribute to quality of dialogue
systems

116
Evaluation Paradigms Evaluation Metrics

Different metrics for evaluating different
components of a dialogue system
Speech recognizer word error rate / word
accuracy
Understanding component attribute value matrix
Dialogue manager
appropriateness of system responses
error recovery abilities
Overall system
task success
average number of turns
elapsed time
turn correction ratio

117
Paradigms Evaluation Metrics (Contd)

Advantage
Takes into account the process toward completing
the task
Limitations
Difficult to determine how different metrics
compensate for one another
Metrics may not be independent of one another
Does not generalize across domains and tasks

118
Paradigms Performance Functions

PARADISE Walker et al. derives performance
functions using both task-based and
dialogue-based metrics
User satisfaction
Maximize task success
Minimize costs
Efficiency measures e.g., number of utterances,
elapsed time
Qualitative measures e.g., repair ratio,
inappropriate utt. ratio
Performance function derivation
Obtain user satisfaction ratings (questionnaire)
Obtain values for various metrics (automatic or
manual)
Apply multiple linear regression to derive a
function relating user satisfaction and various
cost factors, e.g.,

119
Paradigms Performance Functions (Contd)

Advantages
Allows for comparison of dialogue systems
performing different tasks
Specifies relative contributions of cost factors
to overall performance
Can be used to make predictions about future
versions of the dialogue system
Disadvantages
Data collection cost for deriving performance
function is high
Cost for deriving performance function for
multiple systems to draw general conclusions is
high

120
Tutorial Overview Outline

Part I
Signal processing
Speech recognition
acoustic modeling
language modeling
decoding
Semantic interpretation
Speech synthesis

Part II
Discourse and dialogue
Discourse interpretation
Dialogue management
Response generation
Dialogue evaluation
Data collection

121
Data Collection Wizard of Oz Paradigm

Setup for initial data collection
User communicates with system through telephone
(speech) or keyboard (text)
System is actually a human, typically given
instructions on how to behave like a system
Users are typically given tasks to perform in the
target domain
Subjects are the users and the system can be
played by one person
Dialogues between system and user are recorded
and transcribed
Setup for intermediate system evaluation
Use actual running system, with wizard
supervision
Wizard can override undesirable system behavior,
e.g., correct ASR errors

122
Data Collection Wizard of Oz (Contd)

Features of collected data
Typically much less complex than actual
human-human dialogues performing the same tasks
Captures how humans behave when they talk to
computers
Captures variations among different subjects in
both language and approach when performing the

Write a Comment

User Comments (0)