Spoken Dialogue Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Spoken Dialogue Systems

Description:

Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999 Tutorial Overview: Data Flow Speech and Audio Processing Signal processing: Convert the ... – PowerPoint PPT presentation

Number of Views:838
Avg rating:3.0/5.0
Slides: 125
Provided by: csColumb4
Category:

less

Transcript and Presenter's Notes

Title: Spoken Dialogue Systems


1
Spoken Dialogue Systems
  • Bob Carpenter and Jennifer Chu-Carroll
  • June 20, 1999

2
Tutorial Overview Data Flow
Part I
Part II
Discourse Interpretation
Signal Processing
Speech Recognition
Semantic Interpretation
Dialogue Management
Response Generation
Speech Synthesis
3
Speech and Audio Processing
  • Signal processing
  • Convert the audio wave into a sequence of feature
    vectors
  • Speech recognition
  • Decode the sequence of feature vectors into a
    sequence of words
  • Semantic interpretation
  • Determine the meaning of the recognized words
  • Speech synthesis
  • Generate synthetic speech from a marked-up word
    string

4
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

5
Acoustic Waves
  • Human speech generates a wave
  • like a loudspeaker moving
  • A wave for the words speech lab looks like

s p ee ch
l a b
l to a transition
Graphs from Simon Arnfields web tutorial on
speech, Sheffield http//lethe.leeds.ac.uk/resear
ch/cogn/speech/tutorial/
6
Acoustic Sampling
  • 10 ms frame (ms millisecond 1/1000 second)
  • 25 ms window around frame to smooth signal
    processing

25 ms
. . .
10ms
Result Acoustic Feature Vectors
a1 a2 a3
7
Spectral Analysis
  • Frequency gives pitch amplitude gives volume
  • sampling at 8 kHz phone, 16 kHz mic (kHz1000
    cycles/sec)
  • Fourier transform of wave yields a spectrogram
  • darkness indicates energy at each frequency
  • hundreds to thousands of frequency samples

s p ee ch
l a b
amplitude
frequency
8
Acoustic Features Mel Scale Filterbank
  • Derive Mel Scale Filterbank coefficients
  • Mel scale
  • models non-linearity of human audio perception
  • mel(f) 2595 log10(1 f / 700)
  • roughly linear to 1000Hz and then logarithmic
  • Filterbank
  • collapses large number of FFT parameters by
    filtering with 20 triangular filters spaced on
    mel scale

...
frequency

m1 m2 m3 m4 m5 m6

coefficients
9
Cepstral Coefficients
  • Cepstral Transform is a discrete cosine transform
    of log filterbank amplitudes
  • Result is 12 Mel Frequency Cepstral Coefficients
    (MFCC)
  • Almost independent (unlike mel filterbank)
  • Use Delta (velocity / first derivative) and
    Delta2 (acceleration / second derivative) of MFCC
    ( 24 features)

10
Additional Signal Processing
  • Pre-emphasis prior to Fourier transform to boost
    high level energy
  • Liftering to re-scale cepstral coefficients
  • Channel Adaptation to deal with line and
    microphone characteristics (example cepstral
    mean normalization)
  • Echo Cancellation to remove background noise
    (including speech generated from the synthesizer)
  • Adding a Total (log) Energy feature (/-
    normalization)
  • End-pointing to detect signal start and stop

11
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

12
Properties of Recognizers
  • Speaker Independent vs. Speaker Dependent
  • Large Vocabulary (2K-200K words) vs. Limited
    Vocabulary (2-200)
  • Continuous vs. Discrete
  • Speech Recognition vs. Speech Verification
  • Real Time vs. multiples of real time
  • Spontaneous Speech vs. Read Speech
  • Noisy Environment vs. Quiet Environment
  • High Resolution Microphone vs. Telephone vs.
    Cellphone
  • Adapt to speaker vs. non-adaptive
  • Low vs. High Latency
  • With online incremental results vs. final results

13
The Speech Recognition Problem
  • Bayes Law
  • P(a,b) P(ab) P(b) P(ba) P(a)
  • Joint probability of a and b probability of b
    times the probability of a given b
  • The Recognition Problem
  • Find most likely sequence w of words given the
    sequence of acoustic observation vectors a
  • Use Bayes law to create a generative model
  • ArgMaxw P(wa) ArgMaxw P(aw) P(w) / P(a)
  • ArgMaxw
    P(aw) P(w)
  • Acoustic Model P(aw)
  • Language Model P(w)

14
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

15
Hidden Markov Models (HMMs)
  • HMMs provide generative acoustic models P(aw)
  • probabilistic, non-deterministic finite-state
    automaton
  • state n generates feature vectors with density
    Pn
  • transitions from state j to n are probabilistic
    Pj,n

P1,1
P2,2
P3,3
P3(.)
P1(.)
P2(.)
P1,2
P2,3
P3,4
16
HMMs Single Gaussian Distribution
  • Outgoing likelihoods
  • Feature vector a generated by normal density
    (Gaussian) with mean h and covariance matrix S

17
HMMs Gaussian Mixtures
  • To account for variable pronunciations
  • Each state generates acoustic vectors according
    to a linear combination of m Gaussian models,
    weighted by lm

Three-component mixture model in two dimensions
18
Acoustic Modeling with HMMs
  • Train HMMs to represent subword units
  • Units typically segmental may vary in
    granularity
  • phonological (40 for English)
  • phonetic (60 for English)
  • context-dependent triphones (14,000 for
    English) models temporal and spectral
    transitions between phones
  • silence and noise are usually additional symbols
  • Standard architecture is three successive states
    per phone

19
Pronunciation Modeling
  • Needed for speech recognition and synthesis
  • Maps orthographic representation of words to
    sequence(s) of phones
  • Dictionary doesnt cover language due to
  • open classes
  • names
  • inflectional and derivational morphology
  • Pronunciation variation can be modeled with
    multiple pronunciation and/or acoustic mixtures
  • If multiple pronunciations are given, estimate
    likelihoods
  • Use rules (e.g. assimilation, devoicing,
    flapping), or statistical transducers

20
Lexical HMMs
  • Create compound HMM for each lexical entry by
    concatenating the phones making up the
    pronunciation
  • example of HMM for lab (following speech for
    crossword triphone)
  • Multiple pronunciations can be weighted by
    likelihood into compound HMM for a word
  • (Tri)phone models are independent parts of word
    models

triphone ch-la l-ab
a-b
phone l a
b
21
HMM Training Baum-Welch Re-estimation
  • Determines the probabilities for the acoustic HMM
    models
  • Bootstraps from initial model
  • hand aligned data, previous models or flat start
  • Allows embedded training of whole utterances
  • transcribe utterance to words w1,,wk and
    generate a compound HMM by concatenating compound
    HMMs for words m1,,mk
  • calculate acoustic vectors a1,,an
  • Iteratively converges to a new estimate
  • Re-estimates all paths because states are hidden
  • Provides a maximum likelihood estimate
  • model that assigns training data the highest
    likelihood

22
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

23
Probabilistic Language Modeling History
  • Assigns probability P(w) to word sequence w w1
    ,w2,,wk
  • Bayes Law provides a history-based model
  • P(w1 ,w2,,wk)
  • P(w1) P(w2w1) P(w3w1,w2)
    P(wkw1,,wk-1)
  • Cluster histories to reduce number of parameters

24
N -gram Language Modeling
  • n-gram assumption clusters based on last n-1
    words
  • P(wjw1,,wj-1) P(wjwj-n-1,,wj-2 ,wj-1)
  • unigrams P(wj)
  • bigrams P(wjwj-1)
  • trigrams P(wjwj-2 ,wj-1)
  • Trigrams often interpolated with bigram and
    unigram
  • the li typically estimated by maximum likelihood
    estimation on held out data (F(..) are relative
    frequencies)
  • many other interpolations exist (another standard
    is a non-linear backoff)

25
Extended Probabilistic Language Modeling
  • Histories can include some indication of semantic
    topic
  • latent-semantic indexing (vector-based
    information retrieval model)
  • topic-spotting and blending of topic-specific
    models
  • dialogue-state specific language models
  • Language models can adapt over time
  • recent history updates model through
    re-estimation or blending
  • often done by boosting estimates for seen words
    (triggers)
  • new words and/or pronunciations can be added
  • Can estimate category tags (syntactic and/or
    semantic)
  • Joint word/category model P(word1tag1,,wordkta
    gk)
  • example P(wordtagHistory) P(wordtag)
    P(tagHistory)

26
Finite State Language Modeling
  • Write a finite-state task grammar (with
    non-recursive CFG)
  • Simple Java Speech API example (from users
    guide)
  • public ltCommandgt ltPolitegt ltActiongt
    ltObjectgt (and ltObjectgt)
  • ltActiongt open close delete
  • ltObjectgt the window the file
  • ltPolitegt please
  • Typically assume that all transitions are
    equi-probable
  • Technology used in most current applications
  • Can put semantic actions in the grammar

27
Information Theory Perplexity
  • Perplexity is standard model of recognition
    complexity given a language model
  • Perplexity measures the conditional likelihood of
    a corpus, given a language model P(.)
  • Roughly the number of equi-probable choices per
    word
  • Typically computed by taking logs and applying
    history-based Bayesian decomposition
  • But lower perplexity doesnt guarantee better
    recognition

28
Zipfs Law
  • Lexical frequency is inversely proportional to
    rank
  • Frequency(n) Frequency of n-th most frequent
    word
  • Zipfs Law Frequency(Rank) Frequency(1)/Rank
  • Thus log Frequency(Rank) ? - log Rank

From G.R. Turners web site on Zipfs
law http//www.btinternet.com/g.r.turner/ZipfDoc
.htm
29
Vocabulary Acquisition
  • IBM personal E-mail corpus of PDB (by R.L.
    Mercer)
  • static coverage is given by most frequent n words
  • dynamic coverage is most recent n words

30
Language HMMs
  • Can take HMMs for each word and combine into a
    single HMM for the whole language (allows
    cross-word models)
  • Result is usually too large to expand statically
    in memory
  • A two word example is given by

P(word1word1)
word1
P(word1)
P(word2word1)
P(word1word2)
word2
P(word2)
P(word2word2)
31
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

32
HMM Decoding
  • Decoding Problem is finding best word sequence
  • ArgMax w1,,wm P(w1,,wm a1,,an)
  • Words w1wm are fully determined by sequences of
    states
  • Many state sequences produce the same words
  • The Viterbi assumption
  • the word sequence derived from the most likely
    path will be the most likely word sequence (as
    would be computed over all paths)


Max over previous states r
likelihood previous state is r
Transition probability from r to s
Acoustics for state s for input a
33
Visualizing Viterbi Decoding The Trellis
ai-1
ai1
ai
input
P1,1
P1,1
s1
s1
s1
fi(s1)
fi1(s1)
fi-1(s1)
P2,1
P1,2
...
s2
s2
...
s2
Pk,1
fi(s2)
fi1(s2)
fi-1(s2)
P1,k
best path
...
...
...
sk
sk
sk
fi(sk)
fi1(sk)
fi-1(sk)
time
ti-1
ti
ti1
34
Viterbi Search Dynamic Programming Token Passing
  • Algorithm
  • Initialize all states with a token with a null
    history and the likelihood that its a start
    state
  • For each frame ak
  • For each token t in state s with probability
    P(t), history H
  • For each state r
  • Add new token to s with probability P(t) Ps,r
    Pr(ak), and history s.H
  • Time synchronous from left to right
  • Allows incremental results to be evaluated

35
Pruning the Search Space
  • Entire search space for Viterbi search is much
    too large
  • Solution is to prune tokens for paths whose score
    is too low
  • Typical method is to use
  • histogram only keep at most n total hypotheses
  • beam only keep hypotheses whose score is a
    fraction of best score
  • Need to balance small n and tight beam to limit
    search and minimal search error (good hypotheses
    falling off beam)
  • HMM densities are usually scaled differently than
    the discrete likelihoods from the language model
  • typical solution boost language models dynamic
    range, using P(w)n P(aw), usually with with n
    15
  • Often include penalty for each word to favor
    hypotheses with fewer words

36
N-best Hypotheses and Word Graphs
  • Keep multiple tokens and return n-best
    paths/scores
  • p1 flights from Boston today
  • p2 flights from Austin today
  • p3 flights for Boston to pay
  • p4 lights for Boston to pay
  • Can produce a packed word graph (a.k.a. lattice)
  • likelihoods of paths in lattice should equal
    likelihood for n-best

37
Search-based Decoding
  • A search
  • Compute all initial hypotheses and place in
    priority queue
  • For best hypothesis in queue
  • extend by one observation, compute next state
    score(s) and place into the queue
  • Scoring now compares derivations of different
    lengths
  • would like to, but cant compute cost to complete
    until all data is seen
  • instead, estimate with simple normalization for
    length
  • usually prune with beam and/or histogram
    constraints
  • Easy to include unbounded amounts of history
    because no collapsing of histories as in dynamic
    programming n-gram
  • Also known as stack decoder (priority queue is
    stack)

38
Multiple Pass Decoding
  • Perform multiple passes, applying successively
    more fine-grained language models
  • Can much more easily go beyond finite state or
    n-gram
  • Can use for Viterbi or stack decoding
  • Can use word graph as an efficient interface
  • Can compute likelihood to complete hypotheses
    after each pass and use in next round to tighten
    beam search
  • First pass can even be a free phone decoder
    without a word-based language model

39
Measuring Recognition Accuracy
  • Word Error Rate
  • Example scoring
  • actual utterance four six seven
    nine three three seven
  • recognizer four oh six seven
    five three seven
  • insert
    subst delete
  • WER (1 1 1)/7 43
  • Would like to study concept accuracy
  • typically count only errors on content words
    application dependent
  • ignore case marking (singular, plural, etc.)
  • For word/concept spotting applications
  • recall percentage of target words (concept)
    found
  • precision percentage of hypothesized words
    (concepts) in target

40
Empirical Recognition Accuracies
  • Cambridge HTK, 1997 multipass HMM w. lattice
    rescoring
  • Top Performer in ARPAs HUB-4 Broadcast News
    Task
  • 65,000 word vocabulary Out of Vocabulary 0.5
  • Perplexities
  • word bigram 240
    (6.9 million bigrams)
  • backoff trigram of 1000 categories 238
    (803K bi, 7.1G tri)
  • word trigram 159
    (8.4 million trigrams)
  • word 4-gram 147
    (8.6 million 4-grams)
  • word 4-gram category trigram 137
  • Word Error Rates
  • clean, read speech 9.4
  • clean, spontaneous speech 15.2
  • low fidelity speech 19.5

41
Empirical Recognition Accuracies (contd)
  • Lucent 1998, single pass HMM
  • Typical of real-time telephony performance (low
    fidelity)
  • 3,000 word vocabulary Out of Vocabulary 1.5
  • Blended models from customer/operator
    customer/system
  • Perplexities customer/op customer/system
  • bigram 105.8 (27,200) 32.1
    (12,808)
  • trigram 99.5 (68,500) 24.4
    (25,700)
  • Word Error Rate 23
  • Content Term (single, pair, triple of words)
    Precision/Recall
  • one-word terms 93.7 / 88.4
  • two-word terms 96.9 / 85.4
  • three-word terms 98.5 / 84.3

42
Confidence Scoring and Rejection
  • Alternative to standard acoustic density scoring
  • compute HMM acoustic score for word(s) in usual
    way
  • baseline score for an anti-model
  • compute hypothesis ratio (Word Score / Baseline
    Score)
  • test hypothesis ratio vs. threshold
  • Can be applied to
  • free word spotting (given pronunciations)
  • (word-by-word) acoustic confidence scoring for
    later processing
  • verbal information verification
  • existing info name, address, social security
    number
  • password

43
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

44
Semantic Interpretation Word Strings
  • Content is just words
  • System What is your address?
  • User fourteen eleven main street
  • Can also do concept extraction / keyword(s)
    spotting
  • User My address is fourteen eleven main
    street
  • Applications
  • template filling
  • directory services
  • information retrieval

45
Semantic Interpretation Pattern-Based
  • Simple (typically regular) patterns specify
    content
  • ATIS (Air Traffic Information System) Task
  • System What are your travel plans?
  • User On Monday, Im going from Boston
    to San Francisco.
  • Content DATEMonday, ORIGINBoston,
    DESTINATIONSFO
  • Can combine content-extraction and language
    modeling
  • but can be too restrictive as a language model
  • Java Speech API (curly brackets show semantic
    actions)
  • public ltcommandgt ltactiongt ltobjectgt
    ltpolitegt
  • ltactiongt open OP close
    CL move MV
  • ltobjectgt ltthis_that_etcgt
    window door
  • ltthis_that_etcgt a the
    this that the current
  • ltpolitegt please kindly
  • Can be generated and updated on the fly (eg. Web
    Apps)

46
Semantic Interpretation Parsing
  • In general case, have to uncover who did what to
    whom
  • System What would you like me to do next?
  • User Put the block in the box on Platform 1.
    ambiguous
  • System How can I help you?
  • User Where is A Bugs Life playing in Summit?
  • Requires some kind of parsing to produce
    relations
  • Who did what to whom ?(where(present(in(Summit,p
    lay(BugsLife)))))
  • This kind of representation often used for
    machine translation
  • Often transferred to flatter frame-based
    representation
  • Utterance type where-question
  • Movie A Bugs Life
  • Town Summit

47
Robustness and Partiality
  • Controlled Speech
  • limited task vocabulary limited task grammar
  • Spontaneous Speech
  • Can have high out-of-vocabulary (OOV) rate
  • Includes restarts, word fragments, omissions,
    phrase fragments, disagreements, and other
    disfluencies
  • Contains much grammatical variation
  • Causes high word error-rate in recognizer
  • Parsing is often partial, allowing
  • omission
  • parsing fragments

48
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

49
Recorded Prompts
  • The simplest (and most common) solution is to
    record prompts spoken by a (trained) human
  • Produces human quality voice
  • Limited by number of prompts that can be recorded
  • Can be extended by limited cut-and-paste or
    template filling

50
Speech Synthesis
  • Rule-based Synthesis
  • Uses linguistic rules (/- training) to generate
    features
  • Example DECTalk
  • Concatenative Synthesis
  • Record basic inventory of sounds
  • Retrieve appropriate sequence of units at run
    time
  • Concatenate and adjust durations and pitch
  • Waveform synthesis

51
Diphone and Polyphone Synthesis
  • Phone sequences capture co-articulation
  • Cut speech in positions that minimize context
    contamination
  • Need single phones, diphones and sometimes
    triphones
  • Reduce number collected by
  • phonotactic constraints
  • collapsing in cases of no co-articulation
  • Data Collection Methods
  • Collect data from a single (professional) speaker
  • Select text with maximal coverage (typically with
    greedy algorithm), or
  • Record minimal pairs in desired contexts (real
    words or nonsense)

52
Duration Modeling
  • Must generate segments with the appropriate
    duration
  • Segmental Identity
  • /ai/ in like twice as long as /I/ in lick
  • Surrounding Segments
  • vowels longer following voiced fricatives than
    voiceless stops
  • Syllable Stress
  • onsets and nuclei of stressed syllables longer
    than in unstressed
  • Word importance
  • word accent with major pitch movement lengthens
  • Location of Syllable in Word
  • word ending longer than word starting longer than
    word internal
  • Location of the Syllable in the Phrase
  • phrase final syllables longer than same syllable
    in other positions

53
Intonation Tone Sequence Models
  • Functional Information can be encoded via tones
  • given/new information (information status)
  • contrastive stress
  • phrasal boundaries (clause structure)
  • dialogue act (statement/question/command)
  • Tone Sequence Models
  • F0 contours generated from phonologically
    distinctive tones/pitch accents which are locally
    independent
  • generate a sequence of tonal targets and fit with
    signal processing

54
Intonation for Function
  • ToBI (Tone and Break Index) System, is one
    example
  • Pitch Accent (H, L, HL, HL, LH,
    LH)
  • Phrase Accent - (H-, L-)
  • Boundary Tone (H, L)
  • Intonational Phrase
  • ltPitch Accentgt ltPhrase
    Accentgt ltBoundary Tonegt

statement vs. question example
source Multilingual Text-to-Speech Synthesis, R.
Sproat, ed., Kluwer, 1998
55
Text Markup for Synthesis
  • Bell Labs TTS Markup
  • r(0.9) LH(0.8) Humpty LH(0.8) Dumpty
    r(0.85) L(0.5) sat on a H(1.2) wall.
  • Tones Tone(Prominence)
  • Speaking Rate r(Rate) and pauses
  • Top Line (highest pitch) Reference Line
    (reference pitch) Base Line (lowest pitch)
  • SABLE is an emerging standard extending SGML
    http//www.cstr.ed.ac.uk/projects/sable.html
  • marks emphasis(), break(), pitch(base/mid/rang
    e,), rate(), volume(), semanticMode(date/time/e
    mail/URL/...), speaker(age,sex)
  • Implemented in Festival Synthesizer (free for
    research, etc.)
  • http//www.cstr.ed.ac.uk/projec
    ts/festival.html

56
Intonation in Bell Labs TTS
  • Generate a sequence of F0 targets for synthesis
  • Example
  • We were away a year ago.
  • phones w E w R w A y E r g O
  • Default Declarative intonation (H) H L- L
    question L H- H

source Multilingual Text-to-Speech Synthesis, R.
Sproat, ed., Kluwer, 1998
57
Signal Processing for Speech Synthesis
  • Diphones recorded in one context must be
    generated in other contexts
  • Features are extracted from recorded units
  • Signal processing manipulates features to smooth
    boundaries where units are concatenated
  • Signal processing modifies signal via
    interpolation
  • intonation
  • duration

58
The Source-Filter Model of Synthesis
  • Model of features to be extracted and fitted
  • Excitation or Voicing Source(s) to model sound
    source
  • standard wave of glottal pulses for voiced sounds
  • randomly varying noise for unvoiced sounds
  • modification of airflow due to lips, etc.
  • high frequency (F0 rate), quasi-periodic, choppy
  • modeled with vector of glottal waveform patterns
    in voiced regions
  • Acoustic Filter(s)
  • shapes the frequency character of vocal tract and
    radiation character at the lips
  • relatively slow (samples around 5ms suffice) and
    stationary
  • modeled with LPC (linear predictive coding)

59
Barge-in
  • Technique to allow speaker to interrupt the
    systems speech
  • Combined processing of input signal and output
    signal
  • Signal detector runs looking for speech start and
    endpoints
  • tests a generic speech model against noise model
  • typically cancels echoes created by outgoing
    speech
  • If speech is detected
  • Any synthesized or recorded speech is cancelled
  • Recognition begins and continues until end point
    is detected

60
Speech Application Programming Interfaces
  • Abstract from recognition/synthesis engines
  • Recognizer and synthesizer loading
  • Acoustic and grammar model loading (dynamic
    updates)
  • Recognition
  • online
  • n-best or lattice
  • Synthesis
  • markup
  • barge in
  • Acoustic control
  • telephony interface
  • microphone/speaker interface

61
Speech API Examples
  • SAPI Microsoft Speech API (recsynth)
  • communicates through COM objects
  • instances most systems implement all or some of
    this (Dragon, IBM, Lucent, LH, etc.)
  • JSAPI Java Speech API (rec synth)
  • communicates through Java events (like GUI)
  • concurrency through threads
  • instances IBM ViaVoice (rec), LH (synth)
  • (J)HAPI (Java) HTK API (recognition)
  • communicates through C or Java port of C
    interface
  • eg Entropics Cambridge Research Labs HMM Tool
    Kit (HTK)
  • Galaxy (rec synth)
  • communicates through a production system
    scripting language
  • MIT System, ported by MITRE for DARPA Communicator

62
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

63
Discourse Dialogue Processing
  • Discourse interpretation
  • Understand what the user really intends by
    interpreting utterances in context
  • Dialogue management
  • Determine system goals in response to user
    utterances based on user intention
  • Response generation
  • Generate natural language utterances to achieve
    the selected goals

64
Discourse Interpretation
  • Goal understand what the user really intends
  • Example Can you move it?
  • What does it refer to?
  • Is the utterance intended as a simple yes-no
    query or a request to perform an action?
  • Issues addressed
  • Reference resolution
  • Intention recognition
  • Interpret user utterances in context

65
Reference Resolution
U Where is A Bugs Life playing in Summit? S A
Bugs Life is playing at the Summit theater. U
When is it playing there? S Its playing at 2pm,
5pm, and 8pm. U Id like 1 adult and 2 children
for the first show. How much would that cost?
  • Knowledge sources
  • Domain knowledge
  • Discourse knowledge
  • World knowledge

66
Reference Resolution In Theory
  • Focus stacks
  • Maintain recent objects in stack
  • Select objects that satisfy semantic/pragmatic
    constraints starting from top of stack
  • May take into account discourse structure
  • Centering
  • Backward-looking center (Cb) object connecting
    the current sentence with the previous sentence
  • Forward-looking centers (Cf) potential Cb of the
    next sentence
  • Rule-based filtering ranking of objects for
    pronoun resolution

67
Reference Resolution In Practice
  • Non-existent does not allow the use of anaphoric
    references
  • Allows only simple references
  • utilizes the focus stack reference resolution
    mechanism
  • does not take into account discourse structure
    information
  • Example

U Where is A Bugs Life playing in Summit?
Summit
A Bugs Life
68

S A Bugs Life is playing at the
Summit theater.
Summit theater
A Bugs Life
69
U When is it playing there?
70
Intention Recognition
B I have to wash my hair.
71

A Would you like to go to the hairdresser?
  • Bs utterance should be interpreted as an
    acceptance of As proposal.

72

A Whats that smell around here?
  • Bs utterance should be interpreted as an answer
    to As question.

73

A Would you be interested in going out to
dinner tonight?
  • Bs utterance should be interpreted as a
    rejection of As proposal.

74
Intention Recognition (Contd)
  • Goal to recognize the intent of each user
    utterance as one (or more) of a set of dialogue
    acts based on context
  • Sample dialogue actions
  • Switchboard DAMSL
  • Conventional-closing
  • Statement-(non-)opinion
  • Agree/Accept
  • Acknowledgment
  • Yes-No-Question/Yes-Answer
  • Non-verbal
  • Abandoned
  • On-going standardization efforts (Discourse
    Resource Initiative)
  • Verbmobil
  • Greet/Thank/Bye
  • Suggest
  • Accept/Reject
  • Confirm
  • Clarify-Query/Answer
  • Give-Reason
  • Deliberate

75
Intention Recognition In Theory
  • Knowledge sources
  • Overall dialogue goals
  • Orthographic features, e.g.
  • punctuation
  • cue words/phrases but, furthermore, so
  • transcribed words would you please, I want
    to
  • Dialogue history, i.e., previous dialogue act
    types
  • Dialogue structure, e.g.
  • subdialogue boundaries
  • dialogue topic changes
  • Prosodic features of utterance duration, pause,
    F0, speaking rate

76
Intention Recognition In Theory (Contd)
  • Finite-state dialogue grammar
  • e.g.
  • Plan-based discourse understanding
  • Recipes templates for performing actions
  • Inference rules to construct plausible plans
  • Empirical methods
  • Probabilistic dialogue act classifiers HMMs
  • Rule-based dialogue act recognition CART,
    Transformation-based learning

S / Greet
U / Question
S / Answer
U / Bye
U / Question
77
Intention Recognition In Practice
  • Makes assumptions about (high-level)
    task-specific intentions e.g.,
  • Call routing giving destination information
  • ATIS requesting flight information
  • Movie information system movie showtime or
    theater playlist information
  • Does not allow user-initiated complex dialogue
    acts, e.g. confirmation, clarification, or
    indirect responses

S1 Whats your account number? U1 Is that the
number on my ATM card? S2 Would you like to
transfer 1,500 from savings to checking? U2 If
I have enough in savings.
78
Intention Recognition In Practice (Contd)
  • User utterances can play one of two roles
  • Identify one of a set of possible task intentions
  • Provide necessary information for performing a
    task
  • Based on either keywords in an utterance or its
    syntactic/semantic representation
  • Maps keywords or representations to intentions
    using
  • Template matching
  • Probabilistic model
  • Vector-based similarity measures

79
Intention Recognition Example
U What time is A Bugs Life playing at the
Summit theater?
  • Using keyword extraction and vector-based
    similarity measures
  • Intention Ask-Reference _time
  • Movie A Bugs Life
  • Theater the Summit quadplex

80
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

81
Dialogue Management Motivating Examples
  • Dialogue 1

S Would you like movie showtime or theater
playlist information? U Movie showtime. S
What movie do you want showtime information
about? U Saving Private Ryan. S At what theater
do you want to see Saving Private Ryan? U
Paramount theater. S Saving Private Ryan is not
playing at the Paramount theater.
82
DM Motivating Examples (Contd)
  • Dialogue 2

S How may I help you? U When is Saving Private
Ryan playing? S For what theater? U The
Paramount theater. S Saving Private Ryan is not
playing at the Paramount theater, but its
playing at the Madison theater at 300, 530,
800, and 1030.
83
DM Motivating Examples (Contd)
  • Dialogue 3

S How may I help you? U When is Saving Private
Ryan playing? S For what theater? U The
Paramount theater. S Saving Private Ryan is
playing at the Fairmont theater at 600 and
830. U I wanted to know about the Paramount
theater, not the Fairmont theater. S Saving
Private Ryan is not playing at the Paramount
theater, but its playing at the Madison
theater at 300, 530, 800, and 1030.
84
Comparison of Sample Dialogues
  • Dialogue 1
  • System-initiative
  • Implicit confirmation
  • Merely informs user of failed query
  • Mechanical
  • Least efficient
  • Dialogue 2
  • Mixed-initiative
  • No confirmation
  • Suggests alternative when query fails
  • More natural
  • Most efficient
  • Dialogue 3
  • Mixed-initiative
  • No confirmation
  • Suggests alternative when query fails
  • More natural
  • Moderately efficient

85
Dialogue Management
  • Goal determine what to accomplish in response to
    user utterances, e.g.
  • Answer user question
  • Solicit further information
  • Confirm/Clarify user utterance
  • Notify invalid query
  • Notify invalid query and suggest alternative
  • Interface between user/language processing
    components and system knowledge base

86
Dialogue Management (Contd)
  • Main design issues
  • Functionality how much should the system do?
  • Process how should the system do them?
  • Affected by
  • Task complexity how hard the task is
  • Dialogue complexity what dialogue phenomena are
    allowed
  • Affects
  • robustness
  • naturalness
  • perceived intelligence

87
Task Complexity
  • Application dependent
  • Examples

Weather Information ATIS
Call Routing
Travel Planning
Automatic Banking
University Course Advisement
Simple
Complex
  • Directly affects
  • Types and quantity of system knowledge
  • Complexity of systems reasoning abilities

88
Dialogue Complexity
  • Determines what can be talked about
  • The task only
  • Subdialogues e.g., clarification, confirmation
  • The dialogue itself meta-dialogues
  • Could you hold on for a minute?
  • What was that click? Did you hear it?
  • Determines who can talk about them
  • System only
  • User only
  • Both participants

89
Dialogue Management Functionality
  • Determines the set of possible goals that the
    system may select at each turn
  • At the task level, dictated by task complexity
  • At the dialogue level, determined by system
    designer in terms of dialogue complexity
  • Are subdialogues allowed?
  • Are meta-dialogues allowed?
  • Only by the system, by the user, or by both
    agents?

90
DM Functionality In Theory
  • Task complexity moderate to complex
  • Travel planning
  • University course advisement
  • Dialogue complexity
  • System/user-initiated complex subdialogues
  • Embedded negotiation subdialogues
  • Expressions of doubt
  • Meta-dialogues
  • Multiple dialogue threads

91
DM Functionality In Practice
  • Task complexity simple to moderate
  • Call routing
  • Weather information query
  • Train schedule inquiry
  • Dialogue complexity
  • About task only
  • Limited system-initiated subdialogues

92
Dialogue Management Process
  • Determines how the system will go about selecting
    among the possible goals
  • At the dialogue level, determined by system
    designer in terms of initiative strategies
  • System-initiative system always has control,
    user only responds to system questions
  • User-initiative user always has control, system
    passively answers user questions
  • Mixed-initiative control switches between system
    and user using fixed rules
  • Variable-initiative control switches between
    system and user dynamically based on participant
    roles, dialogue history, etc.

93
DM Process In Theory
  • Initiative strategies
  • Mixed-initiative
  • Variable-initiative
  • Mechanisms for modeling initiative
  • Planning and reasoning
  • Theorem proving
  • Belief modeling
  • Knowledge sources for modeling initiative
  • System beliefs, user beliefs, and mutual beliefs
  • System domain knowledge
  • Dialogue history
  • User preferences

94
DM Process In Practice
  • Initiative strategies
  • User-initiative
  • System-initiative
  • Mixed-initiative
  • Variable-initiative
  • Mechanisms for modeling initiative
  • System and mixed-initiative finite-state
    machines
  • Variable-initiative evidential model for
    computing initiative
  • Knowledge sources
  • Dialogue history e.g. user fails to make
    progress in task
  • Participant roles advisor/advisee vs.
    collaborators
  • Features of current utterance e.g. ambiguous
    utterance, underspecified utterance

95
Dialogue Management Summary
  • In theory
  • Focuses on developing algorithms for addressing
    difficult dialogue phenomena in complex tasks
  • In practice
  • Focuses on developing robust, real-time dialogue
    strategies for handling common dialogue phenomena
    in simpler tasks

96
Dialogue Management Example
  • System for providing movie showtime and theater
    playlist information
  • Functionality
  • Task provide simple movie information
  • Dialogue allow system-initiated subdialogues
  • Process
  • Variable-initiative

97
Dialogue Management Example (Contd)
ltU has task dialogue initiativegt U When
is Antz playing at the Fairmont theater? ltS
has dialogue initiative (low score for theater)gt
ltgoal confirm Paramount theatergt S Did you
say the Paramount theater? U No, the Fairmont
theater. ltS has no initiativegt ltgoal
answer user questiongt S Antz is playing at the
Fairmont theater at 7pm. U When is it playing at
the Summit theater? ltS has task dialogue
initiativegt ltgoal notify failure suggest
alternativegt S Antz is not playing at the Summit
theater, but its playing at the AMC theater
in Summit at 8pm.
98
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

99
Response Generation
U Is Saving Private Ryan playing at the Chatham
cinema?
100

S No, its not.
  • S provides elliptical response

101

S No, Saving Private Ryan is not playing
at the Chatham cinema.
  • S provides full response (which provides
    grounding information)

102

S No, Saving Private Ryan is not playing
at the Chatham cinema the theaters under
renovation.
  • S provides full response and supporting evidence

103
Response Generation (Contd)
  • Goal generate natural language utterances to
    achieve goal(s) selected by the dialogue manager
  • Issues
  • Content selection determining what to say
  • Surface realization determining how to say it
  • Generation gap discrepancy between the actual
    output of the content selection process and the
    expected input of the surface realization process

104
Content Selection
  • Goal determine the propositional content of
    utterances to achieve goal(s)
  • Examples
  • Antz is not playing at the Maplewood theater
    Nucleus
  • Would you like the suite? Nucleus
  • Can you get the groceries from the car?
    Nucleus

105
  • the theaters under renovation. (evidence)
    Satellite
  • Its the same price as the regular room.
    (motivation) Satellite
  • The key is on the dryer. (enablement) Satellite

106
Content Selection In Theory
  • Knowledge sources
  • Domain knowledge base
  • User beliefs
  • User model user characteristics, preferences,
    etc.
  • Dialogue history
  • Content selection mechanisms
  • Schemas patterns of predicates
  • Rule-based generation
  • Plan-based generation
  • Recipes templates for performing actions
  • Planner to construct plans for given goal
  • Case-based reasoning

107
Content Selection In Practice
  • Knowledge sources
  • Domain knowledge base
  • Dialogue history
  • Pre-determined content selection strategies
  • Nucleus only, no satellite information
  • Nucleus fixed satellite

108
Surface Realization
  • Goal determine how the selected content will be
    conveyed by natural language utterances
  • Examples
  • Antz is showing (shown) at the Maplewood theater.
  • The Maplewood theater is showing Antz.
  • It is at the Maplewood theater that Antz is
    shown.
  • Antz, thats whats being shown at the Maplewood
    theater.
  • Issues
  • Clausal structure construction
  • Lexical selection

109
Surface Realization In Theory
  • Typical surface generator requires as input
  • Semantic representation to be realized
  • Clausal structure for generated utterance
  • Surface realization component utilizes a grammar
    to generate utterance that conveys the given
    semantic representation

110
Surface Realization In Practice
  • Canned utterances
  • Pre-determined utterances for goals e.g.
  • Greetings Hello, this is the ABC banks
    operator.
  • Repeat Could you please repeat your request?
  • Facilitates pre-recorded prompts for speech
    output
  • Template-based generation
  • Templates for goals e.g.
  • Notification Your call is being transferred to
    X.
  • Inform A,B,C,D, and E are playing at the F
    theater.
  • Clarify Did you say X or Y?
  • Needs cut-and-paste of pre-recorded segments or
    full TTS system

111
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

112
Dialogue Evaluation
  • Goal determine how well a dialogue system
    performs
  • Main difficulties
  • No strict right or wrong answers
  • Difficult to determine what features make a
    dialogue system better than another
  • Difficult to select metrics that contribute to
    the overall goodness of the system
  • Difficult to determine how the metrics compensate
    for one another
  • Expensive to collect new data for evaluating
    incremental improvement of systems

113
Dialogue Evaluation (Contd)
  • System-initiative, explicit confirmation
  • better task success rate
  • lower WER
  • longer dialogues
  • fewer recovery subdialogues
  • less natural
  • Mixed-initiative, no confirmation
  • lower task success rate
  • higher WER
  • shorter dialogues
  • more recovery subdialogues
  • more natural

114
Dialogue Evaluation Paradigms
  • Evaluating the end result only
  • Reference answers
  • Evaluating both the end result and the process
    toward it
  • Evaluation metrics
  • Performance functions

115
Evaluation Paradigms Reference Answers
  • Evaluates the task success rate only
  • Suitable for query-answering systems for which a
    correct answer can be defined for each query
  • For each query
  • Obtain answer from dialogue system
  • Compare with reference answer
  • Score system performance
  • Advantage simple
  • Disadvantage ignores many other important
    factors that contribute to quality of dialogue
    systems

116
Evaluation Paradigms Evaluation Metrics
  • Different metrics for evaluating different
    components of a dialogue system
  • Speech recognizer word error rate / word
    accuracy
  • Understanding component attribute value matrix
  • Dialogue manager
  • appropriateness of system responses
  • error recovery abilities
  • Overall system
  • task success
  • average number of turns
  • elapsed time
  • turn correction ratio

117
Paradigms Evaluation Metrics (Contd)
  • Advantage
  • Takes into account the process toward completing
    the task
  • Limitations
  • Difficult to determine how different metrics
    compensate for one another
  • Metrics may not be independent of one another
  • Does not generalize across domains and tasks

118
Paradigms Performance Functions
  • PARADISE Walker et al. derives performance
    functions using both task-based and
    dialogue-based metrics
  • User satisfaction
  • Maximize task success
  • Minimize costs
  • Efficiency measures e.g., number of utterances,
    elapsed time
  • Qualitative measures e.g., repair ratio,
    inappropriate utt. ratio
  • Performance function derivation
  • Obtain user satisfaction ratings (questionnaire)
  • Obtain values for various metrics (automatic or
    manual)
  • Apply multiple linear regression to derive a
    function relating user satisfaction and various
    cost factors, e.g.,

119
Paradigms Performance Functions (Contd)
  • Advantages
  • Allows for comparison of dialogue systems
    performing different tasks
  • Specifies relative contributions of cost factors
    to overall performance
  • Can be used to make predictions about future
    versions of the dialogue system
  • Disadvantages
  • Data collection cost for deriving performance
    function is high
  • Cost for deriving performance function for
    multiple systems to draw general conclusions is
    high

120
Tutorial Overview Outline
  • Part I
  • Signal processing
  • Speech recognition
  • acoustic modeling
  • language modeling
  • decoding
  • Semantic interpretation
  • Speech synthesis
  • Part II
  • Discourse and dialogue
  • Discourse interpretation
  • Dialogue management
  • Response generation
  • Dialogue evaluation
  • Data collection

121
Data Collection Wizard of Oz Paradigm
  • Setup for initial data collection
  • User communicates with system through telephone
    (speech) or keyboard (text)
  • System is actually a human, typically given
    instructions on how to behave like a system
  • Users are typically given tasks to perform in the
    target domain
  • Subjects are the users and the system can be
    played by one person
  • Dialogues between system and user are recorded
    and transcribed
  • Setup for intermediate system evaluation
  • Use actual running system, with wizard
    supervision
  • Wizard can override undesirable system behavior,
    e.g., correct ASR errors

122
Data Collection Wizard of Oz (Contd)
  • Features of collected data
  • Typically much less complex than actual
    human-human dialogues performing the same tasks
  • Captures how humans behave when they talk to
    computers
  • Captures variations among different subjects in
    both language and approach when performing the
Write a Comment
User Comments (0)
About PowerShow.com