Title: Stochastic Language Generation: Analysis and Evaluation
1Stochastic Language GenerationAnalysis and
Evaluation
- Alice Oh
- aliceo_at_cs.cmu.edu
- May 26, 2000
2Acknowledgements
- Advisor Alex Rudnicky
- Thesis Readers Eric Nyberg, Kevin Lenzo
- Voice for Evaluation Kevin Lenzo
- Discussions John Lafferty, Roni Rosenfeld, Wei
Xu - Transcription Tina Bennett
- Evaluation Participants
3Outline
- Natural Language Generation
- Stochastic NLG
- Evaluation
- Evaluation of NLG in Dialog Systems
- Analysis of stochastic NLG
- Evaluation of stochastic NLG
4Natural Language Generation
- Natural Language Understanding (NLU)
- Natural Language Generation (NLG)
- NLG in Communicator
Text
Semantic (Syntactic) Representation
Text
Semantic (Syntactic) Representation
Dialog Manager Input Frame
NLG Text
TTS Speech
5Stochastic Natural Language Generation
6Stochastic NLG Problem Statement
- Problem build a generation engine for a dialog
system that can combine the advantages, as well
as overcome the difficulties, of the two dominant
approaches (template-based generation, and
grammar rule-based NLG) - Our Approach design a corpus-driven stochastic
generation engine that takes advantage of the
characteristics of task-oriented conversational
systems. Some of those characteristics are that - Spoken utterances are much shorter in length
- There are well-defined subtopics within the task,
so the language can be selectively modeled
7Stochastic NLG overview
- Language Model an n-gram language model of
domain experts language built from a corpus of
travel reservation dialogs - Generation given an utterance class, randomly
generates a set of candidate utterances based on
the LM distributions - Scoring based on a set of heuristics, scores the
candidates and picks the best one - Slot filling substitute slots in the utterance
with the appropriate values in the input frame
8Stochastic NLG overview
Language Models
Generation
Dialog Manager
Candidate Utterances What time on
depart_date? At what time would you be leaving
depart_city?
Input Frame act query content
depart_time depart_date 20000501
Tagged Corpora
Scoring
Best Utterance What time on depart_date?
Complete Utterance What time on Mon, May 8th?
TTS
Slot Filling
9Example
- Utterances in Corpus
- What time do you want to depart depart_city?
- What time on depart_date would you like to
depart? - What time would you like to leave?
- What time do you want to depart on depart_date?
- Output (different from corpus)
- What time would you like to depart?
- What time on depart_date would you like to
depart depart_city? - What time on depart_date would you like to
depart on depart_date?
10Stochastic NLG in Communicator
- Hybrid System
- canned expressions Hello. Welcome to the CMU
Communicator - templates I'm sorry, I must have the wrong
cities for your trip.You cannot travel from
depart_city to arrive_city. - stochastic generation Okay. I have a nonstop
on airline departing depart_city at
depart_time arriving into arrive_city at
arrive_time. - Language Models
- built from CMU dialogs (39 dialogs, 970
utterances, 12852 words) - inform_flight sub-corpus 43 utts, 699 words (55
unique words) - query_depart_time sub-corpus 37 utts, 382 words
(24 unique words) - Time to Generate 75 msec on average
11Analysis and Evaluation
12Evaluation of NLG in Dialog Systems
- Analysis of technique
- coverage
- quality of output utterances
- Evaluation of NLG within the overall system
- user satisfaction of system utterances
- overall impression of the interaction
- effect of system prompt on user utterances
- Evaluation of NLG with respect to task completion
13Analysis
Stochastic NLG_1
Output_1
Stochastic NLG_2
Output_1
Input Frame
. . .
. . .
Stochastic NLG_k
Output_k
Analysis
Input
Parameterized NLG
14Analysis Experiment
- Setup Web-based survey
- 2 dialog acts (inform flight, query depart_time)
- 5 groups per dialog act
- 8 output sentences (unigram - 8gram) per group
- Subjects 12 subjects not familiar with NLG or
the Communicator task - Parameter n in ngram (1-8)
- Task
- for each sentence, mark Acceptable, Unacceptable
- for each group, pick best sentence (with lowest
number)
15Analysis Results
Average best Sentence 4
3.75
16Analysis Discussion
- Text vs. Speech
- subject marked unacceptable an exact duplicate
of a sentence in corpus - And departing Pittsburgh at what time?
- how would they judge travel agents language on
paper? On the phone? - Grammar
- subjects looked for near-perfect grammar in
acceptable sentences - I have a nonstop on United Airlines departing
Seattle at six twenty a.m., arrives Los Angeles
at eight fifty three a.m. - If the system can do better than humans, should
it?
17Evaluation
Dialogs with OutputS
Dialogs
Stochastic NLG
Dialogs with OutputT
Template NLG
Batch-mode Generation
Comparative Evaluation
18Experiment 1
- Setup Hardcopies of dialog transcripts
- 7 sets of dialogs
- one set output from template NLG stochastic
NLG - 49 pairs of sentences to compare
- Subjects 7 subjects in LTI (not in Communicator)
- Task for each pair of sentences, mark which
sentence is better - Results
- Weak preference for Stochastic NLG (p 0.18)
- 5 out of 7 subjects preferred Stochastic NLG
19Experiment 2
- Setup
- Webpage with links to .wav files of recorded
dialogs - Second webpage with the transcripts of those
dialogs - 3 sets of dialogs
- one set output from template NLG stochastic
NLG - Criteria for Choosing Dialogs
- calls from outside of the Communicator group
- successfully completed itinerary
- used a part of the dialog (e.g., one leg of the
trip) - contains at least 5 utterances generated by
stochastic NLG
20Experiment 2 (contd)
- Subjects 20 subjects not familiar with NLG or
the Communicator task - Task
- recorded dialogs listen to each pair of dialogs,
then answer 4 questions (natural, understandable,
like better, prefer to use) - transcripts for each pair of sentences, pick the
more natural sentence
21Experiment 2 Results
- Listening to the recorded dialogs
- by question (lt 0 stochastic gt 0 templates)
- natural mean -0.02, p-value 0.43
- understand mean 0.08, p-value 0.14
- like better mean 0.12, p-value 0.13
- prefer to use mean -0.02, p-value 0.43
- by subject
- 11 subjects preferring S, 8 preferring T, 1
Neutral - 6 statistically significant (3 S, 3 T) by sign
test (at p 0.05)
22Experiment 2 Results
- Reading the transcripts
- 11 subjects preferring S, 7 preferring T, 3
Neutral - 2 statistically significant (both S) by t-test
- Correlation between Recorded Dialogs
Transcripts - (r 0.009)
23Experiment 2 Discussion
- No correlation between subjects judgements on
recorded dialogs transcripts - No significant difference in the results
- Possible reasons
- Subjects made judgments based on voice quality
- not following instructions (!)
- 15 out of 20 subjects mentioned voice quality in
comments - use TTS voice?
- Some other factors (e.g., length of utterance,
lexicon) should have been controlled - There really is no significant difference in the
quality of output
24Experiment 2 Comments
- Although B sounded more natural, it was slower
(too slow) - Voice A slurred his words a little at the end of
one or two sentences--it sounded sloppy - Voice seems to care
- Neither. I prefer talking to real people rather
than computers - Both. Really no preference. The dialogues,
while different, were both pleasant in my
opinion
25Conclusion
- Stochastic NLG does at least as well as
hand-crafted templates - Evaluation of NLG (esp. in dialog systems) is
hard - getting subjects to do the right thing
- designing experiment to control as many
independent variables - teasing apart TTS from NLG
- More research is needed
- judging human travel agents language
- another round of experiments!
26Extra Slides
27Current Approaches
- Traditional (rule-based) NLG
- hand-crafted generation grammar rules and other
knowledge - input a very richly specified set of semantic
and syntactic features - Example
- (h / possibleltlatent
- domain (h2 / obligatoryltnecessary
- domain (e / eat,take in
- agent you
- patient (c / poulet))))
- You may have to eat chicken
- Template-based NLG
- simple to build
- input a dialog act, and/or a set of slot-value
pairs - from a Nitrogen demo website,
http//www.isi.edu/natural-language/projects/nitro
gen/
28Stochastic NLG can also be thought of as a way to
automatically build templates from a corpus
- If you set n equal to a large enough number, most
utterances generated by LM-NLG will be exact
duplicates of the utterances in the corpus.
29Stochastic NLG corpora
- Human-Human dialogs in travel reservations
- (CMU-Leah, SRI-ATIS/American Express dialogs)
30Tagging
- CMU corpus tagged manually
- SRI corpus tagged semi-automatically using
trigram language models built from CMU corpus
31Tags
- Utterance classes (29)
- query_arrive_city inform_airport
- query_arrive_time inform_confirm_utterance
- query_arrive_time inform_epilogue
- query_confirm inform_flight
- query_depart_date inform_flight_another
- query_depart_time inform_flight_earlier
- query_pay_by_card inform_flight_earliest
- query_preferred_airport inform_flight_later
- query_return_date inform_flight_latest
- query_return_time inform_not_avail
- hotel_car_info inform_num_flights
- hotel_hotel_chain inform_price
- hotel_hotel_info other
- hotel_need_car
- hotel_need_hotel
- hotel_where
- Attributes (24)
- airline flight_num
- am hotel
- arrive_airport hotel_city
- arrive_city hotel_price
- arrive_date name
- arrive_time num_flights
- car_company pm
- car_price price
- connect_airline
- connect_airport
- connect_city
- depart_airport
- depart_city
- depart_date
- depart_time
- depart_tod
32Stochastic NLG Generation
- Given an utterance class, randomly generates a
set of candidate utterances based on the LM
distributions - Generation stops when an utterance has penalty
score of 0 or the maximum number of iterations
(50) has been reached - Average generation time 75 msec for Communicator
dialogs
33Stochastic NLG Scoring
- Assign various penalty scores for
- unusual length of utterance (thresholds for
too-long and too-short) - slot in the generated utterance with an invalid
(or no) value in the input frame - a new and required attribute in the input
frame thats missing from the generated utterance - repeated slots in the generated utterance
- Pick the utterance with the lowest penalty (or
stop generating at an utterance with 0 penalty)
34Stochastic NLG Slot Filling
- Substitute slots in the utterance with the
appropriate values in the input frame - Example
- What time do you need to arrive in arrive_city?
- What time do you need to arrive in New York?
35Stochastic NLG Advantages
- corpus-driven
- easy to build (minimal knowledge engineering)
- fast prototyping
- minimal input (speech act, slot values)
- natural output
- leverages data-collecting/tagging effort
36Stochastic NLG Shortcomings
- What might sound natural (imperfect grammar,
intentional omission of words, etc.) for a human
speaker may sound awkward (or wrong) for the
system. - It is difficult to define utterance boundaries
and utterance classes. Some utterances in the
corpus may be a conjunction of more than one
utterance class. - Factors other than the utterance class may affect
the words (e.g., discourse history). - Some sophistication built into traditional NLG
engines is not available (e.g., aggregation,
anaphorization).
37Open Issues
- How big of a corpus do we need?
- How much of it needs manual tagging?
- How does the n in n-gram affect the output?
- What happens to output when two different human
speakers are modeled in one model? - Can we replace scoring with a search algorithm?
38Evaluation
- Must be able to evaluate generation independent
of the rest of the dialog system - Comparative evaluation using dialog transcripts
- need more subjects
- 8-10 dialogs system output generated batch-mode
by two different engines - Evaluation of human travel agent utterances
- Do users rate them well?
- Is it good enough to model human utterances?
39Preliminary Evaluation
- Batch-mode generation using two systems,
comparative evaluation of output by human
subjects - User Preferences (49 utterances total)
- Weak preference for Stochastic NLG (p 0.18)
subject
stochastic
templates
difference
1
41
8
33
2
34
15
19
3
17
32
-15
4
32
17
15
5
30
17
13
6
27
19
8
7
8
41
-33
average
27
21.29
5.71