Stochastic Language Generation: Analysis and Evaluation - PowerPoint PPT Presentation

About This Presentation

Title:

Stochastic Language Generation: Analysis and Evaluation

Description:

Stochastic NLG: Problem Statement ... stochastic generation: Okay. ... one set: output from template NLG & stochastic NLG. Criteria for Choosing Dialogs ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 40

Provided by: alic49

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Stochastic Language Generation: Analysis and Evaluation

1
Stochastic Language GenerationAnalysis and
Evaluation

Alice Oh
aliceo_at_cs.cmu.edu
May 26, 2000

2
Acknowledgements

Advisor Alex Rudnicky
Thesis Readers Eric Nyberg, Kevin Lenzo
Voice for Evaluation Kevin Lenzo
Discussions John Lafferty, Roni Rosenfeld, Wei
Xu
Transcription Tina Bennett
Evaluation Participants

3
Outline

Natural Language Generation
Stochastic NLG
Evaluation
Evaluation of NLG in Dialog Systems
Analysis of stochastic NLG
Evaluation of stochastic NLG

4
Natural Language Generation

Natural Language Understanding (NLU)
Natural Language Generation (NLG)
NLG in Communicator

Text
Semantic (Syntactic) Representation
Text
Semantic (Syntactic) Representation
Dialog Manager Input Frame
NLG Text
TTS Speech
5
Stochastic Natural Language Generation
6
Stochastic NLG Problem Statement

Problem build a generation engine for a dialog
system that can combine the advantages, as well
as overcome the difficulties, of the two dominant
approaches (template-based generation, and
grammar rule-based NLG)
Our Approach design a corpus-driven stochastic
generation engine that takes advantage of the
characteristics of task-oriented conversational
systems. Some of those characteristics are that
Spoken utterances are much shorter in length
There are well-defined subtopics within the task,
so the language can be selectively modeled

7
Stochastic NLG overview

Language Model an n-gram language model of
domain experts language built from a corpus of
travel reservation dialogs
Generation given an utterance class, randomly
generates a set of candidate utterances based on
the LM distributions
Scoring based on a set of heuristics, scores the
candidates and picks the best one
Slot filling substitute slots in the utterance
with the appropriate values in the input frame

8
Stochastic NLG overview
Language Models
Generation
Dialog Manager
Candidate Utterances What time on
depart_date? At what time would you be leaving
depart_city?
Input Frame act query content
depart_time depart_date 20000501
Tagged Corpora
Scoring
Best Utterance What time on depart_date?
Complete Utterance What time on Mon, May 8th?
TTS
Slot Filling
9
Example

Utterances in Corpus
What time do you want to depart depart_city?
What time on depart_date would you like to
depart?
What time would you like to leave?
What time do you want to depart on depart_date?
Output (different from corpus)
What time would you like to depart?
What time on depart_date would you like to
depart depart_city?
What time on depart_date would you like to
depart on depart_date?

10
Stochastic NLG in Communicator

Hybrid System
canned expressions Hello. Welcome to the CMU
Communicator
templates I'm sorry, I must have the wrong
cities for your trip.You cannot travel from
depart_city to arrive_city.
stochastic generation Okay. I have a nonstop
on airline departing depart_city at
depart_time arriving into arrive_city at
arrive_time.
Language Models
built from CMU dialogs (39 dialogs, 970
utterances, 12852 words)
inform_flight sub-corpus 43 utts, 699 words (55
unique words)
query_depart_time sub-corpus 37 utts, 382 words
(24 unique words)
Time to Generate 75 msec on average

11
Analysis and Evaluation
12
Evaluation of NLG in Dialog Systems

Analysis of technique
coverage
quality of output utterances
Evaluation of NLG within the overall system
user satisfaction of system utterances
overall impression of the interaction
effect of system prompt on user utterances
Evaluation of NLG with respect to task completion

13
Analysis
Stochastic NLG_1
Output_1
Stochastic NLG_2
Output_1
Input Frame
. . .
. . .
Stochastic NLG_k
Output_k
Analysis
Input
Parameterized NLG
14
Analysis Experiment

Setup Web-based survey
2 dialog acts (inform flight, query depart_time)
5 groups per dialog act
8 output sentences (unigram - 8gram) per group
Subjects 12 subjects not familiar with NLG or
the Communicator task
Parameter n in ngram (1-8)
Task
for each sentence, mark Acceptable, Unacceptable
for each group, pick best sentence (with lowest
number)

15
Analysis Results
Average best Sentence 4

3.75
16
Analysis Discussion

Text vs. Speech
subject marked unacceptable an exact duplicate
of a sentence in corpus
And departing Pittsburgh at what time?
how would they judge travel agents language on
paper? On the phone?
Grammar
subjects looked for near-perfect grammar in
acceptable sentences
I have a nonstop on United Airlines departing
Seattle at six twenty a.m., arrives Los Angeles
at eight fifty three a.m.
If the system can do better than humans, should
it?

17
Evaluation
Dialogs with OutputS
Dialogs
Stochastic NLG
Dialogs with OutputT
Template NLG
Batch-mode Generation
Comparative Evaluation

Transcription

18
Experiment 1

Setup Hardcopies of dialog transcripts
7 sets of dialogs
one set output from template NLG stochastic
NLG
49 pairs of sentences to compare
Subjects 7 subjects in LTI (not in Communicator)
Task for each pair of sentences, mark which
sentence is better
Results
Weak preference for Stochastic NLG (p 0.18)
5 out of 7 subjects preferred Stochastic NLG

19
Experiment 2

Setup
Webpage with links to .wav files of recorded
dialogs
Second webpage with the transcripts of those
dialogs
3 sets of dialogs
one set output from template NLG stochastic
NLG
Criteria for Choosing Dialogs
calls from outside of the Communicator group
successfully completed itinerary
used a part of the dialog (e.g., one leg of the
trip)
contains at least 5 utterances generated by
stochastic NLG

20
Experiment 2 (contd)

Subjects 20 subjects not familiar with NLG or
the Communicator task
Task
recorded dialogs listen to each pair of dialogs,
then answer 4 questions (natural, understandable,
like better, prefer to use)
transcripts for each pair of sentences, pick the
more natural sentence

21
Experiment 2 Results

Listening to the recorded dialogs
by question (lt 0 stochastic gt 0 templates)
natural mean -0.02, p-value 0.43
understand mean 0.08, p-value 0.14
like better mean 0.12, p-value 0.13
prefer to use mean -0.02, p-value 0.43
by subject
11 subjects preferring S, 8 preferring T, 1
Neutral
6 statistically significant (3 S, 3 T) by sign
test (at p 0.05)

22
Experiment 2 Results

Reading the transcripts
11 subjects preferring S, 7 preferring T, 3
Neutral
2 statistically significant (both S) by t-test
Correlation between Recorded Dialogs
Transcripts
(r 0.009)

23
Experiment 2 Discussion

No correlation between subjects judgements on
recorded dialogs transcripts
No significant difference in the results
Possible reasons
Subjects made judgments based on voice quality
not following instructions (!)
15 out of 20 subjects mentioned voice quality in
comments
use TTS voice?
Some other factors (e.g., length of utterance,
lexicon) should have been controlled
There really is no significant difference in the
quality of output

24
Experiment 2 Comments

Although B sounded more natural, it was slower
(too slow)
Voice A slurred his words a little at the end of
one or two sentences--it sounded sloppy
Voice seems to care
Neither. I prefer talking to real people rather
than computers
Both. Really no preference. The dialogues,
while different, were both pleasant in my
opinion

25
Conclusion

Stochastic NLG does at least as well as
hand-crafted templates
Evaluation of NLG (esp. in dialog systems) is
hard
getting subjects to do the right thing
designing experiment to control as many
independent variables
teasing apart TTS from NLG
More research is needed
judging human travel agents language
another round of experiments!

26
Extra Slides
27
Current Approaches

Traditional (rule-based) NLG
hand-crafted generation grammar rules and other
knowledge
input a very richly specified set of semantic
and syntactic features
Example
(h / possibleltlatent
domain (h2 / obligatoryltnecessary
domain (e / eat,take in
agent you
patient (c / poulet))))
You may have to eat chicken
Template-based NLG
simple to build
input a dialog act, and/or a set of slot-value
pairs
from a Nitrogen demo website,
http//www.isi.edu/natural-language/projects/nitro
gen/

28
Stochastic NLG can also be thought of as a way to
automatically build templates from a corpus

If you set n equal to a large enough number, most
utterances generated by LM-NLG will be exact
duplicates of the utterances in the corpus.

29
Stochastic NLG corpora

Human-Human dialogs in travel reservations
(CMU-Leah, SRI-ATIS/American Express dialogs)

30
Tagging

CMU corpus tagged manually
SRI corpus tagged semi-automatically using
trigram language models built from CMU corpus

31
Tags

Utterance classes (29)
query_arrive_city inform_airport
query_arrive_time inform_confirm_utterance
query_arrive_time inform_epilogue
query_confirm inform_flight
query_depart_date inform_flight_another
query_depart_time inform_flight_earlier
query_pay_by_card inform_flight_earliest
query_preferred_airport inform_flight_later
query_return_date inform_flight_latest
query_return_time inform_not_avail
hotel_car_info inform_num_flights
hotel_hotel_chain inform_price
hotel_hotel_info other
hotel_need_car
hotel_need_hotel
hotel_where

Attributes (24)
airline flight_num
am hotel
arrive_airport hotel_city
arrive_city hotel_price
arrive_date name
arrive_time num_flights
car_company pm
car_price price
connect_airline
connect_airport
connect_city
depart_airport
depart_city
depart_date
depart_time
depart_tod

32
Stochastic NLG Generation

Given an utterance class, randomly generates a
set of candidate utterances based on the LM
distributions
Generation stops when an utterance has penalty
score of 0 or the maximum number of iterations
(50) has been reached
Average generation time 75 msec for Communicator
dialogs

33
Stochastic NLG Scoring

Assign various penalty scores for
unusual length of utterance (thresholds for
too-long and too-short)
slot in the generated utterance with an invalid
(or no) value in the input frame
a new and required attribute in the input
frame thats missing from the generated utterance
repeated slots in the generated utterance
Pick the utterance with the lowest penalty (or
stop generating at an utterance with 0 penalty)

34
Stochastic NLG Slot Filling

Substitute slots in the utterance with the
appropriate values in the input frame
Example
What time do you need to arrive in arrive_city?
What time do you need to arrive in New York?

35
Stochastic NLG Advantages

corpus-driven
easy to build (minimal knowledge engineering)
fast prototyping
minimal input (speech act, slot values)
natural output
leverages data-collecting/tagging effort

36
Stochastic NLG Shortcomings

What might sound natural (imperfect grammar,
intentional omission of words, etc.) for a human
speaker may sound awkward (or wrong) for the
system.
It is difficult to define utterance boundaries
and utterance classes. Some utterances in the
corpus may be a conjunction of more than one
utterance class.
Factors other than the utterance class may affect
the words (e.g., discourse history).
Some sophistication built into traditional NLG
engines is not available (e.g., aggregation,
anaphorization).

37
Open Issues

How big of a corpus do we need?
How much of it needs manual tagging?
How does the n in n-gram affect the output?
What happens to output when two different human
speakers are modeled in one model?
Can we replace scoring with a search algorithm?

38
Evaluation

Must be able to evaluate generation independent
of the rest of the dialog system
Comparative evaluation using dialog transcripts
need more subjects
8-10 dialogs system output generated batch-mode
by two different engines
Evaluation of human travel agent utterances
Do users rate them well?
Is it good enough to model human utterances?

39
Preliminary Evaluation

Batch-mode generation using two systems,
comparative evaluation of output by human
subjects
User Preferences (49 utterances total)
Weak preference for Stochastic NLG (p 0.18)