Stochastic Language Generation: Analysis and Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Stochastic Language Generation: Analysis and Evaluation

Description:

Stochastic NLG: Problem Statement ... stochastic generation: Okay. ... one set: output from template NLG & stochastic NLG. Criteria for Choosing Dialogs ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 40
Provided by: alic49
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Stochastic Language Generation: Analysis and Evaluation


1
Stochastic Language GenerationAnalysis and
Evaluation
  • Alice Oh
  • aliceo_at_cs.cmu.edu
  • May 26, 2000

2
Acknowledgements
  • Advisor Alex Rudnicky
  • Thesis Readers Eric Nyberg, Kevin Lenzo
  • Voice for Evaluation Kevin Lenzo
  • Discussions John Lafferty, Roni Rosenfeld, Wei
    Xu
  • Transcription Tina Bennett
  • Evaluation Participants

3
Outline
  • Natural Language Generation
  • Stochastic NLG
  • Evaluation
  • Evaluation of NLG in Dialog Systems
  • Analysis of stochastic NLG
  • Evaluation of stochastic NLG

4
Natural Language Generation
  • Natural Language Understanding (NLU)
  • Natural Language Generation (NLG)
  • NLG in Communicator

Text
Semantic (Syntactic) Representation
Text
Semantic (Syntactic) Representation
Dialog Manager Input Frame
NLG Text
TTS Speech
5
Stochastic Natural Language Generation
6
Stochastic NLG Problem Statement
  • Problem build a generation engine for a dialog
    system that can combine the advantages, as well
    as overcome the difficulties, of the two dominant
    approaches (template-based generation, and
    grammar rule-based NLG)
  • Our Approach design a corpus-driven stochastic
    generation engine that takes advantage of the
    characteristics of task-oriented conversational
    systems. Some of those characteristics are that
  • Spoken utterances are much shorter in length
  • There are well-defined subtopics within the task,
    so the language can be selectively modeled

7
Stochastic NLG overview
  • Language Model an n-gram language model of
    domain experts language built from a corpus of
    travel reservation dialogs
  • Generation given an utterance class, randomly
    generates a set of candidate utterances based on
    the LM distributions
  • Scoring based on a set of heuristics, scores the
    candidates and picks the best one
  • Slot filling substitute slots in the utterance
    with the appropriate values in the input frame

8
Stochastic NLG overview
Language Models
Generation
Dialog Manager
Candidate Utterances What time on
depart_date? At what time would you be leaving
depart_city?
Input Frame act query content
depart_time depart_date 20000501
Tagged Corpora
Scoring
Best Utterance What time on depart_date?
Complete Utterance What time on Mon, May 8th?
TTS
Slot Filling
9
Example
  • Utterances in Corpus
  • What time do you want to depart depart_city?
  • What time on depart_date would you like to
    depart?
  • What time would you like to leave?
  • What time do you want to depart on depart_date?
  • Output (different from corpus)
  • What time would you like to depart?
  • What time on depart_date would you like to
    depart depart_city?
  • What time on depart_date would you like to
    depart on depart_date?

10
Stochastic NLG in Communicator
  • Hybrid System
  • canned expressions Hello. Welcome to the CMU
    Communicator
  • templates I'm sorry, I must have the wrong
    cities for your trip.You cannot travel from
    depart_city to arrive_city.
  • stochastic generation Okay. I have a nonstop
    on airline departing depart_city at
    depart_time arriving into arrive_city at
    arrive_time.
  • Language Models
  • built from CMU dialogs (39 dialogs, 970
    utterances, 12852 words)
  • inform_flight sub-corpus 43 utts, 699 words (55
    unique words)
  • query_depart_time sub-corpus 37 utts, 382 words
    (24 unique words)
  • Time to Generate 75 msec on average

11
Analysis and Evaluation
12
Evaluation of NLG in Dialog Systems
  • Analysis of technique
  • coverage
  • quality of output utterances
  • Evaluation of NLG within the overall system
  • user satisfaction of system utterances
  • overall impression of the interaction
  • effect of system prompt on user utterances
  • Evaluation of NLG with respect to task completion

13
Analysis
Stochastic NLG_1
Output_1
Stochastic NLG_2
Output_1
Input Frame
. . .
. . .
Stochastic NLG_k
Output_k
Analysis
Input
Parameterized NLG
14
Analysis Experiment
  • Setup Web-based survey
  • 2 dialog acts (inform flight, query depart_time)
  • 5 groups per dialog act
  • 8 output sentences (unigram - 8gram) per group
  • Subjects 12 subjects not familiar with NLG or
    the Communicator task
  • Parameter n in ngram (1-8)
  • Task
  • for each sentence, mark Acceptable, Unacceptable
  • for each group, pick best sentence (with lowest
    number)

15
Analysis Results
Average best Sentence 4

3.75
16
Analysis Discussion
  • Text vs. Speech
  • subject marked unacceptable an exact duplicate
    of a sentence in corpus
  • And departing Pittsburgh at what time?
  • how would they judge travel agents language on
    paper? On the phone?
  • Grammar
  • subjects looked for near-perfect grammar in
    acceptable sentences
  • I have a nonstop on United Airlines departing
    Seattle at six twenty a.m., arrives Los Angeles
    at eight fifty three a.m.
  • If the system can do better than humans, should
    it?

17
Evaluation
Dialogs with OutputS
Dialogs
Stochastic NLG
Dialogs with OutputT
Template NLG
Batch-mode Generation
Comparative Evaluation
  • Transcription

18
Experiment 1
  • Setup Hardcopies of dialog transcripts
  • 7 sets of dialogs
  • one set output from template NLG stochastic
    NLG
  • 49 pairs of sentences to compare
  • Subjects 7 subjects in LTI (not in Communicator)
  • Task for each pair of sentences, mark which
    sentence is better
  • Results
  • Weak preference for Stochastic NLG (p 0.18)
  • 5 out of 7 subjects preferred Stochastic NLG

19
Experiment 2
  • Setup
  • Webpage with links to .wav files of recorded
    dialogs
  • Second webpage with the transcripts of those
    dialogs
  • 3 sets of dialogs
  • one set output from template NLG stochastic
    NLG
  • Criteria for Choosing Dialogs
  • calls from outside of the Communicator group
  • successfully completed itinerary
  • used a part of the dialog (e.g., one leg of the
    trip)
  • contains at least 5 utterances generated by
    stochastic NLG

20
Experiment 2 (contd)
  • Subjects 20 subjects not familiar with NLG or
    the Communicator task
  • Task
  • recorded dialogs listen to each pair of dialogs,
    then answer 4 questions (natural, understandable,
    like better, prefer to use)
  • transcripts for each pair of sentences, pick the
    more natural sentence

21
Experiment 2 Results
  • Listening to the recorded dialogs
  • by question (lt 0 stochastic gt 0 templates)
  • natural mean -0.02, p-value 0.43
  • understand mean 0.08, p-value 0.14
  • like better mean 0.12, p-value 0.13
  • prefer to use mean -0.02, p-value 0.43
  • by subject
  • 11 subjects preferring S, 8 preferring T, 1
    Neutral
  • 6 statistically significant (3 S, 3 T) by sign
    test (at p 0.05)

22
Experiment 2 Results
  • Reading the transcripts
  • 11 subjects preferring S, 7 preferring T, 3
    Neutral
  • 2 statistically significant (both S) by t-test
  • Correlation between Recorded Dialogs
    Transcripts
  • (r 0.009)

23
Experiment 2 Discussion
  • No correlation between subjects judgements on
    recorded dialogs transcripts
  • No significant difference in the results
  • Possible reasons
  • Subjects made judgments based on voice quality
  • not following instructions (!)
  • 15 out of 20 subjects mentioned voice quality in
    comments
  • use TTS voice?
  • Some other factors (e.g., length of utterance,
    lexicon) should have been controlled
  • There really is no significant difference in the
    quality of output

24
Experiment 2 Comments
  • Although B sounded more natural, it was slower
    (too slow)
  • Voice A slurred his words a little at the end of
    one or two sentences--it sounded sloppy
  • Voice seems to care
  • Neither. I prefer talking to real people rather
    than computers
  • Both. Really no preference. The dialogues,
    while different, were both pleasant in my
    opinion

25
Conclusion
  • Stochastic NLG does at least as well as
    hand-crafted templates
  • Evaluation of NLG (esp. in dialog systems) is
    hard
  • getting subjects to do the right thing
  • designing experiment to control as many
    independent variables
  • teasing apart TTS from NLG
  • More research is needed
  • judging human travel agents language
  • another round of experiments!

26
Extra Slides
27
Current Approaches
  • Traditional (rule-based) NLG
  • hand-crafted generation grammar rules and other
    knowledge
  • input a very richly specified set of semantic
    and syntactic features
  • Example
  • (h / possibleltlatent
  • domain (h2 / obligatoryltnecessary
  • domain (e / eat,take in
  • agent you
  • patient (c / poulet))))
  • You may have to eat chicken
  • Template-based NLG
  • simple to build
  • input a dialog act, and/or a set of slot-value
    pairs
  • from a Nitrogen demo website,
    http//www.isi.edu/natural-language/projects/nitro
    gen/

28
Stochastic NLG can also be thought of as a way to
automatically build templates from a corpus
  • If you set n equal to a large enough number, most
    utterances generated by LM-NLG will be exact
    duplicates of the utterances in the corpus.

29
Stochastic NLG corpora
  • Human-Human dialogs in travel reservations
  • (CMU-Leah, SRI-ATIS/American Express dialogs)

30
Tagging
  • CMU corpus tagged manually
  • SRI corpus tagged semi-automatically using
    trigram language models built from CMU corpus

31
Tags
  • Utterance classes (29)
  • query_arrive_city inform_airport
  • query_arrive_time inform_confirm_utterance
  • query_arrive_time inform_epilogue
  • query_confirm inform_flight
  • query_depart_date inform_flight_another
  • query_depart_time inform_flight_earlier
  • query_pay_by_card inform_flight_earliest
  • query_preferred_airport inform_flight_later
  • query_return_date inform_flight_latest
  • query_return_time inform_not_avail
  • hotel_car_info inform_num_flights
  • hotel_hotel_chain inform_price
  • hotel_hotel_info other
  • hotel_need_car
  • hotel_need_hotel
  • hotel_where
  • Attributes (24)
  • airline flight_num
  • am hotel
  • arrive_airport hotel_city
  • arrive_city hotel_price
  • arrive_date name
  • arrive_time num_flights
  • car_company pm
  • car_price price
  • connect_airline
  • connect_airport
  • connect_city
  • depart_airport
  • depart_city
  • depart_date
  • depart_time
  • depart_tod

32
Stochastic NLG Generation
  • Given an utterance class, randomly generates a
    set of candidate utterances based on the LM
    distributions
  • Generation stops when an utterance has penalty
    score of 0 or the maximum number of iterations
    (50) has been reached
  • Average generation time 75 msec for Communicator
    dialogs

33
Stochastic NLG Scoring
  • Assign various penalty scores for
  • unusual length of utterance (thresholds for
    too-long and too-short)
  • slot in the generated utterance with an invalid
    (or no) value in the input frame
  • a new and required attribute in the input
    frame thats missing from the generated utterance
  • repeated slots in the generated utterance
  • Pick the utterance with the lowest penalty (or
    stop generating at an utterance with 0 penalty)

34
Stochastic NLG Slot Filling
  • Substitute slots in the utterance with the
    appropriate values in the input frame
  • Example
  • What time do you need to arrive in arrive_city?
  • What time do you need to arrive in New York?

35
Stochastic NLG Advantages
  • corpus-driven
  • easy to build (minimal knowledge engineering)
  • fast prototyping
  • minimal input (speech act, slot values)
  • natural output
  • leverages data-collecting/tagging effort

36
Stochastic NLG Shortcomings
  • What might sound natural (imperfect grammar,
    intentional omission of words, etc.) for a human
    speaker may sound awkward (or wrong) for the
    system.
  • It is difficult to define utterance boundaries
    and utterance classes. Some utterances in the
    corpus may be a conjunction of more than one
    utterance class.
  • Factors other than the utterance class may affect
    the words (e.g., discourse history).
  • Some sophistication built into traditional NLG
    engines is not available (e.g., aggregation,
    anaphorization).

37
Open Issues
  • How big of a corpus do we need?
  • How much of it needs manual tagging?
  • How does the n in n-gram affect the output?
  • What happens to output when two different human
    speakers are modeled in one model?
  • Can we replace scoring with a search algorithm?

38
Evaluation
  • Must be able to evaluate generation independent
    of the rest of the dialog system
  • Comparative evaluation using dialog transcripts
  • need more subjects
  • 8-10 dialogs system output generated batch-mode
    by two different engines
  • Evaluation of human travel agent utterances
  • Do users rate them well?
  • Is it good enough to model human utterances?

39
Preliminary Evaluation
  • Batch-mode generation using two systems,
    comparative evaluation of output by human
    subjects
  • User Preferences (49 utterances total)
  • Weak preference for Stochastic NLG (p 0.18)

subject
stochastic
templates
difference
1
41
8
33
2
34
15
19
3
17
32
-15
4
32
17
15
5
30
17
13
6
27
19
8
7
8
41
-33
average
27
21.29
5.71
Write a Comment
User Comments (0)
About PowerShow.com