InfoMagnets: Making Sense of Corpus Data - PowerPoint PPT Presentation

About This Presentation
Title:

InfoMagnets: Making Sense of Corpus Data

Description:

Helping InfoMagnets Make Sense of Corpus Data. Jaime Arguello. Language Technologies Institute ... Very little training data. Linguistic Phenomena. Ellipsis ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 25
Provided by: Jai779
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: InfoMagnets: Making Sense of Corpus Data


1
InfoMagnets Making Sense of Corpus Data
  • Jaime Arguello
  • Language Technologies Institute

2
Topic SegmentationHelping InfoMagnets Make
Sense of Corpus Data
  • Jaime Arguello
  • Language Technologies Institute

3
Outline
  • InfoMagnets
  • Applications
  • Topic Segmentation
  • Evaluation of 3 Algorithms
  • Results
  • Conclusions
  • Q/A

4
InfoMagnets
5
InfoMagnets Applications
  • Behavioral Research
  • 2 Publishable results (submitted to CHI)
  • CycleTalk Project, LTI
  • Netscan Group, HCII
  • Conversational Interfaces
  • Tutalk Gweon et al., (2005)
  • Guide authoring using pre-processed human-human
    sample conversations
  • Corpus organization makes authoring
    conversational agents less intimidating.
  • Rose, Pai, Arguello (2005)

6
Pre-processing Dialogue
Transcribed conversations
Topic chunks
(2)
C
(1)
A
Topic Segmentation
B
Topic Clustering
A
B
C
7
Topic Segmentation
  • Preprocess for InfoMagnets
  • Important computational linguistics problem!
  • Previous Work
  • Marti Hearsts TextTiling (1994)
  • Beeferman, Berger, and Lafferty (1997)
  • Barzilay and Lee (2004) NAACL best paper award!
  • Many others
  • But we are segmenting dialogue

8
Topic Segmentation of Dialogue
  • Dialogue is Different
  • Very little training data
  • Linguistic Phenomena
  • Ellipsis
  • Telegraphic content
  • - And, most importantly

Coherence in dialogue is organized around a
shared task, and not around a single flow of
information!
9
Coherence Defined Over Shared Task
Multiple topic shifts in regions w/ no
intersection of content words
10
Evaluation of 3 Algorithms
  • 22 student-tutor pairs
  • Thermodynamics
  • Conversation via chat interface
  • One coder
  • Results shown in terms of Pk
  • Lafferty et al., 1999
  • Significant tests 2-tailed, t-tests

11
3 Baselines
  • NONE no topic boundaries
  • ALL every utterance marks topic boundary
  • EVEN every 13th utterance marks topic boundary
  • avg topic length 13 utterances

12
1st Attempt TextTiling
(Hearst, 1997)
  • Slide two adjacent windows down the text
  • Calculate cosine correlation at each step
  • Use correlation values to calculate depth
  • Depth values higher than a threshold correspond
    to topic shifts

w1
w2
13
TextTiling Results
  • TextTiling performs worse than baselines
  • Difference not statistically significant
  • Why doesnt it work?

14
TextTiling Results
  • Topic boundary set heuristically where
    correlation is 0
  • Bad results, but still valuable!

15
2nd Attempt Barzilay and Lee (2004)
  • Cluster utterances
  • Treat each cluster as a state
  • Construct HMM
  • Emissions state-specific language models
  • Transitions based on location and
    cluster-membership of the utterances
  • Viterbi re-estimation until convergence

16
BL Results
  • BL statistically better than TT, but not better
    than degenerate algorithms

17
BL Results
  • Too fine grained topic boundaries
  • Fixed expressions (ok, yeah, sure )
  • Remember cohesion based on shared task
  • State-based language models sufficiently
    different?

18
Adding Dialogue Dynamics
  • Dialogue Act coding scheme
  • Developed for discourse analysis of human-tutor
    dialogues
  • 4 main dimensions
  • Action
  • Depth
  • Focus
  • Control
  • Dialogue Exchange (Sinclair and Coulthart, 1975)

19
3rd Attempt Cross-Dimensional Learning
  • X- dimensional learning (Donmez et al., 2004)
  • Use estimated labels on some dimensions to learn
    other dimensions
  • 3 types of Features
  • Text (discourse cues)
  • Lexical coherence (binary)
  • Dialogue Acts labels
  • 10-fold cross-validation
  • Topic Boundaries learned on estimated labels, not
    hand coded ones!

20
X-Dimensional Learning Results
  • X-DIM statistically better than TT, degenerate
    algorithms AND BL!

21
Statistically Significant Improvement
22
Future Directions
  • Merge cross-dimensional learning (w/ dialogue act
    features) with BL content modeling HMM approach.
  • Explore other work in topic segmentation of
    dialogue

23
Summary
  • Introduction to InfoMagnets
  • Applications
  • Need for topic segmentation
  • Evaluation of other algorithms
  • Novel algorithm using X-dimensional learning
    w/statistically significant improvement

24
Q/A
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com