InfoMagnets: Making Sense of Corpus Data

About This Presentation

Title:

InfoMagnets: Making Sense of Corpus Data

Description:

Helping InfoMagnets Make Sense of Corpus Data. Jaime Arguello. Language Technologies Institute ... Very little training data. Linguistic Phenomena. Ellipsis ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 25

Provided by: Jai779

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: InfoMagnets: Making Sense of Corpus Data

1
InfoMagnets Making Sense of Corpus Data

Jaime Arguello
Language Technologies Institute

2
Topic SegmentationHelping InfoMagnets Make
Sense of Corpus Data

Jaime Arguello
Language Technologies Institute

3
Outline

InfoMagnets
Applications
Topic Segmentation
Evaluation of 3 Algorithms
Results
Conclusions
Q/A

4
InfoMagnets
5
InfoMagnets Applications

Behavioral Research
2 Publishable results (submitted to CHI)
CycleTalk Project, LTI
Netscan Group, HCII
Conversational Interfaces
Tutalk Gweon et al., (2005)
Guide authoring using pre-processed human-human
sample conversations
Corpus organization makes authoring
conversational agents less intimidating.
Rose, Pai, Arguello (2005)

6
Pre-processing Dialogue
Transcribed conversations
Topic chunks
(2)
C
(1)
A
Topic Segmentation
B
Topic Clustering
A
B
C
7
Topic Segmentation

Preprocess for InfoMagnets
Important computational linguistics problem!
Previous Work
Marti Hearsts TextTiling (1994)
Beeferman, Berger, and Lafferty (1997)
Barzilay and Lee (2004) NAACL best paper award!
Many others
But we are segmenting dialogue

8
Topic Segmentation of Dialogue

Dialogue is Different
Very little training data
Linguistic Phenomena
Ellipsis
Telegraphic content
- And, most importantly

Coherence in dialogue is organized around a
shared task, and not around a single flow of
information!
9
Coherence Defined Over Shared Task
Multiple topic shifts in regions w/ no
intersection of content words
10
Evaluation of 3 Algorithms

22 student-tutor pairs
Thermodynamics
Conversation via chat interface
One coder
Results shown in terms of Pk
Lafferty et al., 1999
Significant tests 2-tailed, t-tests

11
3 Baselines

NONE no topic boundaries
ALL every utterance marks topic boundary
EVEN every 13th utterance marks topic boundary
avg topic length 13 utterances

12
1st Attempt TextTiling
(Hearst, 1997)

Slide two adjacent windows down the text
Calculate cosine correlation at each step
Use correlation values to calculate depth
Depth values higher than a threshold correspond
to topic shifts

w1
w2
13
TextTiling Results

TextTiling performs worse than baselines
Difference not statistically significant
Why doesnt it work?

14
TextTiling Results

Topic boundary set heuristically where
correlation is 0
Bad results, but still valuable!

15
2nd Attempt Barzilay and Lee (2004)

Cluster utterances
Treat each cluster as a state
Construct HMM
Emissions state-specific language models
Transitions based on location and
cluster-membership of the utterances
Viterbi re-estimation until convergence

16
BL Results

BL statistically better than TT, but not better
than degenerate algorithms

17
BL Results

Too fine grained topic boundaries
Fixed expressions (ok, yeah, sure )
Remember cohesion based on shared task
State-based language models sufficiently
different?

18
Adding Dialogue Dynamics

Dialogue Act coding scheme
Developed for discourse analysis of human-tutor
dialogues
4 main dimensions
Action
Depth
Focus
Control
Dialogue Exchange (Sinclair and Coulthart, 1975)

19
3rd Attempt Cross-Dimensional Learning

X- dimensional learning (Donmez et al., 2004)
Use estimated labels on some dimensions to learn
other dimensions
3 types of Features
Text (discourse cues)
Lexical coherence (binary)
Dialogue Acts labels
10-fold cross-validation
Topic Boundaries learned on estimated labels, not
hand coded ones!

20
X-Dimensional Learning Results

X-DIM statistically better than TT, degenerate
algorithms AND BL!

21
Statistically Significant Improvement
22
Future Directions

Merge cross-dimensional learning (w/ dialogue act
features) with BL content modeling HMM approach.
Explore other work in topic segmentation of
dialogue

23
Summary

Introduction to InfoMagnets
Applications
Need for topic segmentation
Evaluation of other algorithms
Novel algorithm using X-dimensional learning
w/statistically significant improvement