Title: InfoMagnets: Making Sense of Corpus Data
1InfoMagnets Making Sense of Corpus Data
- Jaime Arguello
- Language Technologies Institute
2Topic SegmentationHelping InfoMagnets Make
Sense of Corpus Data
- Jaime Arguello
- Language Technologies Institute
3Outline
- InfoMagnets
- Applications
- Topic Segmentation
- Evaluation of 3 Algorithms
- Results
- Conclusions
- Q/A
4InfoMagnets
5InfoMagnets Applications
- Behavioral Research
- 2 Publishable results (submitted to CHI)
- CycleTalk Project, LTI
- Netscan Group, HCII
- Conversational Interfaces
- Tutalk Gweon et al., (2005)
- Guide authoring using pre-processed human-human
sample conversations - Corpus organization makes authoring
conversational agents less intimidating. - Rose, Pai, Arguello (2005)
6Pre-processing Dialogue
Transcribed conversations
Topic chunks
(2)
C
(1)
A
Topic Segmentation
B
Topic Clustering
A
B
C
7Topic Segmentation
- Preprocess for InfoMagnets
- Important computational linguistics problem!
- Previous Work
- Marti Hearsts TextTiling (1994)
- Beeferman, Berger, and Lafferty (1997)
- Barzilay and Lee (2004) NAACL best paper award!
- Many others
- But we are segmenting dialogue
8Topic Segmentation of Dialogue
- Dialogue is Different
- Very little training data
- Linguistic Phenomena
- Ellipsis
- Telegraphic content
- - And, most importantly
Coherence in dialogue is organized around a
shared task, and not around a single flow of
information!
9Coherence Defined Over Shared Task
Multiple topic shifts in regions w/ no
intersection of content words
10Evaluation of 3 Algorithms
- 22 student-tutor pairs
- Thermodynamics
- Conversation via chat interface
- One coder
- Results shown in terms of Pk
- Lafferty et al., 1999
- Significant tests 2-tailed, t-tests
113 Baselines
- NONE no topic boundaries
- ALL every utterance marks topic boundary
- EVEN every 13th utterance marks topic boundary
- avg topic length 13 utterances
121st Attempt TextTiling
(Hearst, 1997)
- Slide two adjacent windows down the text
- Calculate cosine correlation at each step
- Use correlation values to calculate depth
- Depth values higher than a threshold correspond
to topic shifts
w1
w2
13TextTiling Results
- TextTiling performs worse than baselines
- Difference not statistically significant
- Why doesnt it work?
14TextTiling Results
- Topic boundary set heuristically where
correlation is 0 - Bad results, but still valuable!
152nd Attempt Barzilay and Lee (2004)
- Cluster utterances
- Treat each cluster as a state
- Construct HMM
- Emissions state-specific language models
- Transitions based on location and
cluster-membership of the utterances - Viterbi re-estimation until convergence
16BL Results
- BL statistically better than TT, but not better
than degenerate algorithms
17BL Results
- Too fine grained topic boundaries
- Fixed expressions (ok, yeah, sure )
- Remember cohesion based on shared task
- State-based language models sufficiently
different?
18Adding Dialogue Dynamics
- Dialogue Act coding scheme
- Developed for discourse analysis of human-tutor
dialogues - 4 main dimensions
- Action
- Depth
- Focus
- Control
- Dialogue Exchange (Sinclair and Coulthart, 1975)
193rd Attempt Cross-Dimensional Learning
- X- dimensional learning (Donmez et al., 2004)
- Use estimated labels on some dimensions to learn
other dimensions - 3 types of Features
- Text (discourse cues)
- Lexical coherence (binary)
- Dialogue Acts labels
- 10-fold cross-validation
- Topic Boundaries learned on estimated labels, not
hand coded ones!
20X-Dimensional Learning Results
- X-DIM statistically better than TT, degenerate
algorithms AND BL!
21Statistically Significant Improvement
22Future Directions
- Merge cross-dimensional learning (w/ dialogue act
features) with BL content modeling HMM approach. - Explore other work in topic segmentation of
dialogue
23Summary
- Introduction to InfoMagnets
- Applications
- Need for topic segmentation
- Evaluation of other algorithms
- Novel algorithm using X-dimensional learning
w/statistically significant improvement
24Q/A