Datadriven Approaches for Information Structure Identification

About This Presentation

Title:

Datadriven Approaches for Information Structure Identification

Description:

intonation center (IC) systemic (canonical) order of dependents. t. 13 ... The bearer of Intonation Center (IC) (typically, the rightmost child of the verb) ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 31

Provided by: oanapos

Category:

more less

Transcript and Presenter's Notes

Title: Datadriven Approaches for Information Structure Identification

1
Data-driven Approaches for Information Structure
Identification

HLT/EMNLP Oct 6, 2005

2
Information Structure (IS)

Division of the sentence in two parts
Links the sentence to the discourse
Advances the discourse (brings new information)
Ex
Rob needs to talk things out, and he certainly
isnt
going to do that with Dick or Barry.
So, he talks to HIMSELF instead.
Not the given/new distinction.

3
Why / Where is IS important?

Realization of IS
Prosody (English)
Word order variation (Czech, German)
Morphology (Japanese)
Applications
Text-to-speech systems
Natural Language Generation
Machine Translation

4
Outline

The Prague School Approach of IS
Theory of Topic-Focus Articulation (TFA)
Annotation of TFA
Automatic Extraction of topic focus
Experimental setup
Results
Error analysis

5
Topic-Focus Articulation

Topic what is the sentence about
Focus information asserted about the topic
SentenceSem Focus(Topic)
Defined considering
the Contextually-Bound/Non-Bound distinction

6
Contextually Bound / Non-Bound

Operational criterion based on question-answer
test
CB (Contextually-Bound) are
Weak and zero pronouns
Items in the answer which reproduce expressions
present (or associated to those present) in the
question
NB (Non-Bound) are
The item corresponding to the wh-word
Strong/stressed pronouns

7
Example CB vs. NB

Rob needs to talk things out, and he certainly
isnt going to do that with Dick or Barry.
So, he talks to HIMSELF instead.

So, whom does he talk to instead?
8
TFA Theoretical Definition

Focus
The main verb (V) and any child of V (and the
subordinated sub-tree) iff they are NB.
If V and all its children are CB, then the NB
items subordinated to the children (and the
subordinated sub-trees).
Topic
All items not belonging to Focus cf. 1

9
Example Topic Focus

So, he talks to HIMSELF instead.

talks to
CB
he
instead
So
HIMSELF
CB
NB
CB
CB
10
Roadmap

The Prague School Approach of IS
Theory of Topic-Focus Articulation (TFA)
Annotation of TFA
Automatic Extraction of topic focus
Experimental setup
Results
Error analysis

11
Prague Dependency Treebank

Three layers of annotation
Morphological
Analytical
Syntactic trees containing each token of the
surface form (incl. punctuation marks)
Main syntactic functions SUBJ, OBJ,
Tectogrammatical deep structure of the sentence
Only autosemantic words
Recovered words (deleted on the surface)
Detailed classification of functors PAT, ACT,
ADDR,
Topic Focus Articulation (TFA)

12
Annotation of TFA

Marked on all nodes from the tectogrammatical
layer (50K sentences / 632K tecto-nodes).
Three classes t, c, f
t non-contrastive CB nodes
c contrastive CB nodes
f NB nodes
Guidelines use notions of
surface order
intonation center (IC)
systemic (canonical) order of dependents

t
13
Annotators Agreement (in )
Veselá, Havelka Hajicová, LREC2004
14
Roadmap

The Prague School Approach of IS
Theory of Topic-Focus Articulation (TFA)
Annotation of TFA
Automatic Extraction of topic focus
Experimental setup
Results
Error analysis

15
Extraction of topic focus

Goal label each tecto-node with t(opic) or
f(ocus)
Steps
1. Rule-based system (used as 2nd baseline)
2. Build models for different classifiers
C4.5
MaxEnt
Ripper
3. Error Analysis

16
Rule-based system 1/2
17
Rule-based system 2/2
18
Machine Learning Models

Three different techniques
Decision trees (C4.5)
Rule induction (RIPPER)
Maximum Entropy (MaxEnt)
Use 2 classes of features
Basic features (attributes from the treebank)
Derived features (inspired by the annotation
guidelines)

19
Basic Features

nodetype gt complex, atom,
functor gt ACT, PAT, ATT,
coref gttrue, false
coreftypegttext, gram, NA
afun gt Sbj, Obj, Pred,
SUBPOS gt P1, PD, PE, NN,
-- 23. The gramatemes sempos, verbmod, aspect,

20
Derived Features

is_rightmost_dependent_of_the_verb
is_rightside_dependent_of_the_verb
is_rightside_dependent
is_embedded_attribute
has_repeated_lemma
is_in_canonical_order
is_weak_pronoun
is_indexical_expression
is_pronoun_with_general_meaning
is_strong_pronoun_with_no_prep

21
Data statistics

3,168 files
49,442 sentences
Instances 621,991
Training set 494,756 (78.3)
Development set 66,711 (10.5)
Test set 70,323 (11.2)
TFA distribution on the training set
63.11 f
36.89 t

22
Evaluation

Metric Correctly classified instances
Baseline assigns the class that has the most
instances (f)
Second Baseline the Rule-based system

23
Results
24
Error analysis
New contexts in development data 2,043 (2,125
instances)
25
Naïve Predictor
New contexts in test data gt f
26
Naïve Predictor evaluation
?226.3 plt0.001
?230.7 plt0.001
27
Learning curves
C4.5
Naïve predictor
MaxEnt
RIPPER
28
Conclusions

Information Structure can be recovered using
mostly syntactic features.
Improvement could be done by introducing more
features rather than providing more annotated
data.
Future research Transferring IS from Czech to
English through word alignment.