Title: Datadriven Approaches for Information Structure Identification
1Data-driven Approaches for Information Structure
Identification
2Information Structure (IS)
- Division of the sentence in two parts
- Links the sentence to the discourse
- Advances the discourse (brings new information)
- Ex
- Rob needs to talk things out, and he certainly
isnt - going to do that with Dick or Barry.
- So, he talks to HIMSELF instead.
- Not the given/new distinction.
3Why / Where is IS important?
- Realization of IS
- Prosody (English)
- Word order variation (Czech, German)
- Morphology (Japanese)
- Applications
- Text-to-speech systems
- Natural Language Generation
- Machine Translation
4Outline
- The Prague School Approach of IS
- Theory of Topic-Focus Articulation (TFA)
- Annotation of TFA
- Automatic Extraction of topic focus
- Experimental setup
- Results
- Error analysis
5Topic-Focus Articulation
- Topic what is the sentence about
- Focus information asserted about the topic
- SentenceSem Focus(Topic)
- Defined considering
- the Contextually-Bound/Non-Bound distinction
6Contextually Bound / Non-Bound
- Operational criterion based on question-answer
test - CB (Contextually-Bound) are
- Weak and zero pronouns
- Items in the answer which reproduce expressions
present (or associated to those present) in the
question - NB (Non-Bound) are
- The item corresponding to the wh-word
- Strong/stressed pronouns
7Example CB vs. NB
- Rob needs to talk things out, and he certainly
isnt going to do that with Dick or Barry. -
-
- So, he talks to HIMSELF instead.
So, whom does he talk to instead?
8TFA Theoretical Definition
- Focus
- The main verb (V) and any child of V (and the
subordinated sub-tree) iff they are NB. - If V and all its children are CB, then the NB
items subordinated to the children (and the
subordinated sub-trees). - Topic
- All items not belonging to Focus cf. 1
9Example Topic Focus
- So, he talks to HIMSELF instead.
talks to
CB
he
instead
So
HIMSELF
CB
NB
CB
CB
10Roadmap
- The Prague School Approach of IS
- Theory of Topic-Focus Articulation (TFA)
- Annotation of TFA
- Automatic Extraction of topic focus
- Experimental setup
- Results
- Error analysis
11Prague Dependency Treebank
- Three layers of annotation
- Morphological
- Analytical
- Syntactic trees containing each token of the
surface form (incl. punctuation marks) - Main syntactic functions SUBJ, OBJ,
- Tectogrammatical deep structure of the sentence
- Only autosemantic words
- Recovered words (deleted on the surface)
- Detailed classification of functors PAT, ACT,
ADDR, - Topic Focus Articulation (TFA)
12Annotation of TFA
- Marked on all nodes from the tectogrammatical
layer (50K sentences / 632K tecto-nodes). - Three classes t, c, f
- t non-contrastive CB nodes
- c contrastive CB nodes
- f NB nodes
- Guidelines use notions of
- surface order
- intonation center (IC)
- systemic (canonical) order of dependents
t
13Annotators Agreement (in )
Veselá, Havelka Hajicová, LREC2004
14Roadmap
- The Prague School Approach of IS
- Theory of Topic-Focus Articulation (TFA)
- Annotation of TFA
- Automatic Extraction of topic focus
- Experimental setup
- Results
- Error analysis
15Extraction of topic focus
- Goal label each tecto-node with t(opic) or
f(ocus) - Steps
- 1. Rule-based system (used as 2nd baseline)
- 2. Build models for different classifiers
- C4.5
- MaxEnt
- Ripper
- 3. Error Analysis
16Rule-based system 1/2
17Rule-based system 2/2
18Machine Learning Models
- Three different techniques
- Decision trees (C4.5)
- Rule induction (RIPPER)
- Maximum Entropy (MaxEnt)
- Use 2 classes of features
- Basic features (attributes from the treebank)
- Derived features (inspired by the annotation
guidelines)
19Basic Features
- nodetype gt complex, atom,
- functor gt ACT, PAT, ATT,
- coref gttrue, false
- coreftypegttext, gram, NA
- afun gt Sbj, Obj, Pred,
- SUBPOS gt P1, PD, PE, NN,
- -- 23. The gramatemes sempos, verbmod, aspect,
20Derived Features
- is_rightmost_dependent_of_the_verb
- is_rightside_dependent_of_the_verb
- is_rightside_dependent
- is_embedded_attribute
- has_repeated_lemma
- is_in_canonical_order
- is_weak_pronoun
- is_indexical_expression
- is_pronoun_with_general_meaning
- is_strong_pronoun_with_no_prep
21Data statistics
- 3,168 files
- 49,442 sentences
- Instances 621,991
- Training set 494,756 (78.3)
- Development set 66,711 (10.5)
- Test set 70,323 (11.2)
- TFA distribution on the training set
- 63.11 f
- 36.89 t
22Evaluation
- Metric Correctly classified instances
- Baseline assigns the class that has the most
instances (f) - Second Baseline the Rule-based system
23Results
24Error analysis
New contexts in development data 2,043 (2,125
instances)
25Naïve Predictor
New contexts in test data gt f
26Naïve Predictor evaluation
?226.3 plt0.001
?230.7 plt0.001
27Learning curves
C4.5
Naïve predictor
MaxEnt
RIPPER
28Conclusions
- Information Structure can be recovered using
mostly syntactic features. - Improvement could be done by introducing more
features rather than providing more annotated
data. - Future research Transferring IS from Czech to
English through word alignment.
29Thank you!
30More examples Topic Focus
NB
CB
CB
CB
CB
CB
NB
NB
Proxy foci
NB
NB
NB
CB
NB
CB
CB
CB
NB