Title: Chunking
1Chunking
2Chunking
- is an efficient and robust method for
- identifying short phrases in text (chunks)
-
- Chunks are non-overlapping spans of text,
containing - a head word (noun, proper name, adjective, )
- adjacent modifiers and function words
(adjective, determiner, preposition, )
3Example
- I begin with an intuition when I read
- a sentence, I read it a chunk at a time
- I begin with an intuition when I
read - a sentence, I read it a chunk at a
time - I begin with an intuition when I
read - a sentence, I read it a chunk at a
time
4Motivation
- Locate information
- Extract Noun Phrases Indexing
- Extract NPs and Verbs Information Extraction
- Ignore information
- Aquisition of subcat-information
- gave NP, gave up NP in NP, gave NP up,
- gave NP NP, gave NP to NP
5Definition of Chunks (Abney93)
- major head a content word not between a function
word f and the word selected by f
- root highest node with major head as semantic
head
- chunk maximal string containing a major head,
dominated by root, not contained in other chunk
6Problem empty categories
- The underlying grammar must assume
-
- null determiners
- Ø poor people forms a noun chunk
- and empty nouns.
- the poor Ø forms a noun chunk
7Problem Center-Embedding
- Function word and semantic head may be separated
by other noun chunks. - der/den mehrere Milliarden Euro hohen
Schäden - in Johns house
- Base noun chunks may be ungrammatical.
- die im Alter nachlassenden Kräfte
8Problems with Coordination
-
- if chunks may be
multi-headed -
-
-
- if conjunctions
are excluded from
chunks
9System Overview
10Base Nouns Chunks vs. Full Noun Chunks
- Base Noun Chunk maximal string dominated by root
containing noun as major head ( Abney 1993) - Full Noun Chunk Part of NP between determiner
and (first) head noun (Schmid and Schulte im
Walde 2000) - includes names
- the discoverer Christopher Columbus
- but not coordinated NPs
- parts of Scotland and Northern Ireland
- and not appositions
- Christopher Columbus, the famous discoverer,
11Negative Definition of Full Noun Chunks
- NP/PP stripped of - adverbials at the front and
- PPs and relative clauses at the back
(Brants,
1999) - coordinations and appositions
- ? parts ? of Scotland and Northern Ireland
- pre- and postnominal genitives
- Marias Version der Geschichte
- measure phrases 20 Dollar Strafe
12Disambiguation in the recognition of full noun
chunks
- POS ambiguities resolved by POS tagger.
- PP attachment ambiguities are kept underspecified
(wrt. positive and negative definition of full
noun chunk). - All other ambiguities are resolved using the
longest-match criterion (Abney, 1993). - Chunks should be as long as possible.
- But approaches to deal with them explicitly, by
underspecification, or by non-monotoniciy exist.
13Recognizing Full Noun Chunksexplicit
representation of ambiguities
- used in previous work on full noun chunking
(Brants99, Schmid and Schulte im Walde00,
Kermes and Evert02) - drawback requires search
- Parser is not deterministic any longer.
- Linear complexity is lost.
14Recognizing Full Noun Chunksdealing with
ambiguities by non-monotonic cascades
- A method retaining determinism and linear
complexity (Schiehlen02) - recognize base noun chunks that could form
- beginning,
- middle or
- end of a full noun chunk
- discard those noun chunks (monotonicity lost!)
- re-apply original noun chunk transducer
15Recognizing Recursive NPsby Non-Monotonic
Cascades
16Recognizing Recursive NPsby Underspecification
- 0 1 2 3 4
5 - die Ende der Woche geplanten Treffen
- Underspecified Representation (Ã la
Spranger05) - lt NP, 0,1,2,3,4,5, 1,3,5, 1,3,5 gt
- Desambiguierungsalgorithmus
- Problem needs preprocessing of set of NPs
17Case Checking
183 Approaches to Agreement Checking in FS Parsers
- add agreement info to POS tags and compile the
grammar out (drawback explosion of trans table) - postpone agreement check until after chunk
recognition (Abney, 1997) - interleave agreement checking with chunking
(Neumann et al., 2000), problems with
subcategorizing multi-words - um Gottes willen (for God's sake)
- um takes acc., um-willen takes gen.!
19Online Agreement Checking
- errors avoided
- genitives (case mismatch)
- in John's house
- conjunction attachment (case mismatch)
- das Leben von Schauspielern und Zirkusleuten
the life(nomacc) of actors
and circus people(dat) - adjacent NPs (adjective declination)
- diese beiden ähnliche Erfolge
- those two(weak) similar(strong) successes
20Online Agreement Checking
- Some grammar errors become visible only with
agreement checking. - N coordination is missing.
- die nachlassenden Kräfte
the
diminishing strength - die Verletzungen und nachlassenden Kräfte
the injuries
and diminishing strength
no noun chunk!
21Experiment by Schielen02
- Writing a finite-state grammar is worth the
effort. FS method performs better than
statistical method - Noun chunker is not very good at determining POS
tags. - Online agreement checking improves performance.
- Shortest match is better than longest match for
conjunction attachment.
22Case checking Readings
23Underspecification Context Variables
- Udo/0 1aNPnom,1bNPakk
- kennt/1
- eine/2
- nette/4
- Frau/5 1aNPakk,1bNPnom
- aus/6
- Rio/7 ADJ 1A5
- ADJ 1A1
24Literatur
- Schiehlen02. Experiments in German Noun
Chunking. COLING 2002, Taipei, August 27th, 2002 - Schiehlen03. A Cascaded Finite-State Parser for
German. EACL 2003, Budapest, April 17th, 2003 - Spranger05. Combining deterministic processing
with ambiguity awareness the case of German
quantifying noun groups. PhD Thesis. University
of Stuttgart.