Title: Chunk Parsing
1Chunk Parsing
2Chunk Parsing
- Also called chunking, light parsing, or partial
parsing. - Method Assign some additional structure to
input over tagging - Used when full parsing not feasible or not
desirable. - Because of the expense of full-parsing, often
treated as a stop-gap solution.
3Chunk Parsing
- No rich hierarchy, as in parsing.
- Usually one layer above tagging.
- The process
- Tokenize
- Tag
- Chunk
4Chunk Parsing
- Like tokenizing and tagging in a few respects
- Can skip over material in the input
- Often finite-state (or finite-state like) methods
are used (applied over tags) - Often application specific (i.e., the chunks
tagged have uses for particular applications)
5Chunk Parsing
- Chief Motivations to find data or to ignore
data - Example from Bird and Loper find the argument
structures for the verb give. - Can discover significant grammatical structures
before developing a grammar - gave NP
- gave up NP in NP
- gave NP up
- gave NP help
- gave NP to NP
6Chunk Parsing
- Like parsing, except
- It is not exhaustive, and doesnt pretend to be.
- Structures and data can be skipped when not
convenient or not desired - Structures of fixed depth produced
- Nested structures typical in parsing
- SNP The cow PP in NP the barn ate
- Not in chunking
- NP The cow in NP the barn ate
7Chunk Parsing
- Finds contiguous, non-overlapping spans of
related text, and groups them into chunks. - Because contiguity is given, finite state methods
can be adapted to chunking
8Longest Match
- Abney 1995 discusses longest match heuristic
- One automaton for each phrasal category
- Start automata at position i (where i0
initially) - Winner is the automaton with the longest match
9Longest Match
- He took chunks from the PTB
- NP ? D N
- NP ? D Adj N
- VP ? V
- Encoded each rule as an automaton
- Stored longest matching pattern (the winner)
- If no match for a given word, skipped it (in
other words, didnt chunk it) - Results Precision .92, Recall .88
10An Application
- Data-Driven Linguistics Ontology Development (NSF
BCE-0411348) - One focus locate linguistically annotated
(read tagged) text and extract linguistically
relevant terms from text - Attempt to discover meaning of the terms
- Intended to build out content of the ontology
(GOLD) - Focus on Interlinear Glossed Text (IGT)
11An Application
- Interlinear Glossed Text (IGT), some examples
- (1) Afisi a-na-ph-a nsomba
- hyenas SP-PST-kill-ASP fish
- The hyenas killed the fish.'
(Baker 1988254)
12An Application
- More examples
- (4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG - The child opened the window. (Megerdoom
ian ??)
13An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
- Problem How do we discover the meaning of the
linguistically salient terms, such as NOM, ACC,
AOR, 3SG? - Perhaps we can discover the meanings by examining
the contexts in which the occur. - POS can be a context.
- Problem POS tags rarely used in IGT
- How do you assign POS tags to a language you know
nothing about? - IGT gives us aligned text for free!!
14An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
- IGT gives us aligned text for free!!
- POS tag the English translation
- Align with the glosses and language data
- That helps. We now know that NOM and ACC attach
to nouns, not verbs (nominal inflections) - And AOR and 3SG attach to verbs (verbal
inflections)
15An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
- In the LaPolla example, we know that NOM does not
attach to nouns, but to verbs. Must be some
other kind of NOM.
16An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
- How we tagged
- Globally applied most frequent tags (stupid
tagger) - Repaired tags where context dictated a change
(e.g., TO preceding race VB) - Technique similar to Brill 1995
17An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
- But can we get more information about NOM, ACC,
etc.? - Can chunking tell us something more about these
terms? - Yes!
18An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
- Chunk phrases, mainly NPs
- Since relationship (in simple sentences) between
NPs and verbs tells us something about the verbs
arguments (Bird and Loper 2005) - We can tap this information to discover more
about the linguistic tags
19An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
NP
NP
VP
- Apply Abney 1995s longest match heuristic to get
as many chunks as possible (especially NP) - Leverage English canonical SVO (NVN) order to
identify simple argument structures - Use these to discover more information about the
terms - Thus
20An Application
(4) a. yerexa-n p'at'uhan-e bats-ets
child-NOM window-ACC open-AOR.3SG The
child opened the window. (Megerdoomian ??)
DT NN VBP DT NN
NP
NP
VP
- We know that
- NOM attaches to subject NPs may be a case
marker indicating subject - ACC attaches to object NPs may be a case marker
indicating object
21An Application
- What we do next look at co-occurrence relations
(clustering) of - Terms with terms
- Host categories with terms
- To determine more information about the terms
- Done by building feature vectors of the various
linguistic grammatical terms (grams)
representing their contexts - And measuring relative distances between these
vectors (in particular, for terms we know)
22Linguistic Gram Space