Title: DataOriented Parsing
1Data-Oriented Parsing
- Remko Scha
- Institute for Logic, Language and Computation
- University of Amsterdam
2Beperking van PCFG's Deze beide analyses hebben
altijd dezelfde waarschijnlijkheid!
3Een krachtiger modelData-Oriented Parsing (DOP)
- Memory-based approach to syntactic parsing and
disambiguation. - Basic idea use the subtrees from a syntactically
annotated corpus directly as a stochastic
grammar.
4Language processing by analogy Cf. "Bloomfield,
Hockett, Paul, Saussure, Jespersen, and many
others". "To attribute the creative aspect of
language use to 'analogy' or 'grammatical
patterns' is to use these terms in a completely
metaphorical way, with no clear sense and with
no relation to the technical usage of linguistic
theory." Chomsky (1966)
5Data-Oriented Parsing (DOP)
- Background PCFG's
- Question What are the statistically significant
units of language?
6Data-Oriented Parsing (DOP)
- Background PCFG's
- Question What are the statistically significant
units of language? - Answer We don't know.
7Data-Oriented Parsing (DOP)
- Background PCFG's
- Question What are the statistically significant
units of language? - Answer We don't know
- Include CFG-rules, lexicalized rules,
sentences, phrases, sentences and phrases with 1
or 2 constituents left out, ...
8Data-Oriented Parsing (DOP)
- Memory-based approach to syntactic parsing and
disambiguation. - Basic idea use the subtrees from a syntactically
annotated corpus directly as a stochastic
grammar.
9Data-Oriented Parsing (DOP)
- Simplest version DOP1 (Bod, 1992).
- Annotated corpus defines Stochastic Tree
Substitution Grammar
10Data-Oriented Parsing (DOP)
- Simplest version DOP1 (Bod 1992).
- Annotated corpus defines Stochastic Tree
Substitution Grammar - (Slides adapted from Guy De Pauw,
- University of Antwerp)
11(No Transcript)
12(No Transcript)
13Treebank
14Generating "Peter killed the bear."
Note one parse has many derivations!
15An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Probability of a Derivation
- Product of the Probabilities of the Subtrees
16An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Probability of a Derivation
- Product of the Probabilities of the Subtrees
- Probability of a Parse
- Sum of the Probabilities of its Derivations
17 Example derivation for "Van Utrecht naar
Leiden."
18Probability of substituting a subtree ti on a
node the number of occurrences of a subtree
ti, divided by the total number of occurrences
of subtrees t with the same root node label as ti
(ti) / (t root(t) root(ti)
) Probability of a derivation t1... tn the
product of the probabilities of the substitutions
that it involves Pi (ti) / (t root(t)
root(ti) ) Probability of a parse-tree the
sum of the probabilities of all derivations of
that parse-tree Si Pj (tij) / (t
root(t) root(tij) )
19An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Probability of a Derivation
- Product of the Probabilities of the Subtrees
- Probability of a Parse
- Sum of the Probabilities of its Derivations
- Disambiguation Choose the Most Probable
Parse-tree
20An annotated corpus defines a Stochastic Tree
Substitution Grammar
21An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Q. Does this work?
- A. Yes. Experiments on a small fragment of the
ATIS corpus gave very good results. (Bod's
dissertation, 1995.)
22An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Q. Do we really need all fragments?
23An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Q. Do we really need all fragments?
- A. Experiments on the ATIS corpusHow do
restrictions on the fragments influence parse
accuracy?
24Experiments on a small subset of the ATIS
corpus max words ? 1 2 3 4
6 8 unlimited max tree-depth ? 1 47
47 2 65 68 68 68 3 74
76 79 79 79 79 79 4 75 79
81 83 83 83 83 5 77 80
83 83 83 85 84 6 75 80
83 83 83 87 84 Parse accuracy (in
) as a function of the maximum number of
lexical items and the maximum tree-depth of the
fragments.
25Beyond DOP1
- Computational issues
- Linguistic issues
- Statistical issues
26Computational issues Part 1 the good news
- TSG parsing can be based on the techniques of
CFG-parsing, and inherits some of their
properties. - Semi-ring algorithms are applicable for many
useful purposes
27Computational issues Part 1 the good news
- Semi-ring algorithms are applicable for many
useful purposes. In O(n3) of sentence-length, we
can - Build a parse-forest.
- Compute the Most Probable Derivation.
- Select a random parse.
- Compute a Monte-Carlo estimation of the Most
Probable Parse.
28Computational issues Part 2 the bad news
- Computing the Most Probable Parse is NP-complete
(Sima'an). (Not a semi-ring algorithm.) - The grammar gets very large.
29Computational issuesPart 3 Solutions
- Non-probabilistic DOP Choose the shortest
derivation. (De Pauw, 1997 more recently, good
results by Bod on WSJ corpus.) - Compress the fragment-set. (Use Minimum
Description Length. Van der Werff, 2004.) - Rig the probability assignments so that the Most
Probable Derivation becomes applicable.
30 31- Linguistic issues
- Part 1 Future work
32- Linguistic issues
- Part 1 Future work
- Scha (1990), about an imagined future DOP
algorithm - It will be especially interesting to find out how
such an algorithm can deal with complex syntactic
phenomena such as "long distance movement". It is
quite possible that an optimal matching algorithm
does not operate exclusively on constructions
which occur explicitly in the surface-structure
perhaps "transformations" (in the classical
Chomskyan sense) play a role in the parsing
process.
33- Transformations
- "John likes Mary."
- "Mary is liked by John."
- "Does John like Mary?"
- "Who does John like?"
- "Who do you think John likes?"
- "Mary is the girl I think John likes."
34- Transformations
- Wh-movement, Passivization, Topicalization,
Fronting, Scrambling, . . .? - Move-Alfa?
35- Linguistic issues
- Part 2 Current work on more powerful models
-
- Kaplan Bod LFG-DOP (Based on
Lexical-Functional Grammar) - Hoogweg TIG-DOP (Based on Tree-Insertion
Grammar cf. Tree-Adjoining Grammar) - Sima'an The Tree-Gram Model (Markov-processes
on sister-nodes, conditioned on lexical heads)
36- Statistical issues
-
- DOP1 has strange properties The largest trees in
the corpus completely dominate the statistics. - Maximum Likelihood Estimation completely overfits
the corpus
37- Statistical issues
-
- Solutions
- "Sima'an heuristics" constraints on tree-depth
and number of terminals and non-terminals. - Bonnema et al. Treat every corpus tree as the
representation of a set derivations. - Smoothing an overfitting estimation (Sima'an,
Buratto). - Held-out estimation (Zollmann).
38- Part II
- The Big Picture
- The Problem of Perception
39- The Problem of Perception
- E.g. Visual Gestalt Perception
40The Data-Oriented World View
- All of perception and cognition may be usefully
analyzed from a data-oriented point of view. - All interpretive processes are based on detecting
similarities and analogies with concrete past
experiences.
41The Data-Oriented World View
- All interpretive processes are based on detecting
similarities and analogies with concrete past
experiences. - E.g.
- Visual Perception
- Music Perception
- Lexical Semantics
- Concept Formation.
42The Data-Oriented Perspective on Lexical
Semantics and Concept Formation.
- A concept the extensional set of its
previously experienced instances. - Classifying new input under an existing concept
judging the input's similarity to these
instances. - Against
- Explicit definitions
- Prototypes
-
43The Data-Oriented Perspective on Lexical
Semantics and Concept Formation.
- A concept the extensional set of its
previously experienced instances. - Classifying new input under an existing concept
judging the input's similarity to these
instances. - Against
- Explicit definitions
- Prototypes
- Learning
44- Part II
- Data-Oriented Parsing
45(No Transcript)
46The Data-Oriented Perspective on Perlocutionary
Effect
- "The effect of a lecture depends on the habits of
the listener, because we expect the language to
which we are accustomed." - Aristotle, Metaphysics II 12,13
47Data-Oriented Parsing as a cognitive model
S
VP
NP
NP
detevery
Nwoman
Nman
det a
Vloves
48Data-Oriented Parsing as a cognitive model
S
VP
NP
NP
detevery
Nwoman
Nman
det a
Vloves