Title: DataOriented Parsing
1Data-Oriented Parsing
- Remko Scha
- Institute for Logic, Language and Computation
- University of Amsterdam
2- Overview
- The Big Picture (cognitive motivation)
- A simple Data-Oriented Parsing model
- Extended DOP models
- Psycholinguistics revisited
- Statistical considerations
3- Data-Oriented Parsing
- The Big Picture
4- Data-Oriented Parsing
- The Big Picture
- (1) The key to understanding cognition is
- understanding perception.
5- Data-Oriented Parsing
- The Big Picture
- (1) The key to understanding cognition is
- understanding visual Gestalt perception.
6- Data-Oriented Parsing
- The Big Picture
- (1) The key to understanding cognition is
- understanding visual Gestalt perception.
- Conjecture Language processing and "thinking"
involve a metaphorical use of our Gestalt
perception capability. - R. Scha "Wat is het medium van het denken?" In
M.B. In 't Veld R. de Groot Beelddenken en
begripsdenken een paradox? Utrecht Agiel, 2005.
7- Data-Oriented Parsing
- The Big Picture
- (1) The key to understanding cognition is
- understanding visual Gestalt perception.
- (2) All perceptual processes are based on
detecting similarities and analogies with
concrete past experiences.
8The Data-Oriented World View
- All interpretive processes are based on detecting
similarities and analogies with concrete past
experiences. - E.g.
- Visual Perception
- Music Perception
- Lexical Semantics
- Concept Formation.
9E.g. The Data-Oriented Perspective on Lexical
Semantics and Concept Formation.
- A concept the extensional set of its
previously experienced instances. - Classifying new input under an existing concept
judging the input's similarity to these
instances. - Against
- Explicit definitions
- Prototypes
10The Data-Oriented Perspective on Lexical
Semantics and Concept Formation.
- A concept the extensional set of its
previously experienced instances. - Classifying new input under an existing concept
judging the input's similarity to these
instances. - Against
- Explicit definitions
- Prototypes
- Learning
11- Part II
- Data-Oriented Parsing
12- Data-Oriented Parsing
- Processing new input utterances in terms of their
similarities and analogies with previously
experienced utterances.
13Language processing by analogy Was proposed
already by "Bloomfield, Hockett, Paul, Saussure,
Jespersen, and many others". But "To attribute
the creative aspect of language use to 'analogy'
or 'grammatical patterns' is to use these terms
in a completely metaphorical way, with no clear
sense and with no relation to the technical
usage of linguistic theory." (Chomsky
1966)
14Challenge To work out a formally precise
notion of "language processing by analogy".
15Challenge To work out a formally precise
notion of "language processing by analogy". A
first step Data-Oriented Parsing Remember all
utterances with their syntactic tree-structures.
Analyse new input by recombining fragments of
these tree structures.
16Data-Oriented Parsing
- Memory-based approach to syntactic parsing and
disambiguation. - Basic idea use the subtrees from a syntactically
annotated corpus directly as a stochastic
grammar.
17Data-Oriented Parsing (DOP)
- Simplest version DOP1 (Bod, 1992).
- Annotated corpus defines Stochastic Tree
Substitution Grammar
18Data-Oriented Parsing (DOP)
- Simplest version DOP1 (Bod 1992).
- Annotated corpus defines Stochastic Tree
Substitution Grammar - (Slides adapted from Guy De Pauw,
- University of Antwerp)
19(No Transcript)
20(No Transcript)
21Fragment Collection
22Generating "Peter killed the bear."
Note one parse has many derivations!
23An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Probability of a Derivation
- Product of the Probabilities of the Subtrees
24An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Probability of a Derivation
- Product of the Probabilities of the Subtrees
- Probability of a Parse
- Sum of the Probabilities of its Derivations
25 Example derivation for "Van Utrecht naar
Leiden."
26Probability of substituting a subtree ti on a
node the number of occurrences of a subtree
ti, divided by the total number of occurrences
of subtrees t with the same root node label as ti
(ti) / (t root(t) root(ti)
) Probability of a derivation t1... tn the
product of the probabilities of the substitutions
that it involves Pi (ti) / (t root(t)
root(ti) ) Probability of a parse-tree the
sum of the probabilities of all derivations of
that parse-tree Si Pj (tij) / (t
root(t) root(tij) )
27An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Probability of a Derivation
- Product of the Probabilities of the Subtrees
- Probability of a Parse
- Sum of the Probabilities of its Derivations
- Disambiguation Choose the Most Probable
Parse-tree
28An annotated corpus defines a Stochastic Tree
Substitution Grammar
29An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Q. Does this work?
- A. Yes. Experiments on a small fragment of the
ATIS corpus gave very good results. (Bod's
dissertation, 1995.)
30An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Q. Do we really need all fragments?
31An annotated corpus defines a Stochastic Tree
Substitution Grammar
- Q. Do we really need all fragments?
- A. Experiments on the ATIS corpus
32Experiments on a small subset of the ATIS
corpus max words ? 1 2 3 4
6 8 unlimited max tree-depth ? 1 47
47 2 65 68 68 68 3 74
76 79 79 79 79 79 4 75 79
81 83 83 83 83 5 77 80
83 83 83 85 84 6 75 80
83 83 83 87 84 Parse accuracy (in
) as a function of the maximum number of
lexical items and the maximum tree-depth of the
fragments.
33Beyond DOP1
- Computational issues
- Linguistic issues
- Psycholinguistic issues
- Statistical issues
34Computational issues Part 1 the good news
- TSG parsing can be based on the techniques of
CFG-parsing, and inherits some of their
properties. - Semi-ring algorithms are applicable for many
useful purposes
35Computational issues Part 1 the good news
- Semi-ring algorithms are applicable for many
useful purposes. In O(n3) of sentence-length, we
can - Build a parse-forest.
- Compute the Most Probable Derivation.
- Select a random parse.
- Compute a Monte-Carlo estimation of the Most
Probable Parse.
36Computational issues Part 2 the bad news
- Computing the Most Probable Parse is NP-complete
(Sima'an). (Not a semi-ring algorithm.) - The grammar gets very large.
37Computational issuesPart 3 Solutions
- Non-probabilistic DOP Choose the shortest
derivation. (De Pauw, 1997 more recently, good
results by Bod on WSJ corpus.) - Compress the fragment-set. (Use Minimum
Description Length. Van der Werff, 2004.) - Rig the probability assignments so that the Most
Probable Derivation becomes applicable.
38 39- More powerful models
-
- Kaplan Bod LFG-DOP (Based on
Lexical-Functional Grammar) - Hoogweg TIG-DOP (Based on Tree-Insertion
Grammar cf. Tree-Adjoining Grammar) - Sima'an The Tree-Gram Model (Markov-processes
on sister-nodes, conditioned on lexical heads)
40- Linguistic issues Future work
41- Linguistic issues Future work
- Scha (1990), about an imagined future DOP
algorithm - It will be especially interesting to find out how
such an algorithm can deal with complex syntactic
phenomena such as "long distance movement". It is
quite possible that an optimal matching algorithm
does not operate exclusively on constructions
which occur explicitly in the surface-structure
perhaps "transformations" (in the classical
Chomskyan sense) play a role in the parsing
process.
42- Transformations
- "John likes Mary."
- "Mary is liked by John."
- "Does John like Mary?"
- "Who does John like?"
- "Who do you think John likes?"
- "Mary is the girl I think John likes."
43- Transformations
- Wh-movement, Passivization, Topicalization,
Fronting, Scrambling, . . .? - Move-Alfa?
44 Psycholinguistics Revisited
45Psycholinguistic Considerations
- DOP is a performance model
- DOP defines syntactic probabilities of sentences
and their analyses - (against the background of a weak, overgenerating
competence grammar the definition of all
formally possible sentence annotations).
46Psycholinguistic Considerations
- Does DOP account for performance phenomena?
47Psycholinguistic Considerations
- Probabilistic Disambiguation
- Psychological experiments consistently show that
disambiguation preferences correlate with
occurrence frequencies.
48Psycholinguistic Considerations
- The "Garden Path" Phenomenon
- "The horse raced past the barn "
49Psycholinguistic Considerations
- The "Garden Path" Phenomenon
- "The horse raced past the barn fell."
50Psycholinguistic Considerations
- The "Garden Path" Phenomenon
- "The horse raced past the barn fell."
- Plausible model Incremental version of DOP
- Analysis with very high probability kills
analyses with low probability.
51Psycholinguistic Considerations
- Utterance Generation
- Cf. Kempen et al. (Leyden University)
- (Non-probabilistic) generation mechanism which
combines tree fragments at random.
52Psycholinguistic Considerations
- Grammaticality Judgements
- Cf. Stich Priming of Grammaticality Judgements.
- Plausible model DOP with "recency effect".
53Psycholinguistic Considerations
- Integration with semantics
- Cf. "Compositional Semantics" (Montague).
- Assume semantically annotated corpus.Cf. Van den
Berg et al. - Factoring in the probabilities of semantic
subcategories Cf. Bonnema.
54Psycholinguistic Considerations
- Language dynamics
- Grammar as an "emergent phenomenon" its
development to be explained in terms of
underlying, more detailed, possibly
incommensurable phenomena.
55Psycholinguistic Considerations
- Dynamics
- E.g. Physics
- Thermodynamics Describes the relations between
temperature, pressure, volume and entropy (in
equilibrium situations). - Statistical thermodynamics explains this in terms
of movements of molecules. (And movements of
molecules also account for non-equilibrium
situations.) - E.g. Biology
- Theory of Evolution
56- "Doesn't every science live on this paradoxical
slope to which it is doomed by the evanescence of
its object in the very process of its
apprehension, and by the pitiless reversal this
dead object exerts on it?" - Baudrillard, 1983
57Psycholinguistic Considerations
- Language Acquisition
- Q. How does a child get its first corpus?
58Psycholinguistic Considerations
- Language Acquisition
- Q. How does a child get its first corpus?
- A. By bootstrapping pragmatic/semantic
structures.
59Psycholinguistic Considerations
- Language Acquisition
- Rule-based models which bootstrap the syntactic
structures from perceived semantic relations - Suggested by Schlesinger (1971, 1975)
- Implemented by Chang Maia (2001)
- Data-oriented version of this
- Described by De Kreek (2003)
60Psycholinguistic Considerations
- Language Change
- The data-oriented approach allows for gradual
changes in parsing and generation preferences. - It allows language change within a lifetime.
(Language change does not depend on
misunderstandings between successive generations.)
61Psychological Considerations
- Perception Revisited
- How to generalize DOP to visual and musical
perception?
62Psychological Considerations
- Perception Revisited
- How to generalize DOP to visual and musical
perception? - How to represent visual and musical Gestalts in a
formal way? - How to generalize DOP to arbitrary algebras?
63Data-Oriented ParsingStatistical Issues
64- Statistical problems
- DOP1 Relative Frequency Estimation on the
fragment set. - Bonnema et al. (1999)
- The DOP1 estimator has strange properties The
largest trees in the corpus completely dominate
the statistics. - Maximum Likelihood Estimation is not a viable
alternative MLE completely overfits the corpus.
65In DOP1, the largest trees in the corpus
completely dominate the statistics.
The above treebank contains 7 fragments with root
label S, each with probability 1/7. For the
input string 'ab', parse (a) will thus receive
probability 3/7 parse (d) will receive
probability 4/7.
66In DOP1, the largest trees in the corpus
completely dominate the statistics.
Assume the above treebank, with equiprobable
initial rules S ? X and S ? A. Input string
'ab' will be analysed as a constituent of
category X, because of the relative improbability
of the fragments from (b)
67In DOP1, the largest trees in the corpus
completely dominate the statistics.
Assume a treebank with 999 binary trees of depth
five and 1 tree of depth six. Now 99.8 of the
probability mass will go to fragments from the
only tree of depth six.
68In DOP1, the largest trees in the corpus
completely dominate the statistics.
- "Solution"
- Heuristic constraints on tree-depth and number of
terminals and non-terminals. E.g., Sima'an
(1999) - Maximum of substitution sites (leaf
non-terminals) 2. - Maximum of lexical items 9.
- Maximum of consecutive lexical items 3.
- Maximum tree-depth 4.
69In DOP1, the largest trees in the corpus
completely dominate the statistics.Needed a
different estimator.
- Not a solution Maximum Likelihood Estimation.
- MLE completely overfits the corpus The DOP
grammar which maximizes the chance of generating
the treebank assigns the following probabilities - to every full corpus tree its relative frequency
in the corpus - to every other fragment zero
(Bonnema Scha, 2003)
70In DOP1, the largest trees in the corpus
completely dominate the statistics.Needed a
different estimator.
- Bonnema et al. (1999) Treat every full
corpus-tree as the representation of a set
derivations. - If we assume a uniform probability distribution
over this set of derivations, we arrive at the
following "weighed relative frequency estimate".
A fragment ? with N(?) non-root non-terminal
nodes receives probability - P(?) 2N(?) F(?)
71In DOP1, the largest trees in the corpus
completely dominate the statistics.Needed a
different estimator.
- Bonnema et al. (1999) Treat every full
corpus-tree as the representation of a set
derivations. - If we assume a uniform probability distribution
over this set of derivations, we arrive at the
following "weighed relative frequency estimate".
A fragment ? with N(?) non-root non-terminal
nodes receives probability - P(?) 2N(?) F(?)
- Sub-optimal assumption!
72In DOP1, the largest trees in the corpus
completely dominate the statistics.Needed a
different estimator.
- Solutions
- Smoothing an overfitting estimation (Sima'an,
Buratto). - Held-out estimation (Zollmann).
73- Smoothing
- Good-Turing estimation Estimating the
probability of unseen events on the basis of the
number of observed unique events, twice-occurring
events, etc. - Back-off Sparse-data problem with
trigram-models estimate the probabilities of
unseen trigrams on the basis of the probabilities
of their constituent bigrams and unigrams
74- Held-out estimation
- Get the fragment set from one part of the corpus
and the probabilities from another part. Use ten
different splits and take the average.
75(No Transcript)
76The Data-Oriented Perspective on Perlocutionary
Effect
- "The effect of a lecture depends on the habits of
the listener, because we expect the language to
which we are accustomed." - Aristotle, Metaphysics II 12,13
77Data-Oriented Parsing as a cognitive model
S
VP
NP
NP
detevery
Nwoman
Nman
det a
Vloves
78Data-Oriented Parsing as a cognitive model
S
VP
NP
NP
detevery
Nwoman
Nman
det a
Vloves