Title: An evaluation measure after all
1An evaluation measure after all?
-
- Janet Dean Fodor
-
- The Graduate Center, CUNY
- San Sebastian, June 2006
2Credit where its due
Joint work with William G. Sakas
(Hunter College CUNY Graduate Center)
CUNY graduate students
David Brizan Carrie
Crowther Arthur Hoskey Xuân-Nga Kam
Iglika Stoyneshka Lidiya Tornyova
- This is one part of the CUNY-CoLAG project
(Computational Language Acquisition Group)
3Agenda
- Chapter 1, Aspects of the Theory of Syntax (1965)
- A program for modeling language acquisition
- Why have we not fulfilled it?
- Valuable studies of what children know when,but
still no viable psycho-computational models.
4Aspects Chapter 1
- Let us consider what is involved in the
construction of an acquisition model for
language. - Representation of input signal and
derivations - The class of possible grammars
- A method for selecting one on the
basis of a childs primary linguistic
data an evaluation measure. - An actual acquisition model must have a strategy
for finding hypotheses. E.g., grammars
that exceed a certain value (in terms of the
evaluation measure).
5From creating rules to setting parameters
- The Chapter 1 program for modeling acquisition
was never fulfilled. Too many possible grammars
no plausible EM. - Shift to parameter theory (Chomsky 1981)
languages differ only in their lexicons and the
values of a small finite number of parameters,
e.g., null subject. - Now a finite number of possible grammars. ?
- Input sentences trigger the parameter values,
so the learner knows which parameter values to
adopt to license each input sentence. ? - Incremental (memoryless) learning. Choose next
grammar hypothesis on basis of current input
sentence only. ?
6But ambiguous triggers (Clark 1989)
- Pat expects Sue to win. How is case licensed on
Sue? - ECM matrix verb governs lower
subject. or SCM non-finite Infl assigns
case to its subject. - What can Pat sing? Why is the object in initial
position? - WH-movement or Scrambling
- Over-optimistic For each parameter, an
unambiguous trigger, innately specified.
Realistically, it would be masked by other
differences between languages.
(Clark 1989 Gibson Wexler 1994) - The null subject parameter is not typical!
7So parameter setting needs EM too
- Ambiguity of triggers ? no unique grammar to
choose for an input sentence. A pool of
candidates. - So either the switch-setting mechanism contains
overrides. Or the switches are set only after a
choice is made between candidates. - Either way, a decision must be made EM.
- So not automatic triggering, and not
error-free. Parameter settings must be
revisable.(Non-deterministic - important below)
8EM must include the Subset Principle
- Poverty of the Stimulus (POS)
- POPS poverty of the positive stimulus
E.g., little if any exposure to
parasitic gaps. - So grammar formation cant be purely
data driven. (Kam et al.) Whatever
data are available must entail the rest. - PONS extreme poverty of the negative
stimulus - I.e., little if any info about what is
ungrammatical. - So all grammar choices must be
conservative. - Subset Principle If one language properly
contains another, both compatible with the
input, EM must favor the latter.
9Aspects illuminates current learning models
- Whether learning is rule-based or
parameter-based - On hearing a sentence that my current grammar
does not license, what is the pool of possible
grammars to switch to? - Obviously right answer All and only the grammars
compatible with this input sentence. - Preferably ranked by EM, so that all learners
follow the same route, favor the same
generalizations. - Necessarily ranked by EM, where subset/superset
choices.
10Recent models of parameter setting
- Recent learning models do not reflect this
picture at all. - They have no analogue of triggering.
- They have no way to compute the set of candidate
grammars. - There is no way to apply an evaluation metric,
including SP. - 4 examples
11Candidate grammar pool in recent models
- Any grammar identical to Gcurrent except add
one transformation to convert the wrongly
generated word string into the observed word
string or delete a transformation, selected at
random, regardless of whether the string is then
licensed. (Wexler Culicover 1980) - Any grammar that differs from Gcurrent with
respect to one parameter value. (Then check
whether it licenses the string adopt it only if
it does.)
(Gibson Wexler 1994)
12Candidate grammars in recent models
- From the set of all grammars, a batch that are
being simultaneously evaluated against successive
sentences. (Breed the fitter ones. Then evaluate
abatch of the offspring. And again)
(Clark 1992) - A grammar selected from among all grammars, with
probability based on how well each of its
parameter values has performed in the past. (If
it parses the new input, upgrade the weights of
its p-values if it fails, downgrade
them.) (Yang 2000)
13The CUNY model (STL)
- Retain as much of triggering as possible As in
triggering, the input sentence should tell the
learner which parameters could be reset to
license the sentence. - E.g. What can Pat sing? WH-movt or
Scrambling - Who is he looking at? Wh-movt
?Pied piping -
- We call this parametric decoding.
- But no-one knows how to do it, fully and
feasibly.
14Theres no good substitute for decoding
- Trial-and-error selection of a grammar hypothesis
without reference to the input sentence is
inefficient. Our simulation studies confirm this.
Many input sentences go by with nothing learned
from them, because an unsuccessful hypothesis was
tested. - Models that assign a success score to grammars
dont waste inputs, but have to cover the search
space.Without decoding, these are also
inefficient. - Compare decoding The target grammar is one of
these, one of these, and one of these. Intersect.
Apply EM,SP.
15Only partial decoding is feasible
- The sentence parsing routines (innate) can do
parametric decoding When the current grammar
fails to parse the input sentence, draw on
additional parameter values to complete the
parse. Adopt those values. - But for full decoding, the parse would have to be
parallel Compute every parse tree for the
sentence. - Unrealistic! Even adults cant do full parallel
parsing. - With serial parsing ( compute just one structure
for the sentence), only partial decoding. - Partial decoding is not fail-safe. An unattended
language might be a subset of the adopted
language.
16Summary (so far) on decoding
- The Chapter 1 blueprint for a model of
acquisition from primary linguistic data requires
exhaustive decoding ? the set of all grammars
compatible with the sentence, so that
EM can select the best one. - How to do this was not specified in Chapter 1.
- All we can realistically assume is partial
decoding.And that looks to be as useless for SP
as none at all. -
17What to do?
- First, a detour. Reconsider a traditional
(clunky) old approach
to EM enumeration. - It needs a new twist, to make it psychologically
acceptable. - But then it solves the problem of how to apply
SP Locate candidates in EM-order adopt first
that works. - It also solves another dire problem the fact
that SP can itself cause learning failures. - Partial decoding fits profitably into this model.
18Traditional solution enumeration (Gold)
- Assume an innate ordering of all
grammars/languages with subsets preceding
supersets, other EM rankings. - The learning algorithm must test grammars in that
sequence, moving to the next one only when
preceding grammars have been disconfirmed. - By doing so the learner automatically respects
SP. - But psycholinguistically absurd! 30 parameters ?
billion grammars. Learning by enumeration within
a reasonable time bound is likely to be
intractable (Pinker 1979) - No role for decoding at all just trial-and-error
again.
19Re-thinking enumeration
- Twist the ordering of all possible grammars into
a lattice, representing the subset-superset
relations between them. (Strictly a poset) - .
For our 3,072 languages 31,504 subset relations
20Approx 10 of the CoLAG grammar lattice
(256)
super sub
(366)
21How a learner could use the lattice
- At the bottom of the lattice are languages with
no proper subsets. We call these smallest
languages. - They are the only legitimate hypotheses when
learning begins. - As smallest languages are tried and disconfirmed
by input, they are erased from the lattice. - So the pool of legitimate hypotheses at the
bottom edge changes as learning proceeds. But all
respect SP. - Smallest languages might be tested by
trial-and-error. Or more efficiently by partial
decoding, using the parser.
22 Enumeration by lattice - pros and cons
- Safe but efficient. No need to crawl through all
languages between Lcurrent and Ltarget. Only
through all subsets of Ltarget.
? - Erasing grammars is like phonological learning,
where all possible distinctions at birth are
whittled down by exposure to the target language.
? - Keeping track of disconfirmed grammars by erasure
does not add to memory load.
? - To check Can we successfully integrate the
smallest languages restriction into the
decoding process? ?
23Now Lattice solves another bizarre problem
- Parameter setting implies incremental learning.
- Incremental learning is considered plausible /
desirable because it requires no memory for past
inputs or past grammar hypotheses. (In contrast
to little linguist models) - But SP and incremental learning are incompatible.
SP becomes over-conservative. It causes
undershoot errors which can prevent convergence
on the target. - SP demands selection of the least inclusive
language compatible with the current input
sentence an absurdly small language, lacking
constructions previously acquired. E.g.
Its bedtime. ? no topicalization,
extraposition,
passive, tag questions,
24Learning failures without SP and with SP
- The culprit (again) is ambiguity of triggers, a
fact about the natural language domain. - How could the learning mechanism cope?
- The erasure of grammars from the lattice blocks
excessive retrenchment. No more undershoot
errors. - But are we really born with this grammar lattice
in our heads? Could it all be physics? Could it
be projected? Current evidence suggests it cant. -
25To wrap up
- Starting with Chomskys Chapter 1 blueprint
(idealization to instantaneous acquisition),
attempt to build a process model of syntactic
parameter setting. - Aim is computational rigor psychological
verisimilitude.Impose strict limits on
resources memory, computation.
(Pinker 1979) - We have looked two nasty problems in the eye,
concerning the choice among grammars all
compatible with the input, to find out what a
solution would have to be like. -
- After 40 years, a suggestion. The lattice model
is an idea about a direction towards a possible
solution.