Title: Unsupervised Learning of Syntactic Structure
1Unsupervised Learning of Syntactic Structure
- Christopher Manning
- (work with Dan Klein and Teg Grenager)
- Depts of CS and Linguistics
- Stanford University
2Probabilistic models of language
- Everyone knows that language is variable
- Sapir (1921)
- Probabilistic models give precise descriptions of
a variable, uncertain world - The choice for language isnt a dichotomy between
rules and neural networks - Probabilistic models can be used over rich
linguistic representations - They support inference and learning
- Theres not much evidence of a poverty of the
stimulus preventing them being used
3Poverty of the stimulus?
- Theres a lot of stimulus Baayen estimates
children hear 200 million words by adulthood - c. 1,000,000 sentences a year
- Who says seeing something 500 times a year isnt
enough? - Much more sign of other poverties
- Poverty of imagination
- Poverty of model building
- Poverty of understanding of (machine) learning
among linguists - At any rate, an obvious way to investigate this
question is empirically.
4LinguisticsIs it time for a change?
- The corollary of noting that all the ideas of
Chomsky through Minimalism are presaged in LSLT
is that all these ideas are 50 years old. - In the 1950s, we didnt know much about cognitive
science - In the 1950s, we didnt know much about human or,
especially, machine learning - Its time to make a few changes
- Others are making them if linguistics doesnt,
it might be ignored.
5Gold (1967)
- Gold no superfinite class of languages (e.g.,
regular or context-free languages, etc.) is
learnable without negative examples. - Certain conditions nearly arbitrary sequence of
examples only constraint is that no sentence may
be withheld from the learner indefinitely. - Still regularly cited as bedrock for innatist
linguistics - Responses suggested by Gold
- Subtle, covert negative evidence ? Some recent
claims - Innate knowledge shrinks language class ? Chomsky
- Assumption about presentation of examples is too
general ? e.g., probabilistic language model
6Horning (1969)
- If texts are generated by a stochastic process,
how often something occurs can drive language
acquisition - As time goes by without seeing something, we have
evidence that it either doesnt happen or is very
rare - Implicit negative evidence cant withhold common
stuff - Horning stochastic context-free languages are
learnable from only positive examples. - But Hornings proof is enumerative, rather than
providing a plausible grammar learning method - (See, e.g., Rohde and Plaut 1999 for discussion)
- Here we provide two case studies
- Phrase structure (and dependencies) learning
- Learning semantic roles underlying surface syntax
71. Grammar inductionLearning phrase structure
- Start with raw text, learn syntactic structure
- Some have argued that learning syntax from
positive data alone is impossible - Gold, 1967 Non-identifiability in the limit
- Chomsky, 1980 The poverty of the stimulus
- Many others have felt it should be possible
- Lari and Young, 1990
- Carroll and Charniak, 1992
- Alex Clark, 2001
- Mark Paskin, 2001
- but it is a hard problem
8Idea Lexical Affinity Models
- Words select other words on syntactic grounds
- Link up pairs with high mutual information
- Yuret, 1998 Greedy linkage
- Paskin, 2001 Iterative re-estimation with EM
- Evaluation compare linked pairs to a gold
standard
congress narrowly passed the amended bill
Accuracy
Method
39.7
Paskin, 2001
9Problem Non-Syntactic Affinity
- Mutual information between words does not
necessarily indicate syntactic selection.
congress narrowly passed the amended bill
expect brushbacks but no beanballs
a new year begins in new york
10Idea Word Classes
- Individual words like congress are entwined with
semantic facts about the world. - Syntactic classes, like NOUN and ADVERB are
bleached of word-specific semantics. - Automatic word classes more likely to look like
DAYS-OF-WEEK or PERSON-NAME. - We could build dependency models over word
classes. cf. Carroll and Charniak, 1992
NOUN ADVERB VERB DET PARTICIPLE NOUN
congress narrowly passed the amended
bill
11Problems Word Class Models
- Issues
- Too simple a model doesnt work much better
supervised - No representation of valence (number of
arguments)
41.7
Random
44.7
Carroll and Charniak, 92
53.2
Adjacent Words
NOUN NOUN VERB
NOUN NOUN VERB
stock prices fell
stock prices fell
12Bias Using more sophisticated dependency
representations
?
arg
head
55.9
Adjacent Words
63.6
Our Model (DMV)
13Constituency Parsing
- Model still doesnt directly use constituency
- Constituency structure gives boundaries
Machine Translation
Information Extraction
14Idea Learn PCFGs with EM
- Classic experiments on learning Probabilistic
CFGs with Expectation-Maximization Lari and
Young, 1990 - Full binary grammar over n symbols
- Parse randomly at first
- Re-estimate rule probabilities off parses
- Repeat
- Their conclusion it doesnt work at all!
X1 , X2 Xn
15Other Approaches
- Some earlier work in learning constituency
- Adrians, 99 Language grammars arent general
PCFGs - Clark, 01 Mutual-information filters detect
constituents, then an MDL-guided search assembles
them - van Zaanen, 00 Finds low edit-distance
sentence pairs and extracts their differences - GB/Minimalism No results that Ive seen
- Evaluation fraction of nodes in gold trees
correctly posited in proposed trees (unlabeled
recall)
16Right-Branching Baseline
- English trees tend to be right-branching, not
balanced - A simple (English-specific) baseline is to choose
the right-branching structure for each sentence
they were unwilling to agree to new terms
35.6
van Zaanen, 00
46.4
Right-Branch
17Inspiration Distributional Clustering
? the president said that the downturn was over ?
president
governor
the
a
said
reported
Finch and Chater 92, Schütze 93, many others
18Idea Distributional Syntax?
- Can we use distributional clustering for learning
syntax?
? factory payrolls fell in september ?
19Problem Identifying Constituents
Distributional classes are easy to find
the final vote two decades most people
decided to took most of go with
of the with a without many
in the end on time for now
the final the intitial two of the
20A Nested Distributional Model
- Wed like a model that
- Ties spans to linear contexts (like
distributional clustering) - Considers only proper tree structures (like a
PCFG model) - Has no symmetries to break (like a dependency
model)
21Constituent-Context Model (CCM)
?factory payrolls fell in september ?
22Initialization A little UG?
Tree Uniform
Split Uniform
23Results Constituency
CCM Parse
Treebank Parse
24A combination model
- What weve got
- Two models of syntactic structure
- Each was the first to break its baseline
- Can we combine them?
- Yes, using a product model
- Which weve also used for supervised parsing
(Klein and Manning 2003)
25Combining the two modelsKlein and Manning ACL
2004
Dependency Evaluation
- Supervised PCFG constituency recall is at 92.8
- Qualitative improvements
- Subject-verb groups gone, modifier placement
improved
Constituency Evaluation
!
26Crosslinguistic applicability of the learning
algorithm
Constituency Evaluation
27Most Common Errors English
Overproposed Constituents
Crossing Constituents
28What Has Been Accomplished?
- Unsupervised learning
- Constituency structure
- Dependency structure
- Constituency recall
- Why it works
- Combination of simple models
- Representations designed for unsupervised learning
292. Unsupervised learning of linking to semantic
roles
Mary opened the door with a key.
30Semantic Role Labeling
- Supervised learning
- Gildea Jurafsky (2000)
- Pradhan et al. (2005)
- Punyakanok et al. (2005)
- Xue Palmer (2004)
- Haghighi et al. (2005)
- etc.
- Unsupervised learning?
Mary opened the door with a key.
31Why Unsupervised Might Work
Instances of open
The bottle opened easily.
She opened the door with a key.
The door opened as they approached.
Fortunately, the corkscrew opened the bottle.
He opened the bottle with a corkscrew.
She opened the bottle very carefully.
This key opens the door of the cottage.
He opened the door.
32Why Unsupervised Might Work
Instances of open
The bottle opened easily.
She opened the door with a key.
The door opened as they approached.
Fortunately, the corkscrew opened the bottle.
He opened the bottle with a corkscrew.
She opened the bottle very carefully.
This key opens the door of the cottage.
He opened the door.
33Inputs and Outputs
Verb give
Verb give
Roles
Syntactic Relation
Head word
Semantic Role
0
it, he, bill, they, that,
1
power, right, stake,
subj
it/PRP
?
2
them, it, him, dept.,
subj
bill/NN
?
obj1
power/NN
?
Linkings
subj
they/PRP
?
obj1
stake/NN
?
0subj,1obj2, 2obj1
0.46
subj
bill/NN
?
0subj,1obj1, 2to
0.19
obj1
them/PRP
?
0subj,1obj1
0.05
obj2
right/NN
?
Observed Data
Learned Model
34Probabilistic Model
A deeper market plunge today could give them
their first test.
v
give
l
0subj, 1obj2, 2obj1
o
0subj, Mnp 1obj2, 2obj1
35Linking Model
- The space of possible linkings is largewhere
S is the set of syntactic relations, and R is the
set of semantic roles - Were going to decompose the linking generation
into a sequence of derivation steps - Advantages
- Smaller parameter space
- Admits a natural informative prior over linkings
36Linking Construction Example
Ways to add role 1
Ways to add role 2
Deterministic function
37Learning Problem
- Given a set of observed verb instances, what are
the most likely model parameters? - A good application for EM!
- M-step
- Trivial computation
- E-Step
- We need to compute conditional distributions over
possible role vectors for each instance - Given the syntactic relations, only a few role
vectors yield possible linkings! - Exact E-step is possible
v
l
o
r
s
w
Dependents
38Datasets and SRL Evaluation
- Training sets
- Penn Treebank - 1M words WSJ, annotated by humans
with parse trees and semantic roles - BLLIP - 30M words WSJ, parsed by the Charniak
parser - Gigaword - 1700M words newswire, Stanford parser
- Development set Propbank WSJ section 24
- 911 types, 2820 verb instances
- Test set Propbank WSJ section 23
- 1067 verb types, 4315 verb instances
- SRL Evaluation
- Learned role names dont correspond to Propbank
names greedy mapping of learned roles to gold
roles - However, we do map learned adjunct roles to ARGM
- Coarse roles only (dont distinguish types of
adjuncts) - Baseline subjarg0, dobjarg1, iobjarg2,
restargm
39Results (Coarse Roles, Sec. 23)Grenager and
Manning 2005
Classification(Coarse Roles)
85.6
Accuracy
89.7
98.3
93.4
Toutanova 2005
40Improved Verbs
Verb give
Verb pay
Linkings
0subj,1obj2, 2obj1
0.46
Linkings
0subj,1obj1
0.32
0subj,1obj1, 2to
0.19
0subj,1obj1, 2for
0.21
0subj,1obj1
0subj
0.05
0.07
0subj,1obj1, 2to
0.05
0subj,1obj2, 2obj1
0.05
Roles
0
it, he, bill, they, that,
1
power, right, stake,
2
Roles
0
them, it, him, dept.,
it, they, company, he,
1
, bill, price, tax
2
stake, gov., share, amt.,
41Conclusions
- We can learn a model of verb linking behavior
without labeled data - It learns many of the linkings that linguists
propose for each verb - While labeling accuracy is below supervised
systems, it is well above informed baselines - Good results requires careful design of model
structure - While this remains to be done, composing the two
models discussed here would give the basis of a
mapping from plain text to a basic semantic
representation of the text.
42Language Acquisition Devices
Highly successful
Results still unclear
43Grammar Induction What hasnt been accomplished?
- Learning languages where the primary syntax is in
the morphology (noun case, verb agreement) rather
than linear order - What is the deal with Chinese?
- Chinese has many less function words
- Chinese has many less pro-forms
- (Its not morphology Chinese has less of that
too) - Extending from syntactic to semantic induction
44Semantic role learning discussion
- Clustering of roles for pay clearly wrong
- pay to X vs. pay for Y
- must have X?Y
- Sparsity of head word observations
- Inter-verb clustering?
- Use of external resources?
- No modeling of alternative senses of verbs
- leave Mary with the gift vs. leave Mary alone
45Worsened Verbs
Verb close
Verb leave
Linkings
0subj, 2in, 3at
0.24
Linkings
0subj,1obj1
0.57
0subj, 3at
0.18
0subj
0.18
0subj, 2in
0subj, 1cl1
0.11
0.12
0subj,1obj1, 2in, 3at
0.10
Roles
0
he, they, it, I, that, this,
1
company, it, office,
Roles
0
share, stock, it, market,
2
stake, alley, impression
1
yesterday, unchanged,
2
trading, exchange, volume
3
, cent, high, penny,
He left Mary with the gift. vs. He left Mary
alone.
In national trading, SFE shares closed yesterday
at 31.25 cents a share, up 6.25 cents.
46Desiderata Practical Learnability
- Be as simple as possible
- Make symmetries self-breaking whenever possible
- Avoid hidden structures which are not directly
coupled to surface phenomena
- To be practically learnable, models should