Unsupervised Learning of Syntactic Structure - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Unsupervised Learning of Syntactic Structure

Description:

a new year begins in new york. expect brushbacks but no beanballs ... Full binary grammar over n symbols. Parse randomly at first ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 47

Provided by: Dank74

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised Learning of Syntactic Structure

1
Unsupervised Learning of Syntactic Structure

Christopher Manning
(work with Dan Klein and Teg Grenager)
Depts of CS and Linguistics
Stanford University

2
Probabilistic models of language

Everyone knows that language is variable
Sapir (1921)
Probabilistic models give precise descriptions of
a variable, uncertain world
The choice for language isnt a dichotomy between
rules and neural networks
Probabilistic models can be used over rich
linguistic representations
They support inference and learning
Theres not much evidence of a poverty of the
stimulus preventing them being used

3
Poverty of the stimulus?

Theres a lot of stimulus Baayen estimates
children hear 200 million words by adulthood
c. 1,000,000 sentences a year
Who says seeing something 500 times a year isnt
enough?
Much more sign of other poverties
Poverty of imagination
Poverty of model building
Poverty of understanding of (machine) learning
among linguists
At any rate, an obvious way to investigate this
question is empirically.

4
LinguisticsIs it time for a change?

The corollary of noting that all the ideas of
Chomsky through Minimalism are presaged in LSLT
is that all these ideas are 50 years old.
In the 1950s, we didnt know much about cognitive
science
In the 1950s, we didnt know much about human or,
especially, machine learning
Its time to make a few changes
Others are making them if linguistics doesnt,
it might be ignored.

5
Gold (1967)

Gold no superfinite class of languages (e.g.,
regular or context-free languages, etc.) is
learnable without negative examples.
Certain conditions nearly arbitrary sequence of
examples only constraint is that no sentence may
be withheld from the learner indefinitely.
Still regularly cited as bedrock for innatist
linguistics
Responses suggested by Gold
Subtle, covert negative evidence ? Some recent
claims
Innate knowledge shrinks language class ? Chomsky
Assumption about presentation of examples is too
general ? e.g., probabilistic language model

6
Horning (1969)

If texts are generated by a stochastic process,
how often something occurs can drive language
acquisition
As time goes by without seeing something, we have
evidence that it either doesnt happen or is very
rare
Implicit negative evidence cant withhold common
stuff
Horning stochastic context-free languages are
learnable from only positive examples.
But Hornings proof is enumerative, rather than
providing a plausible grammar learning method
(See, e.g., Rohde and Plaut 1999 for discussion)
Here we provide two case studies
Phrase structure (and dependencies) learning
Learning semantic roles underlying surface syntax

7
1. Grammar inductionLearning phrase structure

Start with raw text, learn syntactic structure
Some have argued that learning syntax from
positive data alone is impossible
Gold, 1967 Non-identifiability in the limit
Chomsky, 1980 The poverty of the stimulus
Many others have felt it should be possible
Lari and Young, 1990
Carroll and Charniak, 1992
Alex Clark, 2001
Mark Paskin, 2001
but it is a hard problem

8
Idea Lexical Affinity Models

Words select other words on syntactic grounds
Link up pairs with high mutual information
Yuret, 1998 Greedy linkage
Paskin, 2001 Iterative re-estimation with EM
Evaluation compare linked pairs to a gold
standard

congress narrowly passed the amended bill
Accuracy
Method
39.7
Paskin, 2001
9
Problem Non-Syntactic Affinity

Mutual information between words does not
necessarily indicate syntactic selection.

congress narrowly passed the amended bill
expect brushbacks but no beanballs
a new year begins in new york
10
Idea Word Classes

Individual words like congress are entwined with
semantic facts about the world.
Syntactic classes, like NOUN and ADVERB are
bleached of word-specific semantics.
Automatic word classes more likely to look like
DAYS-OF-WEEK or PERSON-NAME.
We could build dependency models over word
classes. cf. Carroll and Charniak, 1992

NOUN ADVERB VERB DET PARTICIPLE NOUN
congress narrowly passed the amended
bill
11
Problems Word Class Models

Issues
Too simple a model doesnt work much better
supervised
No representation of valence (number of
arguments)

41.7
Random
44.7
Carroll and Charniak, 92
53.2
Adjacent Words
NOUN NOUN VERB
NOUN NOUN VERB
stock prices fell
stock prices fell
12
Bias Using more sophisticated dependency
representations
?
arg
head
55.9
Adjacent Words
63.6
Our Model (DMV)
13
Constituency Parsing

Model still doesnt directly use constituency
Constituency structure gives boundaries

Machine Translation
Information Extraction
14
Idea Learn PCFGs with EM

Classic experiments on learning Probabilistic
CFGs with Expectation-Maximization Lari and
Young, 1990
Full binary grammar over n symbols
Parse randomly at first
Re-estimate rule probabilities off parses
Repeat
Their conclusion it doesnt work at all!

X1 , X2 Xn
15
Other Approaches

Some earlier work in learning constituency
Adrians, 99 Language grammars arent general
PCFGs
Clark, 01 Mutual-information filters detect
constituents, then an MDL-guided search assembles
them
van Zaanen, 00 Finds low edit-distance
sentence pairs and extracts their differences
GB/Minimalism No results that Ive seen
Evaluation fraction of nodes in gold trees
correctly posited in proposed trees (unlabeled
recall)

16
Right-Branching Baseline

English trees tend to be right-branching, not
balanced
A simple (English-specific) baseline is to choose
the right-branching structure for each sentence

they were unwilling to agree to new terms
35.6
van Zaanen, 00
46.4
Right-Branch
17
Inspiration Distributional Clustering
? the president said that the downturn was over ?
president
governor
the
a
said
reported
Finch and Chater 92, Schütze 93, many others
18
Idea Distributional Syntax?

Can we use distributional clustering for learning
syntax?

? factory payrolls fell in september ?
19
Problem Identifying Constituents
Distributional classes are easy to find
the final vote two decades most people
decided to took most of go with
of the with a without many
in the end on time for now
the final the intitial two of the
20
A Nested Distributional Model

Wed like a model that
Ties spans to linear contexts (like
distributional clustering)
Considers only proper tree structures (like a
PCFG model)
Has no symmetries to break (like a dependency
model)

21
Constituent-Context Model (CCM)

P(ST)

?factory payrolls fell in september ?
22
Initialization A little UG?
Tree Uniform
Split Uniform
23
Results Constituency
CCM Parse
Treebank Parse
24
A combination model

What weve got
Two models of syntactic structure
Each was the first to break its baseline
Can we combine them?
Yes, using a product model
Which weve also used for supervised parsing
(Klein and Manning 2003)

25
Combining the two modelsKlein and Manning ACL
2004
Dependency Evaluation

Supervised PCFG constituency recall is at 92.8
Qualitative improvements
Subject-verb groups gone, modifier placement
improved

Constituency Evaluation
!
26
Crosslinguistic applicability of the learning
algorithm
Constituency Evaluation
27
Most Common Errors English
Overproposed Constituents
Crossing Constituents
28
What Has Been Accomplished?

Unsupervised learning
Constituency structure
Dependency structure
Constituency recall
Why it works
Combination of simple models
Representations designed for unsupervised learning

29
2. Unsupervised learning of linking to semantic
roles
Mary opened the door with a key.
30
Semantic Role Labeling

Supervised learning
Gildea Jurafsky (2000)
Pradhan et al. (2005)
Punyakanok et al. (2005)
Xue Palmer (2004)
Haghighi et al. (2005)
etc.
Unsupervised learning?

Mary opened the door with a key.
31
Why Unsupervised Might Work
Instances of open
The bottle opened easily.
She opened the door with a key.
The door opened as they approached.
Fortunately, the corkscrew opened the bottle.
He opened the bottle with a corkscrew.
She opened the bottle very carefully.
This key opens the door of the cottage.
He opened the door.
32
Why Unsupervised Might Work
Instances of open
The bottle opened easily.
She opened the door with a key.
The door opened as they approached.
Fortunately, the corkscrew opened the bottle.
He opened the bottle with a corkscrew.
She opened the bottle very carefully.
This key opens the door of the cottage.
He opened the door.
33
Inputs and Outputs
Verb give
Verb give
Roles
Syntactic Relation
Head word
Semantic Role
0
it, he, bill, they, that,
1
power, right, stake,
subj
it/PRP
?
2
them, it, him, dept.,
subj
bill/NN
?
obj1
power/NN
?
Linkings
subj
they/PRP
?
obj1
stake/NN
?
0subj,1obj2, 2obj1
0.46
subj
bill/NN
?
0subj,1obj1, 2to
0.19
obj1
them/PRP
?
0subj,1obj1
0.05
obj2
right/NN
?
Observed Data
Learned Model
34
Probabilistic Model
A deeper market plunge today could give them
their first test.
v
give
l
0subj, 1obj2, 2obj1
o
0subj, Mnp 1obj2, 2obj1
35
Linking Model

The space of possible linkings is largewhere
S is the set of syntactic relations, and R is the
set of semantic roles
Were going to decompose the linking generation
into a sequence of derivation steps
Advantages
Smaller parameter space
Admits a natural informative prior over linkings

36
Linking Construction Example
Ways to add role 1
Ways to add role 2
Deterministic function
37
Learning Problem

Given a set of observed verb instances, what are
the most likely model parameters?
A good application for EM!
M-step
Trivial computation
E-Step
We need to compute conditional distributions over
possible role vectors for each instance
Given the syntactic relations, only a few role
vectors yield possible linkings!
Exact E-step is possible

v
l
o
r
s
w
Dependents
38
Datasets and SRL Evaluation

Training sets
Penn Treebank - 1M words WSJ, annotated by humans
with parse trees and semantic roles
BLLIP - 30M words WSJ, parsed by the Charniak
parser
Gigaword - 1700M words newswire, Stanford parser
Development set Propbank WSJ section 24
911 types, 2820 verb instances
Test set Propbank WSJ section 23
1067 verb types, 4315 verb instances
SRL Evaluation
Learned role names dont correspond to Propbank
names greedy mapping of learned roles to gold
roles
However, we do map learned adjunct roles to ARGM
Coarse roles only (dont distinguish types of
adjuncts)
Baseline subjarg0, dobjarg1, iobjarg2,
restargm

39
Results (Coarse Roles, Sec. 23)Grenager and
Manning 2005
Classification(Coarse Roles)
85.6
Accuracy
89.7
98.3
93.4
Toutanova 2005
40
Improved Verbs
Verb give
Verb pay
Linkings
0subj,1obj2, 2obj1
0.46
Linkings
0subj,1obj1
0.32
0subj,1obj1, 2to
0.19
0subj,1obj1, 2for
0.21
0subj,1obj1
0subj
0.05
0.07

0subj,1obj1, 2to

0.05
0subj,1obj2, 2obj1
0.05
Roles
0
it, he, bill, they, that,

1
power, right, stake,
2
Roles
0
them, it, him, dept.,
it, they, company, he,

1

, bill, price, tax
2
stake, gov., share, amt.,

41
Conclusions

We can learn a model of verb linking behavior
without labeled data
It learns many of the linkings that linguists
propose for each verb
While labeling accuracy is below supervised
systems, it is well above informed baselines
Good results requires careful design of model
structure
While this remains to be done, composing the two
models discussed here would give the basis of a
mapping from plain text to a basic semantic
representation of the text.

42
Language Acquisition Devices
Highly successful
Results still unclear
43
Grammar Induction What hasnt been accomplished?

Learning languages where the primary syntax is in
the morphology (noun case, verb agreement) rather
than linear order
What is the deal with Chinese?
Chinese has many less function words
Chinese has many less pro-forms
(Its not morphology Chinese has less of that
too)
Extending from syntactic to semantic induction

44
Semantic role learning discussion

Clustering of roles for pay clearly wrong
pay to X vs. pay for Y
must have X?Y
Sparsity of head word observations
Inter-verb clustering?
Use of external resources?
No modeling of alternative senses of verbs
leave Mary with the gift vs. leave Mary alone

45
Worsened Verbs
Verb close
Verb leave
Linkings
0subj, 2in, 3at
0.24
Linkings
0subj,1obj1
0.57
0subj, 3at
0.18
0subj
0.18
0subj, 2in
0subj, 1cl1
0.11
0.12
0subj,1obj1, 2in, 3at

0.10

Roles
0
he, they, it, I, that, this,

1
company, it, office,
Roles
0
share, stock, it, market,
2
stake, alley, impression
1
yesterday, unchanged,

2
trading, exchange, volume
3
, cent, high, penny,
He left Mary with the gift. vs. He left Mary
alone.

In national trading, SFE shares closed yesterday
at 31.25 cents a share, up 6.25 cents.
46
Desiderata Practical Learnability