Unsupervised Learning of Syntactic Structure - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Unsupervised Learning of Syntactic Structure

Description:

a new year begins in new york. expect brushbacks but no beanballs ... Full binary grammar over n symbols. Parse randomly at first ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 47
Provided by: Dank74
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Learning of Syntactic Structure


1
Unsupervised Learning of Syntactic Structure
  • Christopher Manning
  • (work with Dan Klein and Teg Grenager)
  • Depts of CS and Linguistics
  • Stanford University

2
Probabilistic models of language
  • Everyone knows that language is variable
  • Sapir (1921)
  • Probabilistic models give precise descriptions of
    a variable, uncertain world
  • The choice for language isnt a dichotomy between
    rules and neural networks
  • Probabilistic models can be used over rich
    linguistic representations
  • They support inference and learning
  • Theres not much evidence of a poverty of the
    stimulus preventing them being used

3
Poverty of the stimulus?
  • Theres a lot of stimulus Baayen estimates
    children hear 200 million words by adulthood
  • c. 1,000,000 sentences a year
  • Who says seeing something 500 times a year isnt
    enough?
  • Much more sign of other poverties
  • Poverty of imagination
  • Poverty of model building
  • Poverty of understanding of (machine) learning
    among linguists
  • At any rate, an obvious way to investigate this
    question is empirically.

4
LinguisticsIs it time for a change?
  • The corollary of noting that all the ideas of
    Chomsky through Minimalism are presaged in LSLT
    is that all these ideas are 50 years old.
  • In the 1950s, we didnt know much about cognitive
    science
  • In the 1950s, we didnt know much about human or,
    especially, machine learning
  • Its time to make a few changes
  • Others are making them if linguistics doesnt,
    it might be ignored.

5
Gold (1967)
  • Gold no superfinite class of languages (e.g.,
    regular or context-free languages, etc.) is
    learnable without negative examples.
  • Certain conditions nearly arbitrary sequence of
    examples only constraint is that no sentence may
    be withheld from the learner indefinitely.
  • Still regularly cited as bedrock for innatist
    linguistics
  • Responses suggested by Gold
  • Subtle, covert negative evidence ? Some recent
    claims
  • Innate knowledge shrinks language class ? Chomsky
  • Assumption about presentation of examples is too
    general ? e.g., probabilistic language model

6
Horning (1969)
  • If texts are generated by a stochastic process,
    how often something occurs can drive language
    acquisition
  • As time goes by without seeing something, we have
    evidence that it either doesnt happen or is very
    rare
  • Implicit negative evidence cant withhold common
    stuff
  • Horning stochastic context-free languages are
    learnable from only positive examples.
  • But Hornings proof is enumerative, rather than
    providing a plausible grammar learning method
  • (See, e.g., Rohde and Plaut 1999 for discussion)
  • Here we provide two case studies
  • Phrase structure (and dependencies) learning
  • Learning semantic roles underlying surface syntax

7
1. Grammar inductionLearning phrase structure
  • Start with raw text, learn syntactic structure
  • Some have argued that learning syntax from
    positive data alone is impossible
  • Gold, 1967 Non-identifiability in the limit
  • Chomsky, 1980 The poverty of the stimulus
  • Many others have felt it should be possible
  • Lari and Young, 1990
  • Carroll and Charniak, 1992
  • Alex Clark, 2001
  • Mark Paskin, 2001
  • but it is a hard problem

8
Idea Lexical Affinity Models
  • Words select other words on syntactic grounds
  • Link up pairs with high mutual information
  • Yuret, 1998 Greedy linkage
  • Paskin, 2001 Iterative re-estimation with EM
  • Evaluation compare linked pairs to a gold
    standard

congress narrowly passed the amended bill
Accuracy
Method
39.7
Paskin, 2001
9
Problem Non-Syntactic Affinity
  • Mutual information between words does not
    necessarily indicate syntactic selection.

congress narrowly passed the amended bill
expect brushbacks but no beanballs
a new year begins in new york
10
Idea Word Classes
  • Individual words like congress are entwined with
    semantic facts about the world.
  • Syntactic classes, like NOUN and ADVERB are
    bleached of word-specific semantics.
  • Automatic word classes more likely to look like
    DAYS-OF-WEEK or PERSON-NAME.
  • We could build dependency models over word
    classes. cf. Carroll and Charniak, 1992

NOUN ADVERB VERB DET PARTICIPLE NOUN
congress narrowly passed the amended
bill
11
Problems Word Class Models
  • Issues
  • Too simple a model doesnt work much better
    supervised
  • No representation of valence (number of
    arguments)

41.7
Random
44.7
Carroll and Charniak, 92
53.2
Adjacent Words
NOUN NOUN VERB
NOUN NOUN VERB
stock prices fell
stock prices fell
12
Bias Using more sophisticated dependency
representations
?
arg
head
55.9
Adjacent Words
63.6
Our Model (DMV)
13
Constituency Parsing
  • Model still doesnt directly use constituency
  • Constituency structure gives boundaries

Machine Translation
Information Extraction
14
Idea Learn PCFGs with EM
  • Classic experiments on learning Probabilistic
    CFGs with Expectation-Maximization Lari and
    Young, 1990
  • Full binary grammar over n symbols
  • Parse randomly at first
  • Re-estimate rule probabilities off parses
  • Repeat
  • Their conclusion it doesnt work at all!

X1 , X2 Xn
15
Other Approaches
  • Some earlier work in learning constituency
  • Adrians, 99 Language grammars arent general
    PCFGs
  • Clark, 01 Mutual-information filters detect
    constituents, then an MDL-guided search assembles
    them
  • van Zaanen, 00 Finds low edit-distance
    sentence pairs and extracts their differences
  • GB/Minimalism No results that Ive seen
  • Evaluation fraction of nodes in gold trees
    correctly posited in proposed trees (unlabeled
    recall)

16
Right-Branching Baseline
  • English trees tend to be right-branching, not
    balanced
  • A simple (English-specific) baseline is to choose
    the right-branching structure for each sentence

they were unwilling to agree to new terms
35.6
van Zaanen, 00
46.4
Right-Branch
17
Inspiration Distributional Clustering
? the president said that the downturn was over ?
president
governor
the
a
said
reported
Finch and Chater 92, Schütze 93, many others
18
Idea Distributional Syntax?
  • Can we use distributional clustering for learning
    syntax?

? factory payrolls fell in september ?
19
Problem Identifying Constituents
Distributional classes are easy to find
the final vote two decades most people
decided to took most of go with
of the with a without many
in the end on time for now
the final the intitial two of the
20
A Nested Distributional Model
  • Wed like a model that
  • Ties spans to linear contexts (like
    distributional clustering)
  • Considers only proper tree structures (like a
    PCFG model)
  • Has no symmetries to break (like a dependency
    model)

21
Constituent-Context Model (CCM)
  • P(ST)

?factory payrolls fell in september ?
22
Initialization A little UG?
Tree Uniform
Split Uniform
23
Results Constituency
CCM Parse
Treebank Parse
24
A combination model
  • What weve got
  • Two models of syntactic structure
  • Each was the first to break its baseline
  • Can we combine them?
  • Yes, using a product model
  • Which weve also used for supervised parsing
    (Klein and Manning 2003)

25
Combining the two modelsKlein and Manning ACL
2004
Dependency Evaluation
  • Supervised PCFG constituency recall is at 92.8
  • Qualitative improvements
  • Subject-verb groups gone, modifier placement
    improved

Constituency Evaluation
!
26
Crosslinguistic applicability of the learning
algorithm
Constituency Evaluation
27
Most Common Errors English
Overproposed Constituents
Crossing Constituents
28
What Has Been Accomplished?
  • Unsupervised learning
  • Constituency structure
  • Dependency structure
  • Constituency recall
  • Why it works
  • Combination of simple models
  • Representations designed for unsupervised learning

29
2. Unsupervised learning of linking to semantic
roles
Mary opened the door with a key.
30
Semantic Role Labeling
  • Supervised learning
  • Gildea Jurafsky (2000)
  • Pradhan et al. (2005)
  • Punyakanok et al. (2005)
  • Xue Palmer (2004)
  • Haghighi et al. (2005)
  • etc.
  • Unsupervised learning?

Mary opened the door with a key.
31
Why Unsupervised Might Work
Instances of open
The bottle opened easily.
She opened the door with a key.
The door opened as they approached.
Fortunately, the corkscrew opened the bottle.
He opened the bottle with a corkscrew.
She opened the bottle very carefully.
This key opens the door of the cottage.
He opened the door.
32
Why Unsupervised Might Work
Instances of open
The bottle opened easily.
She opened the door with a key.
The door opened as they approached.
Fortunately, the corkscrew opened the bottle.
He opened the bottle with a corkscrew.
She opened the bottle very carefully.
This key opens the door of the cottage.
He opened the door.
33
Inputs and Outputs
Verb give
Verb give
Roles
Syntactic Relation
Head word
Semantic Role
0
it, he, bill, they, that,
1
power, right, stake,
subj
it/PRP
?
2
them, it, him, dept.,
subj
bill/NN
?
obj1
power/NN
?
Linkings
subj
they/PRP
?
obj1
stake/NN
?
0subj,1obj2, 2obj1
0.46
subj
bill/NN
?
0subj,1obj1, 2to
0.19
obj1
them/PRP
?
0subj,1obj1
0.05
obj2
right/NN
?
Observed Data
Learned Model
34
Probabilistic Model
A deeper market plunge today could give them
their first test.
v
give
l
0subj, 1obj2, 2obj1
o
0subj, Mnp 1obj2, 2obj1
35
Linking Model
  • The space of possible linkings is largewhere
    S is the set of syntactic relations, and R is the
    set of semantic roles
  • Were going to decompose the linking generation
    into a sequence of derivation steps
  • Advantages
  • Smaller parameter space
  • Admits a natural informative prior over linkings

36
Linking Construction Example
Ways to add role 1
Ways to add role 2
Deterministic function
37
Learning Problem
  • Given a set of observed verb instances, what are
    the most likely model parameters?
  • A good application for EM!
  • M-step
  • Trivial computation
  • E-Step
  • We need to compute conditional distributions over
    possible role vectors for each instance
  • Given the syntactic relations, only a few role
    vectors yield possible linkings!
  • Exact E-step is possible

v
l
o
r
s
w
Dependents
38
Datasets and SRL Evaluation
  • Training sets
  • Penn Treebank - 1M words WSJ, annotated by humans
    with parse trees and semantic roles
  • BLLIP - 30M words WSJ, parsed by the Charniak
    parser
  • Gigaword - 1700M words newswire, Stanford parser
  • Development set Propbank WSJ section 24
  • 911 types, 2820 verb instances
  • Test set Propbank WSJ section 23
  • 1067 verb types, 4315 verb instances
  • SRL Evaluation
  • Learned role names dont correspond to Propbank
    names greedy mapping of learned roles to gold
    roles
  • However, we do map learned adjunct roles to ARGM
  • Coarse roles only (dont distinguish types of
    adjuncts)
  • Baseline subjarg0, dobjarg1, iobjarg2,
    restargm

39
Results (Coarse Roles, Sec. 23)Grenager and
Manning 2005
Classification(Coarse Roles)
85.6
Accuracy
89.7
98.3
93.4
Toutanova 2005
40
Improved Verbs
Verb give
Verb pay
Linkings
0subj,1obj2, 2obj1
0.46
Linkings
0subj,1obj1
0.32
0subj,1obj1, 2to
0.19
0subj,1obj1, 2for
0.21
0subj,1obj1
0subj
0.05
0.07

0subj,1obj1, 2to

0.05
0subj,1obj2, 2obj1
0.05
Roles
0
it, he, bill, they, that,


1
power, right, stake,
2
Roles
0
them, it, him, dept.,
it, they, company, he,

1

, bill, price, tax
2
stake, gov., share, amt.,


41
Conclusions
  • We can learn a model of verb linking behavior
    without labeled data
  • It learns many of the linkings that linguists
    propose for each verb
  • While labeling accuracy is below supervised
    systems, it is well above informed baselines
  • Good results requires careful design of model
    structure
  • While this remains to be done, composing the two
    models discussed here would give the basis of a
    mapping from plain text to a basic semantic
    representation of the text.

42
Language Acquisition Devices
Highly successful
Results still unclear
43
Grammar Induction What hasnt been accomplished?
  • Learning languages where the primary syntax is in
    the morphology (noun case, verb agreement) rather
    than linear order
  • What is the deal with Chinese?
  • Chinese has many less function words
  • Chinese has many less pro-forms
  • (Its not morphology Chinese has less of that
    too)
  • Extending from syntactic to semantic induction

44
Semantic role learning discussion
  • Clustering of roles for pay clearly wrong
  • pay to X vs. pay for Y
  • must have X?Y
  • Sparsity of head word observations
  • Inter-verb clustering?
  • Use of external resources?
  • No modeling of alternative senses of verbs
  • leave Mary with the gift vs. leave Mary alone

45
Worsened Verbs
Verb close
Verb leave
Linkings
0subj, 2in, 3at
0.24
Linkings
0subj,1obj1
0.57
0subj, 3at
0.18
0subj
0.18
0subj, 2in
0subj, 1cl1
0.11
0.12
0subj,1obj1, 2in, 3at

0.10

Roles
0
he, they, it, I, that, this,


1
company, it, office,
Roles
0
share, stock, it, market,
2
stake, alley, impression
1
yesterday, unchanged,


2
trading, exchange, volume
3
, cent, high, penny,
He left Mary with the gift. vs. He left Mary
alone.


In national trading, SFE shares closed yesterday
at 31.25 cents a share, up 6.25 cents.
46
Desiderata Practical Learnability
  • Be as simple as possible
  • Make symmetries self-breaking whenever possible
  • Avoid hidden structures which are not directly
    coupled to surface phenomena
  • To be practically learnable, models should
Write a Comment
User Comments (0)
About PowerShow.com