Probabilistic CFG with Latent Annotations - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Probabilistic CFG with Latent Annotations

Description:

A single symbol for all NPs. The difference between. Sbj-NP and ... We cannot use DP like the F-B algo. to. solve (1) (NP-hard) ... (1) Parsing by Approximation ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 50
Provided by: mtzk
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic CFG with Latent Annotations


1
Probabilistic CFG with Latent Annotations
  • Takuya Matsuzaki
  • Yusuke Miyao
  • Junichi Tsujii
  • University of Tokyo

2
Motivation Independence assumption in PCFG models
Treebank
Independence Assumption

Treebank-PCFG
3
Wrong independence assumption

Short phrase / Bare pronoun
Long phrase
  • A single symbol for all NPs
  • The difference between Sbj-NP and Obj-NP is
    not captured by the model

Subject-NPs and object-NPs have different
properties
4
Label annotation approache.g., Johnson98,
Charniak99, Collins99, KleinManning03
Treebank
Annotating labelswith features - parent
labels,- head words, -
Annotated-PCFG
5
Label annotation approache.g., Johnson98,
Charniak99, Collins99, KleinManning03
Annotated-PCFG

6
Natural Questions
  • What types of features are effective?
  • How should we combine the features?
  • How many features suffice?
  • Previous approach manual feature
    selection
  • Our approach automatic induction of
    features

7
Our Approach
Treebank
- Annotation of labels with latent variables -
Induction of their values using an
EM-algorithm
PCFG with Latent Annotations

8
Our Approach

A rule with different assignmentshave different
rule probabilities
Different assignments to the latent variables?
Different features
9
Outline
  • ?Model Definition
  • Parameter Estimation
  • Parsing Algorithms
  • Experiments

10
Model Definition (1/4)PCFG-LA model
  • PCFG-LA (PCFG with Latent Annotation) is
  • a generative model of parse trees, and
  • a latent variable model
  • Observed data CFG-style parse trees
  • Complete data parse trees with (latent)
    annotations

Observed tree
Complete tree
11
Model Definition (2/4) Generation of a Tree
Generation of a Complete tree Tx successive
applications of annotated PCFG rules
T x (2,1,3 )
12
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) )
13
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2)
14
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3)
15
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3) P(NP1 ? He)
16
Model Definition (2/4) Generation of a Tree
P(T x (2,1,3 ) ) P(S2) P(S2 ?
NP1 VP3) P(NP1 ? He) P(VP3 ? kicked it)
17
Model Definition (3/4) Components of PCFG-LA
  • Backbone-CFG A simple treebank CFG
  • Parameters Rule probs. for each rule with ALL
    assignments P(S1?NP1 VP1), P(S1?NP1
    VP2), P(S1?NP2 VP1), P(S1?NP2
    VP2), P(S2?NP1 VP1), P(S2?NP1
    VP2),
  • Domain of latent variables
  • A finite set H 1, 2, 3, , N
  • H is chosen before training
  • H 16 ? reasonable performance

18
Model Definition (4/4)
Probability of a Complete tree
Probability of an Observed tree
Sum of all possible assignments to latent
variables
19
Outline
  • Model definition
  • ?Parameter estimation
  • Parsing
  • Experiments

20
Parameter Estimation (1/2)
  • Training data a set of parse trees (treebank)
  • Algorithm an EM-algorithm similar to the
    Baum-Welch algorithm

Sx1
NPx2
VPx3
Nx4
Vx5
Nx6
He
kicked
it
21
Parameter Estimation (1/2)
  • Training data a set of parse trees (treebank)
  • Algorithm an EM-algorithm similar to the
    Baum-Welch algorithm

Sx1
NPx2
VPx3
Nx4
Vx5
Nx6
He
kicked
it
22
Parameter Estimation (1/2)
  • Training data a set of parse trees (treebank)
  • Algorithm an EM-algorithm similar to the
    Baum-Welch algorithm

23
Parameter Estimation (1/2)
Forward
Backward
Forward
x1
x2
x3
x4
w1
w2
w3
w4
Backward
24
Parameter Estimation (2/2)Comparison with I-O
algorithm
  • Estimation of PCFG-LA
  • Training data parsed sentences ? parse trees
    are given
  • Inside-Outside algorithm
  • Training data raw sentences? tree structures
    are unknown

25
Outline
  • Model definition
  • Parameter estimation
  • ?Parsing Algorithms
  • Experiments

26
Parsing with PCFG-LA
  • We want the most probable observable tree

Tmax argmax P(T w) argmax P(T)
(1)
  • However, an obsevable tree has exponentially
    many complete trees

?Hn complete trees
  • We cannot use DP like the F-B algo. to solve
    (1) (NP-hard)

27
Parsing by Approximation
  • Method1 Reranking of PCFG N-best parses
  • Do N-best parsing using a PCFG, and
  • Select the best tree in the N candidates
  • Method2 Viterbi Complete tree
  • Select the most probable complete tree and
    discard the annotation part
  • Method3 Viterbi search with approximate
    distribution
  • Details are in the next slides

28
Method 3 On-the-fly approximation by a simpler
model
  • Input a sentence w w1 w2
  • Parse w with the backbone-CFG andobtain a packed
    parse forest F
  • Break down F and make a PCFG-like distribution
    Q(Tw) using the fragments
  • Obtain argmax Q(Tw) using Viterbi search

29
Method 3 The design of Q(Tw)
  • Q(T w) a product of parameters.
    ??i

? Decoding is easy
?is are determined so that the KL distance from
Q(T w) to P(T w) is minimized
Approximation
P(T w) sum-of-products of parameters
30
Outline
  • Model definition
  • Parameter estimation
  • Parsing
  • ?Experiments

31
Experiment
  • Data Penn WSJ Corpus
  • Extraction of backbone CFG estimation
    Section02-21
  • Dev. set Section22
  • Test set Section23
  • Three experiments
  • Size of models ( H ) vs. Parsing performance
  • Comparison of 3 approximation methods
  • Results on Section 23

32
Setting Backbone-CFG
  • Function tags (-SBJ, -LOC, etc.) are removed
  • No feature annotations
  • 4 types of binarization
  • LEFT
  • RIGHT
  • PARENT-CENTER
  • HEAD-CENTER

PARENT-CENTER
RIGHT
33
Size of H vs. Performance
H4
H16
H2
H8
H1
34
Comparison of Approximations
N300
N100
35
Results on Section23
The same level of performance as
KleinMannings extensively annotated
unlexicalized-PCFG Several points lower than
the lexicalized-PCFG parsers
? The amount of learned features matches KMs
PCFG ? Some types of lexical information is not
captured
36
Conclusion
  • Automatic induction of features by the PCFG-LA
  • Several points lower than lexicalized parsers,
    but promising results 86 F1-score
  • On-the-fly approximation of a complex model by a
    simpler model for parsing
  • Better performance than other straightforward
    method
  • Further applications to models of similar
    typesingle output ?? many derivation
  • Mixture of parsers latent variables correspond
    to component parsers
  • Data-Oriented parsing
  • Projection of head-lexicalized parses to
    dependency structures


( Suggestion by an anonymous reviewer. Thank
you.)
37
Thank you!????????????
38
(No Transcript)
39
Parameter Estimation(2/3)
  • How can we calculate the sum of HN terms?
    P(T) Sx1Sx2Sx3
  • Forward-Backward algo. (Just as in HMMs)

Forward
Backward
40
Parameter Estimation(3/3)
  • Intuition
  • Soft assignments of values to latent variables,
    or
  • Soft clustering of non-terminal node (Klein
    Manning, 03)

41
NP-hardness of PCFG-LA Parsing
  • A similar situation DOP model
  • One parse tree ?? Exponentially many derivation
  • Similar (unfortunate) results
  • Obtaining a most probable tree in a DOP model is
    NP-hard (Simaán, 02)
  • Obtaining a most probable tree in a PCFG-LA is
    also NP-hard (we prove it by using Simaáns
    result)

42
Why cant we use a Dynamic Programming?
  • To aboid the wrong markov assumption,
  • PCFG-LA expands rules horizontally, while
  • DOP expands rules vertically

43
Why cant we use a Dynamic Programming?
  • As a result, every node remembers her
    grand-grand- mother node in both model

Sa1
S
NPa2
NP
VP
N
V
NP
kicked
kicked
44
(No Transcript)
45
Method 3 Local selection probs.
Q(T w) a product of local selection
probabilities 0.3, 0.7 x
0.6, 0.4
Q(Tw) 0.3 x 0.6
S
Q(Tw) 0.7 x 0.6
S
Q(Tw) 0.7 x 0.4

46
Method 3 Minimation of KL(QP)
Minimization of KL(QP) yields simple,
closed-form solutions for local probabilities.
?See the paper for details
47
A more about the Method 3 (3/4)
Minimization of KL(PQ) yields simple
closedresults for local probabilities.
Minimization of KL(PQ) yields simple,
closedresults for local probabilities.
VP
?1
?2
V
N
ADV
48
A more about the Method 3 (3/4)
Minimization of KL(PQ) yields simple,
closedresults for local probabilities.
Minimization of KL(PQ) yields simple
closedresults for local probabilities.
VP
?1
?2
VP
V
N
ADV
P(
w)
N
ADV
?2
P(
w)
49
A more about the Method 3 (4/4)
Calculation ofthese marginal probsInside-Outsid
e algo. on the packed forest
Write a Comment
User Comments (0)
About PowerShow.com