Overview of PCFGs and the insideoutside algorithms - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Overview of PCFGs and the insideoutside algorithms

Description:

start at non-terminal symbols and 'recurse' down trying to find a parse of a given sentence. ... the long np in Big brown funny dogs are hungry ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 48

Provided by: ssliEeWa

Category:

more less

Transcript and Presenter's Notes

Title: Overview of PCFGs and the insideoutside algorithms

1
Overview of PCFGs and the inside/outside
algorithms
based on the book by E. Charniak, 1993

Presented by Jeff Bilmes

2
Outline

Shift/Reduce Chart Parsing
PCFGs
Inside/Outside
inside
outside
use for EM training.

3
Shift-Reduce Parsing

Top down parsing
start at non-terminal symbols and recurse down
trying to find a parse of a given sentence.
Can get stuck in loops (left-recursion)
Bottom-up parsing
start at the terminal level and build up
non-terminal constituents as we go along.
shift-reduce parsing
(used often in machine languages which are
designed to be as unambiguous as possible) are
quite useful
Find sequences of terminals that match right-hand
side of CFG productions, and reduce them to
non-terminals. Do same for sequences of
terminal/non-terminals to reduce to
non-terminals.

4
Example Arithmetic
5
(No Transcript)
6
Shift-Reduce Parsing

This is great for compiler writers (grammars are
often unambiguous, and sometimes can be designed
to not be).
Problems ambiguity. One rule might suggest a
shift while another might suggest a reduce.
S ? if E then S if E then S else S
On stack if E then S
Input else
Should we shift getting 2nd rule or reduce
getting 1st rule??
Can get reduce-reduce conflicts as well, given
two production rules of form A ? ? B ? ?
Solution backtracking (a general search method)
In ambiguous situations, choose one (using a
heuristic)
whenever we get to situation where we cant
parse, we backtrack to location of ambiguity

7
Chart Parsing vs. Shift-Reduce

Shift-reduce parsing computation wasted if
back-tracking occurs
the long np in Big brown funny dogs are hungry
remains parsed while we search for next
constituent, but it becomes undone if a category
below it on the stack would need to be reparsed
Chart parsing avoids reparsing constituents that
have already been found grammatical by storing
all grammatical substrings for duration of parse.

8
Chart Parsing

Uses three main data structures
1) a key list, list of next constituents to
insert in chart (e.g., terminals or
non-terminals)
2) a chart (a triangular table that keeps track
of starting position (x-axis) and length (y-axis)
of a constituent
3) set of edges (things that mark positions on
the chart rules/productions, and how far along in
each rule the parse currently is (it is this last
category that helps avoid reparsing).
We can use chart-parsing for producing (and
summing) parses for PCFGs and for producing
inside/outside probabilities. But what are these?

9
PCFGs

A CFG with a P in front.
A (typically inherently ambiguous) grammar with
probabilities associated with each production
rule for a non-terminal symbol.

10
PCFGs

Example parse
Prob of this parse is
0.80.20.40.050.450.30.40.40.53.5e-10

0.8
0.3
0.2
0.4
0.4
0.05
0.45
0.4
0.5
11
PCFGs

Gives probabilities for string of words, where
t1n varies over all possible parse trees for
word string w1n

12
Domination Notation

Non-terminal dominating terminal symbols
So in general, Njk,l means that the jth
non-terminal (Nj) dominates terminals starting at
k and ending at l (inclusive).

13
Probability Rules

starting at k and ending at l, we get
for any k,l and any m,n,q k m lt n q lt
l
X,Y, , Z can correspond to any valid set of
terminals or non-terminals that partition the
terminals from k to l.

14
Chomsky Hierarchy (reminder)
15
Regular grammar (reminder)

Generates regular languages (e.g., regular
expressions without bells/whistles, FSAs, etc.)
A?a, where A is a non-terminal and a is a
terminal.
A?aB, where A and B are non-terminals and a is a
terminal.
A? ?, where A is a non-terminal.
Ex anbm m,n gt 0 can be generated by
S?aA
A?aA
A?bB
B?bB
B? ?

16
Context-Free (reminder)

Generates context-free languages
Left-hand side of production rule can be only one
non-terminal symbol
Every rule of the form
A ? ?
where ? is a string of terminals or
non-terminals.
context free since non-terminal A can always be
replaced by ? regardless of the context in which
A occurs.

17
Context-Free (reminder)

Example generate anbn n 0
A ? aAb
A??
Above language cant be generated by regular
grammar (language is not regular)
Often used for compilers
No probabilities (yet).

18
Context-Sensitive (reminder)

Generates context-sensitive languages
Generated by rules of the form
? A ? ? ? ? ?
A? e
where A is a non-terminal, ? is any string of
terminals/non-terminals, ? is any string of
terminals/non-terminals, and ? is also any
non-empty string of terminals-non-terminals.

19
Context-Sensitive (reminder)

Example anbncn n 1
A ? abc
A? aABc
cB ? Bc
bB ? bb
Example 2
an n prime

20
Type-0 (unrestricted) grammar

all languages that can be recognized by a Turing
machine (i.e., ones that the TM can say yes, and
then halt).
Known as recursively enumerable languages.
More general than context-sensitive.
Example (I think)
anbm n prime, m is prime following n
aabbb, aaabbbbb,aaaaabbbbbbb, etc.

21
Independence Assumptions

PCFGs are CFGs with probabilities, but there are
other statistical independence assumptions made
about random events.
Probabilities are unaffected by certain parts of
the tree given certain other non-terminals.

22
Example

Following parse tree has equation

B1,3 is outside C4,5
A1,5 is above C4,5
23
Chomsky vs. Conditional Independence

The fact that it is context-free means that only
certain production rules can occur in the grammar
specification. The context-free determines the
set of possible trees, not their probabilities.
This alone does not mean that the probabilities
may not be influenced by parses outside a given
non-terminal. This is where we need the
conditional independence assumptions over
probabilistic events.
The events are parse events or just dominance
events. E.g., without independence assumptions,
p(w4,w5C4,5,B2,3) might not be the same as
p(w4,w5C4,5,B1,3), or could even change
depending on how we parse some other set of
constituents.

24
Issues

We want to be able to compute
But number of parses of a given sentence can grow
exponentially in the length of the sentence, for
a given grammar ? naïve summing wont work.

25
Problems to Solve

1) compute probability of a sentence for a given
grammar, p(w1nG)
2) find most likely parse for a given sentence.
3) Train grammar probabilities in some good way
(e.g., maximum likelihood).
These can all be solved by inside-outside
algorithm.

26
A Note on comparisons to HMMs

Text says that HMMs and probabilistic regular
grammars (PRG) assign different probabilities to
strings.
PRG is such that summing probabilities of all
sentences of all lengths should be one.
HMM says that summing probabilities of all
sentences of a given length T should be one.
But HMMs have an implicit conditional on T. HMM
probability is

27
A Note on comparisons to HMMs

But HMMs can be defined to have an explicit
length distribution
This can be done implicitly in HMM by having a
start and a stop state, thus defining P(T).
We can alternatively explicitly make P(T) equal
to the PRG probability of summing over all
sentences of length T.

28
Example PRG its parses
29
forward/backward in HMMs

Forward (alpha) recursion
Backward (Beta) recursion
Probability of a sentence

30
association of ?/? with outside/inside

For PRG, backwards ? probability is like the
stuff below a given non-terminal, or perhaps
inside the non-terminal.
For PRG, forwards ? probability is like the stuff
above a given non-terminal, or perhaps outside
the non-terminal.

31
Inside/Outside (?/?) probability defs.

Inside probability the stuff that is inside (or
that is dominated by) a given non-terminal,
corresponding to the terminals in the range k
through l.
Outside probability the stuff that is outside
the terminal range k through l.

32
Inside/Outside probability defs.
Outside Nj
Inside Nj
33
Inside/Outside probability

Like in HMMs, we can get sentence probabilities
given inside outside probabilities for any
non-terminal.

34
Inside probability computation

For Chomsky-Normal form
for more general case, see todays handout from
Charniak text, need to use details about edges in
chart.
Base case (to then go from terminal on up the
parse tree)
Next case consider range k through l, and all
non-terminals that could generate wkl. Since
Chomsky normal form, we have two non-terminals Np
and Nq, but we must consider all possible
terminal ranges within k through l so that these
two non-terminals jointly dominate k through l.

35
Inside probability computation
sum over all pairs of non-terminals p,q and
location of split m.
by def. of chain rule of probability.
PCFG conditional independence
Rename
36
Inside probability computation

m unique production rules
n length of string
computational complexity is
O(n3m3)

37
Outside probability

Outside probability
it can also be used to compute the word
probability, as follows
how do we get this?

38
Outside probability computation

Note again this is true for any k at all.
So how do we compute outside probabilities?
Inside probabilities are calculated bottom up,
and (not surprisingly), outside probabilities are
calculated top down, so we start at the top

un-marginalizing all non-terminals j.
chain rule
conditional independence and rename
39
Outside probability computation

Starting at the top
Next, to calculate outside probability for Njk,j,
(and considering weve got Chomsky normal form),
there are two possibilities that this could have
come from higher constituents, namely Nj is
either on the left or the right of its sibling

These two events are exclusive and exhaustive,
so this means well be needing to sum
probabilities of the two cases.
40
Outside probability computation
41
Outside probability computation
42
Outside probability computation

m unique production rules
n length of string
computational complexity is
O(n3m3)

43
Other notes

We can form products of inside/outside values
(but they have slightly different meaning than
with HMMs).
So final sum is the probability of the word
sequence, and some non-terminal constituent
spanning k through l (diff than HMM, which is
just the observations). We get

follows again from conditional indep.
44
Training PCFG probabilities

inside/outside procedure can be used to compute
expected sufficient statistics (e.g., posteriors
over production rules) so that we can use these
in an EM iterative update procedure.
this is very useful when we have a database of
text, but we dont have a database of observed
parses for this (i.e., when we dont have a
treebank). We can use EM inside/outside to find
a maximal likelihood solution to the probability
values.
All we need are equations that give us expected
counts.
note book uses notation and mentions ratios of
actual counts C, but they really are expected
counts Ecount which are sums of probabilities,
as in normal EM. Recall expected value of an
indicator is its probability.
E1event Pr(event)

45
Training PCFG probabilities

First, we need definition of posterior as ratio
of expected counts for production
Next we need expected counts

46
Training PCFG probabilities

Lastly, we need terminal node expected counts
that non-terminal Ni produced vocab word wj

47
Summary

PCFGs are powerful
Good training algorithms for them.
But are they enough? Context-sensitive grammars
were really developed for natural (written and
spoken) language, but algorithms for them not
efficient.

Write a Comment

User Comments (0)