Title: Overview of PCFGs and the insideoutside algorithms
1Overview of PCFGs and the inside/outside
algorithms
based on the book by E. Charniak, 1993
2Outline
- Shift/Reduce Chart Parsing
- PCFGs
- Inside/Outside
- inside
- outside
- use for EM training.
3Shift-Reduce Parsing
- Top down parsing
- start at non-terminal symbols and recurse down
trying to find a parse of a given sentence. - Can get stuck in loops (left-recursion)
- Bottom-up parsing
- start at the terminal level and build up
non-terminal constituents as we go along. - shift-reduce parsing
- (used often in machine languages which are
designed to be as unambiguous as possible) are
quite useful - Find sequences of terminals that match right-hand
side of CFG productions, and reduce them to
non-terminals. Do same for sequences of
terminal/non-terminals to reduce to
non-terminals.
4Example Arithmetic
5(No Transcript)
6Shift-Reduce Parsing
- This is great for compiler writers (grammars are
often unambiguous, and sometimes can be designed
to not be). - Problems ambiguity. One rule might suggest a
shift while another might suggest a reduce. - S ? if E then S if E then S else S
- On stack if E then S
- Input else
- Should we shift getting 2nd rule or reduce
getting 1st rule?? - Can get reduce-reduce conflicts as well, given
two production rules of form A ? ? B ? ? - Solution backtracking (a general search method)
- In ambiguous situations, choose one (using a
heuristic) - whenever we get to situation where we cant
parse, we backtrack to location of ambiguity
7Chart Parsing vs. Shift-Reduce
- Shift-reduce parsing computation wasted if
back-tracking occurs - the long np in Big brown funny dogs are hungry
remains parsed while we search for next
constituent, but it becomes undone if a category
below it on the stack would need to be reparsed - Chart parsing avoids reparsing constituents that
have already been found grammatical by storing
all grammatical substrings for duration of parse.
8Chart Parsing
- Uses three main data structures
- 1) a key list, list of next constituents to
insert in chart (e.g., terminals or
non-terminals) - 2) a chart (a triangular table that keeps track
of starting position (x-axis) and length (y-axis)
of a constituent - 3) set of edges (things that mark positions on
the chart rules/productions, and how far along in
each rule the parse currently is (it is this last
category that helps avoid reparsing). - We can use chart-parsing for producing (and
summing) parses for PCFGs and for producing
inside/outside probabilities. But what are these?
9PCFGs
- A CFG with a P in front.
- A (typically inherently ambiguous) grammar with
probabilities associated with each production
rule for a non-terminal symbol.
10PCFGs
- Example parse
- Prob of this parse is
- 0.80.20.40.050.450.30.40.40.53.5e-10
0.8
0.3
0.2
0.4
0.4
0.05
0.45
0.4
0.5
11PCFGs
- Gives probabilities for string of words, where
t1n varies over all possible parse trees for
word string w1n
12Domination Notation
- Non-terminal dominating terminal symbols
- So in general, Njk,l means that the jth
non-terminal (Nj) dominates terminals starting at
k and ending at l (inclusive).
13Probability Rules
- starting at k and ending at l, we get
- for any k,l and any m,n,q k m lt n q lt
l - X,Y, , Z can correspond to any valid set of
terminals or non-terminals that partition the
terminals from k to l.
14Chomsky Hierarchy (reminder)
15Regular grammar (reminder)
- Generates regular languages (e.g., regular
expressions without bells/whistles, FSAs, etc.) - A?a, where A is a non-terminal and a is a
terminal. - A?aB, where A and B are non-terminals and a is a
terminal. - A? ?, where A is a non-terminal.
- Ex anbm m,n gt 0 can be generated by
- S?aA
- A?aA
- A?bB
- B?bB
- B? ?
16Context-Free (reminder)
- Generates context-free languages
- Left-hand side of production rule can be only one
non-terminal symbol - Every rule of the form
- A ? ?
- where ? is a string of terminals or
non-terminals. - context free since non-terminal A can always be
replaced by ? regardless of the context in which
A occurs.
17Context-Free (reminder)
- Example generate anbn n 0
- A ? aAb
- A??
- Above language cant be generated by regular
grammar (language is not regular) - Often used for compilers
- No probabilities (yet).
18Context-Sensitive (reminder)
- Generates context-sensitive languages
- Generated by rules of the form
- ? A ? ? ? ? ?
- A? e
- where A is a non-terminal, ? is any string of
terminals/non-terminals, ? is any string of
terminals/non-terminals, and ? is also any
non-empty string of terminals-non-terminals.
19Context-Sensitive (reminder)
- Example anbncn n 1
- A ? abc
- A? aABc
- cB ? Bc
- bB ? bb
- Example 2
- an n prime
20Type-0 (unrestricted) grammar
- all languages that can be recognized by a Turing
machine (i.e., ones that the TM can say yes, and
then halt). - Known as recursively enumerable languages.
- More general than context-sensitive.
- Example (I think)
- anbm n prime, m is prime following n
- aabbb, aaabbbbb,aaaaabbbbbbb, etc.
21Independence Assumptions
- PCFGs are CFGs with probabilities, but there are
other statistical independence assumptions made
about random events. - Probabilities are unaffected by certain parts of
the tree given certain other non-terminals.
22Example
- Following parse tree has equation
B1,3 is outside C4,5
A1,5 is above C4,5
23Chomsky vs. Conditional Independence
- The fact that it is context-free means that only
certain production rules can occur in the grammar
specification. The context-free determines the
set of possible trees, not their probabilities. - This alone does not mean that the probabilities
may not be influenced by parses outside a given
non-terminal. This is where we need the
conditional independence assumptions over
probabilistic events. - The events are parse events or just dominance
events. E.g., without independence assumptions,
p(w4,w5C4,5,B2,3) might not be the same as
p(w4,w5C4,5,B1,3), or could even change
depending on how we parse some other set of
constituents.
24Issues
- We want to be able to compute
- But number of parses of a given sentence can grow
exponentially in the length of the sentence, for
a given grammar ? naïve summing wont work.
25Problems to Solve
- 1) compute probability of a sentence for a given
grammar, p(w1nG) - 2) find most likely parse for a given sentence.
- 3) Train grammar probabilities in some good way
(e.g., maximum likelihood). - These can all be solved by inside-outside
algorithm.
26A Note on comparisons to HMMs
- Text says that HMMs and probabilistic regular
grammars (PRG) assign different probabilities to
strings. - PRG is such that summing probabilities of all
sentences of all lengths should be one. - HMM says that summing probabilities of all
sentences of a given length T should be one. - But HMMs have an implicit conditional on T. HMM
probability is
27A Note on comparisons to HMMs
- But HMMs can be defined to have an explicit
length distribution - This can be done implicitly in HMM by having a
start and a stop state, thus defining P(T). - We can alternatively explicitly make P(T) equal
to the PRG probability of summing over all
sentences of length T.
28Example PRG its parses
29forward/backward in HMMs
- Forward (alpha) recursion
- Backward (Beta) recursion
- Probability of a sentence
30association of ?/? with outside/inside
- For PRG, backwards ? probability is like the
stuff below a given non-terminal, or perhaps
inside the non-terminal. - For PRG, forwards ? probability is like the stuff
above a given non-terminal, or perhaps outside
the non-terminal.
31Inside/Outside (?/?) probability defs.
- Inside probability the stuff that is inside (or
that is dominated by) a given non-terminal,
corresponding to the terminals in the range k
through l. - Outside probability the stuff that is outside
the terminal range k through l.
32Inside/Outside probability defs.
Outside Nj
Inside Nj
33Inside/Outside probability
- Like in HMMs, we can get sentence probabilities
given inside outside probabilities for any
non-terminal.
34Inside probability computation
- For Chomsky-Normal form
- for more general case, see todays handout from
Charniak text, need to use details about edges in
chart. - Base case (to then go from terminal on up the
parse tree) - Next case consider range k through l, and all
non-terminals that could generate wkl. Since
Chomsky normal form, we have two non-terminals Np
and Nq, but we must consider all possible
terminal ranges within k through l so that these
two non-terminals jointly dominate k through l.
35Inside probability computation
sum over all pairs of non-terminals p,q and
location of split m.
by def. of chain rule of probability.
PCFG conditional independence
Rename
36Inside probability computation
- m unique production rules
- n length of string
- computational complexity is
- O(n3m3)
37Outside probability
- Outside probability
- it can also be used to compute the word
probability, as follows - how do we get this?
38Outside probability computation
- Note again this is true for any k at all.
- So how do we compute outside probabilities?
- Inside probabilities are calculated bottom up,
and (not surprisingly), outside probabilities are
calculated top down, so we start at the top
un-marginalizing all non-terminals j.
chain rule
conditional independence and rename
39Outside probability computation
- Starting at the top
- Next, to calculate outside probability for Njk,j,
(and considering weve got Chomsky normal form),
there are two possibilities that this could have
come from higher constituents, namely Nj is
either on the left or the right of its sibling
These two events are exclusive and exhaustive,
so this means well be needing to sum
probabilities of the two cases.
40Outside probability computation
41Outside probability computation
42Outside probability computation
- m unique production rules
- n length of string
- computational complexity is
- O(n3m3)
43Other notes
- We can form products of inside/outside values
(but they have slightly different meaning than
with HMMs). - So final sum is the probability of the word
sequence, and some non-terminal constituent
spanning k through l (diff than HMM, which is
just the observations). We get
follows again from conditional indep.
44Training PCFG probabilities
- inside/outside procedure can be used to compute
expected sufficient statistics (e.g., posteriors
over production rules) so that we can use these
in an EM iterative update procedure. - this is very useful when we have a database of
text, but we dont have a database of observed
parses for this (i.e., when we dont have a
treebank). We can use EM inside/outside to find
a maximal likelihood solution to the probability
values. - All we need are equations that give us expected
counts. - note book uses notation and mentions ratios of
actual counts C, but they really are expected
counts Ecount which are sums of probabilities,
as in normal EM. Recall expected value of an
indicator is its probability. - E1event Pr(event)
45Training PCFG probabilities
- First, we need definition of posterior as ratio
of expected counts for production - Next we need expected counts
46Training PCFG probabilities
- Lastly, we need terminal node expected counts
that non-terminal Ni produced vocab word wj
47Summary
- PCFGs are powerful
- Good training algorithms for them.
- But are they enough? Context-sensitive grammars
were really developed for natural (written and
spoken) language, but algorithms for them not
efficient.