Probabilistic - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Probabilistic

Description:

Some ambiguous sentences could not be disambiguated, and we would like to know ... For example, TWA flights is ambiguous between being two separate NPs (cf. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: inderje
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic


1
  • Probabilistic
  • Parsing
  • and
  • Treebanks
  • LING 362
  • Intro to NLP

2
Motivation and Outline
  • Previously, we used CFGs to parse with, but
  • Some ambiguous sentences could not be
    disambiguated, and we would like to know the most
    likely parse
  • How do we get such grammars? Do we write them
    ourselves? Maybe we could use a corpus
  • Where were going
  • Probabilistic Context-Free Grammars (PCFGs)
  • Lexicalized PCFGs
  • Dependency Grammars

3
Statistical Parsing
  • Basic idea
  • Start with a treebank
  • a collection of sentences with syntactic
    annotation, i.e., already-parsed sentences
  • Examine which parse trees occur frequently
  • Extract grammar rules corresponding to those
    parse trees, estimating the probability of the
    grammar rule based on its frequency
  • That is, well have a CFG augmented with
    probabilities

4
Probabilistic Context-Free Grammars (PCFGs)
  • Definition of a CFG
  • Set of non-terminals (N)
  • Set of terminals (T)
  • Set of rules/productions (P), of the form ? ? ß
  • Designated start symbol (S)
  • Definition of a PCFG
  • Same as a CFG, but with one more function, D
  • D assigns probabilities to each rule in P

5
Probabilities
  • The function D gives probabilities for a
    non-terminal A to be expanded to a sequence ß.
  • Written as P(A ? ß)
  • or as P(A ? ßA)
  • The idea is that, given A as the mother
    non-terminal (LHS), what is the likelihood that ß
    is the correct RHS
  • Note that Si (A ? ßi A) 1
  • For example, we would augment a CFG with these
    probabilities
  • P(S ? NP VP S) .80
  • P(S ? Aux NP VP S) .15
  • P(S ? VP S) .05

6
Estimating Probabilities using a Treebank
  • Given a corpus of sentences annotated with
    syntactic annotation (e.g., the Penn Treebank)
  • Consider all parse trees
  • (1) Each time you have a rule of the form A?ß
    applied in a parse tree, increment a counter for
    that rule
  • (2) Also count the number of times A is on the
    left hand side of a rule
  • Divide (1) by (2)
  • P(A?ßA) Count(A?ß)/Count(A)
  • If you dont have annotated data, parse the
    corpus (as well describe next) and estimate the
    probabilities which are then used to re-parse.

7
Using Probabilities to Parse
  • P(T) probability of a particular parse tree
  • P(T) ?n?T p(r(n))
  • i.e., the product of the probabilities of all
    the rules r used to expand each node n in the
    parse tree
  • Example given the probabilities on p. 449,
    compute the probability of the parse tree on the
    right

8
Computing probabilities
  • We have the following rules and probabilities
    (adapted from Figure 12.1)
  • S ? VP .05
  • VP ? V NP .40
  • NP ? Det N .20
  • V ? book .30
  • Det ? that .05
  • N ? flight .25
  • P(T) P(S?VP)P(VP?V NP)P(N?flight)
  • .05.40.20.30.05.25 .000015, or 1.5 x 10-5

9
Using probabilities
  • So, the probability for that parse is 0.000015.
    Whats the big deal?
  • Probabilities are useful for comparing with other
    probabilities
  • Whereas we couldnt decide between two parses
    using a regular CFG, we now can.
  • For example, TWA flights is ambiguous between
    being two separate NPs (cf. I gave NP John NP
    money) or one NP
  • A book TWA flights
  • B book TWA flights
  • Probabilities allows us to choose choice B (see
    figure 12.2)

10
Obtaining the best parse
  • Call the best parse T(S), where S is your
    sentence
  • Get the tree which has the highest probability,
    i.e.
  • T(S) argmaxT?parse-trees(S) P(T)
  • Can use the Cocke-Younger-Kasami (CYK) algorithm
    to calculate best parse
  • CYK is a form of dynamic programming
  • CYK is a chart parser, like the Earley parser

11
The CYK algorithm
  • Base case
  • Add words to the chart
  • Store P(A ? w_i) for every category A in the
    chart
  • Recursive case ? makes this dynamic programming
    because we only calculate B and C once
  • Rules must be of the form A ? BC, i.e., exactly
    two items on the RHS (we call this Chomsky Normal
    Form (CNF))
  • Get the probability for A at this node by
    multiplying the probabilities for B and for C by
    P(A ? BC)
  • P(B)P(C)P(A ? BC)
  • For a given A, only keep the maximum probability
    (again, this is dynamic programming)

12
Problems with PCFGs
  • Its still only a CFG, so dependencies on non-CFG
    info not captured
  • e.g., Pronouns are more likely to be subjects
    than objects
  • P(NP?Pronoun) NPsubj gtgt P(NP?Pronoun) NP
    obj
  • Ignores lexical information (statistics), which
    is usually crucial for disambiguation
  • (T1) America sent 250,000 soldiers into
    Iraq
  • (T2) America sent 250,000 soldiers into Iraq
  • send with into-PP always attached high (T2) in
    PTB!
  • To handle lexical information, well turn to
    lexicalized PCFGs

13
Lexicalized Grammars
  • Remember how we talked about head information
    being passed up in a syntactic analysis?
  • e.g., VPhead 1 ? Vhead 1 NP
  • Well, if you follow this down all the way to the
    bottom of a tree, you wind up with a head word
  • In some sense, we can say that Book that flight
    is not just an S, but an S rooted in book
  • Thus, book is the headword of the whole sentence
  • By adding headword information to nonterminals,
    we wind up with a lexicalized grammar

14
Lexicalized PCFGs
  • Lexicalized Parse Trees
  • Each PCFG rule in a tree is augmented to identify
    one RHS constituent to be the head daughter
  • The headword for a node is set to the head word
    of its head daughter

book
book
flight
flight
15
Incorporating Head Proabilities Wrong Way
  • Simply adding headword w to node wont work
  • So, the node A becomes Aw
  • e.g., P(Aw?ßA) Count(Aw?ß)/Count(A)
  • The probabilities are too small, i.e., we dont
    have a big enough corpus to calculate these
    probabilities
  • VP(dumped) ? VBD(dumped) NP(sacks) PP(into)
    3x10-10
  • VP(dumped) ? VBD(dumped) NP(cats) PP(into)
    8x10-11
  • These probabilities are tiny, and others will
    never occur

16
Incorporating head probabilities Right way
  • Previously, we conditioned on the mother node
    (A)
  • P(A?ßA)
  • Now, we can condition on the mother node and the
    headword of A (h(A))
  • P(A?ßA, h(A))
  • Were no longer conditioning on simply the mother
    category A, but on the mother category when h(A)
    is the head
  • e.g., P(VP?VBD NP PP VP, dumped)

17
Calculating rule probabilities
  • Well write the probability more generally as
  • P(r(n) n, h(n))
  • where n node, r rule, and h headword
  • We calculate this by comparing how many times the
    rule occurs with h(n) as the headword versus how
    many times the mother/headword combination appear
    in total
  • P(VP ? VBD NP PP VP, dumped)
  • C(VP(dumped) ? VBD NP PP)/ Sß C(VP(dumped) ?
    ß)

18
Adding info about word-word dependencies
  • We want to take into account one other factor
    the probability of being a head word (in a given
    context)
  • P(h(n)word )
  • We condition this probability on two things 1.
    the category of the node (n), and 2. the headword
    of the mother (h(m(n)))
  • P(h(n)word n, h(m(n))), shortened as P(h(n)
    n, h(m(n)))
  • P(sacks NP, dumped)
  • What were really doing is factoring in how words
    relate to each other
  • We will call this a dependency relation later
    sacks is dependent on dumped, in this case

19
Putting it all together
  • See p. 459 for an example lexicalized parse tree
    for
  • workers dumped sacks into a bin
  • For rules r, category n, head h, mother m
  • P(T) ?n?T
  • p(r(n) n, h(n))
  • e.g., P(VP? VBD NP PP VP, dumped)
  • subcategorization info
  • p(h(n) n, h(m(n)))
  • e.g. P(sacks NP, dumped)
  • dependency info between words

20
Dependency Grammar
  • Capturing relations between words (e.g. dumped
    and sacks) is moving in the direction of
    dependency grammar (DG)
  • In DG, there is no such thing as constituency
  • The structure of a sentence is purely the binary
    relations between words
  • John loves Mary is represented as
  • LOVE ? JOHN
  • LOVE ? MARY
  • where A ? B means that B depends on A

21
Evaluating Parser Output
  • Dependency relations are also useful for
    comparing parser output to a treebank
  • Traditional measures of parser accuracy
  • Labeled bracketing precision
  • correct constituents in parse/ constituents in
    parse
  • Labeled bracketing recall
  • correct constituents in parse/ (correct)
    constituents in treebank parse
  • There are known problems with these measures, so
    people are trying to use dependency-based
    measures instead
  • How many dependency relations did the parse get
    correct?
Write a Comment
User Comments (0)
About PowerShow.com