Grammar induction by Bayesian model averaging - PowerPoint PPT Presentation

About This Presentation
Title:

Grammar induction by Bayesian model averaging

Description:

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke s thesis UC Berkeley 1994 Why automatic grammar induction ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 14
Provided by: GuyLe3
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Grammar induction by Bayesian model averaging


1
Grammar induction by Bayesian model averaging
  • Guy Lebanon
  • LARG meeting
  • May 2001
  • Based on Andreas Stolckes thesis UC Berkeley 1994

2
Why automatic grammar induction (AGI)
  • Enables using domain-dependent grammars without
    expert intervention.
  • Enables using person-dependent grammars without
    expert intervention.
  • Can be used on different languages (without a
    linguist familiar with the particular language).
  • A process of grammar induction with expert
    guidance may be more accurate than human written
    grammar since computers are more adept than
    humans in analyzing large corpora.

3
Why statistical approaches to AGI
  • In practice languages are not logical structures.
  • Often said sentences are not precisely
    grammatical. The solution of expanding the
    grammar leads to explosion of grammar rules.
  • A large grammar will lead to many parses of the
    same sentences. Clearly, some parses are more
    accurate than others. Statistical approaches
    enable including a large set of grammar rules
    together with assigning probability to each
    parse.
  • There are known optimality conditions and
    optimization procedure in statistics.

4
Some Bayesian statistics
  • For each grammar (rule
    probabilities rules), a prior probability p(M)
    is assigned. This value may represent experts
    opinion about how likely is this grammar.
  • Upon introduction of a training set X (an
    unlabeled corpus), the model posterior is
    computed by Bayes law
  • Either the grammar that maximizes the posterior
    is kept (as the best grammar), or the set of all
    grammars and their posteriors is kept (better).

5
Priors for CF grammars
  • The prior of a grammar p(M) is split to two
    parts
  • The component is taken to introduce a
    bias towards short grammars (less rules). One way
    of doing that, though still heuristic, is minimum
    description length (MDL)
  • Prior for the rule probabilities is taken to be
    uniform Dirichlet prior which has the effect of
    smoothing low counts of rules usage.

6
Grammar posterior
  • Too hard to maximize over the posterior of both
    the rules and the probabilities. Instead, the
    search is done to maximize the posterior of the
    rules only
  • Where V is the Viterbi derivation of x. The last
    integral has a closed form solution.

7
Maximizing the posterior
  • Even though computing an approximation to the
    posterior is possible in closed form, coming up
    with a grammar that maximizes it is still a hard
    problem.
  • A. Stolcke Start with many rules. Apply greedy
    operations of merging rules to maximize the
    posterior.
  • Model merging was applied to Hidden Markov
    models, probabilistic context free grammar and
    probabilistic attribute grammar (PCFG with
    semantic features tied to non-terminals).

8
A concrete example PCFG
  • A specific PCFG consists of a list of rules s and
    a set of production probabilities .
  • For a given s, it is possible to learn the
    production probabilities with EM. Coming up with
    an optimal s is still an open problem. Stolckes
    model merging is an attempt to tackle this
    problem.
  • Given a corpus (set of sentences), an initial set
    of rules is constructed

9
Merging operators
  • Non-terminal merging replace two existing
    non-terminals with a single new non-terminal.
  • Non-terminal chunking Given an ordered sequence
    of non-terminals, create a new
    non-terminal Y that expands to
    and replaces occurrences of in right
    hand side with Y.

10
PCFG priors
Prior for rule probabilities
Prior for rules For a non-lexical rule (doesnt
produce a terminal symbol) the description length
is
For a lexical rule (produces a terminal symbol)
the description length is The prior was taken to
be either exponentially decreasing or Poisson in
the description length
11
(No Transcript)
12
Search strategy
  • Start with the initial rules.
  • Try applying all possible merge operations. For
    each resulting grammar compute the posterior and
    choose the merge which resulted in the highest
    posterior.
  • Search strategy
  • Best first search,
  • Best first with look-ahead
  • Beam search

13
Now some examples
Write a Comment
User Comments (0)
About PowerShow.com