Bayesian Networks in Language Modeling - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Bayesian Networks in Language Modeling

Description:

620 CEP. in. presents. Peshkin. Doctor. Word. location ... doctor peshkin presents in room 620 CEP. Leonid Peshkin. Bayesian Networks for Language Modeling ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 84
Provided by: pes9
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Networks in Language Modeling


1
Bayesian Networks in Language Modeling
Leon Peshkin MIT
http//www.ai.mit.edu/pesha
2
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place TTI Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic Bayesian networks (DBNs) offer an
elegant way to integrate various aspects
of language in one model. Many existing
algorithms developed for learning and inference
in DBNs are applicable to probabilistic language
modeling.
3
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_ai.mit.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic Bayesian networks (DBNs) offer an
elegant way to integrate various aspects
of language in one model. Many existing
algorithms developed for learning and inference
in DBNs are applicable to probabilistic language
modeling.
4
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place TTI Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
5
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 29 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 Time Tuesday
Mar 09, 10am Place I-9
Conference Room If you want to meet with the
speaker, please send email to Cecilia
Salazar cecilia.salazar_at_intel.com .
"Dynamic Bayesian Nets for Language Modeling
Dr. Leon Peshkin, MIT
CSAIL Statistical methods in NLP exclude
linguistically plausible models due to the
prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
6
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
Location
7
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
Time
Location
8
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Time
9
Benchmark dataset
Freitag98
  • CMU Seminar Announcement data set
  • 485 documents
  • 80 training, 20 testing
  • Extract

Speaker
S-time
Location
E-time
10
Applications unlimited
  • Terrorist events (MUC)
  • Product Descriptions (ShopBot)
  • Restaurant Guides (STALKER)
  • Job Advertisement (RAPIER)
  • Executive Succession (WHISK)
  • Molecular biology (MEDLINE)

11
Many Formats
  • Free Text
  • Natural language processing
  • Structured Text
  • Textual information in a database
  • File following a predefined and strict format
  • Semistructured Text
  • Ungrammatical
  • Telegraphic
  • Web Documents

12
Great Many Systems
  • AutoSlog 1993
  • Liep 1995
  • Palka 1995
  • Hasten 1995
  • Crystal 1995
  • WebFoot 1997
  • WHISK 1999
  • RAPIER 1999
  • SRV 1998
  • Stalker 1995
  • WIEN 1997
  • Mealy 1998
  • SNOW-IE 2001
  • LP2 2001

13
Great Many Tools !
  • Lemmatizer morpha Yarowsky
  • PoS taggers
  • MaxEnt Ratnaparkhi
  • TnT Brants
  • Brill
  • LTChunk Mikheev
  • Syntactic Chunker Sundance Riloff
  • Parser Charniak

http//www.cis.upenn.edu/adwait/penntools.html
14
Features
15
Features
16
Features
17
Approaches
  • Create a set of rules and classifiers
  • Prune it using the training set
  • Markov Models

18
Authorship Resolution
  • Markov studied the distribution
  • of vowels and consonants
  • among initial 20000 letters of
  • Eugene Onegin by Pushkin.

Markov A., An example of a statistical
investigation of the text of Eugene Onegin
illustrating the dependence between samples in
chain. Bulletin de l'Academie Imperiale des
Sciences de St. Petersbourg, pages 153-162, ser
VI, T.X, No 3, (in Russian). 1913.
19
Bayesian Nets in One Slide
  • BNs represent structure in joint distribution
  • Inference is computation of conditionals
  • Pr(t2 LastName w1 Doctor) .4
  • Pr(w2 prescribes w1 Doctor) .01
  • Learning is counting

t1
t2
w1
w2
Doctor
20
Sequence Tagging
doctor peshkin presents in
room 620 CEP
21
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
22
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
23
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
24
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
25
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
26
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
27
Markov model
Text
w1 w2 w3 wn
tn
t2

Tags
t3
t1
Lexical probabilities
Contextual probabilities
Transition probs. N-grams (trigrams)
28
Markov model
  • Training the model parameters should be
    estimated from annotated corpora using relative
    frequencies (EM)
  • Tagging tag sequence that maximize likelihood is
    calculated using dynamic programming Viterbi,
    60 algorithm
  • Severe sparseness problem when extending the
    n-gram scope

29
HMM example
Freitag McCallum 99
  • Fixed topology captures limited context
  • prefix states before and suffix after target
    state.

background
5 most-probable tokens\n . - unknown
speaker
suf1
suf3
suf4
pre1
pre2
pre3
pre4
suf2
\n seminar . robotics unknown
\n . - unknown
\n who speaker .
\n . with ,
presentunknown.departmentthe
\n of.,unknown
\nofunknown.
speaks,will(-
unknown. Dr ProfessorMichael
30
BIEN
31
Conditional Probability
Pr(Current Tag Last Target)
Current Tag
Last Target
32
Evaluation Metrics for IE
  • Precision (P)
  • Recall (R)
  • F-measure

33
Comparison
Peshkin Pfeffer 2003
34
Cheating ?
Date Tue, 9 Mar 2004 111040 -0500 (EST) From
rprasad_at_bbn.com To sl-seminar-ext_at_bbn.com Subject
SDP Speech and Language Seminar
SCIENCE DEVELOPMENT PROGRAM AT BBN
Speech and Language Seminar
Series
TITLE Dynamic
Bayesian Nets for Language Modeling Speaker Dr.
Leon Peshkin, MIT AI Lab Date Thursday, March
18, at 11 a.m. Place BBN Technologies, 50
Moulton Street, Cambridge, MA, Room
2/267 ABSTRACT Statistical methods in NLP
exclude linguistically plausible models due
to the prohibitive complexity of inference in
such models. Dynamic Bayesian networks (DBNs)
offer an elegant way to integrate various aspects
of
35
New corpus
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 13 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 This Tuesday
March 9th, Leon Peshkin is in town. He will tell
us about Dynamic Bayesian nets for language
modeling. Talk starts at 10am, Michael Ryerson,
main building of Intel, Santa Clara.
36
New corpus
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 13 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 This Tuesday
March 9th, Leon Peshkin is in town. He will tell
us about Dynamic Bayesian nets for language
modeling. Talk starts at 10am, Michael Ryerson,
main building of Intel, Santa Clara.
37
Ambiguous Features
38
Ambiguous Features
39
Ambiguous Features
40
Ambiguous Features
Professors Rob Banks and Will Gates give us a
short POS demo.
41
Part-of-Speech Tagging
doctor steals presents in
christmas hall
42
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
43
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
44
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
45
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
46
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
47
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
48
Part-of-Speech Tagging
doctor steals presents in
christmas hall
p.noun
p.noun
prep
verb
p.noun
p.noun
  • Correct Disambiguation takes clues
  • Contextual Information
  • what is the sequence of words and tags
  • Morphological Information
  • suffix of the word is -s

49
U. Penn PoS tagset
50
Linguistic Data Consortium (LDC)
www.ldc.upenn.org
  • WSJ 2400 files, 1086250 words
  • Brown corpus 1016277 words
  • popular lore
  • belles lettres, biography, memoires, etc.
  • general fiction
  • mystery and detective fiction
  • science fiction
  • adventure and western fiction
  • romance and love story
  • humor

51
Why Build Another Tagger?
  • 3.4 Transformation-based Brill, 95
  • 3.3 Markov Models TnT Brants, 00
  • 4.3 Conditional Random Field Lafferty et al.,
    01
  • Maximum Entropy Model
  • 3.5 WSJ corpus Ranaparkhi, 96
  • 2.6 LOB corpus Van Halteren et al, 98
  • 5 Bi-gram, 10 Uni-gram Charniak, 93

52
Out of Vocabulary
Laferty, McCallum 02
  • ??? Transformation-based
  • 25 Markov Models
  • 28 Conditional Random Field
  • 27 Maximum Entropy Model

Per sentence 70 wrong
53
Why Build Another Tagger?
  • 3.4 Transformation-based Brill, 95 93000
    lexicon
  • 3.3 Markov Models TnT Brants, 00 45000
    lexicon
  • 2.9 unknown words
  • 11 unknown words 5.5 error
  • Maximum Entropy Model
  • 3.5 WSJ corpus Ranaparkhi, 96 25000
    lexicon

54
Lewis Carroll
  • 'T was brillig, and the slithy toves
  • Did gyre and gimble in the wabe
  • All mimsy were the borogoves,
  • And the mome raths outgrabe.
  • "Beware the Jabberwock, my son!
  • The jaws that bite, the claws that catch!
  • Beware the Jubjub bird, and shun
  • The frumious Bandersnatch!"

55
Lewis Carroll
  • 'T was brillig , and the slithy toves

RPR VerbPT Adject , Conj Det Adject NNS
Did gyre and gimble in the wabe
VerbPT VB Conj VB IN Det NN

All mimsy were the borogoves ,
Det Adject VerbPT Det NNS
,
And the mome raths outgrabe .
Conj Det Adject NNS VerbPT
.
56
Good Tagger
  • Generalization to novel corpora (English not WSJ)
  • Modest vocabulary
  • Easy customization
  • Integrated into larger NLP system

57
Debate in the literature
Lafferty et al. 2002 Manning Schutze 1999
Klein Manning 2002 Brants 2000
  • What kind of model to use ?
  • How to train it ? (Joint vs conditional
    likelihood, etc.)
  • What kind of features to consider ?

58
Decompose MaxEnt
  • What kind of model to use ?
  • How to train it ? (Joint vs conditional
    likelihood, etc.)
  • What kind of features to consider ?

59
Close look at 117558 features
Ratnaparkhi 96,97
2 2 2 3600 2900
Does the token contain a capital letter
Does the token contain a hyphen Does
the token contain a number  Frequent
prefixes, up to 4 letters long  Frequent
suffixes, up to 4 letters long
6800
3600
3800
3000
3000
w-2 w-1 w w1 w2 t-3 t-2 t-1 t t1 t2
60
Naïve DBN
Peshkin Savova 2004
Bouquet of factored morphology features
61
Naïve DBN
tk
sk
pk
wk
nk
hk
ck
62
DBN
2
Morphology features and some context
63
Even closer look at features
Ratnaparkhi 96,97
Among 2900 suffixes 31 hyphens -off,
-on, -out, -up 400 numbers
-47, .014, 1970 100 capital letters and
bi-grams Among 3600 prefixes 84 hyphens
co-, -ex, in-, mid- 533
numbers 1500 capitalized words and
letters 684 entries are identical in suffix and
prefix lists 500 entries common for prefix and
vocabulary (there are five!) 400 entries common
for suffix and word vocabulary
64
Empirical results - WSJ
65
Empirical results - WSJ
66
Empirical results - WSJ
67
Empirical results - WSJ
68
Empirical results
WSJ
3.6
9.4
51.7
Brown
7.7
21.9
69.2
Jabberwocky
11.7
23.4
65.2
Seminar Announcements
16.3
22.7
79.0
69
Combined DBN
70
Combined DBN
71
Future Work
  • Complex data cancel, reschedule, multi-slot
  • Relational data Roth Yih02
  • Shallow parsing, keyword extraction, parsing
  • Semi-supervised learning M. Collins03
  • Automated feature selection
  • Structure learning Friedman et al.01
  • DBN MDP

72
Message
  • Set of features over model
  • Factored features help
  • Linguistically justified features
  • Generalization to novel corpora
  • Co-training and integration
  • Approximate inference, rich models
  • New application domain for DBNs

73
Acknowledgments
  • I have benefited from technical interactions
  • with many people, including
  • Eugene Charniak, Michael Collins, Mark Johnson,
    Avi Pfeffer, Leslie Kaelbling
  • Jason Eisner, Andrew McCallum,
  • Kevin Murphy, Stuart Shieber

74
Transformation-Based
Brill 1995
The horse will win the race tomorrow DT NN MD VB
DT RB The horse will race tomorrow DT NN MD R
B
Step1 assign known tags draw at random NN
vs VB
75
Transformation-Based
The horse will win the race tomorrow DT NN MD VB
DT NN RB The horse will race tomorrow DT NN MD
RB
w-3 w-2 w-1 w w1 w2 t-3 t-2 t-1 t t1 t2
76
Transformation-Based
The horse will win the race tomorrow DT NN MD VB
DT NN RB The horse will race tomorrow DT NN MD
NN RB
VB
Step 2 Transformation rule
(430 rules) Turn NN into VB after
tag MD
77
Maximum Entropy Principle
  • Many problems in NLP can be re-formulated as
    statistical classification problems, in which the
    task is to estimate the probability of class t
    occurring with context w, or p(t,w).
  • Large text corpora usually contain some
    information about the coocurrence of ws and ts,
    but never enough to completely specify p(t,w) for
    all possible (t,w) pairs, since the words w are
    typically sparse.
  • Maximum entropy models offer a clean way to
    combine diverse pieces of contextual sparse
    evidence about the ws and ts to reliably
    estimate a probability model p(w,t).

78
Maximum Entropy Principle
Jaynes 57, Good 63
The correct distribution p(t,w) is that which
maximizes entropy, or uncertainty, subject to
the constraints which represent evidence,
i.e. the facts known to the experimenter.
This is the only unbiased assignment we can make
to use any other would amount to arbitrary
assumption of information which by hypothesis we
do not have.
79
Maximum Entropy Principle
Della Pietra et al. 97
  • Each parameter aj corresponds to exactly one
    feature fj and can be viewed as a weight for that
    feature
  • Estimation of the model parameters Generalized
    Iterative Scaling (GIS) algorithm Darroch
    Ratcliff 72. Improved Iterative Scaling (IIS).

80
Maximum Entropy Tagging
Ratnaparkhi 97,98
GIS allows to calculate the probability model
p(h,t)
81
Combined DBN
82
Artificial Language Processing
Dr. Leon Peshkin Harvard University
83
Cheating ?
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 29 Feb
2003 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 Time Tuesday
Mar 09, 10am Place
Conference room If you want to meet with the
speaker, please send email to Cecilia Salazar
cecilia.salazar_at_intel.com . TITLE "Dynamic
Bayesian Nets for Language Modeling WHO Dr.
Leon Peshkin, MIT CSAIL
Write a Comment
User Comments (0)
About PowerShow.com