Title: Bayesian Networks in Language Modeling
1Bayesian Networks in Language Modeling
Leon Peshkin MIT
http//www.ai.mit.edu/pesha
2Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place TTI Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic Bayesian networks (DBNs) offer an
elegant way to integrate various aspects
of language in one model. Many existing
algorithms developed for learning and inference
in DBNs are applicable to probabilistic language
modeling.
3Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_ai.mit.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic Bayesian networks (DBNs) offer an
elegant way to integrate various aspects
of language in one model. Many existing
algorithms developed for learning and inference
in DBNs are applicable to probabilistic language
modeling.
4Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place TTI Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
5Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 29 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 Time Tuesday
Mar 09, 10am Place I-9
Conference Room If you want to meet with the
speaker, please send email to Cecilia
Salazar cecilia.salazar_at_intel.com .
"Dynamic Bayesian Nets for Language Modeling
Dr. Leon Peshkin, MIT
CSAIL Statistical methods in NLP exclude
linguistically plausible models due to the
prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
6Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
Location
7Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
Time
Location
8Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Time
9Benchmark dataset
Freitag98
- CMU Seminar Announcement data set
- 485 documents
- 80 training, 20 testing
- Extract
Speaker
S-time
Location
E-time
10Applications unlimited
- Terrorist events (MUC)
- Product Descriptions (ShopBot)
- Restaurant Guides (STALKER)
- Job Advertisement (RAPIER)
- Executive Succession (WHISK)
- Molecular biology (MEDLINE)
11Many Formats
- Free Text
- Natural language processing
- Structured Text
- Textual information in a database
- File following a predefined and strict format
- Semistructured Text
- Ungrammatical
- Telegraphic
- Web Documents
12Great Many Systems
- AutoSlog 1993
- Liep 1995
- Palka 1995
- Hasten 1995
- Crystal 1995
- WebFoot 1997
- WHISK 1999
- RAPIER 1999
- SRV 1998
- Stalker 1995
- WIEN 1997
- Mealy 1998
- SNOW-IE 2001
- LP2 2001
13Great Many Tools !
- Lemmatizer morpha Yarowsky
- PoS taggers
- MaxEnt Ratnaparkhi
- TnT Brants
- Brill
- LTChunk Mikheev
- Syntactic Chunker Sundance Riloff
- Parser Charniak
http//www.cis.upenn.edu/adwait/penntools.html
14Features
15Features
16Features
17Approaches
- Create a set of rules and classifiers
- Prune it using the training set
- Markov Models
18Authorship Resolution
- Markov studied the distribution
- of vowels and consonants
- among initial 20000 letters of
- Eugene Onegin by Pushkin.
Markov A., An example of a statistical
investigation of the text of Eugene Onegin
illustrating the dependence between samples in
chain. Bulletin de l'Academie Imperiale des
Sciences de St. Petersbourg, pages 153-162, ser
VI, T.X, No 3, (in Russian). 1913.
19Bayesian Nets in One Slide
- BNs represent structure in joint distribution
- Inference is computation of conditionals
- Pr(t2 LastName w1 Doctor) .4
- Pr(w2 prescribes w1 Doctor) .01
- Learning is counting
t1
t2
w1
w2
Doctor
20Sequence Tagging
doctor peshkin presents in
room 620 CEP
21Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
22Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
23Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
24Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
25Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
26Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
27Markov model
Text
w1 w2 w3 wn
tn
t2
Tags
t3
t1
Lexical probabilities
Contextual probabilities
Transition probs. N-grams (trigrams)
28Markov model
- Training the model parameters should be
estimated from annotated corpora using relative
frequencies (EM) - Tagging tag sequence that maximize likelihood is
calculated using dynamic programming Viterbi,
60 algorithm - Severe sparseness problem when extending the
n-gram scope
29HMM example
Freitag McCallum 99
- Fixed topology captures limited context
- prefix states before and suffix after target
state.
background
5 most-probable tokens\n . - unknown
speaker
suf1
suf3
suf4
pre1
pre2
pre3
pre4
suf2
\n seminar . robotics unknown
\n . - unknown
\n who speaker .
\n . with ,
presentunknown.departmentthe
\n of.,unknown
\nofunknown.
speaks,will(-
unknown. Dr ProfessorMichael
30BIEN
31Conditional Probability
Pr(Current Tag Last Target)
Current Tag
Last Target
32Evaluation Metrics for IE
- Precision (P)
- Recall (R)
- F-measure
33Comparison
Peshkin Pfeffer 2003
34Cheating ?
Date Tue, 9 Mar 2004 111040 -0500 (EST) From
rprasad_at_bbn.com To sl-seminar-ext_at_bbn.com Subject
SDP Speech and Language Seminar
SCIENCE DEVELOPMENT PROGRAM AT BBN
Speech and Language Seminar
Series
TITLE Dynamic
Bayesian Nets for Language Modeling Speaker Dr.
Leon Peshkin, MIT AI Lab Date Thursday, March
18, at 11 a.m. Place BBN Technologies, 50
Moulton Street, Cambridge, MA, Room
2/267 ABSTRACT Statistical methods in NLP
exclude linguistically plausible models due
to the prohibitive complexity of inference in
such models. Dynamic Bayesian networks (DBNs)
offer an elegant way to integrate various aspects
of
35New corpus
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 13 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 This Tuesday
March 9th, Leon Peshkin is in town. He will tell
us about Dynamic Bayesian nets for language
modeling. Talk starts at 10am, Michael Ryerson,
main building of Intel, Santa Clara.
36New corpus
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 13 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 This Tuesday
March 9th, Leon Peshkin is in town. He will tell
us about Dynamic Bayesian nets for language
modeling. Talk starts at 10am, Michael Ryerson,
main building of Intel, Santa Clara.
37Ambiguous Features
38Ambiguous Features
39Ambiguous Features
40Ambiguous Features
Professors Rob Banks and Will Gates give us a
short POS demo.
41Part-of-Speech Tagging
doctor steals presents in
christmas hall
42Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
43Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
44Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
45Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
46Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
47Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
48Part-of-Speech Tagging
doctor steals presents in
christmas hall
p.noun
p.noun
prep
verb
p.noun
p.noun
- Correct Disambiguation takes clues
- Contextual Information
- what is the sequence of words and tags
- Morphological Information
- suffix of the word is -s
49U. Penn PoS tagset
50Linguistic Data Consortium (LDC)
www.ldc.upenn.org
- WSJ 2400 files, 1086250 words
- Brown corpus 1016277 words
- popular lore
- belles lettres, biography, memoires, etc.
- general fiction
- mystery and detective fiction
- science fiction
- adventure and western fiction
- romance and love story
- humor
51Why Build Another Tagger?
- 3.4 Transformation-based Brill, 95
- 3.3 Markov Models TnT Brants, 00
- 4.3 Conditional Random Field Lafferty et al.,
01 - Maximum Entropy Model
- 3.5 WSJ corpus Ranaparkhi, 96
- 2.6 LOB corpus Van Halteren et al, 98
- 5 Bi-gram, 10 Uni-gram Charniak, 93
52Out of Vocabulary
Laferty, McCallum 02
- ??? Transformation-based
- 25 Markov Models
- 28 Conditional Random Field
- 27 Maximum Entropy Model
Per sentence 70 wrong
53Why Build Another Tagger?
- 3.4 Transformation-based Brill, 95 93000
lexicon - 3.3 Markov Models TnT Brants, 00 45000
lexicon - 2.9 unknown words
- 11 unknown words 5.5 error
- Maximum Entropy Model
- 3.5 WSJ corpus Ranaparkhi, 96 25000
lexicon
54Lewis Carroll
- 'T was brillig, and the slithy toves
- Did gyre and gimble in the wabe
- All mimsy were the borogoves,
- And the mome raths outgrabe.
- "Beware the Jabberwock, my son!
- The jaws that bite, the claws that catch!
- Beware the Jubjub bird, and shun
- The frumious Bandersnatch!"
55Lewis Carroll
- 'T was brillig , and the slithy toves
RPR VerbPT Adject , Conj Det Adject NNS
Did gyre and gimble in the wabe
VerbPT VB Conj VB IN Det NN
All mimsy were the borogoves ,
Det Adject VerbPT Det NNS
,
And the mome raths outgrabe .
Conj Det Adject NNS VerbPT
.
56Good Tagger
- Generalization to novel corpora (English not WSJ)
- Modest vocabulary
- Easy customization
- Integrated into larger NLP system
57Debate in the literature
Lafferty et al. 2002 Manning Schutze 1999
Klein Manning 2002 Brants 2000
- What kind of model to use ?
- How to train it ? (Joint vs conditional
likelihood, etc.) - What kind of features to consider ?
58Decompose MaxEnt
- What kind of model to use ?
- How to train it ? (Joint vs conditional
likelihood, etc.) - What kind of features to consider ?
59Close look at 117558 features
Ratnaparkhi 96,97
2 2 2 3600 2900
Does the token contain a capital letter
Does the token contain a hyphen Does
the token contain a number  Frequent
prefixes, up to 4 letters long  Frequent
suffixes, up to 4 letters long
6800
3600
3800
3000
3000
w-2 w-1 w w1 w2 t-3 t-2 t-1 t t1 t2
60Naïve DBN
Peshkin Savova 2004
Bouquet of factored morphology features
61Naïve DBN
tk
sk
pk
wk
nk
hk
ck
62DBN
2
Morphology features and some context
63Even closer look at features
Ratnaparkhi 96,97
Among 2900 suffixes 31 hyphens -off,
-on, -out, -up 400 numbers
-47, .014, 1970 100 capital letters and
bi-grams Among 3600 prefixes 84 hyphens
co-, -ex, in-, mid- 533
numbers 1500 capitalized words and
letters 684 entries are identical in suffix and
prefix lists 500 entries common for prefix and
vocabulary (there are five!) 400 entries common
for suffix and word vocabulary
64Empirical results - WSJ
65Empirical results - WSJ
66Empirical results - WSJ
67Empirical results - WSJ
68Empirical results
WSJ
3.6
9.4
51.7
Brown
7.7
21.9
69.2
Jabberwocky
11.7
23.4
65.2
Seminar Announcements
16.3
22.7
79.0
69Combined DBN
70Combined DBN
71Future Work
- Complex data cancel, reschedule, multi-slot
- Relational data Roth Yih02
- Shallow parsing, keyword extraction, parsing
- Semi-supervised learning M. Collins03
- Automated feature selection
- Structure learning Friedman et al.01
- DBN MDP
-
72Message
- Set of features over model
- Factored features help
- Linguistically justified features
- Generalization to novel corpora
- Co-training and integration
- Approximate inference, rich models
- New application domain for DBNs
73Acknowledgments
- I have benefited from technical interactions
- with many people, including
- Eugene Charniak, Michael Collins, Mark Johnson,
Avi Pfeffer, Leslie Kaelbling - Jason Eisner, Andrew McCallum,
- Kevin Murphy, Stuart Shieber
74Transformation-Based
Brill 1995
The horse will win the race tomorrow DT NN MD VB
DT RB The horse will race tomorrow DT NN MD R
B
Step1 assign known tags draw at random NN
vs VB
75Transformation-Based
The horse will win the race tomorrow DT NN MD VB
DT NN RB The horse will race tomorrow DT NN MD
RB
w-3 w-2 w-1 w w1 w2 t-3 t-2 t-1 t t1 t2
76Transformation-Based
The horse will win the race tomorrow DT NN MD VB
DT NN RB The horse will race tomorrow DT NN MD
NN RB
VB
Step 2 Transformation rule
(430 rules) Turn NN into VB after
tag MD
77Maximum Entropy Principle
- Many problems in NLP can be re-formulated as
statistical classification problems, in which the
task is to estimate the probability of class t
occurring with context w, or p(t,w). - Large text corpora usually contain some
information about the coocurrence of ws and ts,
but never enough to completely specify p(t,w) for
all possible (t,w) pairs, since the words w are
typically sparse. - Maximum entropy models offer a clean way to
combine diverse pieces of contextual sparse
evidence about the ws and ts to reliably
estimate a probability model p(w,t).
78Maximum Entropy Principle
Jaynes 57, Good 63
The correct distribution p(t,w) is that which
maximizes entropy, or uncertainty, subject to
the constraints which represent evidence,
i.e. the facts known to the experimenter.
This is the only unbiased assignment we can make
to use any other would amount to arbitrary
assumption of information which by hypothesis we
do not have.
79Maximum Entropy Principle
Della Pietra et al. 97
- Each parameter aj corresponds to exactly one
feature fj and can be viewed as a weight for that
feature - Estimation of the model parameters Generalized
Iterative Scaling (GIS) algorithm Darroch
Ratcliff 72. Improved Iterative Scaling (IIS).
80Maximum Entropy Tagging
Ratnaparkhi 97,98
GIS allows to calculate the probability model
p(h,t)
81Combined DBN
82Artificial Language Processing
Dr. Leon Peshkin Harvard University
83Cheating ?
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 29 Feb
2003 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 Time Tuesday
Mar 09, 10am Place
Conference room If you want to meet with the
speaker, please send email to Cecilia Salazar
cecilia.salazar_at_intel.com . TITLE "Dynamic
Bayesian Nets for Language Modeling WHO Dr.
Leon Peshkin, MIT CSAIL