Bayesian Networks in Language Modeling

About This Presentation

Title:

Bayesian Networks in Language Modeling

Description:

620 CEP. in. presents. Peshkin. Doctor. Word. location ... doctor peshkin presents in room 620 CEP. Leonid Peshkin. Bayesian Networks for Language Modeling ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 84

Provided by: pes9

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Networks in Language Modeling

1
Bayesian Networks in Language Modeling
Leon Peshkin MIT
http//www.ai.mit.edu/pesha
2
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place TTI Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic Bayesian networks (DBNs) offer an
elegant way to integrate various aspects
of language in one model. Many existing
algorithms developed for learning and inference
in DBNs are applicable to probabilistic language
modeling.
3
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_ai.mit.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic Bayesian networks (DBNs) offer an
elegant way to integrate various aspects
of language in one model. Many existing
algorithms developed for learning and inference
in DBNs are applicable to probabilistic language
modeling.
4
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place TTI Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
5
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 29 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 Time Tuesday
Mar 09, 10am Place I-9
Conference Room If you want to meet with the
speaker, please send email to Cecilia
Salazar cecilia.salazar_at_intel.com .
"Dynamic Bayesian Nets for Language Modeling
Dr. Leon Peshkin, MIT
CSAIL Statistical methods in NLP exclude
linguistically plausible models due to the
prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
6
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
Location
7
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Topic
Time
Location
8
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt Subject
Talk by Leon Peshkin "Dynamic Bayesian Nets for
Language Modeling" Date Fri, 29 Feb 2004
123652 -0400 X-MimeOLE Produced By Microsoft
MimeOLE V6.00.2800.1165 Time Tuesday Mar 09,
10am Place I-9 Conference
Room If you want to meet with the speaker,
please send email to Cecilia Salazar cecilia.salaz
ar_at_intel.com . "Dynamic Bayesian Nets
for Language Modeling Dr. Leon
Peshkin, MIT CSAIL Statistical methods in NLP
exclude linguistically plausible models due to
the prohibitive complexity of inference in such
models. Dynamic
Speaker
Time
9
Benchmark dataset
Freitag98

CMU Seminar Announcement data set
485 documents
80 training, 20 testing
Extract

Speaker
S-time
Location
E-time
10
Applications unlimited

Terrorist events (MUC)
Product Descriptions (ShopBot)
Restaurant Guides (STALKER)
Job Advertisement (RAPIER)
Executive Succession (WHISK)
Molecular biology (MEDLINE)

11
Many Formats

Free Text
Natural language processing
Structured Text
Textual information in a database
File following a predefined and strict format
Semistructured Text
Ungrammatical
Telegraphic
Web Documents

12
Great Many Systems

AutoSlog 1993
Liep 1995
Palka 1995
Hasten 1995
Crystal 1995
WebFoot 1997
WHISK 1999

RAPIER 1999
SRV 1998
Stalker 1995
WIEN 1997
Mealy 1998
SNOW-IE 2001
LP2 2001

13
Great Many Tools !

Lemmatizer morpha Yarowsky
PoS taggers
MaxEnt Ratnaparkhi
TnT Brants
Brill
LTChunk Mikheev
Syntactic Chunker Sundance Riloff
Parser Charniak

http//www.cis.upenn.edu/adwait/penntools.html
14
Features
15
Features
16
Features
17
Approaches

Create a set of rules and classifiers
Prune it using the training set
Markov Models

18
Authorship Resolution

Markov studied the distribution
of vowels and consonants
among initial 20000 letters of
Eugene Onegin by Pushkin.

Markov A., An example of a statistical
investigation of the text of Eugene Onegin
illustrating the dependence between samples in
chain. Bulletin de l'Academie Imperiale des
Sciences de St. Petersbourg, pages 153-162, ser
VI, T.X, No 3, (in Russian). 1913.
19
Bayesian Nets in One Slide

BNs represent structure in joint distribution
Inference is computation of conditionals
Pr(t2 LastName w1 Doctor) .4
Pr(w2 prescribes w1 Doctor) .01
Learning is counting

t1
t2
w1
w2
Doctor
20
Sequence Tagging
doctor peshkin presents in
room 620 CEP
21
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
22
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
23
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
24
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
25
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
26
Sequence Tagging
doctor peshkin presents in
room 620 CEP
speaker
speaker
speaker
speaker
speaker
speaker
location
location
location
location
location
location
none
none
none
none
none
none
27
Markov model
Text
w1 w2 w3 wn
tn
t2

Tags
t3
t1
Lexical probabilities
Contextual probabilities
Transition probs. N-grams (trigrams)
28
Markov model

Training the model parameters should be
estimated from annotated corpora using relative
frequencies (EM)
Tagging tag sequence that maximize likelihood is
calculated using dynamic programming Viterbi,
60 algorithm
Severe sparseness problem when extending the
n-gram scope

29
HMM example
Freitag McCallum 99

Fixed topology captures limited context
prefix states before and suffix after target
state.

background
5 most-probable tokens\n . - unknown
speaker
suf1
suf3
suf4
pre1
pre2
pre3
pre4
suf2
\n seminar . robotics unknown
\n . - unknown
\n who speaker .
\n . with ,
presentunknown.departmentthe
\n of.,unknown
\nofunknown.
speaks,will(-
unknown. Dr ProfessorMichael
30
BIEN
31
Conditional Probability
Pr(Current Tag Last Target)
Current Tag
Last Target
32
Evaluation Metrics for IE

Precision (P)
Recall (R)
F-measure

33
Comparison
Peshkin Pfeffer 2003
34
Cheating ?
Date Tue, 9 Mar 2004 111040 -0500 (EST) From
rprasad_at_bbn.com To sl-seminar-ext_at_bbn.com Subject
SDP Speech and Language Seminar
SCIENCE DEVELOPMENT PROGRAM AT BBN
Speech and Language Seminar
Series
TITLE Dynamic
Bayesian Nets for Language Modeling Speaker Dr.
Leon Peshkin, MIT AI Lab Date Thursday, March
18, at 11 a.m. Place BBN Technologies, 50
Moulton Street, Cambridge, MA, Room
2/267 ABSTRACT Statistical methods in NLP
exclude linguistically plausible models due
to the prohibitive complexity of inference in
such models. Dynamic Bayesian networks (DBNs)
offer an elegant way to integrate various aspects
of
35
New corpus
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 13 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 This Tuesday
March 9th, Leon Peshkin is in town. He will tell
us about Dynamic Bayesian nets for language
modeling. Talk starts at 10am, Michael Ryerson,
main building of Intel, Santa Clara.
36
New corpus
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 13 Feb
2004 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 This Tuesday
March 9th, Leon Peshkin is in town. He will tell
us about Dynamic Bayesian nets for language
modeling. Talk starts at 10am, Michael Ryerson,
main building of Intel, Santa Clara.
37
Ambiguous Features
38
Ambiguous Features
39
Ambiguous Features
40
Ambiguous Features
Professors Rob Banks and Will Gates give us a
short POS demo.
41
Part-of-Speech Tagging
doctor steals presents in
christmas hall
42
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
43
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
44
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
45
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
46
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
47
Part-of-Speech Tagging
doctor steals presents in
christmas hall
verb
verb
verb
p.noun
p.noun
p.noun
p.noun
noun
prep
adjective
noun
noun
p.noun
48
Part-of-Speech Tagging
doctor steals presents in
christmas hall
p.noun
p.noun
prep
verb
p.noun
p.noun

Correct Disambiguation takes clues
Contextual Information
what is the sequence of words and tags
Morphological Information
suffix of the word is -s

49
U. Penn PoS tagset
50
Linguistic Data Consortium (LDC)
www.ldc.upenn.org

WSJ 2400 files, 1086250 words
Brown corpus 1016277 words
popular lore
belles lettres, biography, memoires, etc.
general fiction
mystery and detective fiction
science fiction
adventure and western fiction
romance and love story
humor

51
Why Build Another Tagger?

3.4 Transformation-based Brill, 95
3.3 Markov Models TnT Brants, 00
4.3 Conditional Random Field Lafferty et al.,
01
Maximum Entropy Model
3.5 WSJ corpus Ranaparkhi, 96
2.6 LOB corpus Van Halteren et al, 98
5 Bi-gram, 10 Uni-gram Charniak, 93

52
Out of Vocabulary
Laferty, McCallum 02

??? Transformation-based
25 Markov Models
28 Conditional Random Field
27 Maximum Entropy Model

Per sentence 70 wrong
53
Why Build Another Tagger?

3.4 Transformation-based Brill, 95 93000
lexicon
3.3 Markov Models TnT Brants, 00 45000
lexicon
2.9 unknown words
11 unknown words 5.5 error
Maximum Entropy Model
3.5 WSJ corpus Ranaparkhi, 96 25000
lexicon

54
Lewis Carroll

'T was brillig, and the slithy toves
Did gyre and gimble in the wabe
All mimsy were the borogoves,
And the mome raths outgrabe.
"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

55
Lewis Carroll

'T was brillig , and the slithy toves

RPR VerbPT Adject , Conj Det Adject NNS
Did gyre and gimble in the wabe
VerbPT VB Conj VB IN Det NN

All mimsy were the borogoves ,
Det Adject VerbPT Det NNS
,
And the mome raths outgrabe .
Conj Det Adject NNS VerbPT
.
56
Good Tagger

Generalization to novel corpora (English not WSJ)
Modest vocabulary
Easy customization
Integrated into larger NLP system

57
Debate in the literature
Lafferty et al. 2002 Manning Schutze 1999
Klein Manning 2002 Brants 2000

What kind of model to use ?
How to train it ? (Joint vs conditional
likelihood, etc.)
What kind of features to consider ?

58
Decompose MaxEnt

What kind of model to use ?
How to train it ? (Joint vs conditional
likelihood, etc.)
What kind of features to consider ?

59
Close look at 117558 features
Ratnaparkhi 96,97
2 2 2 3600 2900
Does the token contain a capital letter
Does the token contain a hyphen Does
the token contain a number Frequent
prefixes, up to 4 letters long Frequent
suffixes, up to 4 letters long
6800
3600
3800
3000
3000
w-2 w-1 w w1 w2 t-3 t-2 t-1 t t1 t2
60
Naïve DBN
Peshkin Savova 2004
Bouquet of factored morphology features
61
Naïve DBN
tk
sk
pk
wk
nk
hk
ck
62
DBN
2
Morphology features and some context
63
Even closer look at features
Ratnaparkhi 96,97
Among 2900 suffixes 31 hyphens -off,
-on, -out, -up 400 numbers
-47, .014, 1970 100 capital letters and
bi-grams Among 3600 prefixes 84 hyphens
co-, -ex, in-, mid- 533
numbers 1500 capitalized words and
letters 684 entries are identical in suffix and
prefix lists 500 entries common for prefix and
vocabulary (there are five!) 400 entries common
for suffix and word vocabulary
64
Empirical results - WSJ
65
Empirical results - WSJ
66
Empirical results - WSJ
67
Empirical results - WSJ
68
Empirical results
WSJ
3.6
9.4
51.7
Brown
7.7
21.9
69.2
Jabberwocky
11.7
23.4
65.2
Seminar Announcements
16.3
22.7
79.0
69
Combined DBN
70
Combined DBN
71
Future Work

Complex data cancel, reschedule, multi-slot
Relational data Roth Yih02
Shallow parsing, keyword extraction, parsing
Semi-supervised learning M. Collins03
Automated feature selection
Structure learning Friedman et al.01
DBN MDP

72
Message

Set of features over model
Factored features help
Linguistically justified features
Generalization to novel corpora
Co-training and integration
Approximate inference, rich models
New application domain for DBNs

73
Acknowledgments

I have benefited from technical interactions
with many people, including
Eugene Charniak, Michael Collins, Mark Johnson,
Avi Pfeffer, Leslie Kaelbling
Jason Eisner, Andrew McCallum,
Kevin Murphy, Stuart Shieber

74
Transformation-Based
Brill 1995
The horse will win the race tomorrow DT NN MD VB
DT RB The horse will race tomorrow DT NN MD R
B
Step1 assign known tags draw at random NN
vs VB
75
Transformation-Based
The horse will win the race tomorrow DT NN MD VB
DT NN RB The horse will race tomorrow DT NN MD
RB
w-3 w-2 w-1 w w1 w2 t-3 t-2 t-1 t t1 t2
76
Transformation-Based
The horse will win the race tomorrow DT NN MD VB
DT NN RB The horse will race tomorrow DT NN MD
NN RB
VB
Step 2 Transformation rule
(430 rules) Turn NN into VB after
tag MD
77
Maximum Entropy Principle

Many problems in NLP can be re-formulated as
statistical classification problems, in which the
task is to estimate the probability of class t
occurring with context w, or p(t,w).
Large text corpora usually contain some
information about the coocurrence of ws and ts,
but never enough to completely specify p(t,w) for
all possible (t,w) pairs, since the words w are
typically sparse.
Maximum entropy models offer a clean way to
combine diverse pieces of contextual sparse
evidence about the ws and ts to reliably
estimate a probability model p(w,t).

78
Maximum Entropy Principle
Jaynes 57, Good 63
The correct distribution p(t,w) is that which
maximizes entropy, or uncertainty, subject to
the constraints which represent evidence,
i.e. the facts known to the experimenter.
This is the only unbiased assignment we can make
to use any other would amount to arbitrary
assumption of information which by hypothesis we
do not have.
79
Maximum Entropy Principle
Della Pietra et al. 97

Each parameter aj corresponds to exactly one
feature fj and can be viewed as a weight for that
feature
Estimation of the model parameters Generalized
Iterative Scaling (GIS) algorithm Darroch
Ratcliff 72. Improved Iterative Scaling (IIS).

80
Maximum Entropy Tagging
Ratnaparkhi 97,98
GIS allows to calculate the probability model
p(h,t)
81
Combined DBN
82
Artificial Language Processing
Dr. Leon Peshkin Harvard University
83
Cheating ?
Return-Path ltcecilia.salazar_at_intel.comgt X-Origina
l-To pesha_at_eecs.harvard.edu To "NLP Group"
ltnlp_at_cs.brown.edugt, ltcolloquium_at_cs.cmu.edugt Cc
"Bradski, Gary" ltgary.bradski_at_intel.comgt
Subject Talk by Leon Peshkin "Dynamic Bayesian
Nets for Language Modeling" Date Fri, 29 Feb
2003 123652 -0400 X-MimeOLE Produced By
Microsoft MimeOLE V6.00.2800.1165 Time Tuesday
Mar 09, 10am Place
Conference room If you want to meet with the
speaker, please send email to Cecilia Salazar
cecilia.salazar_at_intel.com . TITLE "Dynamic
Bayesian Nets for Language Modeling WHO Dr.
Leon Peshkin, MIT CSAIL

Write a Comment

User Comments (0)