PPT – Graphical Models - Learning - PowerPoint presentation

About This Presentation

Title:

Graphical Models - Learning -

Description:

Advanced I WS 06/07 Based on J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 79

Provided by: unif77

Category:

more less

Transcript and Presenter's Notes

Title: Graphical Models - Learning -

1
Graphical Models- Learning -
Advanced I WS 06/07
Based on J. A. Bilmes,A Gentle Tutorial of the
EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov
Models, TR-97-021, U.C. Berkeley, April 1998
G. J. McLachlan, T. Krishnan, The EM Algorithm
and Extensions, John Wiley Sons, Inc., 1997
D. Koller, course CS-228 handouts, Stanford
University, 2001., N. Friedman D. Kollers
NIPS99.
Parameter Estimation

Wolfram Burgard, Luc De Raedt, Kristian
Kersting, Bernhard Nebel

Albert-Ludwigs University Freiburg, Germany
2
Outline

Introduction
Reminder Probability theory
Basics of Bayesian Networks
Modeling Bayesian networks
Inference (VE, Junction tree)
Excourse Markov Networks
Learning Bayesian networks
Relational Models

3
What is Learning?

Agents are said to learn if they improve their
performance over time based on experience.

The problem of understanding intelligence is said
to be the greatest problem in science today and
the problem for this century as deciphering
the genetic code was for the second half of the
last onethe problem of learning represents a
gateway to understanding intelligence in man and
machines. -- Tomasso Poggio and Steven
Smale, 2003.
- Learning
4
Why bothering with learning?

Bottleneck of knowledge aquisition
Expensive, difficult
Normally, no expert is around
Data is cheap !
Huge amount of data avaible, e.g.
Clinical tests
Web mining, e.g. log files
....

- Learning
5
Why Learning Bayesian Networks?

Conditional independencies graphical language
capture structure of many real-world
distributions
Graph structure provides much insight into domain
Allows knowledge discovery
Learned model can be used for many tasks
Supports all the features of probabilistic
learning
Model selection criteria
Dealing with missing data hidden variables

6
Learning With Bayesian Networks
Data Priori Info
- Learning
7
Learning With Bayesian Networks
Data Priori Info
- Learning
8
What does the data look like?
attributes/variables
complete data set
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
X1
X2
data cases
- Learning
...
XM
9
What does the data look like?
incomplete data set
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?

Real-world data states of
some random variables are
missing
E.g. medical diagnose not all patient are
subjects to all test
Parameter reduction, e.g. clustering, ...

- Learning
10
What does the data look like?
incomplete data set
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?

Real-world data states of
some random variables are
missing
E.g. medical diagnose not all patient are
subjects to all test
Parameter reduction, e.g. clustering, ...

- Learning
missing value
11
What does the data look like?
hidden/ latent
incomplete data set
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?

Real-world data states of
some random variables are
missing
E.g. medical diagnose not all patient are
subjects to all test
Parameter reduction, e.g. clustering, ...

- Learning
missing value
12
Hidden variable Examples

Parameter reduction

X1
X3
X2
X1
X3
X2
hidden
H
H
Y3
Y2
Y1
Y3
Y2
Y1
- Learning
13
Hidden variable Examples

Clustering

Cluster
assignment of cluster
Cluster
hidden
...
attributes
X1
X2
Xn

Hidden variables also appear in clustering
Autoclass model
Hidden variables assignsclass labels
Observed attributes areindependent given the
class

- Learning
14
Slides due to Eamonn Keogh
- Learning
15
Slides due to Eamonn Keogh
Iteration 1 The cluster means are randomly
assigned
- Learning
16
Slides due to Eamonn Keogh
Iteration 2
- Learning
17
Slides due to Eamonn Keogh
Iteration 5
- Learning
18
Slides due to Eamonn Keogh
Iteration 25
- Learning
19
Slides due to Eamonn Keogh
What is a natural grouping among these objects?
20
Slides due to Eamonn Keogh
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
21
Learning With Bayesian Networks

Easiest problem counting Selection of arcs New domain with no domain expert Data mining
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, Only Structural EM is known Scientific discouvery
Fixed structure Fixed variables Hidden variables
observed fully
observed Partially
?
?
?
A
B
A
B
H
A
B
- Learning
22
Parameter Estimation

Let be set
of data
over m RVs
is called a data case
iid - assumption
All data cases are independently sampled from
identical distributions

- Learning
Find Parameters of CPDs which match the
data best
23
Maximum Likelihood - Parameter Estimation

What does best matching mean ?

- Learning
24
Maximum Likelihood - Parameter Estimation

What does best matching mean ?

Find paramteres which have most likely
produced the data
- Learning
25
Maximum Likelihood - Parameter Estimation

What does best matching mean ?
MAP parameters

- Learning
26
Maximum Likelihood - Parameter Estimation

What does best matching mean ?
MAP parameters
Data is equally likely for all parameters.

- Learning
27
Maximum Likelihood - Parameter Estimation

What does best matching mean ?
MAP parameters
Data is equally likely for all parameters
All parameters are apriori equally likely

- Learning
28
Maximum Likelihood - Parameter Estimation

What does best matching mean ?

Find ML parameters
- Learning
29
Maximum Likelihood - Parameter Estimation

What does best matching mean ?

Find ML parameters
Likelihood of the paramteres given
the data
- Learning
30
Maximum Likelihood - Parameter Estimation

What does best matching mean ?

Find ML parameters
Likelihood of the paramteres given
the data
- Learning
Log-Likelihood of the paramteres
given the data
31
Maximum Likelihood

This is one of the most commonly used estimators
in statistics
Intuitively appealing
Consistent estimate converges to best possible
value as the number of examples grow
Asymptotic efficiency estimate is as close to
the true value as possible given a particular
training set

- Learning
32
Learning With Bayesian Networks

Easiest problem counting Selection of arcs New domain with no domain expert Data mining
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, Only Structural EM is known Scientific discouvery
Fixed structure Fixed variables Hidden variables
observed fully
observed Partially
?
?
?
A
B
A
B
H
A
B
?
- Learning
33
Known Structure, Complete Data
E, B, A ltY,N,Ngt ltY,N,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
- Learning

Network structure is specified
Learner needs to estimate parameters
Data does not contain missing values

34
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
- Learning
35
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
- Learning
36
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
- Learning
37
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
- Learning
38
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
- Learning
39
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
- Learning
40
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
Only local parameters of family of Aj involved
- Learning
41
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
Only local parameters of family of Aj involved
- Learning
Each factor individually !!
42
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
Decomposability of the likelihood
(BN semantics)
Only local parameters of family of Aj involved
- Learning
Each factor individually !!
43
Decomposability of Likelihood

If the data set if complete (no question marks)
we can maximize each local likelihood function
independently, and
then combine the solutions to get an MLE
solution.
decomposition of the global problem to
independent, local sub-problems. This allows
efficient solutions to the MLE problem.

44
Likelihood for Multinominals

Random variable V with 1,...,K values
where Nk is the counts of state k in data

This constraint implies that the choice on ?I
influences the choice on ?j (iltgtj)
- Learning
45
Likelihood for Binominals (2 states only)

Compute partial derivative

?1 ?2 1

Set partial derivative zero

- Learning
gt MLE is
46
Likelihood for Binominals (2 states only)

Compute partial derivative

?1 ?2 1

Set partial derivative zero

In general, for multinomials (gt2 states), the MLE
is
- Learning
gt MLE is
47
Likelihood for Conditional Multinominals

multinomial for
each joint state pa of the parents of V
MLE

- Learning
48
Learning With Bayesian Networks

Easiest problem counting Selection of arcs New domain with no domain expert Data mining
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, Only Structural EM is known Scientific discouvery
Fixed structure Fixed variables Hidden variables
observed fully
observed Partially
?
?
?
A
B
A
B
H
A
B
- Learning
?
49
Known Structure, Incomplete Data
E, B, A ltY,?,Ngt ltY,N,?gt ltN,N,Ygt ltN,Y,Ygt .
. lt?,Y,Ygt
- Learning

Network structure is specified
Data contains missing values
Need to consider assignments to missing values

50
EM Idea

In the case of complete data, ML parameter
estimation is easy
simply counting (1 iteration)
Incomplete data ?
Complete data (Imputation)
most probable?, average?, ... value
Count
Iterate

- Learning
51
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
- Learning
52
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
0.5
false
false
53
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
expected counts
complete data
A B N
true true 1.0
true ?true 0.5
true ?false 0.5
false true 1.0
true false 1.0
false ?true 0.5
false ?false 0.5
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
0.5
false
false
54
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
maximize
0.5
false
false
55
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
maximize
0.5
false
false
56
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
- Learning
57
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.75
true
false
maximize
0.25
false
false
58
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
- Learning
59
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.875
true
false
maximize
0.125
false
false
60
Complete-data likelihood
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?
incomplete-data likelihood
Assume complete data exists
with
- Learning
complete-data likelihood
61
EM Algorithm - Abstract
- Learning
62
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
63
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
64
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
65
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
66
EM for Multi-Nominals

Random variable V with 1,...,K values
where ENk are the expected counts of state k
in the data, i.e.
MLE

- Learning
67
EM for Conditional Multinominals

multinomial for
each joint state pa of the parents of V
MLE

- Learning
68
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Current model
- Learning
69
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
70
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
Maximization
Update parameters (ML, MAP)
71
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
Maximization
Update parameters (ML, MAP)
EM-algorithm iterate until convergence
72
Learning Parameters incomplete data

Initialize parameters
Compute pseudo counts for each variable
Set parameters to the (completed) ML estimates
If not converged, iterate to 2

- Learning
73
Monotonicity

(Dempster, Laird, Rubin 77) the incomplete-data
likelihood fuction is not decreased after an EM
iteration
(discrete) Bayesian networks for any initial,
non-uniform value the EM algorithm converges to a
(local or global) maximum.

- Learning
74
LL on training set (Alarm)
- Learning
Experiment by Bauer, Koller and Singer UAI97
75
Parameter value (Alarm)
- Learning
Experiment by Baur, Koller and Singer UAI97
76
EM in Practice

Initial parameters
Random parameters setting
Best guess from other source
Stopping criteria
Small change in likelihood of data
Small change in parameter values
Avoiding bad local maxima
Multiple restarts
Early pruning of unpromising ones
Speed up
various methods to speed convergence

- Learning
77
Gradient Ascent

Main result
Requires same BN inference computations as EM
Pros
Flexible
Closely related to methods in neural network
training
Cons
Need to project gradient onto space of legal
parameters
To get reasonable convergence we need to combine
with smart optimization techniques

- Learning
78
Parameter Estimation Summary