Graphical Models - Learning - - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Graphical Models - Learning -

Description:

Advanced I WS 06/07 Based on J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 79
Provided by: unif77
Category:

less

Transcript and Presenter's Notes

Title: Graphical Models - Learning -


1
Graphical Models- Learning -
Advanced I WS 06/07
Based on J. A. Bilmes,A Gentle Tutorial of the
EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov
Models, TR-97-021, U.C. Berkeley, April 1998
G. J. McLachlan, T. Krishnan, The EM Algorithm
and Extensions, John Wiley Sons, Inc., 1997
D. Koller, course CS-228 handouts, Stanford
University, 2001., N. Friedman D. Kollers
NIPS99.
Parameter Estimation
  • Wolfram Burgard, Luc De Raedt, Kristian
    Kersting, Bernhard Nebel

Albert-Ludwigs University Freiburg, Germany
2
Outline
  • Introduction
  • Reminder Probability theory
  • Basics of Bayesian Networks
  • Modeling Bayesian networks
  • Inference (VE, Junction tree)
  • Excourse Markov Networks
  • Learning Bayesian networks
  • Relational Models

3
What is Learning?
  • Agents are said to learn if they improve their
    performance over time based on experience.

The problem of understanding intelligence is said
to be the greatest problem in science today and
the problem for this century as deciphering
the genetic code was for the second half of the
last onethe problem of learning represents a
gateway to understanding intelligence in man and
machines. -- Tomasso Poggio and Steven
Smale, 2003.
- Learning
4
Why bothering with learning?
  • Bottleneck of knowledge aquisition
  • Expensive, difficult
  • Normally, no expert is around
  • Data is cheap !
  • Huge amount of data avaible, e.g.
  • Clinical tests
  • Web mining, e.g. log files
  • ....

- Learning
5
Why Learning Bayesian Networks?
  • Conditional independencies graphical language
    capture structure of many real-world
    distributions
  • Graph structure provides much insight into domain
  • Allows knowledge discovery
  • Learned model can be used for many tasks
  • Supports all the features of probabilistic
    learning
  • Model selection criteria
  • Dealing with missing data hidden variables

6
Learning With Bayesian Networks
Data Priori Info
- Learning
7
Learning With Bayesian Networks
Data Priori Info
- Learning
8
What does the data look like?
attributes/variables
complete data set
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
X1
X2
data cases
- Learning
...
XM
9
What does the data look like?
incomplete data set
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?
  • Real-world data states of
  • some random variables are
  • missing
  • E.g. medical diagnose not all patient are
    subjects to all test
  • Parameter reduction, e.g. clustering, ...

- Learning
10
What does the data look like?
incomplete data set
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?
  • Real-world data states of
  • some random variables are
  • missing
  • E.g. medical diagnose not all patient are
    subjects to all test
  • Parameter reduction, e.g. clustering, ...

- Learning
missing value
11
What does the data look like?
hidden/ latent
incomplete data set
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?
  • Real-world data states of
  • some random variables are
  • missing
  • E.g. medical diagnose not all patient are
    subjects to all test
  • Parameter reduction, e.g. clustering, ...

- Learning
missing value
12
Hidden variable Examples
  • Parameter reduction

X1
X3
X2
X1
X3
X2
hidden
H
H
Y3
Y2
Y1
Y3
Y2
Y1
- Learning
13
Hidden variable Examples
  1. Clustering

Cluster
assignment of cluster
Cluster
hidden
...
attributes
X1
X2
Xn
  • Hidden variables also appear in clustering
  • Autoclass model
  • Hidden variables assignsclass labels
  • Observed attributes areindependent given the
    class

- Learning
14
Slides due to Eamonn Keogh
- Learning
15
Slides due to Eamonn Keogh
Iteration 1 The cluster means are randomly
assigned
- Learning
16
Slides due to Eamonn Keogh
Iteration 2
- Learning
17
Slides due to Eamonn Keogh
Iteration 5
- Learning
18
Slides due to Eamonn Keogh
Iteration 25
- Learning
19
Slides due to Eamonn Keogh
What is a natural grouping among these objects?
20
Slides due to Eamonn Keogh
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
21
Learning With Bayesian Networks

Easiest problem counting Selection of arcs New domain with no domain expert Data mining
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, Only Structural EM is known Scientific discouvery
Fixed structure Fixed variables Hidden variables
observed fully
observed Partially
?
?
?
A
B
A
B
H
A
B
- Learning
22
Parameter Estimation
  • Let be set
    of data
  • over m RVs
  • is called a data case
  • iid - assumption
  • All data cases are independently sampled from
    identical distributions


- Learning
Find Parameters of CPDs which match the
data best
23
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?

- Learning
24
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?

Find paramteres which have most likely
produced the data
- Learning
25
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?
  • MAP parameters

- Learning
26
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?
  • MAP parameters
  • Data is equally likely for all parameters.

- Learning
27
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?
  • MAP parameters
  • Data is equally likely for all parameters
  • All parameters are apriori equally likely

- Learning
28
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?

Find ML parameters
- Learning
29
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?

Find ML parameters
Likelihood of the paramteres given
the data
- Learning
30
Maximum Likelihood - Parameter Estimation
  • What does best matching mean ?

Find ML parameters
Likelihood of the paramteres given
the data
- Learning
Log-Likelihood of the paramteres
given the data
31
Maximum Likelihood
  • This is one of the most commonly used estimators
    in statistics
  • Intuitively appealing
  • Consistent estimate converges to best possible
    value as the number of examples grow
  • Asymptotic efficiency estimate is as close to
    the true value as possible given a particular
    training set

- Learning
32
Learning With Bayesian Networks

Easiest problem counting Selection of arcs New domain with no domain expert Data mining
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, Only Structural EM is known Scientific discouvery
Fixed structure Fixed variables Hidden variables
observed fully
observed Partially
?
?
?
A
B
A
B
H
A
B
?
- Learning
33
Known Structure, Complete Data
E, B, A ltY,N,Ngt ltY,N,Ygt ltN,N,Ygt ltN,Y,Ygt .
. ltN,Y,Ygt
- Learning
  • Network structure is specified
  • Learner needs to estimate parameters
  • Data does not contain missing values

34
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
- Learning
35
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
- Learning
36
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
- Learning
37
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
- Learning
38
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
- Learning
39
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
- Learning
40
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
Only local parameters of family of Aj involved
- Learning
41
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
(BN semantics)
Only local parameters of family of Aj involved
- Learning
Each factor individually !!
42
ML Parameter Estimation
A1 A2 A3 A4 A5 A6
true true false true false false
false true true true false false
... ... ... ... ... ...
true false false false true true
(iid)
Decomposability of the likelihood
(BN semantics)
Only local parameters of family of Aj involved
- Learning
Each factor individually !!
43
Decomposability of Likelihood
  • If the data set if complete (no question marks)
  • we can maximize each local likelihood function
    independently, and
  • then combine the solutions to get an MLE
    solution.
  • decomposition of the global problem to
  • independent, local sub-problems. This allows
  • efficient solutions to the MLE problem.

44
Likelihood for Multinominals
  • Random variable V with 1,...,K values
  • where Nk is the counts of state k in data

This constraint implies that the choice on ?I
influences the choice on ?j (iltgtj)
- Learning
45
Likelihood for Binominals (2 states only)
  • Compute partial derivative

?1 ?2 1
  • Set partial derivative zero

- Learning
gt MLE is
46
Likelihood for Binominals (2 states only)
  • Compute partial derivative

?1 ?2 1
  • Set partial derivative zero

In general, for multinomials (gt2 states), the MLE
is
- Learning
gt MLE is
47
Likelihood for Conditional Multinominals
  • multinomial for
    each joint state pa of the parents of V
  • MLE

- Learning
48
Learning With Bayesian Networks

Easiest problem counting Selection of arcs New domain with no domain expert Data mining
Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, Only Structural EM is known Scientific discouvery
Fixed structure Fixed variables Hidden variables
observed fully
observed Partially
?
?
?
A
B
A
B
H
A
B
- Learning
?
49
Known Structure, Incomplete Data
E, B, A ltY,?,Ngt ltY,N,?gt ltN,N,Ygt ltN,Y,Ygt .
. lt?,Y,Ygt
- Learning
  • Network structure is specified
  • Data contains missing values
  • Need to consider assignments to missing values

50
EM Idea
  • In the case of complete data, ML parameter
    estimation is easy
  • simply counting (1 iteration)
  • Incomplete data ?
  • Complete data (Imputation)
  • most probable?, average?, ... value
  • Count
  • Iterate

- Learning
51
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
- Learning
52
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
0.5
false
false
53
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
expected counts
complete data
A B N
true true 1.0
true ?true 0.5
true ?false 0.5
false true 1.0
true false 1.0
false ?true 0.5
false ?false 0.5
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
0.5
false
false
54
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
maximize
0.5
false
false
55
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.5
true
false
maximize
0.5
false
false
56
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
- Learning
57
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.75
true
false
maximize
0.25
false
false
58
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
- Learning
59
EM Idea complete the data
A
B
incomplete data
A B
true true
true ?
false true
true false
false ?
complete data
expected counts
iterate
- Learning
N
B
A
1.5
true
true
1.5
false
true
1.875
true
false
maximize
0.125
false
false
60
Complete-data likelihood
A1 A2 A3 A4 A5 A6
true true ? true false false
? true ? ? false false
... ... ... ... ... ...
true false ? false true ?
incomplete-data likelihood
Assume complete data exists
with
- Learning
complete-data likelihood
61
EM Algorithm - Abstract
- Learning
62
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
63
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
64
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
65
EM Algorithm - Principle
P(Yq)
- Learning
q
Expectation Maximization (EM) Construct an new
function based on the current point (which
behaves well) Property The maximum of the new
function has a better scoring then the current
point.
66
EM for Multi-Nominals
  • Random variable V with 1,...,K values
  • where ENk are the expected counts of state k
    in the data, i.e.
  • MLE

- Learning
67
EM for Conditional Multinominals
  • multinomial for
    each joint state pa of the parents of V
  • MLE

- Learning
68
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Current model
- Learning
69
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
70
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
Maximization
Update parameters (ML, MAP)
71
Learning Parameters incomplete data
Non-decomposable likelihood (missing value,
hidden nodes)
Initial parameters
Expectation
Current model
- Learning
Maximization
Update parameters (ML, MAP)
EM-algorithm iterate until convergence
72
Learning Parameters incomplete data
  1. Initialize parameters
  2. Compute pseudo counts for each variable
  3. Set parameters to the (completed) ML estimates
  4. If not converged, iterate to 2

- Learning
73
Monotonicity
  • (Dempster, Laird, Rubin 77) the incomplete-data
    likelihood fuction is not decreased after an EM
    iteration
  • (discrete) Bayesian networks for any initial,
    non-uniform value the EM algorithm converges to a
    (local or global) maximum.

- Learning
74
LL on training set (Alarm)
- Learning
Experiment by Bauer, Koller and Singer UAI97
75
Parameter value (Alarm)
- Learning
Experiment by Baur, Koller and Singer UAI97
76
EM in Practice
  • Initial parameters
  • Random parameters setting
  • Best guess from other source
  • Stopping criteria
  • Small change in likelihood of data
  • Small change in parameter values
  • Avoiding bad local maxima
  • Multiple restarts
  • Early pruning of unpromising ones
  • Speed up
  • various methods to speed convergence

- Learning
77
Gradient Ascent
  • Main result
  • Requires same BN inference computations as EM
  • Pros
  • Flexible
  • Closely related to methods in neural network
    training
  • Cons
  • Need to project gradient onto space of legal
    parameters
  • To get reasonable convergence we need to combine
    with smart optimization techniques

- Learning
78
Parameter Estimation Summary
  • Parameter estimation is a basic task for learning
    with Bayesian networks
  • Due to missing values non-linear optimization
  • EM, Gradient, ...
  • EM for multi-nominal random variables
  • Fully observed data counting
  • Partially observed data pseudo counts
  • Junction tree to do multiple inference

- Learning
Write a Comment
User Comments (0)
About PowerShow.com