Generative Models - PowerPoint PPT Presentation

About This Presentation

Title:

Generative Models

Description:

Learning a Statistical Model. Prediction. p(y|x; ) Male: Gaussian distribution N ... Pr(male|1.67m) Pr(female|1.67m) Probabilistic Models for. Classification Problems ... – PowerPoint PPT presentation

Number of Views:301

Avg rating:3.0/5.0

Slides: 77

Provided by: rong7

Learn more at: http://www.cse.msu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Generative Models

1
Generative Models

Rong Jin

2
Statistical Inference
Female Gaussian distribution N(?1,?1)
Pr(male1.67m) Pr(female1.67m)
Male Gaussian distribution N(?2,?2)
3
Statistical Inference
Male Gaussian distribution N(?1,?1)
Pr(male1.67m) Pr(female1.67m)
Female Gaussian distribution N(?2,?2)
4
Probabilistic Models for Classification Problems

Apply statistical inference methods
Given training example
Assume a parametric model
Learn the model parameters ? from training
example using maximum likelihood approach
The class of a new instance is predicted by

5
Probabilistic Models for Classification Problems

Apply statistical inference methods
Given training example
Assume a parametric model
Learn the model parameters ? from training
example using maximum likelihood approach
The class of a new instance is predicted by

6
Probabilistic Models for Classification Problems

Apply statistical inference methods
Given training example
Assume a parametric model
Learn the model parameters ? from training
example using maximum likelihood approach
The class of a new instance is predicted by

7
Probabilistic Models for Classification Problems

Apply statistical inference methods
Given training example
Assume a parametric model
Learn the model parameters ? from training
example using the maximum likelihood approach
The class of a new instance is predicted by

8
Maximum Likelihood Estimation (MLE)

Given training example
Compute log-likelihood of data
Find the parameters ? that maximizes the
log-likelihood
In many case, the expression for log-likelihood
is not closed form and therefore MLE requires
numerical calculation

9
Maximum Likelihood Estimation (MLE)

Given training example
Compute log-likelihood of data
Find the parameters ? that maximizes the
log-likelihood
In many case, the expression for log-likelihood
is not closed form and therefore MLE requires
numerical calculation

10
Probabilistic Models for Classification Problems

Apply statistical inference methods
Given training example
Assume a parametric model
Learn the model parameters ? from training
example using the maximum likelihood approach
The class of a new instance is predicted by

11
Generative Models

Most probabilistic distributions are joint
distribution (i.e., p(x?)), not conditional
distribution (i.e., p(yx?))
Using Bayes rule
p(xly?) ? p(yx?) p(y?)

12
Generative Models

Most probabilistic distributions are joint
distribution (i.e., p(x?)), not conditional
distribution (i.e., p(yx?))
Using Bayes rule
p(yx?) ? p(xy?) p(y?)

13
Generative Models (contd)

Treatment of p(xy?)
Let y?Y1, 2, , c
Allocate a separate set of parameters for each
class
? ? ?1, ?2,, ?c
p(xly?) ? p(x?y)
Data in different class have different input
patterns

14
Generative Models (contd)

Parameter space
Parameters for distribution ?1, ?2,, ?c
Class priors p(y1), p(y2), , p(yc)
Learn parameters from training examples using
MLE
Compute log-likelihood
Search for the optimal parameters by maximizing
the log-likelihood

15
Generative Models (contd)

Parameter space
Parameters for distribution ?1, ?2,, ?c
Class priors p(y1), p(y2), , p(yc)
Learn parameters from training examples using
MLE
Compute log-likelihood
Search for the optimal parameters by maximizing
the log-likelihood

16
Generative Models (contd)

Parameter space
Parameters for distribution ?1, ?2,, ?c
Class priors p(y1), p(y2), , p(yc)
Learn parameters from training examples using
MLE
Compute log-likelihood
Search for the optimal parameters by maximizing
the log-likelihood

17
Generative Models (contd)

Parameter space
Parameters for distribution ?1, ?2,, ?c
Class priors p(y1), p(y2), , p(yc)
Learn parameters from training examples using
MLE
Compute log-likelihood
Search for the optimal parameters by maximizing
the log-likelihood

18
Example

Task predict gender of individuals based on
their heights
Given
100 height examples of women
100 height examples of man
Assume height of women and man follow different
Gaussian distributions

19
Example (contd)

Gaussian distribution
Parameter space
Gaussian distribution for man (?m ?m)
Gaussian distribution for man (?w ?w)
Class priors pm p(yman), pw p(ywomen)

20
Example (contd)

Gaussian distribution
Parameter space
Gaussian distribution for male (?m, ?m)
Gaussian distribution for female (?f , ?f)
Class priors pm p(ymale), pf p(yfemale)

21
Example (contd)
22
Example (contd)
23
Example (contd)
24
Example (contd)

Learn a Gaussian generative model

25
Example (contd)

Learn a Gaussian generative model

26
Example (contd)
27
Example (contd)

Predict the gender of an individual given his/her
height

28
Decision boundary

Decision boundary h
Predict female when h
Predict male when hh
Random when hh
Where is the decision boundary?
It depends on the ratio pm/pf

29
Example

Decision boundary h
Predict female when h
Predict male when hh
Random when hh
Where is the decision boundary?
It depends on the ratio pm/pf

30
Example

Decision boundary h
Predict female when h
Predict male when hh
Random when hh
Where is the decision boundary?
It depends on the ratio pm/pf

31
Gaussian Generative Model (II)

Inputs contain multiple features
Example
Task predict if an individual is overweight
based on his/her salary and the number of hours
on watching TV
Input (s salary, h hours for watching TV)
Output 1 (overweight), -1 (normal)

32
Multi-variate Gaussian Distribution
33
Multi-variate Gaussian Distribution
34
Multi-variate Gaussian Distribution
35
Properties of Covariance Matrix

What if the number of data points N
How about for any vector ?
Positive semi-definitive matrix

36
Properties of Covariance Matrix

What if the number of data points N
How about for any ?
Positive semi-definitive matrix

37
Properties of Covariance Matrix

What if the number of data points N
How about for any ?
Positive semi-definitive matrix
Number of different elements in ??

38
Gaussian Generative Model (II)

Joint distribution p(s,h) for salary (s) and
hours for watching TV (h)

39
Gaussian Generative Model (II)

Joint distribution p(s,h) for salary (s) and
hours for watching TV (h)

40
Multi-variate Gaussian Generative Model

Input with multiple input features
A multi-variate Gaussian distribution for each
class

41
Improve Multivariate Gaussian Model

How could we improve the prediction of model for
overweight?
Multiple modes for each class
Introduce more attributes of individuals
Location
Occupation
The number of children
House
Age

42
Problems with Using Multi-variate Gaussian
Generative Model

? is a matrix of size dxd, contains d(d1)/2
independent variables
d100 the number of variables in ? is 5,050
d1000 the number of variables in ? is 505,000
? A large parameter space
? can be singular
If N
If two features are linear correlated ? ?-1 does
not exist

43
Problems with Using Multi-variate Gaussian
Generative Model

Diagonalize ?

44
Problems with Using Multi-variate Gaussian
Generative Model

Diagonalize ?
Feature independence assumption (Naïve Bayes
assumption)

45
Problems with Using Multi-variate Gaussian
Generative Model

Diagonalize ?
Smooth the covariance matrix

46
Overfitting Issue

Complex model vs. insufficient training
Example
Consider a classification problem of multiple
inputs
100 input features
5 classes
1000 training examples
Total number parameters for a full Gaussian model
is
5 class prior ? 5 parameters
5 means ? 500 parameters
5 covariance matrices ? 50,500 parameters
51,005 parameters ? insufficient training data

47
Model Complexity Vs. Data
48
Model Complexity Vs. Data
49
Model Complexity Vs. Data
50
Model Complexity Vs. Data
51
Problems with Using Multi-variate Gaussian
Generative Model

Diagonalize ?
Feature independence assumption

52
Naïve Bayes Model

In general, for any generative model, we have to
estimate
For x in high dimension space, this probability
is hard to estimate
In Naïve Bayes Model, we approximate

53
Naïve Bayes Model

In general, for any generative model, we have to
estimate
For x in high dimension space, this probability
is hard to estimate
In Naïve Bayes Model, we approximate

54
Naïve Bayes Model

In general, for any generative model, we have to
estimate
For x in high dimension space, this probability
is hard to estimate
In Naïve Bayes Model, we approximate

55
Text Categorization

Learn to classify text into predefined
categories
Input x a document
Represented by a vector of words
Example (president, 10), (bush, 2), (election,
5),
Output y if the document is politics or not
1 for political document, -1 for not political
document

56
Text Categorization

A generative model for text classification (TC)
Parameter space
p() and p(-)
p(doc?), p(doc-?)
It is difficult to estimate both p(doc?),
p(doc-?)
Typical vocabulary size 100,000
Each document is a vector of 100,000 attributes
!
Too many words in a document
A Naïve Bayes approach

57
Text Classification

A generative model for text classification (TC)
Parameter space
p() and p(-)
p(doc?), p(doc-?)
It is difficult to estimate both p(doc?),
p(doc-?)
Typical vocabulary size 100,000
Each document is a vector of 100,000 attributes
!
Too many words in a document
A Naïve Bayes approach

58
Text Classification

A generative model for text classification (TC)
Parameter space
p() and p(-)
p(doc?), p(doc-?)
It is difficult to estimate both p(doc?),
p(doc-?)
Typical vocabulary size 100,000
Each document is a vector of 100,000 attributes
!
Too many words in a document
A Naïve Bayes approach

59
Text Classification

A Naïve Bayes approach
For a document

60
Text Classification

The original parameter space
p() and p(-)
p(doc?), p(doc-?)

Parameter space after Naïve Bayes simplification
p() and p(-)
p(w1), p(w2),, p(wn)
p(w1-), p(w2-),, p(wn-)

61
Text Classification

Learning parameters from training examples
Each document
Learn parameters using maximum likelihood
estimation

62
Text Classification
63
Text Classification
64
Text Classification
65
Text Classification

The optimal solution that maximizes the
likelihood of training data

66
Text Classification
Twenty Newsgroups
An Example
67
Text Classification

Any problems with the Naïve Bayes text
classifier?
Unseen words
Word w is unseen from the training documents,
what is the consequence?
Word w is only unseen for documents of one
class, what is the consequence?
Related to the overfitting problem
Any suggestion?
Solution word class approach
Introducing word class T t1, t2, , tm
Compute p(ti), p(ti-)
When w is unseen before, replace p(w?) with
p(ti?)
Introducing prior for word probabilities

68
Naïve Bayes Model

This is a terrible approximation

69
Naïve Bayes Model

Why use Naïve Bayes Model ?
We are essentially interested in p(yx?), not
p(xy?)

70
Naïve Bayes Model

Why use Naïve Bayes Model ?
We are essentially interested in p(yx?), not
p(xy?)

71
Naïve Bayes Model

Why use Naïve Bayes Model ?
We are essentially interested in p(yx?), not
p(xy?)

72
Naïve Bayes Model

The key for the prediction model is not p(xy?),
but the ratio p(xy?)/p(xy?)
Although Naïve Bayes model does a poor job for
estimating p(xy?), it does a reasonable good on
estimating the ratio.

73
The Ratio of Likelihood for Binary Classes