Title: Generative Models
1Generative Models
2Statistical Inference
Female Gaussian distribution N(?1,?1)
Pr(male1.67m) Pr(female1.67m)
Male Gaussian distribution N(?2,?2)
3Statistical Inference
Male Gaussian distribution N(?1,?1)
Pr(male1.67m) Pr(female1.67m)
Female Gaussian distribution N(?2,?2)
4Probabilistic Models for Classification Problems
- Apply statistical inference methods
- Given training example
- Assume a parametric model
- Learn the model parameters ? from training
example using maximum likelihood approach
- The class of a new instance is predicted by
5Probabilistic Models for Classification Problems
- Apply statistical inference methods
- Given training example
- Assume a parametric model
- Learn the model parameters ? from training
example using maximum likelihood approach
- The class of a new instance is predicted by
6Probabilistic Models for Classification Problems
- Apply statistical inference methods
- Given training example
- Assume a parametric model
- Learn the model parameters ? from training
example using maximum likelihood approach
- The class of a new instance is predicted by
7Probabilistic Models for Classification Problems
- Apply statistical inference methods
- Given training example
- Assume a parametric model
- Learn the model parameters ? from training
example using the maximum likelihood approach
- The class of a new instance is predicted by
8Maximum Likelihood Estimation (MLE)
- Given training example
- Compute log-likelihood of data
- Find the parameters ? that maximizes the
log-likelihood
- In many case, the expression for log-likelihood
is not closed form and therefore MLE requires
numerical calculation
9Maximum Likelihood Estimation (MLE)
- Given training example
- Compute log-likelihood of data
- Find the parameters ? that maximizes the
log-likelihood
- In many case, the expression for log-likelihood
is not closed form and therefore MLE requires
numerical calculation
10Probabilistic Models for Classification Problems
- Apply statistical inference methods
- Given training example
- Assume a parametric model
- Learn the model parameters ? from training
example using the maximum likelihood approach
- The class of a new instance is predicted by
11Generative Models
- Most probabilistic distributions are joint
distribution (i.e., p(x?)), not conditional
distribution (i.e., p(yx?))
- Using Bayes rule
- p(xly?) ? p(yx?) p(y?)
12Generative Models
- Most probabilistic distributions are joint
distribution (i.e., p(x?)), not conditional
distribution (i.e., p(yx?))
- Using Bayes rule
- p(yx?) ? p(xy?) p(y?)
13Generative Models (contd)
- Treatment of p(xy?)
- Let y?Y1, 2, , c
- Allocate a separate set of parameters for each
class
- ? ? ?1, ?2,, ?c
- p(xly?) ? p(x?y)
- Data in different class have different input
patterns
14Generative Models (contd)
- Parameter space
- Parameters for distribution ?1, ?2,, ?c
- Class priors p(y1), p(y2), , p(yc)
- Learn parameters from training examples using
MLE
- Compute log-likelihood
- Search for the optimal parameters by maximizing
the log-likelihood
15Generative Models (contd)
- Parameter space
- Parameters for distribution ?1, ?2,, ?c
- Class priors p(y1), p(y2), , p(yc)
- Learn parameters from training examples using
MLE
- Compute log-likelihood
- Search for the optimal parameters by maximizing
the log-likelihood
16Generative Models (contd)
- Parameter space
- Parameters for distribution ?1, ?2,, ?c
- Class priors p(y1), p(y2), , p(yc)
- Learn parameters from training examples using
MLE
- Compute log-likelihood
- Search for the optimal parameters by maximizing
the log-likelihood
17Generative Models (contd)
- Parameter space
- Parameters for distribution ?1, ?2,, ?c
- Class priors p(y1), p(y2), , p(yc)
- Learn parameters from training examples using
MLE
- Compute log-likelihood
- Search for the optimal parameters by maximizing
the log-likelihood
18Example
- Task predict gender of individuals based on
their heights
- Given
- 100 height examples of women
- 100 height examples of man
- Assume height of women and man follow different
Gaussian distributions
19Example (contd)
- Gaussian distribution
- Parameter space
- Gaussian distribution for man (?m ?m)
- Gaussian distribution for man (?w ?w)
- Class priors pm p(yman), pw p(ywomen)
20Example (contd)
- Gaussian distribution
- Parameter space
- Gaussian distribution for male (?m, ?m)
- Gaussian distribution for female (?f , ?f)
- Class priors pm p(ymale), pf p(yfemale)
21Example (contd)
22Example (contd)
23Example (contd)
24Example (contd)
- Learn a Gaussian generative model
25Example (contd)
- Learn a Gaussian generative model
26Example (contd)
27Example (contd)
- Predict the gender of an individual given his/her
height
28Decision boundary
- Decision boundary h
- Predict female when h
- Predict male when hh
- Random when hh
- Where is the decision boundary?
- It depends on the ratio pm/pf
29Example
- Decision boundary h
- Predict female when h
- Predict male when hh
- Random when hh
- Where is the decision boundary?
- It depends on the ratio pm/pf
30Example
- Decision boundary h
- Predict female when h
- Predict male when hh
- Random when hh
- Where is the decision boundary?
- It depends on the ratio pm/pf
31Gaussian Generative Model (II)
- Inputs contain multiple features
- Example
- Task predict if an individual is overweight
based on his/her salary and the number of hours
on watching TV
- Input (s salary, h hours for watching TV)
- Output 1 (overweight), -1 (normal)
32Multi-variate Gaussian Distribution
33Multi-variate Gaussian Distribution
34Multi-variate Gaussian Distribution
35Properties of Covariance Matrix
- What if the number of data points N
- How about for any vector ?
- Positive semi-definitive matrix
36Properties of Covariance Matrix
- What if the number of data points N
- How about for any ?
- Positive semi-definitive matrix
37Properties of Covariance Matrix
- What if the number of data points N
- How about for any ?
- Positive semi-definitive matrix
- Number of different elements in ??
38Gaussian Generative Model (II)
- Joint distribution p(s,h) for salary (s) and
hours for watching TV (h)
39Gaussian Generative Model (II)
- Joint distribution p(s,h) for salary (s) and
hours for watching TV (h)
40Multi-variate Gaussian Generative Model
- Input with multiple input features
- A multi-variate Gaussian distribution for each
class
41Improve Multivariate Gaussian Model
- How could we improve the prediction of model for
overweight?
- Multiple modes for each class
- Introduce more attributes of individuals
- Location
- Occupation
- The number of children
- House
- Age
42Problems with Using Multi-variate Gaussian
Generative Model
- ? is a matrix of size dxd, contains d(d1)/2
independent variables
- d100 the number of variables in ? is 5,050
- d1000 the number of variables in ? is 505,000
- ? A large parameter space
- ? can be singular
- If N
- If two features are linear correlated ? ?-1 does
not exist
43Problems with Using Multi-variate Gaussian
Generative Model
44Problems with Using Multi-variate Gaussian
Generative Model
- Diagonalize ?
- Feature independence assumption (Naïve Bayes
assumption)
45Problems with Using Multi-variate Gaussian
Generative Model
- Diagonalize ?
- Smooth the covariance matrix
46Overfitting Issue
- Complex model vs. insufficient training
- Example
- Consider a classification problem of multiple
inputs
- 100 input features
- 5 classes
- 1000 training examples
- Total number parameters for a full Gaussian model
is
- 5 class prior ? 5 parameters
- 5 means ? 500 parameters
- 5 covariance matrices ? 50,500 parameters
- 51,005 parameters ? insufficient training data
47Model Complexity Vs. Data
48Model Complexity Vs. Data
49Model Complexity Vs. Data
50Model Complexity Vs. Data
51Problems with Using Multi-variate Gaussian
Generative Model
- Diagonalize ?
- Feature independence assumption
52Naïve Bayes Model
- In general, for any generative model, we have to
estimate
- For x in high dimension space, this probability
is hard to estimate
- In Naïve Bayes Model, we approximate
53Naïve Bayes Model
- In general, for any generative model, we have to
estimate
- For x in high dimension space, this probability
is hard to estimate
- In Naïve Bayes Model, we approximate
54Naïve Bayes Model
- In general, for any generative model, we have to
estimate
- For x in high dimension space, this probability
is hard to estimate
- In Naïve Bayes Model, we approximate
55Text Categorization
- Learn to classify text into predefined
categories
- Input x a document
- Represented by a vector of words
- Example (president, 10), (bush, 2), (election,
5),
- Output y if the document is politics or not
- 1 for political document, -1 for not political
document
56Text Categorization
- A generative model for text classification (TC)
- Parameter space
- p() and p(-)
- p(doc?), p(doc-?)
- It is difficult to estimate both p(doc?),
p(doc-?)
- Typical vocabulary size 100,000
- Each document is a vector of 100,000 attributes
!
- Too many words in a document
- A Naïve Bayes approach
57Text Classification
- A generative model for text classification (TC)
- Parameter space
- p() and p(-)
- p(doc?), p(doc-?)
- It is difficult to estimate both p(doc?),
p(doc-?)
- Typical vocabulary size 100,000
- Each document is a vector of 100,000 attributes
!
- Too many words in a document
- A Naïve Bayes approach
58Text Classification
- A generative model for text classification (TC)
- Parameter space
- p() and p(-)
- p(doc?), p(doc-?)
- It is difficult to estimate both p(doc?),
p(doc-?)
- Typical vocabulary size 100,000
- Each document is a vector of 100,000 attributes
!
- Too many words in a document
- A Naïve Bayes approach
59Text Classification
- A Naïve Bayes approach
- For a document
60Text Classification
- The original parameter space
- p() and p(-)
- p(doc?), p(doc-?)
- Parameter space after Naïve Bayes simplification
- p() and p(-)
- p(w1), p(w2),, p(wn)
- p(w1-), p(w2-),, p(wn-)
61Text Classification
- Learning parameters from training examples
- Each document
- Learn parameters using maximum likelihood
estimation
62Text Classification
63Text Classification
64Text Classification
65Text Classification
- The optimal solution that maximizes the
likelihood of training data
66Text Classification
Twenty Newsgroups
An Example
67Text Classification
- Any problems with the Naïve Bayes text
classifier?
- Unseen words
- Word w is unseen from the training documents,
what is the consequence?
- Word w is only unseen for documents of one
class, what is the consequence?
- Related to the overfitting problem
- Any suggestion?
- Solution word class approach
- Introducing word class T t1, t2, , tm
- Compute p(ti), p(ti-)
- When w is unseen before, replace p(w?) with
p(ti?)
- Introducing prior for word probabilities
68Naïve Bayes Model
- This is a terrible approximation
69Naïve Bayes Model
- Why use Naïve Bayes Model ?
- We are essentially interested in p(yx?), not
p(xy?)
-
70Naïve Bayes Model
- Why use Naïve Bayes Model ?
- We are essentially interested in p(yx?), not
p(xy?)
-
71Naïve Bayes Model
- Why use Naïve Bayes Model ?
- We are essentially interested in p(yx?), not
p(xy?)
-
72Naïve Bayes Model
- The key for the prediction model is not p(xy?),
but the ratio p(xy?)/p(xy?)
- Although Naïve Bayes model does a poor job for
estimating p(xy?), it does a reasonable good on
estimating the ratio.
73The Ratio of Likelihood for Binary Classes
- Assume that both classes share the same variance
74The Ratio of Likelihood for Binary Classes
- Assume that both classes share the same variance
75The Ratio of Likelihood for Binary Classes
- Assume that both classes share the same variance
Gaussian generative model is a linear model
76Linear Decision Boundary
- Gaussian Generative Models Finding a linear
decision boundary
- Why not directly estimate the decision boundary?