Title: Parameter Estimation
1Parameter Estimation
- Shyh-Kang Jeng
- Department of Electrical Engineering/
- Graduate Institute of Communication/
- Graduate Institute of Networking and Multimedia,
National Taiwan University
2Typical Classification Problem
- Rarely know the complete probabilistic structure
of the problem - Have vague, general knowledge
- Have a number of design samples or training data
as representatives of patterns for classification - Find some way to use this information to design
or train the classifier
3Estimating Probabilities
- Not difficulty to Estimate prior probabilities
- Hard to estimate class-conditional densities
- Number of available samples always seems too
small - Serious when dimensionality is large
4Estimating Parameters
- Problems permit to parameterize the conditional
densities - Simplifies the problem from one of estimating an
unknown function to one of estimating the
parameters - e.g., mean vector and covariance matrix for
multi-variate normal distribution
5Maximum-Likelihood Estimation
- View the parameters as quantities whose values
are fixed but unknown - Best estimate is the one that maximize the
probability of obtaining the samples actually
observed - Nearly always have good convergence properties as
the number of samples increases - Often simpler than alternative methods
6I. I. D. Random Variables
- Separate data into D1, . . ., Dc
- Samples in Dj are drawn independently according
to p(xwj) - Such samples are independent and identically
distributed (i. i. d.) random variables - Let p(xwj) has a known parametric form and is
determined uniquely by a parameter vector qj,,
i.e., p(xwj)p(xwj,qj)
7Simplification Assumptions
- Samples in Di give no information about qj, if i
is not equal to j - Can work with each class separately
- Have c separate problems of the same form
- Use set D of i. i. d. samples from p(xq) to
estimate unknown parameter vector q
8Maximum-likelihood Estimate
9Maximum-likelihood Estimation
10A Note
- The likelihood p(Dq) as a function of q is not a
probability density function of q - Its area on the q-domain has no significance
- The likelihood p(Dq) can be regarded as
probability of D for a given q
11Analytical Approach
12MAP Estimators
13Gaussian Case Unknown m
14Univariate Gaussian Case Unknown m and s2
15Multivariate Gaussian Case Unknown m and S
16Bias, Absolutely Unbiased, and Asymptotically
Unbiased
17Model Error
- For reliable model, the ML classifier can give
excellent results - If the model is wrong, the ML classifier can not
get the best results, even for the assumed set of
models
18Bayesian Estimation (Bayesian Learning)
- Answers obtained in general is nearly identical
to those by maximum-likelihood - Basic conceptual difference
- The parameter vector q is a random variable
- Use the training data to convert a distribution
on this variable into a posterior probability
density
19Central Problem
20Parameter Distribution
- Assume p(x) has a known parametric form with
parameter vector q of unknown value - Thus, p(xq) is completely known
- Information about q prior to observing samples is
contained in known prior density p(q) - Observations convert p(q) to p(qD)
- should be sharply peaked about the true value of q
21Parameter Distribution
22Univariate Gaussian Case p(mD)
23Reproducing Density
24Bayesian Learning
25Dogmatism
26Univariate Gaussian Case p(xD)
27Multivariate Gaussian Case
28Multivariate Gaussian Case
29Multivariate Bayesian Learning
30General Bayesian Estimation
31Recursive Bayesian Learning
32Example 1Recursive Bayes Learning
33Example 1Recursive Bayes Learning
34Example 1 Bayes vs. ML
35Identifiability
- p(xq) is identifiable
- Sequence of posterior densities p(qDn) converge
to a delta function - Only one q causes p(xq) to fit the data
- In some occasions, more than one q values may
yield the same p(xq) - p(qDn) will peak near all q that explain the
data - Ambiguity is erased in integration for p(xDn),
which converges to p(x) whether or not p(xq) is
identifiable
36ML vs. Bayes Methods
- Computational complexity
- Interpretability
- Confidence in prior information
- Form of the underlying distribution p(xq)
- Results differs when p(qD) is broad, or
asymmetric around the estimated q - Bayes methods would exploit such information
whereas ML would not
37Classification Errors
- Bayes or indistinguishability error
- Model error
- Estimation error
- Parameters are estimated from a finite sample
- Vanishes in the limit of infinite training data
(ML and Bayes would have the same total
classification error)
38Invariance and Non-informative Priors
- Guidance in creating priors
- Invariance
- Translation invariance
- Scale invariance
- Non-informative with respect to an invariance
- Much better than accommodating arbitrary
transformation in a MAP estimator - Of great use in Bayesian estimation
39Gibbs Algorithm
40Sufficient Statistics
- Statistic
- Any function of samples
- Sufficient statistic s of samples D
- s Contains all information relevant to
estimating some parameter q - Definition p(Ds, q) is independent of q
- If q can be regarded as a random variable
41Factorization Theorem
- A statistic s is sufficient for q if and only if
P(Dq) can be written as the product - P(Dq) g(s, q) h(D)
- for some functions g(.,.) and h(.)
42Example Multivariate Gaussian
43Proof of Factorization Theorem The Only if Part
44Proof of Factorization Theorem The if Part
45Kernel Density
- Factoring of P(Dq) into g(s,q)h(D) is not unique
- If f(s) is any function, g(s,q)f(s)g(s,q) and
h(D) h(D)/f(s) are equivalent factors - Ambiguity is removed by defining the kernel
density invariant to such scaling
46Example Multivariate Gaussian
47Kernel Density and Parameter Estimation
- Maximum-likelihood
- maximization of g(s,q)
- Bayesian
- If prior knowledge of q is vague, p(q) tend to be
uniform, and p(qD) is approximately the same as
the kernel density - If p(xq) is identifiable, g(s,q) peaks sharply
at some value, and p(q) is continuous as well as
non-zero there, p(qD) approaches the kernel
density
48Sufficient Statistics for Exponential Family
49Error Rate and Dimensionality
50Accuracy and Dimensionality
51Effects of Additional Features
- In practice, beyond a certain point, inclusion of
additional features leads to worse rather than
better performance - Sources of difficulty
- Wrong models
- Number of design or training samples is finite
and thus the distributions are not estimated
accurately
52Computational Complexity for Maximum-Likelihood
Estimation
53Computational Complexity for Classification
54Approaches for Inadequate Samples
- Reduce dimensionality
- Redesign feature extractor
- Select appropriate subset of features
- Combine the existing features
- Pool the available data by assuming all classes
share the same covariance matrix - Look for a better estimate for S
- Use Bayesian estimate and diagonal S0
- Threshold sample covariance matrix
- Assume statistical independence
55Shrinkage (Regularized Discriminant Analysis)
56Concept of Overfitting
57Best Representative Point
58Projection Along a Line
59Best Projection to a Line Through the Sample Mean
60Best Representative Direction
61Principal Component Analysis (PCA)
62Concept of Fisher Linear Discriminant
63Fisher Linear Discriminant Analysis
64Fisher Linear Discriminant Analysis
65Fisher Linear Discriminant Analysis
66Fisher Linear Discriminant Analysis for
Multivariate Normal
67Concept of Multidimensional Discriminant Analysis
68Multiple Discriminant Analysis
69Multiple Discriminant Analysis
70Multiple Discriminant Analysis
71Multiple Discriminant Analysis
72Expectation-Maximization (EM)
- Finding the maximum-likelihood estimate of the
parameters of an underlying distribution - from a given data set when the data is incomplete
or has missing values - Two main applications
- When the data indeed has missing values
- When optimizing the likelihood function is
analytically intractable but when the likelihood
function can be simplified by assuming the
existence of and values for additional but
missing (or hidden) parameters
73Expectation-Maximization (EM)
- Full sample D x1, . . ., xn
- xk xkg, xkb
- Separate individual features into Dg and Db
- D is the union of Dg and Db
- Form the function
74Expectation-Maximization (EM)
- begin initialize q0, T, i ? 0
- do i ? i 1
- E step Compute Q(q q i)
- M step q i1 ? arg maxq Q(q,q i)
- until Q(q i1q i)-Q(q iq i-1) T
- return q ? qi1
- end
75Expectation-Maximization (EM)
76Example 2D Model
77Example 2D Model
78Example 2D Model
79Example 2D Model
80Generalized Expectation-Maximization (GEM)
- Instead of maximizing Q(q q i), we find some q
i1 such that - Q(q i1q i)gtQ(q q i)
- and is also guaranteed to converge
- Convergence will not as rapid
- Offers great freedom to choose computationally
simpler steps - e.g., using maximum-likelihood value of unknown
values, if they lead to a greater likelihood -
81Hidden Markov Model (HMM)
- Used for problems of making a series of decisions
- e.g., speech or gesture recognition
- Problem states at time t are influenced directly
by a state at t-1 - More reference
- L. A. Rabiner and B. W. Juang, Fundamentals of
Speech Recognition, Prentice-Hall, 1993, Chapter
6.
82First Order Markov Models
83First Order Hidden Markov Models
84Hidden Markov Model Probabilities
85Hidden Markov Model Computation
- Evaluation problem
- Given aij and bjk, determine P(VTq)
- Decoding problem
- Given VT, determine the most likely sequence of
hidden states that lead to VT - Learning problem
- Given training observations of visible symbols
and the coarse structure but not the
probabilities, determine aij and bjk
86Evaluation
87HMM Forward
88HMM Forward and Trellis
89HMM Forward
90HMM Backward
91HMM Backward
92Example 3 Hidden Markov Model
93Example 3 Hidden Markov Model
94Example 3 Hidden Markov Model
95Left-to-Right Models for Speech
96HMM Decoding
97Problem of Local Optimization
- This decoding algorithm depends only on the
single previous time step, not the full sequence - Not guarantee that the path is indeed allowable
98HMM Decoding
99Example 4 HMM Decoding
100Forward-Backward Algorithm
- Determines model parameters, aij and bjk, from an
ensemble of training samples - An instance of a generalized expectation-maximizat
ion algorithm - No known method for the optimal or most likely
set of parameters from data
101Probability of Transition
102Improved Estimate for aij
103Improved Estimate for bjk
104Forward-Backward Algorithm(Baum-Welch Algorithm)