Parameter Estimation - PowerPoint PPT Presentation

1 / 104

About This Presentation

Title:

Parameter Estimation

Description:

Have a number of design samples or training data as representatives of patterns ... In some occasions, more than one q values may yield the same p(x|q) ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 105

Provided by: shyhka

Category:

more less

Transcript and Presenter's Notes

Title: Parameter Estimation

1
Parameter Estimation

Shyh-Kang Jeng
Department of Electrical Engineering/
Graduate Institute of Communication/
Graduate Institute of Networking and Multimedia,
National Taiwan University

2
Typical Classification Problem

Rarely know the complete probabilistic structure
of the problem
Have vague, general knowledge
Have a number of design samples or training data
as representatives of patterns for classification
Find some way to use this information to design
or train the classifier

3
Estimating Probabilities

Not difficulty to Estimate prior probabilities
Hard to estimate class-conditional densities
Number of available samples always seems too
small
Serious when dimensionality is large

4
Estimating Parameters

Problems permit to parameterize the conditional
densities
Simplifies the problem from one of estimating an
unknown function to one of estimating the
parameters
e.g., mean vector and covariance matrix for
multi-variate normal distribution

5
Maximum-Likelihood Estimation

View the parameters as quantities whose values
are fixed but unknown
Best estimate is the one that maximize the
probability of obtaining the samples actually
observed
Nearly always have good convergence properties as
the number of samples increases
Often simpler than alternative methods

6
I. I. D. Random Variables

Separate data into D1, . . ., Dc
Samples in Dj are drawn independently according
to p(xwj)
Such samples are independent and identically
distributed (i. i. d.) random variables
Let p(xwj) has a known parametric form and is
determined uniquely by a parameter vector qj,,
i.e., p(xwj)p(xwj,qj)

7
Simplification Assumptions

Samples in Di give no information about qj, if i
is not equal to j
Can work with each class separately
Have c separate problems of the same form
Use set D of i. i. d. samples from p(xq) to
estimate unknown parameter vector q

8
Maximum-likelihood Estimate
9
Maximum-likelihood Estimation
10
A Note

The likelihood p(Dq) as a function of q is not a
probability density function of q
Its area on the q-domain has no significance
The likelihood p(Dq) can be regarded as
probability of D for a given q

11
Analytical Approach
12
MAP Estimators
13
Gaussian Case Unknown m
14
Univariate Gaussian Case Unknown m and s2
15
Multivariate Gaussian Case Unknown m and S
16
Bias, Absolutely Unbiased, and Asymptotically
Unbiased
17
Model Error

For reliable model, the ML classifier can give
excellent results
If the model is wrong, the ML classifier can not
get the best results, even for the assumed set of
models

18
Bayesian Estimation (Bayesian Learning)

Answers obtained in general is nearly identical
to those by maximum-likelihood
Basic conceptual difference
The parameter vector q is a random variable
Use the training data to convert a distribution
on this variable into a posterior probability
density

19
Central Problem
20
Parameter Distribution

Assume p(x) has a known parametric form with
parameter vector q of unknown value
Thus, p(xq) is completely known
Information about q prior to observing samples is
contained in known prior density p(q)
Observations convert p(q) to p(qD)
should be sharply peaked about the true value of q

21
Parameter Distribution
22
Univariate Gaussian Case p(mD)
23
Reproducing Density
24
Bayesian Learning
25
Dogmatism
26
Univariate Gaussian Case p(xD)
27
Multivariate Gaussian Case
28
Multivariate Gaussian Case
29
Multivariate Bayesian Learning
30
General Bayesian Estimation
31
Recursive Bayesian Learning
32
Example 1Recursive Bayes Learning
33
Example 1Recursive Bayes Learning
34
Example 1 Bayes vs. ML
35
Identifiability

p(xq) is identifiable
Sequence of posterior densities p(qDn) converge
to a delta function
Only one q causes p(xq) to fit the data
In some occasions, more than one q values may
yield the same p(xq)
p(qDn) will peak near all q that explain the
data
Ambiguity is erased in integration for p(xDn),
which converges to p(x) whether or not p(xq) is
identifiable

36
ML vs. Bayes Methods

Computational complexity
Interpretability
Confidence in prior information
Form of the underlying distribution p(xq)
Results differs when p(qD) is broad, or
asymmetric around the estimated q
Bayes methods would exploit such information
whereas ML would not

37
Classification Errors

Bayes or indistinguishability error
Model error
Estimation error
Parameters are estimated from a finite sample
Vanishes in the limit of infinite training data
(ML and Bayes would have the same total
classification error)

38
Invariance and Non-informative Priors

Guidance in creating priors
Invariance
Translation invariance
Scale invariance
Non-informative with respect to an invariance
Much better than accommodating arbitrary
transformation in a MAP estimator
Of great use in Bayesian estimation

39
Gibbs Algorithm
40
Sufficient Statistics

Statistic
Any function of samples
Sufficient statistic s of samples D
s Contains all information relevant to
estimating some parameter q
Definition p(Ds, q) is independent of q
If q can be regarded as a random variable

41
Factorization Theorem

A statistic s is sufficient for q if and only if
P(Dq) can be written as the product
P(Dq) g(s, q) h(D)
for some functions g(.,.) and h(.)

42
Example Multivariate Gaussian
43
Proof of Factorization Theorem The Only if Part
44
Proof of Factorization Theorem The if Part
45
Kernel Density

Factoring of P(Dq) into g(s,q)h(D) is not unique
If f(s) is any function, g(s,q)f(s)g(s,q) and
h(D) h(D)/f(s) are equivalent factors
Ambiguity is removed by defining the kernel
density invariant to such scaling

46
Example Multivariate Gaussian
47
Kernel Density and Parameter Estimation

Maximum-likelihood
maximization of g(s,q)
Bayesian
If prior knowledge of q is vague, p(q) tend to be
uniform, and p(qD) is approximately the same as
the kernel density
If p(xq) is identifiable, g(s,q) peaks sharply
at some value, and p(q) is continuous as well as
non-zero there, p(qD) approaches the kernel
density

48
Sufficient Statistics for Exponential Family
49
Error Rate and Dimensionality
50
Accuracy and Dimensionality
51
Effects of Additional Features

In practice, beyond a certain point, inclusion of
additional features leads to worse rather than
better performance
Sources of difficulty
Wrong models
Number of design or training samples is finite
and thus the distributions are not estimated
accurately

52
Computational Complexity for Maximum-Likelihood
Estimation
53
Computational Complexity for Classification
54
Approaches for Inadequate Samples

Reduce dimensionality
Redesign feature extractor
Select appropriate subset of features
Combine the existing features
Pool the available data by assuming all classes
share the same covariance matrix
Look for a better estimate for S
Use Bayesian estimate and diagonal S0
Threshold sample covariance matrix
Assume statistical independence

55
Shrinkage (Regularized Discriminant Analysis)
56
Concept of Overfitting
57
Best Representative Point
58
Projection Along a Line
59
Best Projection to a Line Through the Sample Mean
60
Best Representative Direction
61
Principal Component Analysis (PCA)
62
Concept of Fisher Linear Discriminant
63
Fisher Linear Discriminant Analysis
64
Fisher Linear Discriminant Analysis
65
Fisher Linear Discriminant Analysis
66
Fisher Linear Discriminant Analysis for
Multivariate Normal
67
Concept of Multidimensional Discriminant Analysis
68
Multiple Discriminant Analysis
69
Multiple Discriminant Analysis
70
Multiple Discriminant Analysis
71
Multiple Discriminant Analysis
72
Expectation-Maximization (EM)

Finding the maximum-likelihood estimate of the
parameters of an underlying distribution
from a given data set when the data is incomplete
or has missing values
Two main applications
When the data indeed has missing values
When optimizing the likelihood function is
analytically intractable but when the likelihood
function can be simplified by assuming the
existence of and values for additional but
missing (or hidden) parameters

73
Expectation-Maximization (EM)

Full sample D x1, . . ., xn
xk xkg, xkb
Separate individual features into Dg and Db
D is the union of Dg and Db
Form the function

74
Expectation-Maximization (EM)

begin initialize q0, T, i ? 0
do i ? i 1
E step Compute Q(q q i)
M step q i1 ? arg maxq Q(q,q i)
until Q(q i1q i)-Q(q iq i-1) T
return q ? qi1
end

75
Expectation-Maximization (EM)
76
Example 2D Model
77
Example 2D Model
78
Example 2D Model
79
Example 2D Model
80
Generalized Expectation-Maximization (GEM)

Instead of maximizing Q(q q i), we find some q
i1 such that
Q(q i1q i)gtQ(q q i)
and is also guaranteed to converge
Convergence will not as rapid
Offers great freedom to choose computationally
simpler steps
e.g., using maximum-likelihood value of unknown
values, if they lead to a greater likelihood

81
Hidden Markov Model (HMM)

Used for problems of making a series of decisions
e.g., speech or gesture recognition
Problem states at time t are influenced directly
by a state at t-1
More reference
L. A. Rabiner and B. W. Juang, Fundamentals of
Speech Recognition, Prentice-Hall, 1993, Chapter
6.

82
First Order Markov Models
83
First Order Hidden Markov Models
84
Hidden Markov Model Probabilities
85
Hidden Markov Model Computation

Evaluation problem
Given aij and bjk, determine P(VTq)
Decoding problem
Given VT, determine the most likely sequence of
hidden states that lead to VT
Learning problem
Given training observations of visible symbols
and the coarse structure but not the
probabilities, determine aij and bjk

86
Evaluation
87
HMM Forward
88
HMM Forward and Trellis
89
HMM Forward
90
HMM Backward
91
HMM Backward
92
Example 3 Hidden Markov Model
93
Example 3 Hidden Markov Model
94
Example 3 Hidden Markov Model
95
Left-to-Right Models for Speech
96
HMM Decoding
97
Problem of Local Optimization

This decoding algorithm depends only on the
single previous time step, not the full sequence
Not guarantee that the path is indeed allowable

98
HMM Decoding
99
Example 4 HMM Decoding
100
Forward-Backward Algorithm

Determines model parameters, aij and bjk, from an
ensemble of training samples
An instance of a generalized expectation-maximizat
ion algorithm
No known method for the optimal or most likely
set of parameters from data

101
Probability of Transition
102
Improved Estimate for aij
103
Improved Estimate for bjk
104
Forward-Backward Algorithm(Baum-Welch Algorithm)

Write a Comment

User Comments (0)