Title: Prototype Classification Methods
1Prototype Classification Methods
- Fu Chang
- Institute of Information Science
- Academia Sinica
- 2788-3799 ext. 1819
- fchang_at_iis.sinica.edu.tw
2Types of Prototype Methods
- Crisp model (K-means, KM)
- Prototypes are centers of non-overlapping
clusters - Fuzzy model (Fuzzy c-means, FCM)
- Prototypes are weighted average of all samples
- Gaussian Mixture model (GM)
- Prototypes have a mixture of distributions
- Linear Discriminant Analysis (LDA)
- Prototypes are projected sample means
- K-nearest neighbor classifier (K-NN)
- Learning vector quantization (LVQ)
3Prototypes thru Clustering
- Given the number k of prototypes, find k clusters
whose centers are prototypes - Commonality
- Use iterative algorithm, aimed at decreasing an
objective function - May converge to local minima
- The number of k as well as an initial solution
must be specified
4Clustering Objectives
- The aim of the iterative algorithm is to decrease
the value of an objective function - Notations
- Samples
- Prototypes
- L2-distance
5Objectives (cntd)
- Crisp objective
- Fuzzy objective
- Gaussian mixture objective
6K-Means Clustering
7The Algorithm
- Initiate k seeds of prototypes p1, p2, , pk
- Grouping
- Assign samples to their nearest prototypes
- Form non-overlapping clusters out of these
samples - Centering
- Centers of clusters become new prototypes
- Repeat the grouping and centering steps, until
convergence
8Justification
- Grouping
- Assigning samples to their nearest prototypes
helps to decrease the objective -
- Centering
- Also helps to decrease the above objective,
because - and equality holds only if
9Exercise
- Prove that for any group of vectors yi, the
following inequality is always true - Prove that the equality holds only when
-
- Use this fact to prove that the centering step is
helpful to decrease the objective function
10Fuzzy c-Means Clustering
11Crisp vs. Fuzzy Membership
- Membership matrix Ucn
- Uij is the grade of membership of sample j with
respect to prototype i - Crisp membership
- Fuzzy membership
12Fuzzy c-means (FCM)
- The objective function of FCM is
-
13FCM (Cntd)
- Introducing the Lagrange multiplier ? with
respect to the constraint - we rewrite the objective function as
14FCM (Cntd)
- Setting the partial derivatives to zero, we obtain
15FCM (Cntd)
- From the 2nd equation, we obtain
- From this fact and the 1st equation, we obtain
16FCM (Cntd)
17FCM (Cntd)
- Together with the 2nd equation, we obtain the
updating rule for uij
18FCM (Cntd)
- On the other hand, setting the derivative of J
with respect to pi to zero, we obtain
19FCM (Cntd)
- It follows that
- Finally, we can obtain the update rule of ci
20FCM (Cntd)
21K-means vs. Fuzzy c-means
Sample Points
22K-means vs. Fuzzy c-means
K-means
Fuzzy c-means
23 Expectation-Maximization (EM) Algorithm
24What Is Given
- Observed data X x1, x2, , xn, each of them
is drawn independently from a mixture of
probability distributions with the density - where
25Incomplete vs. Complete Data
- The incomplete-data log-likelihood is given by
- which is difficult to optimize
- The complete-data log-likelihood
- can be handled much easily, where H is the
set of hidden random variables - How do we compute the distribution of H?
26EM Algorithm
- E-Step first find the expected value
- where is the current estimate of
- M-Step Update the estimate
- Repeat the process, until convergence
27E-M Steps
28Justification
- The expected value (the circled term) is the
lower bound of the log-likelihood
29Justification (Cntd)
- The maximum of the lower bound equals to the
log-likelihood - The first term of (1) is the relative entropy of
q(h) with respect to - The second term is a magnitude that does not
depend on h - We would obtain the maximum of (1) if the
relative entropy becomes zero - With this choice, the first term becomes zero and
(1) achieves the upper bound, which is
30Details of EM Algorithm
- Let be
the guessed values of - For the given , we can compute
31Details (Cntd)
- We then consider the expected value
32Details (Cntd)
- Lagrangian and partial derivative equation
33Details (Cntd)
- From (2), we derive that ? - n and
- Based on these values, we can derive the optimal
for , of which only the following
part involves
34Exercise
- Deduce from (1) that ? - n and
35Gaussian Mixtures
- The Gaussian distribution is given by
- For Gaussian mixtures,
36Gaussian Mixtures (Cntd)
- Partial derivative
- Setting this to zero, we obtain
37Gaussian Mixtures (Cntd)
- Taking the derivative of with
respect to - and setting it to zero, we get
- (many details are omitted)
38Gaussian Mixtures (Cntd)
39 Linear Discriminant Analysis(LDA)
40Illustration
41Definitions
- Given
- Samples x1, x2, , xn
- Classes ni of them are of class i, i 1, 2, ,
c - Definition
- Sample mean for class i
- Scatter matrix for class i
42Scatter Matrices
- Total scatter matrix
- Within-class scatter matrix
- Between-class scatter matrix
43Multiple Discriminant Analysis
- We seek vectors wi, i 1, 2, .., c-1
- And project the samples x to the c-1 dimensional
space y - The criterion for W (w1, w2, , wc-1) is
44Multiple Discriminant Analysis (Cntd)
- Consider the Lagrangian
- Take the partial derivative
- Setting the derivative to zero, we obtain
45Multiple Discriminant Analysis (Cntd)
- Find the roots of the characteristic function as
eigenvalues - and then solve
- for wi for the largest c-1 eigenvalues
46LDA Prototypes
- The prototype of each class is the mean of the
projected samples of that class, the projection
is thru the matrix W - In the testing phase
- All test samples are projected thru the same
optimal W - The nearest prototype is the winner
47 K-Nearest Neighbor (K-NN) Classifier
48K-NN Classifier
- For each test sample x, find the nearest K
training samples and classify x according to the
vote among the K neighbors - The error rate is
- where
- This shows that the error rate is at most twice
the Bayes error
49 Learning Vector Quantization (LVQ)
50LVQ Algorithm
- Initialize R prototypes for each class m1(k),
m2(k), , mR(k), where k 1, 2, , K. - Sample a training sample x and find the nearest
prototype mj(k) to x - If x and mj(k) match in class type,
- Otherwise,
- Repeat step 2, decreasing e at each iteration
51References
- F. Höppner, F. Klawonn, R. Kruse, and T. Runkler,
Fuzzy Cluster Analysis Methods for
Classification, Data Analysis and Image
Recognition, John Wiley Sons, 1999. - J. A. Bilmes, A Gentle Tutorial of the EM
algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov
Models, www.cs.berkeley.edu/daf/appsem/WordsAndP
ictures/Papers/bilmes98gentle.pdf - T. P. Minka, Expectation-Maximization as Lower
Bound Maximization, www.stat.cmu.edu/minka/paper
s/em.html - R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, 2nd Ed., Wiley Interscience,
2001. - T. Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning,
Springer-Verlag, 2001.