Title: Unsupervised Learning: Clustering
1Unsupervised Learning Clustering
Some material adapted from slides by Andrew
Moore, CMU. Visit http//www.autonlab.org/tutoria
ls/ for Andrews repository of Data Mining
tutorials.
2Unsupervised Learning
- Supervised learning used labeled data pairs (x,
y) to learn a function f X?Y. - But, what if we dont have labels?
- No labels unsupervised learning
- Only some points are labeled semi-supervised
learning - Labels may be expensive to obtain, so we only get
a few. - Clustering is the unsupervised grouping of data
points. It can be used for knowledge discovery.
3Clustering Data
4K-Means Clustering
- K-Means ( k , data )
- Randomly choose k cluster center locations
(centroids). - Loop until convergence
- Assign each point to the cluster of the closest
centroid. - Reestimate the cluster centroids based on the
data assigned to each.
5K-Means Clustering
- K-Means ( k , data )
- Randomly choose k cluster center locations
(centroids). - Loop until convergence
- Assign each point to the cluster of the closest
centroid. - Reestimate the cluster centroids based on the
data assigned to each.
6K-Means Clustering
- K-Means ( k , data )
- Randomly choose k cluster center locations
(centroids). - Loop until convergence
- Assign each point to the cluster of the closest
centroid. - Reestimate the cluster centroids based on the
data assigned to each.
7K-Means Animation
Example generated by Andrew Moore using Dan
Pellegs super-duper fast K-means system Dan
Pelleg and Andrew Moore. Accelerating Exact
k-means Algorithms with Geometric
Reasoning. Proc. Conference on Knowledge
Discovery in Databases 1999.
8Problems with K-Means
- Very sensitive to the initial points.
- Do many runs of k-Means, each with different
initial centroids. - Seed the centroids using a better method than
random. (e.g. Farthest-first sampling) - Must manually choose k.
- Learn the optimal k for the clustering. (Note
that this requires a performance measure.)
9Problems with K-Means
- How do you tell it which clustering you want?
- Constrained clustering techniques
10Learning Bayes Nets
Some material adapted from lecture notes by Lise
Getoor and Ron Parr
Adapted from slides by Tim Finin and Marie
desJardins.
11Learning Bayesian networks
- Given training set
- Find B that best matches D
- model selection
- parameter estimation
Inducer
Data D
12Parameter estimation
- Assume known structure
- Goal estimate BN parameters q
- entries in local probability models, P(X
Parents(X)) - A parameterization q is good if it is likely to
generate the observed data - Maximum Likelihood Estimation (MLE) Principle
Choose q so as to maximize L
i.i.d. samples
13Parameter estimation II
- The likelihood decomposes according to the
structure of the network - ? we get a separate estimation task for each
parameter - The MLE (maximum likelihood estimate) solution
- for each value x of a node X
- and each instantiation u of Parents(X)
- Just need to collect the counts for every
combination of parents and children observed in
the data - MLE is equivalent to an assumption of a uniform
prior over parameter values
sufficient statistics
14Sufficient statistics Example
- Why are the counts sufficient?
Moon-phase
Light-level
Earthquake
Burglary
Alarm
?A E, B N(A, E, B) / N(E, B)
15Model selection
- Goal Select the best network structure, given
the data - Input
- Training data
- Scoring function
- Output
- A network that maximizes the score
16Structure selection Scoring
- Bayesian prior over parameters and structure
- get balance between model complexity and fit to
data as a byproduct - Score (GD) log P(GD) ? log P(DG) P(G)
- Marginal likelihood just comes from our parameter
estimates - Prior on structure can be any measure we want
typically a function of the network complexity
Marginal likelihood
Prior
17Heuristic search
18Exploiting decomposability
19Variations on a theme
- Known structure, fully observable only need to
do parameter estimation - Unknown structure, fully observable do heuristic
search through structure space, then parameter
estimation - Known structure, missing values use expectation
maximization (EM) to estimate parameters - Known structure, hidden variables apply adaptive
probabilistic network (APN) techniques - Unknown structure, hidden variables too hard to
solve!
20Handling missing data
- Suppose that in some cases, we observe
earthquake, alarm, light-level, and moon-phase,
but not burglary - Should we throw that data away??
- Idea Guess the missing valuesbased on the other
data
Moon-phase
Light-level
Earthquake
Burglary
Alarm
21EM (expectation maximization)
- Guess probabilities for nodes with missing values
(e.g., based on other observations) - Compute the probability distribution over the
missing values, given our guess - Update the probabilities based on the guessed
values - Repeat until convergence
22EM example
- Suppose we have observed Earthquake and Alarm but
not Burglary for an observation on November 27 - We estimate the CPTs based on the rest of the
data - We then estimate P(Burglary) for November 27 from
those CPTs - Now we recompute the CPTs as if that estimated
value had been observed - Repeat until convergence!
Earthquake
Burglary
Alarm