Unsupervised Learning: Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Unsupervised Learning: Clustering

Description:

Unsupervised Learning Supervised learning used ... No labels = unsupervised learning Only some points are labeled = semi-supervised learning Labels may be ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 10
Provided by: EricE78
Learn more at: https://cs.brynmawr.edu
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Learning: Clustering


1
Unsupervised Learning Clustering
Some material adapted from slides by Andrew
Moore, CMU. Visit http//www.autonlab.org/tutoria
ls/ for Andrews repository of Data Mining
tutorials.
2
Unsupervised Learning
  • Supervised learning used labeled data pairs (x,
    y) to learn a function f X?Y.
  • But, what if we dont have labels?
  • No labels unsupervised learning
  • Only some points are labeled semi-supervised
    learning
  • Labels may be expensive to obtain, so we only get
    a few.
  • Clustering is the unsupervised grouping of data
    points. It can be used for knowledge discovery.

3
Clustering Data
4
K-Means Clustering
  • K-Means ( k , data )
  • Randomly choose k cluster center locations
    (centroids).
  • Loop until convergence
  • Assign each point to the cluster of the closest
    centroid.
  • Reestimate the cluster centroids based on the
    data assigned to each.

5
K-Means Clustering
  • K-Means ( k , data )
  • Randomly choose k cluster center locations
    (centroids).
  • Loop until convergence
  • Assign each point to the cluster of the closest
    centroid.
  • Reestimate the cluster centroids based on the
    data assigned to each.

6
K-Means Clustering
  • K-Means ( k , data )
  • Randomly choose k cluster center locations
    (centroids).
  • Loop until convergence
  • Assign each point to the cluster of the closest
    centroid.
  • Reestimate the cluster centroids based on the
    data assigned to each.

7
K-Means Animation
Example generated by Andrew Moore using Dan
Pellegs super-duper fast K-means system Dan
Pelleg and Andrew Moore. Accelerating Exact
k-means Algorithms with Geometric
Reasoning. Proc. Conference on Knowledge
Discovery in Databases 1999.
8
Problems with K-Means
  • Very sensitive to the initial points.
  • Do many runs of k-Means, each with different
    initial centroids.
  • Seed the centroids using a better method than
    random. (e.g. Farthest-first sampling)
  • Must manually choose k.
  • Learn the optimal k for the clustering. (Note
    that this requires a performance measure.)

9
Problems with K-Means
  • How do you tell it which clustering you want?
  • Constrained clustering techniques

10
Learning Bayes Nets
Some material adapted from lecture notes by Lise
Getoor and Ron Parr
Adapted from slides by Tim Finin and Marie
desJardins.
11
Learning Bayesian networks
  • Given training set
  • Find B that best matches D
  • model selection
  • parameter estimation

Inducer
Data D
12
Parameter estimation
  • Assume known structure
  • Goal estimate BN parameters q
  • entries in local probability models, P(X
    Parents(X))
  • A parameterization q is good if it is likely to
    generate the observed data
  • Maximum Likelihood Estimation (MLE) Principle
    Choose q so as to maximize L

i.i.d. samples
13
Parameter estimation II
  • The likelihood decomposes according to the
    structure of the network
  • ? we get a separate estimation task for each
    parameter
  • The MLE (maximum likelihood estimate) solution
  • for each value x of a node X
  • and each instantiation u of Parents(X)
  • Just need to collect the counts for every
    combination of parents and children observed in
    the data
  • MLE is equivalent to an assumption of a uniform
    prior over parameter values

sufficient statistics
14
Sufficient statistics Example
  • Why are the counts sufficient?

Moon-phase
Light-level
Earthquake
Burglary
Alarm
?A E, B N(A, E, B) / N(E, B)
15
Model selection
  • Goal Select the best network structure, given
    the data
  • Input
  • Training data
  • Scoring function
  • Output
  • A network that maximizes the score

16
Structure selection Scoring
  • Bayesian prior over parameters and structure
  • get balance between model complexity and fit to
    data as a byproduct
  • Score (GD) log P(GD) ? log P(DG) P(G)
  • Marginal likelihood just comes from our parameter
    estimates
  • Prior on structure can be any measure we want
    typically a function of the network complexity

Marginal likelihood
Prior
17
Heuristic search
18
Exploiting decomposability
19
Variations on a theme
  • Known structure, fully observable only need to
    do parameter estimation
  • Unknown structure, fully observable do heuristic
    search through structure space, then parameter
    estimation
  • Known structure, missing values use expectation
    maximization (EM) to estimate parameters
  • Known structure, hidden variables apply adaptive
    probabilistic network (APN) techniques
  • Unknown structure, hidden variables too hard to
    solve!

20
Handling missing data
  • Suppose that in some cases, we observe
    earthquake, alarm, light-level, and moon-phase,
    but not burglary
  • Should we throw that data away??
  • Idea Guess the missing valuesbased on the other
    data

Moon-phase
Light-level
Earthquake
Burglary
Alarm
21
EM (expectation maximization)
  • Guess probabilities for nodes with missing values
    (e.g., based on other observations)
  • Compute the probability distribution over the
    missing values, given our guess
  • Update the probabilities based on the guessed
    values
  • Repeat until convergence

22
EM example
  • Suppose we have observed Earthquake and Alarm but
    not Burglary for an observation on November 27
  • We estimate the CPTs based on the rest of the
    data
  • We then estimate P(Burglary) for November 27 from
    those CPTs
  • Now we recompute the CPTs as if that estimated
    value had been observed
  • Repeat until convergence!

Earthquake
Burglary
Alarm
Write a Comment
User Comments (0)
About PowerShow.com