Title: Unsupervised Approaches
1Unsupervised Approaches
- An alternative approach clustering
- algorithms that organize and classify data.
- useful for data compression.
- clustering partitions a data set into smaller
subsets based on the 'similarity' of the
examples.
2Unsupervised Approaches
- An alternative approach clustering
- An example
- K-means clustering
- overview (an iterative process)
- 1. determine a starting point i.e. the number of
clusters and the cluster points. - 2. determine a membership matrix for the data
points i.e. to which cluster point they are the
closest. - 3. evaluate the cost function
- 4. update the position of the cluster points
- 5. go back to 2
3Unsupervised Approaches
- An alternative approach clustering
- The cost function is to be minimised. It simply
sums the distance between each data point and the
appropriate cluster point for each cluster - updating the cluster position involves evaluating
the average of the data points within a cluster,
and moving the cluster point to this position.
4Unsupervised Approaches
- An alternative approach clustering
- an example
selecting two cluster points c1, c2
5Unsupervised Approaches
- An alternative approach clustering
- an example
Evaluating the membership matrix
membership of point xj in clusteri 1 (when
dist(xj, clusteri) lt all other dist(xj,
clusteri's)), otherwise it is 0
any suitable distance measure can be used e.g.
euclidean, hamming distance
6Unsupervised Approaches
- An alternative approach clustering
- an example
we will use a hamming distance measure i.e
for n-dimension vectors (or points)
therefore x1 is a member of cluster point 2
7Unsupervised Approaches
- An alternative approach clustering
- an example
cluster 1
cluster 2
8Unsupervised Approaches
- An alternative approach clustering
- an example
- Evaluate the cost function (J)
where there are n cluster points
For each cluster point
where there are m data points within the cluster
i.e. simply sum the distances between each point
in the cluster and the cluster point itself
9Unsupervised Approaches
- An alternative approach clustering
- an example
- using a hamming distance measure
and
thus J 2.8 3.05 5.85
10Unsupervised Approaches
- An alternative approach clustering
- an example
- 4. Update cluster points
i.e. move the cluster point the average position
of the data points in that cluster. For cluster
1 (members are x2, x3, x4, x6, x7)
average (1st component) (0.5 0.55 0.4 0.3
0.5)/points 2.25/5 0.45 average (2nd
component) (0.40.350.20.80.7)/points
2.45/5 0.49
so cluster point 1 moves to (0.45,0.49)
11Unsupervised Approaches
- An alternative approach clustering
- an example
- 4. Update cluster points
i.e. move the cluster point the average position
of the data points in that cluster. For cluster
2 (members are x1, x5, x8, x9)
average (1st component) (0.60.70.60.75)/poin
ts 2.65/4 0.66 average (2nd component)
(0.30.80.90.55)/points 2.55/4 0.64
so cluster point 1 moves to (0.66,0.64)
12Unsupervised Approaches
- An alternative approach clustering
- an example
13Unsupervised Approaches
- An alternative approach clustering
- an example
cluster 2
re-evaluate membership matrix
then re-evaluate the cost function and shift the
cluster pts again
14Unsupervised Approaches
- An alternative approach clustering
- How is the algorithm terminated?
- Normally a threshold is set and either
- as soon as the cost function reaches (or is
below) the threshold the algorithm terminates or - when the improvement (decrease) of the cost
function over the previous improvement (decrease)
falls below the threshold the algorithm
terminates.