Title: Dendrograms for Data Mining
1Eamonn Keogh
Dendrograms for Data Mining
2What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
- Informally, finding natural groupings or
relationships among objects.
3What is a natural grouping among these objects?
4What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
5Two Types of Clustering
- Partitional algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion
Partitional
Hierarchical
6What is Similarity?
Webster's Dictionary
The quality or state of being similar likeness
resemblance as, a similarity of features.
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
7Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
8A Useful Tool for Summarizing Similarity
Measurements
Introducing the dendrogram. Cladogram,
Phylogenetic Tree, phylogram
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
9There is only one dataset that can be perfectly
clustered using a hierarchy
(Bovine0.69395, (Spider Monkey 0.390,
(Gibbon0.36079,(Orang0.33636,(Gorilla0.17147,(C
himp0.19268, Human0.11927)0.08386)0.06124)0.1
5057)0.54939)
10Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
11A Demonstration of Hierarchical Clustering using
String Edit Distance
Pedro (Portuguese) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr
(Russian) Cristovao (Portuguese) Christoph
(German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher
(English) Miguel (Portuguese) Michalis (Greek),
Michael (English), Mick (Irish!)
Piotr
Pyotr
Peka
Peter
Piero
Pietro
Pierre
Petros
Peadar
Mick
Pedro
Peder
Miguel
Krystof
Michael
Michalis
Crisdean
Cristobal
Cristovao
Christoph
Kristoffer
Cristoforo
Christophe
Christopher
12Pedro (Portuguese/Spanish) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Piotr
Peka
Peter
Pedro
Piero
Pyotr
Peder
Pietro
Pierre
Petros
Peadar
13- Hierarchal clustering can sometimes show patterns
that are meaningless or spurious - For example, in this clustering, the tight
grouping of Australia, Anguilla, St. Helena etc
is meaningful, since all these countries are
former UK colonies. - However the tight grouping of Niger and India is
completely spurious, there is no connection
between the two.
14- The flag of Niger is orange over white over
green, with an orange disc on the central white
stripe, symbolizing the sun. The orange stands
the Sahara desert, which borders Niger to the
north. Green stands for the grassy plains of the
south and west and for the River Niger which
sustains them. It also stands for fraternity and
hope. White generally symbolizes purity and hope.
- The Indian flag is a horizontal tricolor in
equal proportion of deep saffron on the top,
white in the middle and dark green at the bottom.
In the center of the white band, there is a wheel
in navy blue to indicate the Dharma Chakra, the
wheel of law in the Sarnath Lion Capital. This
center symbol or the 'CHAKRA' is a symbol dating
back to 2nd century BC. The saffron stands for
courage and sacrifice the white, for purity and
truth the green for growth and auspiciousness.
15We can look at the dendrogram to determine the
correct number of clusters In this case, the
two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely
this clear cut, unfortunately)
16One potential use of a dendrogram is to detect
outliers
The single isolated branch is suggestive of a
data point that is very different to all others
Outlier
17(How-to) Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
- The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)! - Number Number of Possible
- of Leafs Dendrograms
- 2 1
- 3 3
- 4 15
- 5 105
- ...
- 34,459,425
18We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
19Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
20Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
21Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
22Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
23We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
- Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters. - Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors"). - Group average linkage In this method, the
distance between two clusters is calculated as
the average distance between all pairs of objects
in the two different clusters. - Wards Linkage In this method, we try to
minimize the variance of the merged clusters
24Single linkage
Average linkage
Wards linkage
25- Summary of Hierarchal Clustering Methods
- No need to specify the number of clusters in
advance. - Hierarchal nature maps nicely onto human
intuition for some domains - They do not scale well time complexity of at
least O(n2), where n is the number of total
objects. - Like any heuristic search algorithms, local
optima are a problem. - Interpretation of results is (very) subjective.
26Johnson WE, Eizirik E, Pecon-Slattery J, et al.
(January 2006). "The late Miocene radiation of
modern Felidae a genetic assessment". Science
27Irish/Welsh Split Must be before 300AD. Archaic
Irish inscriptions date back to the 5th century
AD divergence must have occurred well before
this time.
How do we know the dates? If we can get dates,
even upper/lower bounds, some events, we can
interpolate to the rest of the tree.
Gray, R.D. and Atkinson, Q. D., Language tree
divergence times support the Anatolian theory of
Indo-European origin
28Do Trees Make Sense for non-Biological Objects?
Gibbon
Sumatran Orangutan
Orangutan
Gorilla
Human
Pygmy Chimp
Chimpanzee
Armenian borrowed so many words from Iranian
languages that it was at first considered a
branch of the Indo-Iranian languages, and was not
recognized as an independent branch of the
Indo-European languages for many decades
- The answer is Yes.
- There are increasing theoretical and empirical
results to suggest that phylogenetic methods work
for cultural artifacts. - Does horizontal transmission invalidate cultural
phylogenies? Greenhill, Currie Gray. - Branching, blending, and the evolution of
cultural similarities and differences among human
populations. Collard, Shennan, Tehrani.
..results show that trees constructed with
Bayesian phylogenetic methods are robust to
realistic levels of borrowing
29On trick to text the applicably of phylogenetic
methods outside of biology is to test on datasets
for which you know the right answer by other
methods.
Canadian Football is historically derived from
the ancestor of rugby, but today closely
resembles the American versions of the game. In
this branch of the tree geography has trumped
deeper phylogenetic history.
Here the results are very good, but not perfect.
Gray, RD, Greenhill, SJ, Ross, RM (2007). The
Pleasures and Perils of Darwinizing Culture (with
phylogenies). Biological Theory, 2(4)
30(No Transcript)
31Why would we want to use trees for human
artifacts?
- Because trees are powerful in biology
- They make predictions
- Pacific Yew produces taxol which treats some
cancers, but it is expensive. Its nearest
relative, the European Yew was also found to
produce taxol. - They tell us the order of events
- Which came first, classic geometric spider webs,
or messy cobwebs? - They tell us about..
- Homelands, where did it come from.
- Dates when did it happen.
- Rates of change
- Ancestral states
32- They tell us the order of events
- Which came first, classic geometric orb webs, or
messy cobwebs?
An orb web is shaped like a circle with spokes
A cobweb is a tangled mass of fluffy silk that
catches insects
33Most parsimonious cladogram for Turkmen textile
designs.
Lobed gul Lobed gul birds
Lobed gul clovers
1presence 0absence
1 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 0
1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1
1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1
0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0
Salor
Data is coded as a 90-dimensional binary vector.
However this is arbitrary in two ways, why
these 90 features, and does this carpet really
have Lobed Gul clovers?
J. Tehrani, M. Collard / Journal of
Anthropological Archaeology 21 (2002) 443463
34- I. Location of maximum blade width
- 1. Proximal quarter
- 2. Secondmost proximal quarter
- 3. Secondmost distal quarter
- 4. Distal quarter
- II. Base shape
- 1. Arc-shaped
- 2. Normal curve
- 3. Triangular
- 4. Folsomoid
- III. Basal indentation ratio
- 1. No basal indentation
- 2. 090099 (shallow)
- 3. 080089 (deep)
- IV. Constriction ratio
- 100
- 090099
- 080089
- 4. 070079
21225212
Data is coded as a 8-dimensional integer vector.
Cladistics Is Useful for Reconstructing
Archaeological Phylogenies Palaeoindian Points
from the Southeastern United States. OBrien
35An Inverted use of Dendrograms
- Suppose you have a new distance measure A, and
you want to claim it is better than the old
method B - You could run some classification experiments and
report some numbers.. - A get 95 B gets 90
- But this is not very forceful, and it does not
tell you when you win/lose
36B
3
4
2
1
8
10
11
5
12
9
6
7
One Second
A
Grylloidea
Tettigonioidea
11
12
7
8
9
10
1
2
3
4
5
6
One Second
Cyrtoxipha columbiana
Neocurtilla hexadactyla
Conocephalus nemoralis
Aglaothorax ovata
Amblycorypha huasteca
Hispanogryllus nesion
37B
A
38Note that the algorithm has no access to color
information, just texture
Dictionnaire D'Histoire Naturelle by Charles
Orbigny. 1849
39The algorithm can handle very subtle differences.
 Ornaments from the Hand-Press period. The
particularity of this period is the use of block
of wood to print the ornaments on the books. The
specialist historians want to record the
individual instances of ornament occurrence and
to identify the individual blocks. This
identification work is very useful to date the
books and to authenticate outputs from some
printing-houses and authors. Indeed, numerous
editions published in the past centuries do not
reveal on the title page their true origin. The
blocks could be re-used to print several books,
be exchanged between the printing-houses or
duplicated in the case of damage. Mathieu
Delalandre
40Indexing and Mining Rock Art
Rock art is found on every continent except
Antarctica. To date, computer science has
had little impact on analysis of rock art.
Australia may have 100 million examples
A decade ago, Walt et al. summed up the state of
petroglyph research by noting, Complete-site and
cross-site research thus remains impossible,
incomplete, or impressionistic
41If we assume that we have high quality binary
images of rock art, then we can do clustering,
classification, indexing motif discovery.
Atlatls
Anthropomorphs
One challenge is designing distance
measures. For example, we would like to find
and similar, even though one is solid and
one is hollow.
Bighorn Sheep
Zhu, Wang, Keogh, Lee (2009). Augmenting the
Generalized Hough Transform to Enable the Mining
of Petroglyphs. SIGKDD 2009
42Eamonn Keogh Computer Science Engineering
Department University of California
Riverside eamonn_at_cs.ucr.edu