Dendrograms for Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Dendrograms for Data Mining

Description:

Eamonn Keogh Dendrograms for Data Mining 21225212 I. Location of maximum blade width 1. Proximal quarter 2. Secondmost proximal quarter 3. Secondmost distal quarter 4. – PowerPoint PPT presentation

Number of Views:545
Avg rating:3.0/5.0
Slides: 43
Provided by: eam9
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Dendrograms for Data Mining


1
Eamonn Keogh
Dendrograms for Data Mining
2
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
  • Informally, finding natural groupings or
    relationships among objects.

3
What is a natural grouping among these objects?
4
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
Males
Females
School Employees
5
Two Types of Clustering
  • Partitional algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • Hierarchical algorithms Create a hierarchical
    decomposition of the set of objects using some
    criterion

Partitional
Hierarchical
6
What is Similarity?
Webster's Dictionary
The quality or state of being similar likeness
resemblance as, a similarity of features.
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
7
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
8
A Useful Tool for Summarizing Similarity
Measurements
Introducing the dendrogram. Cladogram,
Phylogenetic Tree, phylogram
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
9
There is only one dataset that can be perfectly
clustered using a hierarchy
(Bovine0.69395, (Spider Monkey 0.390,
(Gibbon0.36079,(Orang0.33636,(Gorilla0.17147,(C
himp0.19268, Human0.11927)0.08386)0.06124)0.1
5057)0.54939)
10
Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
11
A Demonstration of Hierarchical Clustering using
String Edit Distance
Pedro (Portuguese) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr
(Russian) Cristovao (Portuguese) Christoph
(German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher
(English) Miguel (Portuguese) Michalis (Greek),
Michael (English), Mick (Irish!)
Piotr
Pyotr
Peka
Peter
Piero
Pietro
Pierre
Petros
Peadar
Mick
Pedro
Peder
Miguel
Krystof
Michael
Michalis
Crisdean
Cristobal
Cristovao
Christoph
Kristoffer
Cristoforo
Christophe
Christopher
12
Pedro (Portuguese/Spanish) Petros (Greek), Peter
(English), Piotr (Polish), Peadar (Irish),
Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Piotr
Peka
Peter
Pedro
Piero
Pyotr
Peder
Pietro
Pierre
Petros
Peadar
13
  • Hierarchal clustering can sometimes show patterns
    that are meaningless or spurious
  • For example, in this clustering, the tight
    grouping of Australia, Anguilla, St. Helena etc
    is meaningful, since all these countries are
    former UK colonies.
  • However the tight grouping of Niger and India is
    completely spurious, there is no connection
    between the two.

14
  • The flag of Niger is orange over white over
    green, with an orange disc on the central white
    stripe, symbolizing the sun. The orange stands
    the Sahara desert, which borders Niger to the
    north. Green stands for the grassy plains of the
    south and west and for the River Niger which
    sustains them. It also stands for fraternity and
    hope. White generally symbolizes purity and hope.
  • The Indian flag is a horizontal tricolor in
    equal proportion of deep saffron on the top,
    white in the middle and dark green at the bottom.
    In the center of the white band, there is a wheel
    in navy blue to indicate the Dharma Chakra, the
    wheel of law in the Sarnath Lion Capital. This
    center symbol or the 'CHAKRA' is a symbol dating
    back to 2nd century BC. The saffron stands for
    courage and sacrifice the white, for purity and
    truth the green for growth and auspiciousness.

15
We can look at the dendrogram to determine the
correct number of clusters In this case, the
two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely
this clear cut, unfortunately)
16
One potential use of a dendrogram is to detect
outliers
The single isolated branch is suggestive of a
data point that is very different to all others
Outlier
17
(How-to) Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
  • The number of dendrograms with n leafs (2n
    -3)!/(2(n -2)) (n -2)!
  • Number Number of Possible
  • of Leafs Dendrograms
  • 2 1
  • 3 3
  • 4 15
  • 5 105
  • ...
  • 34,459,425

18
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
19
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

20
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

21
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

22
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

23
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
  • Single linkage (nearest neighbor) In this
    method the distance between two clusters is
    determined by the distance of the two closest
    objects (nearest neighbors) in the different
    clusters.
  • Complete linkage (furthest neighbor) In this
    method, the distances between clusters are
    determined by the greatest distance between any
    two objects in the different clusters (i.e., by
    the "furthest neighbors").
  • Group average linkage In this method, the
    distance between two clusters is calculated as
    the average distance between all pairs of objects
    in the two different clusters.
  • Wards Linkage In this method, we try to
    minimize the variance of the merged clusters

24
Single linkage
Average linkage
Wards linkage
25
  • Summary of Hierarchal Clustering Methods
  • No need to specify the number of clusters in
    advance.
  • Hierarchal nature maps nicely onto human
    intuition for some domains
  • They do not scale well time complexity of at
    least O(n2), where n is the number of total
    objects.
  • Like any heuristic search algorithms, local
    optima are a problem.
  • Interpretation of results is (very) subjective.

26
Johnson WE, Eizirik E, Pecon-Slattery J, et al.
(January 2006). "The late Miocene radiation of
modern Felidae a genetic assessment". Science
27
Irish/Welsh Split Must be before 300AD. Archaic
Irish inscriptions date back to the 5th century
AD divergence must have occurred well before
this time.
How do we know the dates? If we can get dates,
even upper/lower bounds, some events, we can
interpolate to the rest of the tree.
Gray, R.D. and Atkinson, Q. D., Language tree
divergence times support the Anatolian theory of
Indo-European origin
28
Do Trees Make Sense for non-Biological Objects?
Gibbon
Sumatran Orangutan
Orangutan
Gorilla
Human
Pygmy Chimp
Chimpanzee
Armenian borrowed so many words from Iranian
languages that it was at first considered a
branch of the Indo-Iranian languages, and was not
recognized as an independent branch of the
Indo-European languages for many decades
  • The answer is Yes.
  • There are increasing theoretical and empirical
    results to suggest that phylogenetic methods work
    for cultural artifacts.
  • Does horizontal transmission invalidate cultural
    phylogenies? Greenhill, Currie Gray.
  • Branching, blending, and the evolution of
    cultural similarities and differences among human
    populations. Collard, Shennan, Tehrani.

..results show that trees constructed with
Bayesian phylogenetic methods are robust to
realistic levels of borrowing
29
On trick to text the applicably of phylogenetic
methods outside of biology is to test on datasets
for which you know the right answer by other
methods.
Canadian Football is historically derived from
the ancestor of rugby, but today closely
resembles the American versions of the game. In
this branch of the tree geography has trumped
deeper phylogenetic history.
Here the results are very good, but not perfect.
Gray, RD, Greenhill, SJ, Ross, RM (2007). The
Pleasures and Perils of Darwinizing Culture (with
phylogenies). Biological Theory, 2(4)
30
(No Transcript)
31
Why would we want to use trees for human
artifacts?
  • Because trees are powerful in biology
  • They make predictions
  • Pacific Yew produces taxol which treats some
    cancers, but it is expensive. Its nearest
    relative, the European Yew was also found to
    produce taxol.
  • They tell us the order of events
  • Which came first, classic geometric spider webs,
    or messy cobwebs?
  • They tell us about..
  • Homelands, where did it come from.
  • Dates when did it happen.
  • Rates of change
  • Ancestral states

32
  • They tell us the order of events
  • Which came first, classic geometric orb webs, or
    messy cobwebs?

An orb web is shaped like a circle with spokes
A cobweb is a tangled mass of fluffy silk that
catches insects
33
Most parsimonious cladogram for Turkmen textile
designs.
Lobed gul Lobed gul birds
Lobed gul clovers
1presence 0absence
1 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 0
1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1
1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1
0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0
Salor
Data is coded as a 90-dimensional binary vector.
However this is arbitrary in two ways, why
these 90 features, and does this carpet really
have Lobed Gul clovers?
J. Tehrani, M. Collard / Journal of
Anthropological Archaeology 21 (2002) 443463
34
  • I. Location of maximum blade width
  • 1. Proximal quarter
  • 2. Secondmost proximal quarter
  • 3. Secondmost distal quarter
  • 4. Distal quarter
  • II. Base shape
  • 1. Arc-shaped
  • 2. Normal curve
  • 3. Triangular
  • 4. Folsomoid
  • III. Basal indentation ratio
  • 1. No basal indentation
  • 2. 090099 (shallow)
  • 3. 080089 (deep)
  • IV. Constriction ratio
  • 100
  • 090099
  • 080089
  • 4. 070079

21225212
Data is coded as a 8-dimensional integer vector.
Cladistics Is Useful for Reconstructing
Archaeological Phylogenies Palaeoindian Points
from the Southeastern United States. OBrien
35
An Inverted use of Dendrograms
  • Suppose you have a new distance measure A, and
    you want to claim it is better than the old
    method B
  • You could run some classification experiments and
    report some numbers..
  • A get 95 B gets 90
  • But this is not very forceful, and it does not
    tell you when you win/lose

36
B
3
4
2
1
8
10
11
5
12
9
6
7
One Second
A
Grylloidea
Tettigonioidea
11
12
7
8
9
10
1
2
3
4
5
6
One Second
Cyrtoxipha columbiana
Neocurtilla hexadactyla
Conocephalus nemoralis
Aglaothorax ovata
Amblycorypha huasteca
Hispanogryllus nesion
37
B
A
38
Note that the algorithm has no access to color
information, just texture
Dictionnaire D'Histoire Naturelle by Charles
Orbigny. 1849
39
The algorithm can handle very subtle differences.
 Ornaments from the Hand-Press period. The
particularity of this period is the use of block
of wood to print the ornaments on the books. The
specialist historians want to record the
individual instances of ornament occurrence and
to identify the individual blocks. This
identification work is very useful to date the
books and to authenticate outputs from some
printing-houses and authors. Indeed, numerous
editions published in the past centuries do not
reveal on the title page their true origin. The
blocks could be re-used to print several books,
be exchanged between the printing-houses or
duplicated in the case of damage. Mathieu
Delalandre
40
Indexing and Mining Rock Art
Rock art is found on every continent except
Antarctica. To date, computer science has
had little impact on analysis of rock art.
Australia may have 100 million examples
A decade ago, Walt et al. summed up the state of
petroglyph research by noting, Complete-site and
cross-site research thus remains impossible,
incomplete, or impressionistic
41
If we assume that we have high quality binary
images of rock art, then we can do clustering,
classification, indexing motif discovery.
Atlatls
Anthropomorphs
One challenge is designing distance
measures. For example, we would like to find
and similar, even though one is solid and
one is hollow.
Bighorn Sheep
Zhu, Wang, Keogh, Lee (2009). Augmenting the
Generalized Hough Transform to Enable the Mining
of Petroglyphs. SIGKDD 2009
42
Eamonn Keogh Computer Science Engineering
Department University of California
Riverside eamonn_at_cs.ucr.edu
Write a Comment
User Comments (0)
About PowerShow.com