Title: Data Clustering with Application to Relational Data
1Data Clustering with Application to Relational
Data
- Adam Anthony
- Ph.D. Candidate
- University of Maryland Baltimore County
- Adviser Marie desJardins
2Overview
- Clustering Tutorial
- Clustering discussion
- K-means Clustering
- ___-Link Clustering
- Probabilistic Clustering
- My work Relational Data Clustering
- Relational Data Examples
- Sources of Information in Relational Data
Clustering - Fast approximate Relational Data Clustering
- Relation Selection
- Constraining Solutions in Relational Data
Clustering - Conclusion
3What Is Data Clustering?
- Clustering grouping objects into categories
without outside input - Quality of a clustering depends on an objective
- Which clustering is better?
- By rank
- By suit
- By color
- Combinations
4Clustering An Intelligence Perspective
- Why is clustering considered an intelligent
activity? - What are the categories?
- Squirrel, Marlin, Salmon, Mouse, Tuna, Bat
- How many faces?
- But theres more to it
-
- aardvark, addax, alligator, alpaca, anteater,
antelope, aoudad, ape, argali, armadillo, ass,
baboon, badger, basilisk, bat, bear, beaver,
bighorn, bison, boar, budgerigar, buffalo, bull,
bunny, burro, camel, canary, capybara cat,
chameleon, chamois, cheetah, chimpanzee,
chinchilla, chipmunk, civet, coati, colt, cony,
cougar, cow, coyote, crocodile, crow, deer,
dingo, doe, dog, donkey, dormouse, dromedary,
duckbill, dugong, eland, elephant, elk, ermine,
ewe, fawn, ferret, finch, fish, fox, frog,
gazelle, gemsbok, gila_monster, giraffe, gnu,
goat, gopher, gorilla, grizzly_bear, ground_hog,
guanaco, guinea_pig, hamster, hare, hartebeest,
hedgehog, hippopotamus, hog, horse, hyena, ibex,
iguana, impala, jackal, jaguar, jerboa, kangaroo,
kid, kinkajou, kitten, koala, koodoo, lamb,
lemur, leopard, lion, lizard, llama, lovebird,
lynx, mandrill, mare, marmoset, marten, mink,
mole, mongoose, monkey, moose, mountain_goat,
mouse, mule, musk_deer, musk_ox, muskrat,
mustang, mynah_bird, newt, ocelot, okapi,
opossum, orangutan, oryx, otter, ox, panda,
panther, parakeet, parrot, peccary, pig,
platypus, polar_bear, pony, porcupine, porpoise,
prairie_dog, pronghorn, puma, puppy, quagga,
rabbit
5Clustering An Agents Perspective
- An agent has three short- and long-range binary
sensors - Light (high/low)
- Heat (high/low)
- Damaged (yes/no)
- Clustering can be used to predict unknown values
- Recharge station (with fluorescent lightbulb)
- Candle (causes damage)
- How can clustering help this agent?
- Agent can predict and avoid damage using
clustering - Clustering can also filter out irrelevant
information - Add a noise sensor, but noise never causes damage
6Formal Data Clustering
- Data clustering is
- Dividing a set of data objects into groups such
that there is a clear pattern (e.g. similarity to
each other) for why objects are in the same
cluster - A clustering algorithm requires
- A data set D
- A clustering description C
- A clustering objective Obj(C)
- An optimization method Opt(D) C
- Obj measures the goodness of the best clustering
C that Opt(D) can find
7K-Means Clustering
- D numeric d-dimensional data
- C partitioning of data points into k clusters
- Obj(C) Root Mean Squared Error (RMSE)
- Average distance between each object and its
clusters mean value - Optimization Method
- Select k random objects as the initial means
- While RMSE_new
- Move each object to the cluster with the closest
mean
8K-Means Demo
9___-Link Clustering
- Initialize each object in its own cluster
- Compute the cluster distance matrix M by the
selected criterion (below) - While there is more than one cluster
- Join the clusters with the shortest distance
- Update M by the selected criterion
- Criterion for ___-link clustering
- Single-link use the distance of the closest
objects between two clusters - Complete-link use the distance of the most
distant objects between the two clusters
10___-Link Demo
- How can we measure the distance between these
clusters? - What is best for
- Spherical data (above)?
- Chain-like data? ?
Complete-Link Distance
Single-Link Distance
11Probabilistic Clustering
- There are many ways to optimize such a
clustering, including Expectation Maximization
and Simulated Annealing - P(C ? ) is called the prior on C and lets us
control the kinds of clusterings that are found - Balanced-size clusters, lots of little clusters,
a few big clusters, etc. - P( D C, ?) is where the interesting
application-specific work is performed
12Probabilistic Clustering with Simulated Annealing
- Use Maximum Likelihood Estimators for the
parameters ? - Use simulated annealing to find optimal C
- Start with a Random C0, and temperature T0 then
iterate - Perturb a small portion of Ci, store as Ci1
- Re-Estimate MLE(?), given Ci1
- Compute L Objprob(Ci1) Objprob(Ci)
- If L 0 or with probability eL/T, keep solution
Ci1 - Else, revert to solution Ci
- Ti1 Ti/t_s // t_s is a number slightly greater
than 1 - Stop when there is little or no change between
iterations
13My Research Clustering Relational Data
14Relational Data
- Formally
- A set of Object Domains
- Sets of instances from those domains
- Sets of relational tuples between instances
- Simplifications
- Atrribute Vectors
- (Attributed) graphs, when compatible
- In Practice
- Relational Data refers only to data that
requires the use of tuples
Fred,M
Sally,F
Joe,M
15Some Relational Data Examples
-
- Domains People, demographic attributes
- Relations Friendship, Group
-
- Domains Documents, words
- Relations Directed cross-document references
- (Internet Movie Database)
- Domains Actors, Directors, Movies, demographic
attributes - Relations Worked-Together, Directed, Acted-In
16Observation Attributes and Relations Encode
Unique Information
- Internet Movie database subset 508 actors
- 7 binary features has_award, act_drama,
act_comedy, experienced, gender, popularity,
many_movies - Ground-truth clustering
- currently active actors and (semi-)retired actors
- Adjusted Rand index for partition comparison
(closeness of partition A to partition B, not a
percentage) - ARI between Features Only and Graph Only 0.51
17Types of Similarity in Relational Data
- Attribute Similarity
- Can we sort facebook users into categories based
on demographic similarity? - Structural (Relational) Similarity
- Can we categorize an actress based on the people
she has worked with? - Correlation Similarity (contribution)
- Given two blogs that are connected by reciprocal
URLs, how likely are they to cover
similar/different topics?
18Modeling Different Similarities
- Hypothesis A model that uses one or a
combination of attribute, structural, and/or
correlation similarity will be able to find
non-trivial clusterings that contrast what other
models may find.
19The Probabilistic Relational Clustering Framework
(PRCF)
20Probabilistic Relational Clustering Framework
(Cont.)
- Prior probability of observing C
- Attribute Similarity Probability
-
- Structural and/or correlation similarity
probability - bCiCj Specifies the block of edges between
clusters i and j
21Improving Link Prediction with a novel PRCF model
- For the Edge model, most researchers choose
-
- I proposed a new edge model
-
- Experiment
- Withhold a fraction of edges from an artificial
graph as a test set - Remaining edges are the training set
- Learn several models with more and more training
set edges observed - AUC Area under ROC curve, a measure of
classification performance
22Current Work Block Modularity(joint work with
Michael Lombardi 10)
- Block bij Set of edges falling into the block
between clusters i and j - If some are dense, and the rest are sparse, we
can generate a summary graph, and state that
objects in the same cluster have high structural
similarity -
-
-
23Optimizing Block Modularity
- Starting with a random partition C,
- Iterate Until Convergence
- Compute a new partition C by assigning each
object to the cluster that would increase the
block modularity objective the most - Let C C
- Preliminary results
- Block modularity 3-10 iterations on a graph with
100 vertices - Simulated annealing with PRCF 10000 iterations
- Block modularity algorithm is significantly
easier to program than any PRCF model and finds
the same solution
24Relation Selection
- New Application of Block Modularity
- Generate an initial clustering
- Use a randomized single-link clustering where the
distance is measured by the fraction of common
neighbors - Measure the block modularity score for this
clustering - Average over several runs
- Experiment take a graph with obvious clusters,
and rewire some edges
25Observation Sometimes Relations Are Ambiguous
- Assume attributes cant identify CS/CHEM
- We know that they belong apart
- Solution Constrained Clustering
CS
CS
CS
CS
CHEM
CHEM
CHEM
Bio
Bio
Bio
Bio
26Constrained Relational Clustering(joint work
with Paul Guseman 09)
- Add constraints Must-Link and Cannot-Link
- Constrain the original algorithm (e.g. PRCF) so
that no (or very few) constraints are violated - Constrain Obj(C) penalize score for broken
constraints - Constrain Opt(D) avoid solutions with broken
constraints
CS
CS
CS
CS
CHEM
CHEM
CHEM
Bio
Bio
Bio
Bio
27Future Work
- Relation Extraction/Adjustment
- Denser relations dominate solutions over sparser
relations - Degree Prediction
- Given document topic and proposed outlinks, what
is the expected number of references to my blog? - Abstract Relation Type Discovery
- Given a set of unlabeled edges, can they be split
into distinct relation types?
28Conclusion
- Combining Attribute Similarity, Structural
Similarity - boosts performance compared to using either
individually - New source of information correlation similarity
- Improves link prediction performance
- Block Modularity fast (and simple) algorithm for
optimizing block models - Constrained Clustering will help to avoid
ambiguous clustering scenarios - Relation Selection
- Block modularity can help to quickly decide if a
relation has high-quality structure