Data Clustering with Application to Relational Data

About This Presentation

Title:

Data Clustering with Application to Relational Data

Description:

... newt, ocelot, okapi, opossum, orangutan, oryx, otter, ox, panda, panther, ... Domains: Actors, Directors, Movies, demographic attributes... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 28

Provided by: cora3

Category:

more less

Transcript and Presenter's Notes

Title: Data Clustering with Application to Relational Data

1
Data Clustering with Application to Relational
Data

Adam Anthony
Ph.D. Candidate
University of Maryland Baltimore County
Adviser Marie desJardins

2
Overview

Clustering Tutorial
Clustering discussion
K-means Clustering
___-Link Clustering
Probabilistic Clustering
My work Relational Data Clustering
Relational Data Examples
Sources of Information in Relational Data
Clustering
Fast approximate Relational Data Clustering
Relation Selection
Constraining Solutions in Relational Data
Clustering
Conclusion

3
What Is Data Clustering?

Clustering grouping objects into categories
without outside input
Quality of a clustering depends on an objective
Which clustering is better?
By rank
By suit
By color
Combinations

4
Clustering An Intelligence Perspective

Why is clustering considered an intelligent
activity?
What are the categories?
Squirrel, Marlin, Salmon, Mouse, Tuna, Bat
How many faces?
But theres more to it
aardvark, addax, alligator, alpaca, anteater,
antelope, aoudad, ape, argali, armadillo, ass,
baboon, badger, basilisk, bat, bear, beaver,
bighorn, bison, boar, budgerigar, buffalo, bull,
bunny, burro, camel, canary, capybara cat,
chameleon, chamois, cheetah, chimpanzee,
chinchilla, chipmunk, civet, coati, colt, cony,
cougar, cow, coyote, crocodile, crow, deer,
dingo, doe, dog, donkey, dormouse, dromedary,
duckbill, dugong, eland, elephant, elk, ermine,
ewe, fawn, ferret, finch, fish, fox, frog,
gazelle, gemsbok, gila_monster, giraffe, gnu,
goat, gopher, gorilla, grizzly_bear, ground_hog,
guanaco, guinea_pig, hamster, hare, hartebeest,
hedgehog, hippopotamus, hog, horse, hyena, ibex,
iguana, impala, jackal, jaguar, jerboa, kangaroo,
kid, kinkajou, kitten, koala, koodoo, lamb,
lemur, leopard, lion, lizard, llama, lovebird,
lynx, mandrill, mare, marmoset, marten, mink,
mole, mongoose, monkey, moose, mountain_goat,
mouse, mule, musk_deer, musk_ox, muskrat,
mustang, mynah_bird, newt, ocelot, okapi,
opossum, orangutan, oryx, otter, ox, panda,
panther, parakeet, parrot, peccary, pig,
platypus, polar_bear, pony, porcupine, porpoise,
prairie_dog, pronghorn, puma, puppy, quagga,
rabbit

5
Clustering An Agents Perspective

An agent has three short- and long-range binary
sensors
Light (high/low)
Heat (high/low)
Damaged (yes/no)
Clustering can be used to predict unknown values
Recharge station (with fluorescent lightbulb)
Candle (causes damage)
How can clustering help this agent?
Agent can predict and avoid damage using
clustering
Clustering can also filter out irrelevant
information
Add a noise sensor, but noise never causes damage

6
Formal Data Clustering

Data clustering is
Dividing a set of data objects into groups such
that there is a clear pattern (e.g. similarity to
each other) for why objects are in the same
cluster
A clustering algorithm requires
A data set D
A clustering description C
A clustering objective Obj(C)
An optimization method Opt(D) C
Obj measures the goodness of the best clustering
C that Opt(D) can find

7
K-Means Clustering

D numeric d-dimensional data
C partitioning of data points into k clusters
Obj(C) Root Mean Squared Error (RMSE)
Average distance between each object and its
clusters mean value
Optimization Method
Select k random objects as the initial means
While RMSE_new
Move each object to the cluster with the closest
mean

8
K-Means Demo
9
___-Link Clustering

Initialize each object in its own cluster
Compute the cluster distance matrix M by the
selected criterion (below)
While there is more than one cluster
Join the clusters with the shortest distance
Update M by the selected criterion
Criterion for ___-link clustering
Single-link use the distance of the closest
objects between two clusters
Complete-link use the distance of the most
distant objects between the two clusters

10
___-Link Demo

How can we measure the distance between these
clusters?
What is best for
Spherical data (above)?
Chain-like data? ?

Complete-Link Distance
Single-Link Distance
11
Probabilistic Clustering

There are many ways to optimize such a
clustering, including Expectation Maximization
and Simulated Annealing
P(C ? ) is called the prior on C and lets us
control the kinds of clusterings that are found
Balanced-size clusters, lots of little clusters,
a few big clusters, etc.
P( D C, ?) is where the interesting
application-specific work is performed

12
Probabilistic Clustering with Simulated Annealing

Use Maximum Likelihood Estimators for the
parameters ?
Use simulated annealing to find optimal C
Start with a Random C0, and temperature T0 then
iterate
Perturb a small portion of Ci, store as Ci1
Re-Estimate MLE(?), given Ci1
Compute L Objprob(Ci1) Objprob(Ci)
If L 0 or with probability eL/T, keep solution
Ci1
Else, revert to solution Ci
Ti1 Ti/t_s // t_s is a number slightly greater
than 1
Stop when there is little or no change between
iterations

13
My Research Clustering Relational Data
14
Relational Data

Formally
A set of Object Domains
Sets of instances from those domains
Sets of relational tuples between instances
Simplifications
Atrribute Vectors
(Attributed) graphs, when compatible
In Practice
Relational Data refers only to data that
requires the use of tuples

Fred,M
Sally,F
Joe,M
15
Some Relational Data Examples

Domains People, demographic attributes
Relations Friendship, Group
Domains Documents, words
Relations Directed cross-document references
(Internet Movie Database)
Domains Actors, Directors, Movies, demographic
attributes
Relations Worked-Together, Directed, Acted-In

16
Observation Attributes and Relations Encode
Unique Information

Internet Movie database subset 508 actors
7 binary features has_award, act_drama,
act_comedy, experienced, gender, popularity,
many_movies
Ground-truth clustering
currently active actors and (semi-)retired actors
Adjusted Rand index for partition comparison
(closeness of partition A to partition B, not a
percentage)
ARI between Features Only and Graph Only 0.51

17
Types of Similarity in Relational Data

Attribute Similarity
Can we sort facebook users into categories based
on demographic similarity?
Structural (Relational) Similarity
Can we categorize an actress based on the people
she has worked with?
Correlation Similarity (contribution)
Given two blogs that are connected by reciprocal
URLs, how likely are they to cover
similar/different topics?

18
Modeling Different Similarities

Hypothesis A model that uses one or a
combination of attribute, structural, and/or
correlation similarity will be able to find
non-trivial clusterings that contrast what other
models may find.

19
The Probabilistic Relational Clustering Framework
(PRCF)
20
Probabilistic Relational Clustering Framework
(Cont.)

Prior probability of observing C
Attribute Similarity Probability
Structural and/or correlation similarity
probability
bCiCj Specifies the block of edges between
clusters i and j

21
Improving Link Prediction with a novel PRCF model

For the Edge model, most researchers choose
I proposed a new edge model
Experiment
Withhold a fraction of edges from an artificial
graph as a test set
Remaining edges are the training set
Learn several models with more and more training
set edges observed
AUC Area under ROC curve, a measure of
classification performance

22
Current Work Block Modularity(joint work with
Michael Lombardi 10)

Block bij Set of edges falling into the block
between clusters i and j
If some are dense, and the rest are sparse, we
can generate a summary graph, and state that
objects in the same cluster have high structural
similarity

23
Optimizing Block Modularity

Starting with a random partition C,
Iterate Until Convergence
Compute a new partition C by assigning each
object to the cluster that would increase the
block modularity objective the most
Let C C
Preliminary results
Block modularity 3-10 iterations on a graph with
100 vertices
Simulated annealing with PRCF 10000 iterations
Block modularity algorithm is significantly
easier to program than any PRCF model and finds
the same solution

24
Relation Selection

New Application of Block Modularity
Generate an initial clustering
Use a randomized single-link clustering where the
distance is measured by the fraction of common
neighbors
Measure the block modularity score for this
clustering
Average over several runs
Experiment take a graph with obvious clusters,
and rewire some edges

25
Observation Sometimes Relations Are Ambiguous

Assume attributes cant identify CS/CHEM
We know that they belong apart
Solution Constrained Clustering

CS
CS
CS
CS
CHEM
CHEM
CHEM
Bio
Bio
Bio
Bio
26
Constrained Relational Clustering(joint work
with Paul Guseman 09)

Add constraints Must-Link and Cannot-Link
Constrain the original algorithm (e.g. PRCF) so
that no (or very few) constraints are violated
Constrain Obj(C) penalize score for broken
constraints
Constrain Opt(D) avoid solutions with broken
constraints

CS
CS
CS
CS
CHEM
CHEM
CHEM
Bio
Bio
Bio
Bio
27
Future Work

Relation Extraction/Adjustment
Denser relations dominate solutions over sparser
relations
Degree Prediction
Given document topic and proposed outlinks, what
is the expected number of references to my blog?
Abstract Relation Type Discovery
Given a set of unlabeled edges, can they be split
into distinct relation types?

28
Conclusion

Combining Attribute Similarity, Structural
Similarity
boosts performance compared to using either
individually
New source of information correlation similarity
Improves link prediction performance
Block Modularity fast (and simple) algorithm for
optimizing block models
Constrained Clustering will help to avoid
ambiguous clustering scenarios
Relation Selection
Block modularity can help to quickly decide if a
relation has high-quality structure

Write a Comment

User Comments (0)

About PowerShow.com

Data Clustering with Application to Relational Data - PowerPoint PPT Presentation

Data Clustering with Application to Relational Data

... newt, ocelot, okapi, opossum, orangutan, oryx, otter, ox, panda, panther, ... Domains: Actors, Directors, Movies, demographic attributes... – PowerPoint PPT presentation