Title: Data Mining in Streams and Graphs
1Data Mining in Streams and Graphs
2Outline
- Problem definition / Motivation
- Graphs and power laws
- Streams and forecasting
- Conclusions
3Motivation
- Data mining find patterns (rules, outliers)
- How do real graphs look like?
- How do (numerical) streams look like?
4Joint work with
- Dr. Deepayan Chakrabarti (CMU/Yahoo R.L.)
5Graphs - why should we care?
Internet Map lumeta.com
Food Web Martinez 91
Protein Interactions genomebiology.com
Friendship Network Moody 01
6Graphs - why should we care?
- network of supply-chain companies
- network of companies board-of-directors members
- viral marketing
- web-log (blog) news propagation
- computer network security email/IP traffic and
anomaly detection - ....
7Problem 1 - network and graph mining
- How does the Internet look like?
- How does the web look like?
- What constitutes a normal social network?
- What is normal/abnormal?
- which patterns/laws hold?
8Graph mining
9Laws and patterns
- NO!!
- Diameter
- in- and out- degree distributions
- other (surprising) patterns
10Laws degree distributions
- Q avg degree is 3 - what is the most probable
degree?
count
??
degree
3
11Laws degree distributions
- Q avg degree is 3 - what is the most probable
degree?
degree
12Solution 1
count
Exponent slope
O -2.15
-2.15
Nov97
Outdegree
- The plot is linear in log-log scale FFF99
- freq degree (-2.15)
13Solution
- Power law in the degree distribution SIGCOMM99
internet domains
att.com
ibm.com
14But
- Q1 How about graphs from other domains?
- Q2 How about temporal evolution?
15The Peer-to-Peer Topology
Jovanovic
- Frequency versus degree
- Number of adjacent peers follows a power-law
16More power laws
- citation counts (citeseer.nj.nec.com 6/2001)
log(count)
Ullman
log(citations)
17Swedish sex-web
Nodes people (Females Males) Links sexual
relationships
Albert Laszlo Barabasi http//www.nd.edu/networks
/ Publication20Categories/ 0420Talks/2005-norway
-3hours.ppt
Liljeros et al. Nature 2001
4781 Swedes 18-74 59 response rate.
18Swedish sex-web
Nodes people (Females Males) Links sexual
relationships
Albert Laszlo Barabasi http//www.nd.edu/networks
/ Publication20Categories/ 0420Talks/2005-norway
-3hours.ppt
4781 Swedes 18-74 59 response rate.
Liljeros et al. Nature 2001
19More power laws
- web hit counts w/ A. Montgomery
Web Site Traffic
log(count)
Zipf
ebay
log(in-degree)
20epinions.com
- who-trusts-whom Richardson Domingos, KDD 2001
count
trusts-2000-people user
(out) degree
21More Power laws
- Also hold for other web graphs Barabasi,
Tomkins, with additional rules (bi-partite
cores follow power laws)
22But
- Q1 How about graphs from other domains?
- Q2 How about temporal evolution?
23Time Evolution rank R
Domain level
The rank exponent has not changed!
24Any other pattern, over time?
25Time evolution
- with Jure Leskovec (CMU)
- and Jon Kleinberg (Cornell)
26Evolution of the Diameter
- Prior work on Power Law graphs hints at slowly
growing diameter - diameter O(log N)
- diameter O(log log N)
- What is happening in real data?
27Evolution of the Diameter
- Prior work on Power Law graphs hints at slowly
growing diameter - diameter O(log N)
- diameter O(log log N)
- What is happening in real data?
- Diameter shrinks over time
- As the network grows the distances between nodes
slowly decrease
28Diameter ArXiv citation graph
diameter
- Citations among physics papers
- 1992 2003
- One graph per year
time years
29Diameter Autonomous Systems
diameter
- Graph of Internet
- One graph per day
- 1997 2000
number of nodes
30Diameter Affiliation Network
diameter
- Graph of collaborations in physics authors
linked to papers - 10 years of data
time years
31Diameter Patents
diameter
- Patent citation network
- 25 years of data
time years
32Temporal Evolution of the Graphs
- N(t) nodes at time t
- E(t) edges at time t
- Suppose that
- N(t1) 2 N(t)
- Q what is your guess for
- E(t1) ? 2 E(t)
33Temporal Evolution of the Graphs
- N(t) nodes at time t
- E(t) edges at time t
- Suppose that
- N(t1) 2 N(t)
- Q what is your guess for
- E(t1) ? 2 E(t)
- A over-doubled!
- But obeying the Densification Power Law
34Densification Physics Citations
- Citations among physics papers
- 2003
- 29,555 papers, 352,807 citations
E(t)
??
N(t)
35Densification Physics Citations
- Citations among physics papers
- 2003
- 29,555 papers, 352,807 citations
E(t)
1.69
N(t)
36Densification Physics Citations
- Citations among physics papers
- 2003
- 29,555 papers, 352,807 citations
E(t)
1.69
1 tree
N(t)
37Densification Physics Citations
- Citations among physics papers
- 2003
- 29,555 papers, 352,807 citations
E(t)
1.69
clique 2
N(t)
38Densification Patent Citations
- Citations among patents granted
- 1999
- 2.9 million nodes
- 16.5 million edges
- Each year is a datapoint
E(t)
1.66
N(t)
39Densification Autonomous Systems
- Graph of Internet
- 2000
- 6,000 nodes
- 26,000 edges
- One graph per day
E(t)
1.18
N(t)
40Densification Affiliation Network
- Authors linked to their publications
- 2002
- 60,000 nodes
- 20,000 authors
- 38,000 papers
- 133,000 edges
E(t)
1.15
N(t)
41Outline
- Problem definition / Motivation
- Graphs and power laws
- laws
- algorithms graph partitioning using MDL
- Streams and forecasting
- Conclusions
42Graph partitioning
- Documents x terms
- Customers x products
- Users x web-sites
- Q HOW MANY PIECES?
43Graph partitioning
- Documents x terms
- Customers x products
- Users x web-sites
- Q HOW MANY PIECES?
- A MDL/ compression
44Cross-associations
2x2
1x2
45Cross-associations
3x4
3x3
2x3
46Cross-associations
47Cross-associations
missing edge
outlier edge
48education
math
nuclear physics
bio
49Conclusions
- Real graphs obey some surprising patterns
- which can help us spot anomalies / outliers
- MDL helps partition a graph into natural groups
50Outline
- Problem definition / Motivation
- Graphs and power laws
- Streams and forecasting
- Conclusions
51Why care about streams?
- Sensor devices
- Temperature, weather measurements
- Road traffic data
- Geological observations
- Patient physiological data
- Embedded devices
- Network routers
- Intelligent (active) disks
52Co-evolving time sequences
- Joint work with
- Jimeng Sun (CMU)
- Spiros Papadimitriou (CMU/IBM)
- Dr. Yasushi Sakurai (NTT)
53Outline
- Problem definition / Motivation
- Graphs and power laws
- Streams and forecasting
- single stream mining forecasting
- multiple-stream mining and summarization
- Conclusions
54Results - Synthetic data
AWSOM
AR
Seasonal AR
- Triangle pulse
- Mix (sine square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails
55Results real data
- Automobile traffic
- Daily periodicity
- Rush-hour peaks
- Bursty noise at smaller time scales
56Results - Real data
- Sunspot intensity Slightly time-varying period
- AR captures wrong trend
- Seasonal ARIMA
- wrong trend needs human
57Outline
- Problem definition / Motivation
- Graphs and power laws
- Streams and forecasting
- single stream mining forecasting
- multiple-stream mining and summarization
- Conclusions
58Motivation
sensors near leak
sensors away from leak
water distribution network
normal operation
Hundreds of measurements, possibly, correlated.
59Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
Hundreds of measurements, possibly, correlated.
60Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
Hundreds of measurements, possibly, correlated.
61Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
Hundreds of measurements, possibly, correlated.
62Motivation
actual measurements (n streams)
k hidden variable(s)
Find hidden (latent) variables, to summarize
the key trends
63Motivation
Phase 1
Phase 1
Phase 2
Phase 2
chlorine concentrations
k 2
actual measurements (n streams)
k hidden variable(s)
Find hidden (latent) variables, to summarize
the key trends
64Motivation
Phase 1
Phase 1
Phase 2
Phase 2
Phase 3
Phase 3
chlorine concentrations
k 1
actual measurements (n streams)
k hidden variable(s)
Find hidden (latent) variables, to summarize
the key trends
65Stream mining
- Solution SPIRIT VLDB05
- incremental, on-line PCA
66Stream mining
- Solution SPIRIT VLDB05
- incremental, on-line PCA
67SPIRIT
- also to monitor a data center (self- storage
project - www.pdl.cmu.edu/SelfStar/
- see demo of SPIRIT at
- http//warsteiner.db.cs.cmu.edu
- (needs JVM plugin)
demo
68Conclusions
- Graphs streams pose fascinating problems
- MDL, PCA/SVD (wavelets) powerful tools
- self-similarity, fractals and power laws work,
when textbook methods fail!
69Other projects
- video data mining Pan Yang
- Virus propagation (Wang)
- Anomaly detection in network traffic (Wang,
Olston, )
70Books
- Manfred Schroeder Fractals, Chaos, Power Laws
Minutes from an Infinite Paradise W.H. Freeman
and Company, 1991 (Probably the BEST book on
fractals!)
71Contact info
- christos_at_cs.cmu.edu
- www.cs.cmu.edu/christos
- Wean Hall 7107
- Ph x8.1457