Data Mining in Streams and Graphs - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining in Streams and Graphs

Description:

Christos Faloutsos CMU Outline Problem definition / Motivation Graphs and power laws Streams and forecasting Conclusions Motivation Data mining: ~ find patterns ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 69
Provided by: csCmuEdu1
Learn more at: http://www.cs.cmu.edu
Category:
Tags: data | graphs | mining | streams

less

Transcript and Presenter's Notes

Title: Data Mining in Streams and Graphs


1
Data Mining in Streams and Graphs
  • Christos Faloutsos
  • CMU

2
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Streams and forecasting
  • Conclusions

3
Motivation
  • Data mining find patterns (rules, outliers)
  • How do real graphs look like?
  • How do (numerical) streams look like?

4
Joint work with
  • Dr. Deepayan Chakrabarti (CMU/Yahoo R.L.)

5
Graphs - why should we care?
Internet Map lumeta.com
Food Web Martinez 91
Protein Interactions genomebiology.com
Friendship Network Moody 01
6
Graphs - why should we care?
  • network of supply-chain companies
  • network of companies board-of-directors members
  • viral marketing
  • web-log (blog) news propagation
  • computer network security email/IP traffic and
    anomaly detection
  • ....

7
Problem 1 - network and graph mining
  • How does the Internet look like?
  • How does the web look like?
  • What constitutes a normal social network?
  • What is normal/abnormal?
  • which patterns/laws hold?

8
Graph mining
  • Are real graphs random?

9
Laws and patterns
  • NO!!
  • Diameter
  • in- and out- degree distributions
  • other (surprising) patterns

10
Laws degree distributions
  • Q avg degree is 3 - what is the most probable
    degree?

count
??
degree
3
11
Laws degree distributions
  • Q avg degree is 3 - what is the most probable
    degree?

degree
12
Solution 1
count
Exponent slope
O -2.15
-2.15
Nov97
Outdegree
  • The plot is linear in log-log scale FFF99
  • freq degree (-2.15)

13
Solution
  • Power law in the degree distribution SIGCOMM99

internet domains
att.com
ibm.com
14
But
  • Q1 How about graphs from other domains?
  • Q2 How about temporal evolution?

15
The Peer-to-Peer Topology
Jovanovic
  • Frequency versus degree
  • Number of adjacent peers follows a power-law

16
More power laws
  • citation counts (citeseer.nj.nec.com 6/2001)

log(count)
Ullman
log(citations)
17
Swedish sex-web
Nodes people (Females Males) Links sexual
relationships
Albert Laszlo Barabasi http//www.nd.edu/networks
/ Publication20Categories/ 0420Talks/2005-norway
-3hours.ppt
Liljeros et al. Nature 2001
4781 Swedes 18-74 59 response rate.
18
Swedish sex-web
Nodes people (Females Males) Links sexual
relationships
Albert Laszlo Barabasi http//www.nd.edu/networks
/ Publication20Categories/ 0420Talks/2005-norway
-3hours.ppt
4781 Swedes 18-74 59 response rate.
Liljeros et al. Nature 2001
19
More power laws
  • web hit counts w/ A. Montgomery

Web Site Traffic
log(count)
Zipf
ebay
log(in-degree)
20
epinions.com
  • who-trusts-whom Richardson Domingos, KDD 2001

count
trusts-2000-people user
(out) degree
21
More Power laws
  • Also hold for other web graphs Barabasi,
    Tomkins, with additional rules (bi-partite
    cores follow power laws)

22
But
  • Q1 How about graphs from other domains?
  • Q2 How about temporal evolution?

23
Time Evolution rank R
Domain level
The rank exponent has not changed!
24
Any other pattern, over time?
25
Time evolution
  • with Jure Leskovec (CMU)
  • and Jon Kleinberg (Cornell)

26
Evolution of the Diameter
  • Prior work on Power Law graphs hints at slowly
    growing diameter
  • diameter O(log N)
  • diameter O(log log N)
  • What is happening in real data?

27
Evolution of the Diameter
  • Prior work on Power Law graphs hints at slowly
    growing diameter
  • diameter O(log N)
  • diameter O(log log N)
  • What is happening in real data?
  • Diameter shrinks over time
  • As the network grows the distances between nodes
    slowly decrease

28
Diameter ArXiv citation graph
diameter
  • Citations among physics papers
  • 1992 2003
  • One graph per year

time years
29
Diameter Autonomous Systems
diameter
  • Graph of Internet
  • One graph per day
  • 1997 2000

number of nodes
30
Diameter Affiliation Network
diameter
  • Graph of collaborations in physics authors
    linked to papers
  • 10 years of data

time years
31
Diameter Patents
diameter
  • Patent citation network
  • 25 years of data

time years
32
Temporal Evolution of the Graphs
  • N(t) nodes at time t
  • E(t) edges at time t
  • Suppose that
  • N(t1) 2 N(t)
  • Q what is your guess for
  • E(t1) ? 2 E(t)

33
Temporal Evolution of the Graphs
  • N(t) nodes at time t
  • E(t) edges at time t
  • Suppose that
  • N(t1) 2 N(t)
  • Q what is your guess for
  • E(t1) ? 2 E(t)
  • A over-doubled!
  • But obeying the Densification Power Law

34
Densification Physics Citations
  • Citations among physics papers
  • 2003
  • 29,555 papers, 352,807 citations

E(t)
??
N(t)
35
Densification Physics Citations
  • Citations among physics papers
  • 2003
  • 29,555 papers, 352,807 citations

E(t)
1.69
N(t)
36
Densification Physics Citations
  • Citations among physics papers
  • 2003
  • 29,555 papers, 352,807 citations

E(t)
1.69
1 tree
N(t)
37
Densification Physics Citations
  • Citations among physics papers
  • 2003
  • 29,555 papers, 352,807 citations

E(t)
1.69
clique 2
N(t)
38
Densification Patent Citations
  • Citations among patents granted
  • 1999
  • 2.9 million nodes
  • 16.5 million edges
  • Each year is a datapoint

E(t)
1.66
N(t)
39
Densification Autonomous Systems
  • Graph of Internet
  • 2000
  • 6,000 nodes
  • 26,000 edges
  • One graph per day

E(t)
1.18
N(t)
40
Densification Affiliation Network
  • Authors linked to their publications
  • 2002
  • 60,000 nodes
  • 20,000 authors
  • 38,000 papers
  • 133,000 edges

E(t)
1.15
N(t)
41
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • laws
  • algorithms graph partitioning using MDL
  • Streams and forecasting
  • Conclusions

42
Graph partitioning
  • Documents x terms
  • Customers x products
  • Users x web-sites
  • Q HOW MANY PIECES?

43
Graph partitioning
  • Documents x terms
  • Customers x products
  • Users x web-sites
  • Q HOW MANY PIECES?
  • A MDL/ compression

44
Cross-associations
2x2
1x2
45
Cross-associations
3x4
3x3
2x3
46
Cross-associations
47
Cross-associations
missing edge
outlier edge
48
education
math
nuclear physics
bio
49
Conclusions
  • Real graphs obey some surprising patterns
  • which can help us spot anomalies / outliers
  • MDL helps partition a graph into natural groups

50
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Streams and forecasting
  • Conclusions

51
Why care about streams?
  • Sensor devices
  • Temperature, weather measurements
  • Road traffic data
  • Geological observations
  • Patient physiological data
  • Embedded devices
  • Network routers
  • Intelligent (active) disks

52
Co-evolving time sequences
  • Joint work with
  • Jimeng Sun (CMU)
  • Spiros Papadimitriou (CMU/IBM)
  • Dr. Yasushi Sakurai (NTT)

53
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Streams and forecasting
  • single stream mining forecasting
  • multiple-stream mining and summarization
  • Conclusions

54
Results - Synthetic data
AWSOM
AR
Seasonal AR
  • Triangle pulse
  • Mix (sine square)
  • AR captures wrong trend (or none)
  • Seasonal AR estimation fails

55
Results real data
  • Automobile traffic
  • Daily periodicity
  • Rush-hour peaks
  • Bursty noise at smaller time scales

56
Results - Real data
  • Sunspot intensity Slightly time-varying period
  • AR captures wrong trend
  • Seasonal ARIMA
  • wrong trend needs human

57
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Streams and forecasting
  • single stream mining forecasting
  • multiple-stream mining and summarization
  • Conclusions

58
Motivation
sensors near leak
sensors away from leak
water distribution network
normal operation
Hundreds of measurements, possibly, correlated.
59
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
Hundreds of measurements, possibly, correlated.
60
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
Hundreds of measurements, possibly, correlated.
61
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
Hundreds of measurements, possibly, correlated.
62
Motivation
actual measurements (n streams)
k hidden variable(s)
Find hidden (latent) variables, to summarize
the key trends
63
Motivation
Phase 1
Phase 1
Phase 2
Phase 2
chlorine concentrations
k 2
actual measurements (n streams)
k hidden variable(s)
Find hidden (latent) variables, to summarize
the key trends
64
Motivation
Phase 1
Phase 1
Phase 2
Phase 2
Phase 3
Phase 3
chlorine concentrations
k 1
actual measurements (n streams)
k hidden variable(s)
Find hidden (latent) variables, to summarize
the key trends
65
Stream mining
  • Solution SPIRIT VLDB05
  • incremental, on-line PCA

66
Stream mining
  • Solution SPIRIT VLDB05
  • incremental, on-line PCA

67
SPIRIT
  • also to monitor a data center (self- storage
    project
  • www.pdl.cmu.edu/SelfStar/
  • see demo of SPIRIT at
  • http//warsteiner.db.cs.cmu.edu
  • (needs JVM plugin)

demo
68
Conclusions
  • Graphs streams pose fascinating problems
  • MDL, PCA/SVD (wavelets) powerful tools
  • self-similarity, fractals and power laws work,
    when textbook methods fail!

69
Other projects
  • video data mining Pan Yang
  • Virus propagation (Wang)
  • Anomaly detection in network traffic (Wang,
    Olston, )

70
Books
  • Manfred Schroeder Fractals, Chaos, Power Laws
    Minutes from an Infinite Paradise W.H. Freeman
    and Company, 1991 (Probably the BEST book on
    fractals!)

71
Contact info
  • christos_at_cs.cmu.edu
  • www.cs.cmu.edu/christos
  • Wean Hall 7107
  • Ph x8.1457
Write a Comment
User Comments (0)
About PowerShow.com