Title: Mining static and time-evolving graphs
1Mining static and time-evolving graphs
- Christos Faloutsos
- Carnegie Mellon University
2Overview
- Mining Static graphs
- CenterPiece Subgraphs (CePS)
- Fast RWR computation
- best-effort subgraph matching (in progress)
- Mining time-evolving graphs
- Tensors intrusion detection
- Sparse graphs
- Other topics
- Graph sampling
- Graph generators
3CePS
- w/ Hanghang Tong, KDD 2006
- htong_at_cs.cmu.edu
4Center-Piece Subgraph(Ceps)
- Given Q query nodes
- Find Center-piece ( )
- Input of Ceps
- Q Query nodes
- Budget b
- K softand coefficient
- App.
- Social Networks
- Law Inforcement,
5Challenges in Ceps
- Q1 How to measure the importance?
- Q2 How to extract connection subgraph?
- Q3 How to do it efficiently?
6An Illustrating Example
5
Prob (RW will finally stay at j)
11
12
4
- Starting from 1
- Randomly to neighbor
- Some p to return to 1
10
3
13
6
2
7
1
8
9
7Individual Score Calculation
Q1 Q2 Q3
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node 13 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
8Individual Score Calculation
Q1 Q2 Q3
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node 13 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
9AND Combining Scores
- Q How to combine scores?
- A Multiply
- prob. 3 random particles coincide on node j
10K_SoftAnd Combining Scores
- Generalization SoftAND
- We want nodes close to k of Q (kltQ) query nodes.
- Q How to do that?
11K_SoftAnd Combine Scores
- Generalization softAND
- We want nodes close to k of Q (kltQ) query nodes.
- Q How to do that?
- A Prob(at least k-out-of-Q will meet each other
at j)
12AND query vs. K_SoftAnd query
x 1e-4
2_SoftAnd Query
131_SoftAnd query OR query
14Challenges in Ceps
- Q1 How to measure the importance?
- Q2 How to extract connection subgraph?
- Q3 How to do it efficiently?
15Extract Alg.
- Goal
- Maximize total scores and
- Appropriate Connections
- How toExtract Alg.
- Dynamic Programming
- Greedy Alg.
- Pickup promising node
- Find best path
2
10
9
6
8
13
11
4
5
7
12
3
1
16Challenges in Ceps
- Q1 How to measure the importance?
- Q2 How to extract connection subgraph?
- Q3 How to do it efficiently?
17Graph Partition Efficiency Issue
- Straightforward way
- Q linear system
- linear to of edges
- Observation
- Skewed dist.
- How to
- Graph partition
18Even better
- We can correct for the deleted edges (Tong,
ICDM06, best paper award) - But lets omit the details
19Experimental Setup
- Dataset
- DBLP/authorship
- Author-Paper
- 315k nodes
- 1,800k edges
20Experimental Setup
- We want to check
- Does the goodness criteria make sense?
- Does extract alg. capture most of important
nodes/edge? - Efficiency
21Case Study AND query
22database
Statistic
2_SoftAnd query
23Evaluation of Extract Alg.
3 query nodes
Node Ratio
2 query nodes
Budget (b)
24Running Time vs. Quality for Fast Ceps
Quality
Running Time
25Conclusion
- Q1How to measure the importance?
- A1 RWRK_SoftAnd
- Q2 How to find connection subgraph?
- A2Extract Alg.
- Q3How to do it efficiently?
- A3Graph Partition (Fast Ceps)
- 90 quality
- 61 speedup
26Overview
- Mining Static graphs
- CenterPiece Subgraphs (CePS)
- Fast RWR computation
- best-effort subgraph matching (in progress)
- Mining time-evolving graphs
- Other topics
27Random walk with restart
Node 4
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.05 0.08 0.04 0.03 0.04 0.02
Nearby nodes, higher scores
Ranking vector
More red, more relevant
28Computing RWR
Ranking vector
Starting vector
Adjacency matrix
_at_t
_at_(t1)
n x n
n x 1
n x 1
29Alternatives
- On-the-fly precompute nothing -gt slow
- Precompute everything -gt O(NN) space
30Alternatives
- On-the-fly precompute nothing -gt slow
- Precompute a little, and adjust on-the-fly
- Precompute everything -gt O(NN) space
31Computing RWR
10
9
12
2
8
1
11
3
4
6
5
7
32Computing RWR
10
9
12
2
8
1
11
3
4
6
5
Break into communities
7
33FastRWR
- Instead of ONE BIG (and dense) inverted matrix
34FastRWR
- Instead of ONE BIG (and dense) inverted matrix
- Several, smaller matrices, plus info about the
bridges
35FastRWR
- Instead of ONE BIG (and dense) inverted matrix
- Several, smaller matrices, plus info about the
bridges
36FastRWR
- Instead of ONE BIG (and dense) inverted matrix
- Several, smaller matrices, plus info about the
bridges
37Query Time vs. Pre-Compute Time
Log Query Time
- Quality 90
- On-line
- Up to 150x speedup
- Pre-computation
- Two orders saving
Log Pre-compute Time
38Query Time vs. Pre-Storage
Log Query Time
- Quality 90
- On-line
- Up to 150x speedup
- Pre-storage
- Three orders saving
Log Storage
39Conclusion
- FastRWR
- Good accuracy (90)
- 150x speed-up query time
- Orders of magnitude saving pre-compute storage
40Overview
- Mining Static graphs
- CenterPiece Subgraphs (CePS)
- Fast RWR computation
- best-effort subgraph matching (in progress)
- Mining time-evolving graphs
- Other topics
41Best-effort Sub-Graph Matching, on Attributed
Graphs
42Best-effort problem dfn.
- Nodes have one
- (categorical) attribute
- query Eg., loop -gt
- money laundering
Synthetic data
43Best-effort problem dfn.
44 45DBLP dataset
- Authorship Graph
- Nodes authors
- Edges of co-authored paper
- Attributes Conference and Year
- 300k nodes, 1m edges
46Line Query
Footnote for results -Red nodes qualifying
nodes -white nodes immediate nodes.
47Star-query
Results
Footnote for results -Red nodes qualifying
nodes -white nodes immediate nodes.
48Loop-Query
49P.I.T. Terrorist Relations
- Nodes Terrorist Relationship
- Attributes
- Family Contact Colleague Congregate
- Edges Two Relationship shares a common person
- 1k nodes and 8k edges
50Star-Query
51Overview
- Mining Static graphs
- CenterPiece Subgraphs (CePS)
- Fast RWR computation
- best-effort subgraph matching (in progress)
- Mining time-evolving graphs
- Tensors intrusion detection
- Other tools (MDL)
- Other topics
52Tensors for time evolving graphs
- Jimeng Sun KDD06
- , SMD07
- CF, Kolda, Sun, SDM07 tutorial
53Social network analysis
- Static find community structures
- Dynamic monitor community structure evolution
spot abnormal individuals abnormal time-stamps
54Network Forensics
- Directional network flows
- A large ISP with 100 POPs, each POP 10Gbps link
capacity Hotnets2004 - Task Identify abnormal traffic pattern and find
out the cause
abnormal traffic
normal traffic
destination
source
55Tensors - outline
- Motivation
- Main ideas
- Experiments
56Static case
Time 0
- For a timestamp, data can be modeled using a
tensor (matrix 2-mode tensor)
Type
Location
57Dynamic case Tensor streams
Time 0
Type
Location
58Dynamic Data model Tensor streams
Type
Location
(Jimengs Desk, light)
time
59Dynamic Data model Tensor streams
Type
Location
(Jimengs Desk, light)
time
- Streams come with structure
- (time, location, sensor-modality)
- (time, host-id, measurement-type)
60What is the factor?
1st factor
- Factor is a set of 1D summaries
61What is the factor?
1st factor
- Factor is a set of 1D summaries
- Multi-linear approximation on all aspects
62Tensors - outline
- Motivation
- Main ideas
- Experiments
63WTA on real sensor data
type
location
time
1st factor Scaling factor 250
- 1st factor consists of the main trends
- Daily periodicity on time
- Uniform on all locations
- Temp, Light and Volt are positively correlated
while negatively correlated with Humid
64WTA on real sensor data (cont.)
type
time
location
2nd factor Scaling factor 154
- 2nd factor captures an atypical trend
- Uniformly across all time
- Concentrating on 3 locations
- Mainly due to voltage
- Interpretation two sensors have low battery, and
the other one has high battery.
65Application 1 Multiway latent semantic indexing
(LSI)
Philip Yu
2004
Michael Stonebreaker
Uauthors
1990
authors
Ukeyword
keyword
Pattern
Query
- Projection matrices specify the clusters
- Core tensors give cluster activation level
66Bibliographic data (DBLP)
- Papers from VLDB and KDD conferences
- Construct 2nd order tensors with yearly windows
with ltauthor, keywordsgt - Each tensor 4584?3741
- 11 timestamps (years)
67Multiway LSI
Authors Keywords Year
michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995
surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004
jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004
DB
DM
- Two groups are correctly identified Databases
and Data mining - People and concepts are drifting over time
68Application 2Network Anomaly Detection
- Anomaly detection
- Reconstruction error driven
- Multiple resolution
- Data
- TCP flows collected at CMU backbone
- Raw data 500GB with compression
- Construct 3rd order tensors with hourly windows
with ltsource, destination, port gt - 1200 timestamps (hours)
69Network anomaly detection
scanners
error
- Identify when and where anomalies occurred.
- Prominent difference between normal and abnormal
ones is mainly due to unusual scanning activity
(confirmed by the campus admin).
70Computational cost
3rd order network tensor
2nd order DBLP tensor
- OTA is the offline tensor analysis
- Performance metric CPU time (sec)
- Observations
- DTA and STA are orders of magnitude faster than
OTA - The slight upward trend in DBLP is due to the
increasing number of papers each year (data
become denser over time)
71Accuracy comparison
3rd order network tensor
2nd order DBLP tensor
- Performance metric the ratio of reconstruction
error between DTA/STA and OTA fixing the error
of OTA to 20 - Observation DTA performs very close to OTA in
both datasets, STA performs worse in DBLP due to
the bigger changes.
72InteMon intelligent monitoring system on large
clusters
- VLDB06 demo
- Operating System Review 06
73System Architecture
74Case 1 Environmental Monitoring
Temperature
Humidity
- Abnormal dehumidification and reheating cycle is
identified
75(No Transcript)
76Overview
- Mining Static graphs
- CenterPiece Subgraphs (CePS)
- Fast RWR computation
- best-effort subgraph matching (in progress)
- Mining time-evolving graphs
- Tensors intrusion detection
- Other tools (MDL)
- Other topics
77Parameter-free mining
- Using MDL, to
- Find natural communities
- natural cut-points
- (under submission)
78MDL mining on time-evolving graph (Enron emails)
79Overview
- Mining Static graphs
- Mining time-evolving graphs
- Other topics
- Graph sampling
- Graph generators
80(No Transcript)
81Overview
- Mining Static graphs
- CenterPiece Subgraphs (CePS)
- Fast RWR computation
- best-effort subgraph matching (in progress)
- Mining time-evolving graphs
- Tensors intrusion detection
- Sparse graphs
- Other topics
- Graph sampling
- Graph generators
82Realistic graph generation
- Kronecker graphs Leskovec, PKDD05
- Leskovec, under review
83Why fitting graph models?
- Parameters tell us about the structure of a graph
- Extrapolation given a graph today, how will it
look in a year? - Sampling can I get a smaller graph with similar
properties? - Anonymization instead of releasing real graph
(e.g., email network), we can release a synthetic
version of it
84Experiments on real AS graph
Degree distribution
Hop plot
Network value
Adjacency matrix eigen values
85Intro to Kronecker graphs
86Problem Definition
- Given a growing graph with count of nodes N1, N2,
- Generate a realistic sequence of graphs that will
obey all the patterns - Idea Self-similarity
- Leads to power laws
- Communities within communities
87Recursive Graph Generation
- There are many obvious (but wrong) ways
- Does not obey Densification Power Law
- Has increasing diameter
- Kronecker Product is exactly what we need
- There are many obvious (but wrong) ways
Recursive expansion
Initial graph
88Kronecker Product a Graph
Intermediate stage
Adjacency matrix
Adjacency matrix
89Kronecker Product a Graph
- Continuing multypling with G1 we obtain G4 and so
on
G4 adjacency matrix
90Conclusions
- Static graphs Random Walks, CePS,
best-effort sub-graph matching - Dynamic graphs Tensors (intrusion/change
detection - Graph generation Kronecker
91References
- Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan
Fast Random Walk with Restart and Its
Applications ICDM 2006, Hong Kong. - Hanghang Tong, Christos Faloutsos Center-Piece
Subgraphs Problem Definition and Fast Solutions,
KDD 2006, Philadelphia, PA
92References
- Jure Leskovec, Jon Kleinberg and Christos
Faloutsos Graphs over Time Densification Laws,
Shrinking Diameters and Possible Explanations KDD
2005, Chicago, IL. ("Best Research Paper" award).
- Jure Leskovec, Deepayan Chakrabarti, Jon
Kleinberg, Christos Faloutsos Realistic,
Mathematically Tractable Graph Generation and
Evolution, Using Kronecker Multiplication
(ECML/PKDD 2005), Porto, Portugal, 2005. PDF
93References
- Jimeng Sun, Dacheng Tao, Christos Faloutsos
Beyond Streams and Graphs Dynamic Tensor
Analysis, KDD 2006, Philadelphia, PA - Jimeng Sun, Yinglian Xie, Hui Zhang, Christos
Faloutsos. Less is More Compact Matrix
Decomposition for Large Sparse Graphs, SDM,
Minneapolis, Minnesota, Apr 2007. pdf
94Thank you!
- Contact info
- christos, htong, jimeng, jure ltatgt cs.cmu.edu
- www. cs.cmu.edu /christos
- (w/ papers, datasets, code, etc)