Mining static and time-evolving graphs - PowerPoint PPT Presentation

About This Presentation
Title:

Mining static and time-evolving graphs

Description:

LLNL, Feb. '07. C. Faloutsos. 1. School of Computer Science. Carnegie Mellon ... LLNL, Feb. '07. C. Faloutsos. 5. School of Computer Science. Carnegie Mellon ... – PowerPoint PPT presentation

Number of Views:200
Avg rating:3.0/5.0
Slides: 93
Provided by: christosf
Learn more at: http://www.cs.cmu.edu
Category:
Tags: evolving | feb | graphs | mining | static | time

less

Transcript and Presenter's Notes

Title: Mining static and time-evolving graphs


1
Mining static and time-evolving graphs
  • Christos Faloutsos
  • Carnegie Mellon University

2
Overview
  • Mining Static graphs
  • CenterPiece Subgraphs (CePS)
  • Fast RWR computation
  • best-effort subgraph matching (in progress)
  • Mining time-evolving graphs
  • Tensors intrusion detection
  • Sparse graphs
  • Other topics
  • Graph sampling
  • Graph generators

3
CePS
  • w/ Hanghang Tong, KDD 2006
  • htong_at_cs.cmu.edu

4
Center-Piece Subgraph(Ceps)
  • Given Q query nodes
  • Find Center-piece ( )
  • Input of Ceps
  • Q Query nodes
  • Budget b
  • K softand coefficient
  • App.
  • Social Networks
  • Law Inforcement,

5
Challenges in Ceps
  • Q1 How to measure the importance?
  • Q2 How to extract connection subgraph?
  • Q3 How to do it efficiently?

6
An Illustrating Example
5
Prob (RW will finally stay at j)
11
12
4
  • Starting from 1
  • Randomly to neighbor
  • Some p to return to 1

10
3
13
6
2
7
1
8
9
7
Individual Score Calculation
Q1 Q2 Q3
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node 13 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
8
Individual Score Calculation
Q1 Q2 Q3
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 Node 13 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260 0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260
9
AND Combining Scores
  • Q How to combine scores?
  • A Multiply
  • prob. 3 random particles coincide on node j

10
K_SoftAnd Combining Scores
  • Generalization SoftAND
  • We want nodes close to k of Q (kltQ) query nodes.
  • Q How to do that?

11
K_SoftAnd Combine Scores
  • Generalization softAND
  • We want nodes close to k of Q (kltQ) query nodes.
  • Q How to do that?
  • A Prob(at least k-out-of-Q will meet each other
    at j)

12
AND query vs. K_SoftAnd query
x 1e-4
2_SoftAnd Query
  • And Query

13
1_SoftAnd query OR query
14
Challenges in Ceps
  • Q1 How to measure the importance?
  • Q2 How to extract connection subgraph?
  • Q3 How to do it efficiently?

15
Extract Alg.
  • Goal
  • Maximize total scores and
  • Appropriate Connections
  • How toExtract Alg.
  • Dynamic Programming
  • Greedy Alg.
  • Pickup promising node
  • Find best path

2
10
9
6
8
13
11
4
5
7
12
3
1
16
Challenges in Ceps
  • Q1 How to measure the importance?
  • Q2 How to extract connection subgraph?
  • Q3 How to do it efficiently?

17
Graph Partition Efficiency Issue
  • Straightforward way
  • Q linear system
  • linear to of edges
  • Observation
  • Skewed dist.
  • How to
  • Graph partition

18
Even better
  • We can correct for the deleted edges (Tong,
    ICDM06, best paper award)
  • But lets omit the details

19
Experimental Setup
  • Dataset
  • DBLP/authorship
  • Author-Paper
  • 315k nodes
  • 1,800k edges

20
Experimental Setup
  • We want to check
  • Does the goodness criteria make sense?
  • Does extract alg. capture most of important
    nodes/edge?
  • Efficiency

21
Case Study AND query
22
database
Statistic
2_SoftAnd query
23
Evaluation of Extract Alg.
3 query nodes
Node Ratio
2 query nodes
  • 20 nodes
  • 90 preserved

Budget (b)
24
Running Time vs. Quality for Fast Ceps
Quality
  • 90 quality
  • 61 speedup

Running Time
25
Conclusion
  • Q1How to measure the importance?
  • A1 RWRK_SoftAnd
  • Q2 How to find connection subgraph?
  • A2Extract Alg.
  • Q3How to do it efficiently?
  • A3Graph Partition (Fast Ceps)
  • 90 quality
  • 61 speedup

26
Overview
  • Mining Static graphs
  • CenterPiece Subgraphs (CePS)
  • Fast RWR computation
  • best-effort subgraph matching (in progress)
  • Mining time-evolving graphs
  • Other topics

27
Random walk with restart
Node 4
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.13 0.22 0.13 0.05 0.05 0.08 0.04 0.03 0.04 0.02
Nearby nodes, higher scores
Ranking vector
More red, more relevant
28
Computing RWR
Ranking vector
Starting vector
Adjacency matrix
_at_t
_at_(t1)
n x n
n x 1
n x 1
29
Alternatives
  • On-the-fly precompute nothing -gt slow
  • Precompute everything -gt O(NN) space

30
Alternatives
  • On-the-fly precompute nothing -gt slow
  • Precompute a little, and adjust on-the-fly
  • Precompute everything -gt O(NN) space

31
Computing RWR
10
9
12
2
8
1
11
3
4
6
5
7
32
Computing RWR
10
9
12
2
8
1
11
3
4
6
5
Break into communities
7
33
FastRWR
  • Instead of ONE BIG (and dense) inverted matrix

34
FastRWR
  • Instead of ONE BIG (and dense) inverted matrix
  • Several, smaller matrices, plus info about the
    bridges

35
FastRWR
  • Instead of ONE BIG (and dense) inverted matrix
  • Several, smaller matrices, plus info about the
    bridges

36
FastRWR
  • Instead of ONE BIG (and dense) inverted matrix
  • Several, smaller matrices, plus info about the
    bridges

37
Query Time vs. Pre-Compute Time
Log Query Time
  • Quality 90
  • On-line
  • Up to 150x speedup
  • Pre-computation
  • Two orders saving

Log Pre-compute Time
38
Query Time vs. Pre-Storage
Log Query Time
  • Quality 90
  • On-line
  • Up to 150x speedup
  • Pre-storage
  • Three orders saving

Log Storage
39
Conclusion
  • FastRWR
  • Good accuracy (90)
  • 150x speed-up query time
  • Orders of magnitude saving pre-compute storage

40
Overview
  • Mining Static graphs
  • CenterPiece Subgraphs (CePS)
  • Fast RWR computation
  • best-effort subgraph matching (in progress)
  • Mining time-evolving graphs
  • Other topics

41
Best-effort Sub-Graph Matching, on Attributed
Graphs
42
Best-effort problem dfn.
  • Nodes have one
  • (categorical) attribute
  • query Eg., loop -gt
  • money laundering

Synthetic data
43
Best-effort problem dfn.
  • Loop-Query
  • Results

44
  • Star-Query
  • Results

45
DBLP dataset
  • Authorship Graph
  • Nodes authors
  • Edges of co-authored paper
  • Attributes Conference and Year
  • 300k nodes, 1m edges

46
Line Query
  • Results

Footnote for results -Red nodes qualifying
nodes -white nodes immediate nodes.
47
Star-query
Results
Footnote for results -Red nodes qualifying
nodes -white nodes immediate nodes.
48
Loop-Query
  • Results

49
P.I.T. Terrorist Relations
  • Nodes Terrorist Relationship
  • Attributes
  • Family Contact Colleague Congregate
  • Edges Two Relationship shares a common person
  • 1k nodes and 8k edges

50
Star-Query
  • Results

51
Overview
  • Mining Static graphs
  • CenterPiece Subgraphs (CePS)
  • Fast RWR computation
  • best-effort subgraph matching (in progress)
  • Mining time-evolving graphs
  • Tensors intrusion detection
  • Other tools (MDL)
  • Other topics

52
Tensors for time evolving graphs
  • Jimeng Sun KDD06
  • , SMD07
  • CF, Kolda, Sun, SDM07 tutorial

53
Social network analysis
  • Static find community structures
  • Dynamic monitor community structure evolution
    spot abnormal individuals abnormal time-stamps

54
Network Forensics
  • Directional network flows
  • A large ISP with 100 POPs, each POP 10Gbps link
    capacity Hotnets2004
  • Task Identify abnormal traffic pattern and find
    out the cause

abnormal traffic
normal traffic
destination
source
55
Tensors - outline
  • Motivation
  • Main ideas
  • Experiments

56
Static case
Time 0
  • For a timestamp, data can be modeled using a
    tensor (matrix 2-mode tensor)

Type
Location
57
Dynamic case Tensor streams
Time 0
Type
Location
58
Dynamic Data model Tensor streams
Type
Location
(Jimengs Desk, light)
time
59
Dynamic Data model Tensor streams
Type
Location
(Jimengs Desk, light)
time
  • Streams come with structure
  • (time, location, sensor-modality)
  • (time, host-id, measurement-type)

60
What is the factor?
1st factor
  • Factor is a set of 1D summaries

61
What is the factor?
1st factor
  • Factor is a set of 1D summaries
  • Multi-linear approximation on all aspects

62
Tensors - outline
  • Motivation
  • Main ideas
  • Experiments

63
WTA on real sensor data
type
location
time
1st factor Scaling factor 250
  • 1st factor consists of the main trends
  • Daily periodicity on time
  • Uniform on all locations
  • Temp, Light and Volt are positively correlated
    while negatively correlated with Humid

64
WTA on real sensor data (cont.)
type
time
location
2nd factor Scaling factor 154
  • 2nd factor captures an atypical trend
  • Uniformly across all time
  • Concentrating on 3 locations
  • Mainly due to voltage
  • Interpretation two sensors have low battery, and
    the other one has high battery.

65
Application 1 Multiway latent semantic indexing
(LSI)
Philip Yu
2004
Michael Stonebreaker
Uauthors
1990
authors
Ukeyword
keyword
Pattern
Query
  • Projection matrices specify the clusters
  • Core tensors give cluster activation level

66
Bibliographic data (DBLP)
  • Papers from VLDB and KDD conferences
  • Construct 2nd order tensors with yearly windows
    with ltauthor, keywordsgt
  • Each tensor 4584?3741
  • 11 timestamps (years)

67
Multiway LSI
Authors Keywords Year
michael carey, michael stonebreaker, h. jagadish, hector garcia-molina queri,parallel,optimization,concurr, objectorient 1995
surajit chaudhuri,mitch cherniack,michael stonebreaker,ugur etintemel distribut,systems,view,storage,servic,process,cache 2004
jiawei han,jian pei,philip s. yu, jianyong wang,charu c. aggarwal streams,pattern,support, cluster, index,gener,queri 2004
DB
DM
  • Two groups are correctly identified Databases
    and Data mining
  • People and concepts are drifting over time

68
Application 2Network Anomaly Detection
  • Anomaly detection
  • Reconstruction error driven
  • Multiple resolution
  • Data
  • TCP flows collected at CMU backbone
  • Raw data 500GB with compression
  • Construct 3rd order tensors with hourly windows
    with ltsource, destination, port gt
  • 1200 timestamps (hours)

69
Network anomaly detection
scanners
error
  • Identify when and where anomalies occurred.
  • Prominent difference between normal and abnormal
    ones is mainly due to unusual scanning activity
    (confirmed by the campus admin).

70
Computational cost
3rd order network tensor
2nd order DBLP tensor
  • OTA is the offline tensor analysis
  • Performance metric CPU time (sec)
  • Observations
  • DTA and STA are orders of magnitude faster than
    OTA
  • The slight upward trend in DBLP is due to the
    increasing number of papers each year (data
    become denser over time)

71
Accuracy comparison
3rd order network tensor
2nd order DBLP tensor
  • Performance metric the ratio of reconstruction
    error between DTA/STA and OTA fixing the error
    of OTA to 20
  • Observation DTA performs very close to OTA in
    both datasets, STA performs worse in DBLP due to
    the bigger changes.

72
InteMon intelligent monitoring system on large
clusters
  • VLDB06 demo
  • Operating System Review 06

73
System Architecture
74
Case 1 Environmental Monitoring
Temperature
Humidity
  • Abnormal dehumidification and reheating cycle is
    identified

75
(No Transcript)
76
Overview
  • Mining Static graphs
  • CenterPiece Subgraphs (CePS)
  • Fast RWR computation
  • best-effort subgraph matching (in progress)
  • Mining time-evolving graphs
  • Tensors intrusion detection
  • Other tools (MDL)
  • Other topics

77
Parameter-free mining
  • Using MDL, to
  • Find natural communities
  • natural cut-points
  • (under submission)

78
MDL mining on time-evolving graph (Enron emails)
79
Overview
  • Mining Static graphs
  • Mining time-evolving graphs
  • Other topics
  • Graph sampling
  • Graph generators

80
(No Transcript)
81
Overview
  • Mining Static graphs
  • CenterPiece Subgraphs (CePS)
  • Fast RWR computation
  • best-effort subgraph matching (in progress)
  • Mining time-evolving graphs
  • Tensors intrusion detection
  • Sparse graphs
  • Other topics
  • Graph sampling
  • Graph generators

82
Realistic graph generation
  • Kronecker graphs Leskovec, PKDD05
  • Leskovec, under review

83
Why fitting graph models?
  • Parameters tell us about the structure of a graph
  • Extrapolation given a graph today, how will it
    look in a year?
  • Sampling can I get a smaller graph with similar
    properties?
  • Anonymization instead of releasing real graph
    (e.g., email network), we can release a synthetic
    version of it

84
Experiments on real AS graph
Degree distribution
Hop plot
Network value
Adjacency matrix eigen values
85
Intro to Kronecker graphs
86
Problem Definition
  • Given a growing graph with count of nodes N1, N2,
  • Generate a realistic sequence of graphs that will
    obey all the patterns
  • Idea Self-similarity
  • Leads to power laws
  • Communities within communities

87
Recursive Graph Generation
  • There are many obvious (but wrong) ways
  • Does not obey Densification Power Law
  • Has increasing diameter
  • Kronecker Product is exactly what we need
  • There are many obvious (but wrong) ways

Recursive expansion
Initial graph
88
Kronecker Product a Graph
Intermediate stage
Adjacency matrix
Adjacency matrix
89
Kronecker Product a Graph
  • Continuing multypling with G1 we obtain G4 and so
    on

G4 adjacency matrix
90
Conclusions
  • Static graphs Random Walks, CePS,
    best-effort sub-graph matching
  • Dynamic graphs Tensors (intrusion/change
    detection
  • Graph generation Kronecker

91
References
  • Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan
    Fast Random Walk with Restart and Its
    Applications ICDM 2006, Hong Kong.
  • Hanghang Tong, Christos Faloutsos Center-Piece
    Subgraphs Problem Definition and Fast Solutions,
    KDD 2006, Philadelphia, PA

92
References
  • Jure Leskovec, Jon Kleinberg and Christos
    Faloutsos Graphs over Time Densification Laws,
    Shrinking Diameters and Possible Explanations KDD
    2005, Chicago, IL. ("Best Research Paper" award).
  • Jure Leskovec, Deepayan Chakrabarti, Jon
    Kleinberg, Christos Faloutsos Realistic,
    Mathematically Tractable Graph Generation and
    Evolution, Using Kronecker Multiplication
    (ECML/PKDD 2005), Porto, Portugal, 2005. PDF

93
References
  • Jimeng Sun, Dacheng Tao, Christos Faloutsos
    Beyond Streams and Graphs Dynamic Tensor
    Analysis, KDD 2006, Philadelphia, PA
  • Jimeng Sun, Yinglian Xie, Hui Zhang, Christos
    Faloutsos. Less is More Compact Matrix
    Decomposition for Large Sparse Graphs, SDM,
    Minneapolis, Minnesota, Apr 2007. pdf

94
Thank you!
  • Contact info
  • christos, htong, jimeng, jure ltatgt cs.cmu.edu
  • www. cs.cmu.edu /christos
  • (w/ papers, datasets, code, etc)
Write a Comment
User Comments (0)
About PowerShow.com