Title: Proximity on Large Graphs: definitions, fast solutions and applications
1Proximity on Large Graphsdefinitions, fast
solutions and applications
- Speaker Hanghang Tong
- Carnegie Mellon University
IBM T.J. Watson
2008-7-31
2Joint work with
Jia-Yu Pan (Google)
Christos Faloutsos (CMU)
Yehuda Koren (ATT Labs)
IBM
Spiros Papadimitriou
Philip S. Yu
Huiming Qu
Hani Jamjoom
Tina Eliassi-Rad (LLNL)
Brian Gallagher (LLNL)
Kensuke Oonuma (Sony Corp.)
Yasushi Sakurai (NTT Labs)
3Graphs are everywhere!
4Graph Mining Big Picture
- Graph Level
- Patterns
- Laws
- Generators
Smith
Alan
Adam
Adam
John
Jones
Tom
Peter
Subgraph Level - Community
Beck
- Node Level
- Association
- Correlation
- Causality
- Proximity
Jack
Amy
Dan
Anna
Anna
Tom
Alice
Cell Phone
We are here!
5Proximity on Graph What?
a.k.a Relevance, Closeness, Similarity
6Proximity on Graphs Why?
- Link prediction Liben-Nowell, Tong
- Ranking Haveliwala, Chakrabarti
- Email Management Minkov
- Image caption Pan
- Neighborhooh Formulation Sun
- Conn. subgraph Faloutsos, Tong, Koren
- Pattern match Tong
- Collaborative Filtering Fouss
- Many more
Will return to this later
7Link Prediction
density
Prox. Hist. for a set of deleted links
Prox (i?j)Prox (j?i)
Prox. is effective to deleted and absent edges!
density
Prox. Hist. for a set of absent links
Prox (i?j)Prox (j?i)
Q How to predict the existence of the link? A
Proximity! Liben-Nowell 2003
8Neighborhood Search on graphs
Conference
Author
Q what is most related conference to ICDM?
A Proximity! Sun ICDM2005
9Automatic Image Caption
Region
Image
Test Image
Keyword
Sea
Sun
Sky
Wave
Cat
Forest
Tiger
Grass
Q How to assign keywords to the test image? A
Proximity! Pan 2004
10Center-Piece Subgraph(CePS)
Input
Output
CePS guy
CePS
Original Graph
Q How to find hub for the black nodes? A
Proximity! Tong KDD 2006
11Input
Output
Query Graph
Best-Effort Pattern Match
Data Graph
Matching Subgraph
Q How to find matching subgraph? A
Proximity!Tong KDD 2007 b
12Roadmap
- Basic RWR
- Variants
- Properties
- Generalizations
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
13Why not shortest path?
Some bad proximities
pizza delivery guy problem
multi-facet relationship
14Why not max. netflow?
Some bad proximities
No punishment on long paths
15What is a good Proximity?
- Multiple Connections
- Quality of connection
- Direct In-directed Conns
- Length, Degree, Weight
16Random walk with restart
17Random walk with restart
Nearby nodes, higher scores
Ranking vector
More red, more relevant
18Why RWR is a good score?
adjacency matrix. c damping factor
19Roadmap
- Basic RWR
- Variants
- Properties
- Generalizations
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
20Variant escape probability
- Define Random Walk (RW) on the graph
- Esc_Prob(CMU?IBM)
- Prob (starting at CMU, reaches IBM before
returning to CMU)
the remaining graph
CMU
IBM
Esc_Prob Pr (smile before cry)
21Other Variants
- Other measure by RWs
- Community Time/Hitting Time Fouss
- SimRank Jeh
- Equivalence of Random Walks
- Electric Networks
- EC Doyle SAECFaloutsos CFECKoren
- String Systems
- Katz Katz, Huang, Scholkopf
- Matrix-Forest-based Alg Chobotarev
All are related to or similar to random walk
with restart!
22Chaptering different measurements
Regularized Un-constrained Quad Opt.
Norma lize
RWR
Katz
4 ssp decides 1 esc_prob
Esc_Prob Sink
Hitting Time/ Commute Time
relax
X out-degree
Harmonic Func. Constrained Quad Opt.
Effective Conductance
voltage position
String System
Physical Models
Mathematic Tools
23Roadmap
- Basic RWR
- Variants
- Properties
- Generalizations
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
24Property Monotonicity
A degree preserving! Koren KDD06Tong
KDD07aTong SDM08
25Property Asymmetry Tong KDD07 a
What is Prox from A to B? What is Prox from B to
A?
What is Prox between A and B?
26Asymmetry in un-directed graphs
- Hanghangs 1 employer is IBM
- The 1 employee of IBM is ...
Hanghang
IBM
So is love
27Roadmap
- Basic RWR
- Variants
- Properties
- Generalizations
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
28Group Proximity Tong KDD07 a
- Q How close are Accountants to SECs?
- A Prob (starting at any RED, reaches any GREEN
before touching any RED again)
29Proximity on Attributed Graphs Tong KDD07 b
What is the proximity from node 7 to 10?
If we know that
30A Augmented graphs Tong KDD07 b
31More on Generalizations
- Attributed on edges Chakrabarti KDD 06
- Proximity w/ Time
- Minkov, Tong SDM 2008, Tong CIKM 2008
- Proximity w/ Side Information Tong 2008
32Summary of Part I
- Goal Summarize multiple relationship
- Solutions
- Basic Random Walk with Restart
- Pan 2004Sun 2006Tong 2006
- Properties Asymmetry, monotonicity
- Koren 2006Tong 2007 Tong 2008
- Variants Esc_Prob and many others.
- Faloutsos 2004 Koren 2006Tong 2007
- Generalizations Group Prox, w/ Attr., w/ Time,
w/ Side Information - Charkrabarti 2006Tong 2007 Tong 2008
33Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
34Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- B_Lin RWR
- BB_Lin Skewed BGs
- FastUpdate Time-Evolving
35Preliminary ShermanMorrison Lemma
36SM The block-form
37SM Lemma Applications
- RLS (Recursive least square)
- and almost any algorithm in time series!
- Leave-one-out cross validation for LSR
- Kalman filtering
- Incremental matrix decomposition
-
- and all the fast sol.s we will introduce!
38Computing RWR
Starting vector
Restart p
Adjacency matrix
Ranking vector
1
n x n
n x 1
n x 1
39Q Given query i, how to solve it?
Query
?
?
Starting vector
Ranking vector
Ranking vector
Adjacency matrix
40OntheFly
No pre-computation/ light storage
Slow on-line response
O(mE)
41PreCompute
10
9
12
2
8
1
11
R
3
4
6
5
7
c x Q
Haveliwala 2002
Q
42PreCompute
Fast on-line response
Heavy pre-computation/storage cost
O(n )
3
O(n )
2
43Q How to Balance?
On-line
Off-line
44B_Lin Basic IdeaTong ICDM 2006
Find Community
Combine
Fix the remaining
45B_Lin details
Cross-community
46B_Lin details
47B_Lin summary
- Pre-Computational Stage
- Q
- A A few small, instead of ONE BIG, matrices
inversions - On-Line Stage
- Q Efficiently recover one column of Q
- A A few, instead of MANY, matrix-vector
multiplication
48Query Time vs. Pre-Compute Time
Log Query Time
- Quality 90
- On-line
- Up to 150x speedup
- Pre-computation
- Two orders saving
Log Pre-compute Time
49Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- B_Lin RWR
- BB_Lin Skewed BGs
- FastUpdate Time-Evolving
50RWR on Bipartite Graph
authors
Author-Conf. Matrix
Observation n gtgt m! Examples 1. DBLP
400k aus, 3.5k confs 2. NetFlix 2.7M usrs,18k
mvs
n
Conferences
m
51RWR on Skewed bipartite graphs
- Q Given query i, how to solve it?
m confs
. . . . . . . . . . . . ..
0
Ar
?
?
. . . . . . . . . . . . ..
0
Ac
n aus
n
m
52BB_Lin Pre-Computation Tong ICDM 06
2-step RWR for Conferences
- Step 1
- Step 2
- Cost
- Examples
- NetFlix 1.5hr for pre-computation
- DBLP 1 few minutes
m conferences
All Conf-Conf Prox. Scores
n authors
53BB_Lin Pre-Computation Tong ICDM 06
2-step RWR for Conferences
m conferences
All Conf-Conf Prox. Scores
n authors
54BB_Lin Pre-Computation Tong ICDM 06
2-step RWR for Conferences
- Step 1
- Step 2
- Cost
- Examples
- NetFlix 1.5hr for pre-computation
- DBLP 1 few minutes
All Conf-Conf Prox. Scores
m x m
Ac/Ar E edges
55BB_Lin On-Line Stage
authors
Conferences
(Base) Case 1 - Conf - Conf
Read out !
56BB_Lin On-Line Stage
authors
Conferences
Case 2 - Au - Conf
1 matrix-vec!
57BB_Lin On-Line Stage
authors
Conferences
Case 3 - Au - Au
2 matrix-vec!
58BB_Lin Examples
400k authors x 3.5k conf.s
2.7m user x 18k movies
59Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- B_Lin RWR
- BB_Lin Skewed BGs
- FastUpdate Time-Evolving
60Challenges
- BB_Lin is good for skewed bipartite graphs
- for NetFlix (2.7M nodes and 100M edges)
- On-line cost for query fraction of seconds
- w/ 1.5 hr pre-computation for m x m core matrix
- Butwhat if the graph is evolving over time
- New edges/nodes arrive edge weights increase
- On-line cost 1.5hr itself becomes a part of
this!
61Q How to update the core matrix?
62Update the core matrix
X
M
Rank 2 update
63Update General Case
n authors
- E edges changed
- Involves n authors, m confs.
- Observation
m Conferences
64Update General Case
- Observation
- the rank of update is small!
- Real Example (DBLP Post)
- 1258 time steps
- E up to 20,000!
- min(n,m) lt132
- Our Algorithm
-
m Conferences
65Fast-Single-Update
log(Time) (Seconds)
176x speedup
40x speedup
Our method
Our method
Datasets
66Fast-Batch-Update
Time (Seconds)
Time (Seconds)
Our method
Our method
Min (n, m)
E
15x speed-up on average!
67More on Fast Solutions
- FastAllDAP
- Simultaneously solve multiple linear systems
- Tong KDD 2007 a
- MT3
- Multiple-Resolution Analysis on Time Tong CIKM
2008 - Fast-ProSIN
- On-Line response for users feedback Tong 2008
68Summary of Part II
- Goal Efficiently Solve Linear System(s)
- Sols.
- B_Lin one large linear system Tong ICDM06
- BB_Lin the intrinsic complexity is small Tong
ICDM06 - FastUpdate dynamic linear system Tong SDM08
- FastAllDAP multiple linear systems Tong KDD07
a - MT3 Tong CIKM 2008
- Fast-ProSIN Tong 2008
69Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
70Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- Link Prediction
- Ranking Related Tasks
- User Specific Patterns
- Time Related Tasks
71density
Link Prediction existence
Prox. Hist. for a set of deleted links
Prox (i?j)Prox (j?i)
Prox. is effective to deleted and absent edges!
density
Prox. Hist. for a set of absent links
Prox (i?j)Prox (j?i)
Q How to predict the existence of the link? A
Proximity! Liben-Nowell 2003
72Link Prediction direction Tong KDD 07 a
- Q Given the existence of the link, what is the
direction of the link? - A Compare prox(i?j) and prox(j?i)
gt70
density
Prox (i?j) - Prox (j?i)
73 Beyond Link Prediction
- Collaborative Filtering Fouss
- Name Disambiguation
- Minkov SIGIR 06
- Anomaly Nodes/Edges
- a is abnormal if the neighborhood of a is so
different - Sun ICDM 2005
74Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- Link Prediction
- Ranking Related Tasks
- User Specific Patterns
- Time Related Tasks
75Neighborhood Search on graphs
Conference
Author
Q what is most related conference to ICDM?
A Proximity! Sun ICDM2005
76NF example
77gCaP Automatic Image Caption
Sea
Sun
Sky
Wave
?
A Proximity! Pan KDD2004
78Region
Image
Test Image
Keyword
79Region
Image
Test Image
Grass, Forest, Cat, Tiger
Sea
Sun
Sky
Wave
Cat
Forest
Tiger
Grass
Keyword
80C-DEM Multi-Modal Query System for
DrosophilaEmbryo Databases Fan VLDB 2008
81Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- Link Prediction
- Ranking Related Tasks
- User Specific Patterns
- Time Related Tasks
82Center-Piece Subgraph(CePS)
Input
Output
CePS guy
CePS
Original Graph
Q How to find hub for the black nodes? A
Proximity! Tong KDD 2006
Red Max (Prox(A, Red) x Prox(B, Red) x Prox(C,
Red))
83CePS Example
84K_SoftAnd Relaxation of AND
Disconnected Communities
Noise
- Asking AND query? ? No Answer!
85CePS 2 SoftAND
DB
Stat.
86Input
Output
Query Graph
Best-Effort Pattern Match
Data Graph
Matching Subgraph
Q How to find matching subgraph? A
Proximity!Tong KDD 2007 b
87G-Ray How to?
matching node
matching node
matching node
matching node
Goodness Prox (12, 4) x Prox (4, 12) x
Prox (7, 4) x Prox (4, 7) x
Prox (11, 7) x Prox (7, 11) x
Prox (12, 11) x Prox (11, 12)
88Effectiveness star-query
Query
Result
89Effectiveness line-query
Query
Result
90Effectiveness loop-query
Query
Result
91Roadmap
- Motivation
- Part I Definitions
- Part II Fast Solutions
- Part III Applications
- Conclusion
- Link Prediction
- Ranking Related Tasks
- User Specific Patterns
- Time Related Tasks
92Challenge
- Graphs are evolving over time!
- New nodes/edges show up
- Existing nodes/edges die out
- Edge weights change
Q How to Generalize everything? A Track
Proximity! Tong SDM 2008
93pTrack/cTrack Trend analysis on graph level
T. Sejnowski
Rank of Influential-ness
G.Hinton
C. Koch
M. Jordan
Year
94pTrack Philip S. Yus Top-5 conferences up to
each year
DBLP (Au. x Conf.) - 400k aus, - 3.5k confs
- 20 yrs
Databases Performance Distributed Sys.
Databases Data Mining
95KDDs Rank wrt. VLDB over years
Prox. Rank
Data Mining and Databases are more and more
relavant!
Year
96cTrack10 most influential authors in NIPS
community up to each year
T. Sejnowski
M. Jordan
Author-paper bipartite graph from NIPS 1987-1999.
3k. 1740 papers, 2037 authors, spreading over
13 years
97T3 Understand Time in Complex Context Tong
CIKM 2008
Time Cluster, rep. entities b7,b6, b8
Abnormal Time rep. entities b5,b4
Time Cluster rep. entities b3, b2, b1
Output
Input
98T3 Time-to-Time Proximity Matrix
99More Applications
- Clustering
- Proximity as input Ding KDD 2007
- Email management Minkov CEAS 06.
- Business Process Management Qu 2008
- ProSIN
- Listen to clients comments Tong 2008
- TANGENT
- Broaden Users Horizon Oonuma Tong 2008
- Ghost Edge
- Within Network Classification Gallagher Tong
KDD08 b -
100Applications
Computations
Use Proximity as Building block
Efficiently Solve Linear System(s)
MT3 Tong 2008
Fast-ProSIN Tong 2008
FastUpdate Tong 2008
FastAllDAP Tong 2007
BB_Lin Tong 2006
B_Lin Tong 2006
Weighted Multiple Relationship
Proximity On Graphs
Definitions
RWR Pan 2004Sun 2006Tong 2006
Properties. Koren 2006Tong 2007, 2008
Variants Faloutsos 2004 Koren 2006Tong
2007
Generalizations Charkrabarti 2006Tong 2007,
2008
101Take-home messages
- Proximity Definitions
- RWR
- and a lot of variants
- Computations
- Find out smoothness
- SM Lemma
- Applications
- Proximity as a building block
102References
- L. Page, S. Brin, R. Motwani, T. Winograd.
(1998), The PageRank Citation Ranking Bringing
Order to the Web, Technical report, Stanford
Library. - T.H. Haveliwala. (2002) Topic-Sensitive
PageRank. In WWW, 517-526, 2002 - J.Y. Pan, H.J. Yang, C. Faloutsos P. Duygulu.
(2004) Automatic multimedia cross-modal
correlation discovery. In KDD, 653-658, 2004. - C. Faloutsos, K. S. McCurley A. Tomkins. (2002)
Fast discovery of connection subgraphs. In KDD,
118-127, 2004. - J. Sun, H. Qu, D. Chakrabarti C. Faloutsos.
(2005) Neighborhood Formation and Anomaly
Detection in Bipartite Graphs. In ICDM, 418-425,
2005. - W. Cohen. (2007) Graph Walks and Graphical
Models. Draft.
103References
- P. Doyle J. Snell. (1984) Random walks and
electric networks, volume 22. Mathematical
Association America, New York. - Y. Koren, S. C. North, and C. Volinsky. (2006)
Measuring and extracting proximity in networks.
In KDD, 245255, 2006. - A. Agarwal, S. Chakrabarti S. Aggarwal. (2006)
Learning to rank networked entities. In KDD,
14-23, 2006. - S. Chakrabarti. (2007) Dynamic personalized
pagerank in entity-relation graphs. In WWW,
571-580, 2007. - F. Fouss, A. Pirotte, J.-M. Renders, M.
Saerens. (2007) Random-Walk Computation of
Similarities between Nodes of a Graph with
Application to Collaborative Recommendation. IEEE
Trans. Knowl. Data Eng. 19(3), 355-369 2007.
104References
- H. Tong C. Faloutsos. (2006) Center-piece
subgraphs problem definition and fast solutions.
In KDD, 404-413, 2006. - H. Tong, C. Faloutsos, J.Y. Pan. (2006) Fast
Random Walk with Restart and Its Applications. In
ICDM, 613-622, 2006. - H. Tong, Y. Koren, C. Faloutsos. (2007) Fast
direction-aware proximity for graph mining. In
KDD, 747-756, 2007. - H. Tong, B. Gallagher, C. Faloutsos, T.
Eliassi-Rad. (2007) Fast best-effort pattern
matching in large attributed graphs. In KDD,
737-746, 2007. - H. Tong, S. Papadimitriou, P.S. Yu C.
Faloutsos. (2008) Proximity Tracking on
Time-Evolving Bipartite Graphs. to appear in SDM
2008.
105References
- B. Gallagher, H. Tong, T. Eliassi-Rad, C.
Faloutsos. Using Ghost Edges for Classification
in Sparsely Labeled Networks. KDD 2008 - H. Tong, Y. Sakurai, T. Eliassi-Rad, and C.
Faloutsos. Fast Mining of Complex Time-Stamped
Events CIKM 08 - H. Tong, H. Qu, and H. Jamjoom. Measuring
Proximity on Graphs with Side Information.
Submitted. - K. Oonuma, H. Tong, and C. Faloutsos. TANGENT A
Novel, Surprise-me, Recommendation Algorithm.
Submitted.
106- Thank you!
- htong_at_cs.cmu.edu
- www.cs.cmu.edu/htong