Title: DOULION: Counting Triangles in Massive Graphs with a Coin
1DOULION Counting Triangles in Massive Graphs
with a Coin
- Charalampos (Babis) Tsourakakis
- Carnegie Mellon UniversityKDD 09Paris
Joint work with U Kang, Gary L. Miller, Christos
Faloutsos
2Outline
- Motivation
- Related Work
- Proposed Method
- Results
- Conclusion
- Extra
3Why is Triangle Counting important?
- Clustering coefficient
- Transitivity ratio
- Social Network Analysis fact Friends of friends
are friends
A
C
B
WF94)
- Hidden Thematic Structure of the Web (Eckmann et
al. PNAS EM02) - Motif Detection, (e.g., YPSB05 )
- Web Spam Detection (Becchetti et.al. KDD 08
BBCG08)
4Personal Motivation
CET08
Political Blogs
eigenvalues of adjacency matrix
Keep only 3!
3
i-th eigenvector
5Outline
- Motivation
- Related Work
- Proposed Method
- Results
- Conclusion
- Extra
6Counting methods
Sparse graphs
Matrix Multiplication not practical
M. Latapy, Theory and Experiments
7Naive Sampling
- r independent samples of three distinct vertices
X1
T3
X0
T2
T1
T0
8Naive Sampling
- r independent samples of three distinct vertices
with probability at least 1-d
Works
Prohibitive for graphs with T3o(n2). e.g., T3
n2logn
9Buriol, Frahling, Leonardi, Marchetti-Spaccamela,
Sohler
k
Sample uniformly at random an edge (i,j) and a
node k in V-i,j
?
?
i
j
Check if edges (i,k) and (j,k) exist in E(G)
samples
10Outline
- Motivation
- Related Work
- Proposed Method
- Results
- Conclusion
- Extra
11Our Sampling Approach
G(V,E)
1/p
i
j
HEADS! (i,j) survives
12Our Sampling Approach
G(V,E)
k
m
TAILS! (k,m) dies
13Sampling approach
14Our Sampling Approach on Kn
Kn
Gn,0.5
In Expectation
Initially
Weighted
15Mean and Variance
?trianglesk(?-k) k non-edge-disjoint
triangles X r.v, our estimate
E??
16Outline
- Motivation
- Related Work
- Proposed Method
- Results
- Conclusion
- Extra
17Doulion and NodeIterator
- Sparsify first and then use Node Iterator to
count triangles. - Node Iterator Consider each node and count how
many edges among its neighbors
18Expected Speedup
- Expected Speedup 1/p2
- Proof
- Let R be the running time of Node Iterator after
the - sparsification
- Therefore, expected speedup
19Some results (I)
3M, 35M
400K, 2.1M
20Some results (II)
3.1M, 37M
3.6M, 42M
21Outline
- Motivation
- Related Work
- Proposed Method
- Results
- Conclusion
- Extra
22Conclusions
- New Sampling approach that counts triangles
approximately. - Basic analysis of the estimate (expectation,
variance, expected speedup) - Experimentation on many real world datasets where
we showed that for pconstant we get high quality
estimates and 1/p2 constant speedups.
23Question
- Can p be smaller than constant? How small can we
afford p to be and at the same time guarantee
concentration? - Could e.g., p be as small as 1/ ???
- Motivation
-
24Outline
- Motivation
- Related Work
- Proposed Method
- Results
- Conclusion
- Extra
25Approximate Triangle Counting
- Approximate Triangle CountingArxiv preprint
http//arxiv.org/PS_cache/arxiv/pdf/0904/0904.376
1v1.pdf - C.E.T M.N. Kolountzakis
G.L. Miller
26 TheoremC.E.T, Kolountzakis, Miller 2009
Mildness, pick p1
How to choosep?
Concentration
27Practitioners Guide
Wikipedia 2005 1,6M nodes 18,5M edges
Pick p1/ Keep doubling until
concentration
Concentration appears
Concentration becomes stronger
28Bad Instances
Remove edge (1,2)
Remove any weighted edgew sufficiently large
29Thanks!
- http//www.cs.cmu.edu/ctsourak/projects.html
- Code and datasets available
- graphminingtoolbox_at_gmail.com
- (HADOOP, MATLAB, JAVA implementations along with
- small real-world graphs, all datasets used are on
the - web)
- An article about computational science in a
scientific publication is not the scholarship - itself, it is merely advertising of the
scholarship. The actual scholarship is the
complete - software environment and the complete set of
instructions which generated the figures. - Buckheit and DonohoBD95
30References
- Efficient semi-streaming algorithms for local
triangle counting in massive graphs - Becchetti, Boldi, Castillio, Gionis BBCG08
- Commensurate distances and similar motifs in
genetic congruence and protein interaction
networks in yeast - Ye, Peyser, Spencer, Bader YPSB05
31References
- Curvature of co-links uncovers hidden thematic
layers in the World Wide Web - Eckmann, Moses EM02
32References
- Fast Counting of Triangles in Large Real-World
Networks Algorithms and LawsC. Tsourakakis - BD95 Wavelab and reproducible research
Buckheit, Donoho
33References
- Social Network Analysis Methods and Applications
- Wasserman, Faust WF94
- Counting triangles in data streams
- Buriol, Frahling, Leonardi, Spaccamela, Sohler
BFLSS06
34Doulion