Big (graph) data analytics - PowerPoint PPT Presentation

About This Presentation
Title:

Big (graph) data analytics

Description:

Christos Faloutsos CMU IC '13 C. Faloutsos * E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www 07] IC '13 C. Faloutsos * E-bay Fraud detection IC '13 C ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 51
Provided by: ChristosF1
Learn more at: http://www.cs.cmu.edu
Category:
Tags: analytics | data | graph | store

less

Transcript and Presenter's Notes

Title: Big (graph) data analytics


1
Big (graph) data analytics
  • Christos Faloutsos
  • CMU

2
CONGRATULATIONS!
Welcome to CMU!
3
Outline
  • QA
  • Problem definition / Motivation
  • Graphs and power laws
  • Anomaly/fraud detection
  • Conclusions

4
QA
  • Are you recruiting? How many?
  • How many do you have?
  • How frequently you meet them?
  • What is your advising style?
  • How do you feel about summer internships?

5
QA
  • Are you recruiting? How many?
  • How many do you have?
  • How frequently you meet them?
  • What is your advising style?
  • How do you feel about summer internships?
  • Maybe, 1
  • 4 (5pdocs)
  • 1/week
  • results
  • Yes/Maybe (FB, MSR, IBM, )

6
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Patterns
  • Scalability and hadoop
  • Anomaly detection
  • Conclusions

7
Motivation
  • Data mining find patterns (rules, outliers)
  • How do real graphs look like? Anomalies?
  • Virus/influence propagation
  • Time series / env. Monitoring

8
Graphs - why should we care?
9
Graphs - why should we care?
Food Web Martinez 91
1B users 10-100B revenue
Internet Map lumeta.com
10
Tensors Graphs on steroids
  • Tensors (multi-dimensional arrays)
  • Predicates (subject, verb, object) in knowledge
    base

Vagelis Papalexakis CMU-CS
Tom Mitchell CMU/CS-MLD
Eric Clapton plays guitar
(48M)
NELL (Never Ending Language Learner)
data Nonzeros 144M
Barack Obama is the president of U.S.
(26M)
(26M)
11
Concept Discovery
  • Concept Discovery in Knowledge Base

12
Concept Discovery
  • Concept Discovery in Knowledge Base

NP1 Internet, file, data NP2 Protocol,
software, suite
13
NeuroSemantics
gt200GB total
14
Experiments
  • GigaTensor solves 100x larger problem

(K)
(J)
(I)
Number of nonzero I / 50
15
Problem 1 - network and graph mining
  • What does the Internet look like?
  • What does FaceBook look like?
  • What is normal/abnormal?
  • which patterns/laws hold?
  • To spot anomalies (rarities), we have to discover
    patterns
  • Large datasets reveal patterns/anomalies that may
    be invisible otherwise

16
Graph mining
  • Are real graphs random?

17
Laws and patterns
  • NO!!
  • Diameter
  • in- and out- degree distributions
  • other (surprising) patterns

18
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Patterns
  • Scalability and hadoop
  • Anomaly/Fraud detection
  • Conclusions

19
S1 degree distributions
  • Q avg degree is 3 - what is the most probable
    degree?

count
??
degree
3
20
S1 degree distributions
  • Q avg degree is 3 - what is the most probable
    degree?

degree
21
Solution
Frequency
Exponent slope
O -2.15
-2.15
Nov97
Outdegree
  • The plot is linear in log-log scale FFF99
  • freq degree (-2.15)

22
Solution S.2 Triangle Laws
  • Real social networks have a lot of triangles

23
Solution S.2 Triangle Laws
  • Real social networks have a lot of triangles
  • Friends of friends are friends
  • Any patterns?

24
Triangle Law S.2 Tsourakakis ICDM 2008
Reuters
X-axis degree Y-axis mean triangles n friends
-gt ???? triangles
25
Triangle Law S.2 Tsourakakis ICDM 2008
SN
Reuters
X-axis degree Y-axis mean triangles n friends
-gt n1.6 triangles
Epinions
26
Triangle counting for large graphs?
  • Anomalous nodes in Twitter( 3 billion edges)
  • U Kang, Brendan Meeder, , PAKDD11

26
27
Triangle counting for large graphs?
  • Anomalous nodes in Twitter( 3 billion edges)
  • U Kang, Brendan Meeder, , PAKDD11

27
28
Triangle counting for large graphs?
  • Anomalous nodes in Twitter( 3 billion edges)
  • U Kang, Brendan Meeder, , PAKDD11

28
29
And many more patterns
  • Diameter SHRINKS with size!
  • nodes vs edges (power law(!))
  • conn. Components (power law, too)
  • Contact/phone-call duration (log-logistic)
  • Total node weight vs edges (super-linear/power
    law)
  • .

30
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Patterns and anomalies
  • Scalability and hadoop
  • Anomaly/fraud detection
  • Conclusions

31
Scalability
  • Google gt 450,000 processors in clusters of
    2000 processors each Barroso, Dean, Hölzle,
    Web Search for a Planet The Google Cluster
    Architecture IEEE Micro 2003
  • Yahoo 5Pb of data Fayyad, KDD07
  • Problem machine failures, on a daily basis
  • How to parallelize data mining tasks, then?
  • A map/reduce hadoop (open-source clone)
    http//hadoop.apache.org/

32
details
User Program
Master
assign map
assign reduce
Input Data (on HDFS)
Reducer
local write
Reducer
remote read, sort
By default 3-way replication Late/dead
machines ignored, transparently (!)
33
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Patterns
  • Scalability and hadoop
  • Anomaly/Fraud detection
  • Conclusions

34
E-bay Fraud detection
w/ Polo Chau Shashank Pandit, CMU www07
35
E-bay Fraud detection
36
E-bay Fraud detection
37
E-bay Fraud detection - NetProbe
38
App-store fraud
  • Opinion Fraud Detection in Online Reviews using
    Network Effects
  • Leman Akoglu, Rishi Chandy, CF
  • ICWSM13

39
Problem
  • Given
  • user-product review network
  • review sign (/-)
  • Classify
  • objects into type-specific classes
  • users honest / fraudster
  • products good / bad
  • reviews genuine / fake
  • No side data!
  • (e.g., timestamp, review text)

40
Formulation BP
  • User Product
  • honest bad
  • honest bad



Before
After
41
Top scorers
Users
Products
positive (4-5) rating o negative (1-2) rating
42
Top scorers
Users
Products
positive (4-5) rating o negative (1-2) rating
43
Fraud-bot member reviews
Same day activity!
Same developer!
Duplicated text!
44
Outline
  • Problem definition / Motivation
  • Graphs and power laws
  • Patterns and anomalies
  • Scalability and hadoop
  • Anomaly/fraud detection
  • Streams, spikes, environment, data center
    monitoring
  • Conclusions

45
Datacenter Monitoring Management
Lei Li
  • Goal save energy in data centers
  • US alone, 7.4B power consumption (2011)
  • Challenge
  • 1TB per day
  • Complex cyber physical systems

46
Spike forecasting
Yasuko Matsubara
  • Forecast not only tail-part, but also rise-part!

?
?
(2) Release date
(3) Two weeks before release
(1) First spike
47
Spike forecasting
Yasuko Matsubara
  • Forecast not only tail-part, but also rise-part!

(2) Release date
(3) Two weeks before release
(1) First spike
48
Environmental data
Temp. and pressure over time
Temperatures, April
Sao Paulo, Brazil
49
OVERALL CONCLUSIONS high level
Graphs/ Social net
Databases, Map/reduce
Big data / analytics
Cyber-security
Data center monitoring
Fraud detection
Environmental data monitoring
Health db
50
Open research questions
  • Patterns/anomalies for time-evolving graphs (Call
    graph, 3M people x 6mo)
  • Patterns/anomalies given node attributes
  • Graph understanding / attribution
  • ..
  • How is the human brain wired

51
All these projects
  • Require all three
  • Theory (e.g., eigenvalues, tensors, Kalman
    filters, wavelets)
  • Practice (e.g., PIG, hadoop, gt120GB of data,
    often TB)
  • Domain knowledge (e.g., Navier Stokes,
    Volterra-Lotka, etc)

52
Contact info
  • www.cs.cmu.edu/christos
  • GHC 8019
  • Ph x8.1457
  • www.cs.cmu.edu/christos/TALKS/13-ic/
  • FYI Course 15-826, Tu-Th 130-300
  • and, again WELCOME!
Write a Comment
User Comments (0)
About PowerShow.com