Title: Big (graph) data analytics
1Big (graph) data analytics
2CONGRATULATIONS!
Welcome to CMU!
3Outline
- QA
- Problem definition / Motivation
- Graphs and power laws
- Anomaly/fraud detection
- Conclusions
4QA
- Are you recruiting? How many?
- How many do you have?
- How frequently you meet them?
- What is your advising style?
- How do you feel about summer internships?
5QA
- Are you recruiting? How many?
- How many do you have?
- How frequently you meet them?
- What is your advising style?
- How do you feel about summer internships?
- Maybe, 1
- 4 (5pdocs)
- 1/week
- results
- Yes/Maybe (FB, MSR, IBM, )
6Outline
- Problem definition / Motivation
- Graphs and power laws
- Patterns
- Scalability and hadoop
- Anomaly detection
- Conclusions
7Motivation
- Data mining find patterns (rules, outliers)
- How do real graphs look like? Anomalies?
- Virus/influence propagation
- Time series / env. Monitoring
8Graphs - why should we care?
9Graphs - why should we care?
Food Web Martinez 91
1B users 10-100B revenue
Internet Map lumeta.com
10Tensors Graphs on steroids
- Tensors (multi-dimensional arrays)
- Predicates (subject, verb, object) in knowledge
base
Vagelis Papalexakis CMU-CS
Tom Mitchell CMU/CS-MLD
Eric Clapton plays guitar
(48M)
NELL (Never Ending Language Learner)
data Nonzeros 144M
Barack Obama is the president of U.S.
(26M)
(26M)
11Concept Discovery
- Concept Discovery in Knowledge Base
12Concept Discovery
- Concept Discovery in Knowledge Base
NP1 Internet, file, data NP2 Protocol,
software, suite
13NeuroSemantics
gt200GB total
14Experiments
- GigaTensor solves 100x larger problem
(K)
(J)
(I)
Number of nonzero I / 50
15Problem 1 - network and graph mining
- What does the Internet look like?
- What does FaceBook look like?
- What is normal/abnormal?
- which patterns/laws hold?
- To spot anomalies (rarities), we have to discover
patterns - Large datasets reveal patterns/anomalies that may
be invisible otherwise
16Graph mining
17Laws and patterns
- NO!!
- Diameter
- in- and out- degree distributions
- other (surprising) patterns
18Outline
- Problem definition / Motivation
- Graphs and power laws
- Patterns
- Scalability and hadoop
- Anomaly/Fraud detection
- Conclusions
19S1 degree distributions
- Q avg degree is 3 - what is the most probable
degree?
count
??
degree
3
20S1 degree distributions
- Q avg degree is 3 - what is the most probable
degree?
degree
21Solution
Frequency
Exponent slope
O -2.15
-2.15
Nov97
Outdegree
- The plot is linear in log-log scale FFF99
- freq degree (-2.15)
22Solution S.2 Triangle Laws
- Real social networks have a lot of triangles
23Solution S.2 Triangle Laws
- Real social networks have a lot of triangles
- Friends of friends are friends
- Any patterns?
24Triangle Law S.2 Tsourakakis ICDM 2008
Reuters
X-axis degree Y-axis mean triangles n friends
-gt ???? triangles
25Triangle Law S.2 Tsourakakis ICDM 2008
SN
Reuters
X-axis degree Y-axis mean triangles n friends
-gt n1.6 triangles
Epinions
26Triangle counting for large graphs?
- Anomalous nodes in Twitter( 3 billion edges)
- U Kang, Brendan Meeder, , PAKDD11
26
27Triangle counting for large graphs?
- Anomalous nodes in Twitter( 3 billion edges)
- U Kang, Brendan Meeder, , PAKDD11
27
28Triangle counting for large graphs?
- Anomalous nodes in Twitter( 3 billion edges)
- U Kang, Brendan Meeder, , PAKDD11
28
29And many more patterns
- Diameter SHRINKS with size!
- nodes vs edges (power law(!))
- conn. Components (power law, too)
- Contact/phone-call duration (log-logistic)
- Total node weight vs edges (super-linear/power
law) - .
30Outline
- Problem definition / Motivation
- Graphs and power laws
- Patterns and anomalies
- Scalability and hadoop
- Anomaly/fraud detection
- Conclusions
31Scalability
- Google gt 450,000 processors in clusters of
2000 processors each Barroso, Dean, Hölzle,
Web Search for a Planet The Google Cluster
Architecture IEEE Micro 2003 - Yahoo 5Pb of data Fayyad, KDD07
- Problem machine failures, on a daily basis
- How to parallelize data mining tasks, then?
- A map/reduce hadoop (open-source clone)
http//hadoop.apache.org/
32details
User Program
Master
assign map
assign reduce
Input Data (on HDFS)
Reducer
local write
Reducer
remote read, sort
By default 3-way replication Late/dead
machines ignored, transparently (!)
33Outline
- Problem definition / Motivation
- Graphs and power laws
- Patterns
- Scalability and hadoop
- Anomaly/Fraud detection
- Conclusions
34E-bay Fraud detection
w/ Polo Chau Shashank Pandit, CMU www07
35E-bay Fraud detection
36E-bay Fraud detection
37E-bay Fraud detection - NetProbe
38App-store fraud
- Opinion Fraud Detection in Online Reviews using
Network Effects - Leman Akoglu, Rishi Chandy, CF
- ICWSM13
39Problem
- Given
- user-product review network
- review sign (/-)
- Classify
- objects into type-specific classes
- users honest / fraudster
- products good / bad
- reviews genuine / fake
- No side data!
- (e.g., timestamp, review text)
40Formulation BP
- User Product
- honest bad
- honest bad
Before
After
41Top scorers
Users
Products
positive (4-5) rating o negative (1-2) rating
42Top scorers
Users
Products
positive (4-5) rating o negative (1-2) rating
43Fraud-bot member reviews
Same day activity!
Same developer!
Duplicated text!
44Outline
- Problem definition / Motivation
- Graphs and power laws
- Patterns and anomalies
- Scalability and hadoop
- Anomaly/fraud detection
- Streams, spikes, environment, data center
monitoring - Conclusions
45Datacenter Monitoring Management
Lei Li
- Goal save energy in data centers
- US alone, 7.4B power consumption (2011)
- Challenge
- 1TB per day
- Complex cyber physical systems
46Spike forecasting
Yasuko Matsubara
- Forecast not only tail-part, but also rise-part!
?
?
(2) Release date
(3) Two weeks before release
(1) First spike
47Spike forecasting
Yasuko Matsubara
- Forecast not only tail-part, but also rise-part!
(2) Release date
(3) Two weeks before release
(1) First spike
48Environmental data
Temp. and pressure over time
Temperatures, April
Sao Paulo, Brazil
49OVERALL CONCLUSIONS high level
Graphs/ Social net
Databases, Map/reduce
Big data / analytics
Cyber-security
Data center monitoring
Fraud detection
Environmental data monitoring
Health db
50Open research questions
- Patterns/anomalies for time-evolving graphs (Call
graph, 3M people x 6mo) - Patterns/anomalies given node attributes
- Graph understanding / attribution
- ..
- How is the human brain wired
51All these projects
- Require all three
- Theory (e.g., eigenvalues, tensors, Kalman
filters, wavelets) - Practice (e.g., PIG, hadoop, gt120GB of data,
often TB) - Domain knowledge (e.g., Navier Stokes,
Volterra-Lotka, etc)
52Contact info
- www.cs.cmu.edu/christos
- GHC 8019
- Ph x8.1457
- www.cs.cmu.edu/christos/TALKS/13-ic/
- FYI Course 15-826, Tu-Th 130-300
- and, again WELCOME!