On Mining Massive Dynamic Data - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

On Mining Massive Dynamic Data

Description:

Crawl a max of 1K pages per website. Parser. Each webpage parsed into sentences by a parser. Index ... Build a baseline model, any model that provides a p-value ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 62

Provided by: Yah973

Category:

more less

Transcript and Presenter's Notes

Title: On Mining Massive Dynamic Data

1
On Mining Massive Dynamic Data

Deepak Agarwal
Yahoo! Research
SF Bay Area Chapter, ACM
13th September, 2006

2
Background

PhD in statistics, university of Connecticut 01
Multi-level hierarchical Bayesian models to study
deforestation in Madagascar.
Working with massive data in GIS got me
interested in Data Mining
Joined Statistics and Data Mining at ATT
Massive Dynamic networks monitoring massive
streams.
Intrigued by internet advertising joined Y! in
2006

3
Yahoo! Research

Head Prabhakar Raghavan 2005
Search
Economics
Machine Learning
Statistics and Data Mining
http//research.yahoo.com/

4
Context

Massive amounts of data
Internet wave, telecommunications pose new
statistical challenges
Computer science ?progress in managing data,
computing summary statistics efficiently
Dynamic nature of data
Ubiquitous, methods for static data not optimal
Methods for time series, point processes germane
Challenge building scalable methods

5
Focus on three problems

Estimating delta through time
Monitoring massive streams
Mining massive dynamic social networks
Effective but lossy representation
Sequential sampling for learning
optimize the explore-exploit tradeoff
Different from fixed sample size design

6
Other research areas

Detecting hotspots in massive spatial data
Scan statistics (SODA 06, KDD 06)
Forecasting long-term and short-term
Applications at Yahoo!
Data squashing
Scaling down data to facilitate statistical
modeling (KDD 03)
Hierarchical Bayesian modeling
Shrinkage estimation for massive data
(KDD 02 SDM 04 ICDM 05)

7
Problem 1 Monitoring massive streams

Estimating delta through time several apps.
Network monitoring Traffic volume in SNMP etc.
Bio-surveillance Emergency room data events on
websites
spam detection e-mail spam web spam etc.
Business intelligence traffic pattern to
customer care centers augments usual dashboard
type statistics

8
Challenges

Estimate accurate baseline model
Change detection with good sensitivity/specificity
Adjusting for multiple testing global features
Adaptive procedure easy to update
Incorporating correlations among series

9
Application

Question
Can we detect social disruption events in China
before they get reported in the mainstream media?
Our Answer
Probably yes, if we knew what data to use

10

Word patterns
11

We would like to thank Simon Urbanek for
providing the plot

12
How did we get the patterns?

Patterns emerged from retrospective analysis of
biological events (West Nile, SARS) foreseen as
potential indicators and warnings of social
disruption
Direct indicators (e.g, news reports of
outbreaks)
Indirect indicators (school and factory closings
etc.)
Patterns selected manually by experts.
Contingency table obtained daily Websites (about
40) x patterns (about 25).

13
Data Collection, Reporting.

Crawler
Crawl a max of 1K pages per website
Parser
Each webpage parsed into sentences by a parser
Index
Converted to UTF-8 and indexed incrementally,
lucene empowered indexing and search software
Anomalies reported in a newsletter form on a web
portal every morning.

14
(No Transcript)
15

Number of crawled pages show variation, monitor
rate of occurrence per page for each pattern.

16
Notations and transformation.
17
40 days of initial data
18
No short term seasonal effects in the rates
19
Baseline model State space approach
20
QQ-plots of standardized residuals to test the
conditional independence assumption in the
observation equation of the baseline model.
21
State Equation, model update
22
Estimating Variance components.
23
Estimating forgetting factors
24
Detecting anomalies, intervention strategy.

Q-charting, monitor the EWMA of normal scores of
p-values (Liu and Lambert, 2006).
CUSUM based approach using Bayes factor (West and
Harrison, 1997 Gargallo and Salvador,2003)
In this application Only detected spikes

25
Other Issues

What if a spike/drop is detected? Dont use the
data point in model updating, increase the prior
variance by a factor of c (c9 has worked well
for this application)
Missing data
Occurs when we download very few (or no pages).
Draw an observation randomly from the predictive
distribution and proceed as usual.
Deleting uninformative series, adding new ones
Delete a series if 95th percentile lt 1/1000.

26
Multiple testing Bayesian Approach.

Monitoring large number of independent streams
need to correct for multiple testing
Main idea
Derive empirical null based on observed
deviations
Flag interesting cases after adjusting for global
characteristics in the system
Bayesian approach shrink residuals
Shrinkage automatically builds in penalty for
conducting multiple tests (Scott and Berger, 2003)

27
Bayesian procedure.
28
Estimating hyperparameters

Large number of time series
Approximate likelihood data squashing
likelihood approximated by weighted likelihood
MAP estimate used as plug-in
Moderate number of time series
Fully Bayesian Inference using Gauss-Hermite
Quadrature

29
Distribution of normal scores on a randomly
selected day
30
(No Transcript)
31
(No Transcript)
32
Putting it all together

Build a baseline model, any model that provides a
p-value of observed relative to predictive is
appropriate. The state space approach provides a
rich class.
Declare anomalies after adjusting for multiple
testing. We use a Bayesian procedure but other
approaches like FDR may be used
Delete time series that are uninformative (based
on a user defined criteria). Add new series to
the monitoring process.
For missing data, draw an observation randomly
from the predictive distribution. When an anomaly
occurs, make appropriate adjustments to maintain
the correct variance.
Update the baseline distribution with arrival of
new data. The updating process should be quick
for large applications.

33
Dotted solid lines Days when reports appeared in
mainstream mediaDotted gray lines Days when our
system found spikes related to the reports that
appeared later.
34
Rough validation using actual media reports.

July 24th mystery illness kills 17 people in
China, we noticed several spikes on July 17th and
18th
Sept 29th and Dec 7th On Sept 29th , news
reports of China carving out emergency plans to
fight bird flu and prevent it from spreading to
humans. On Dec 7th , a confirmed case of bird flu
in humans reported.
We reported several spikes on Sept 12th and 14th,
Nov 2nd, 7th, 11th, and 16th mostly for the
pattern influenza, flu, pneumonia, meningitis.
On Nov 21st , four big spikes on bf3.syd.com.cn
on influenza, flu, pneumonia, meningitis
emergency, disaster, crisis prevention and
quarantine.

35
Ongoing work

Monitoring hierarchical streams
Applications at Yahoo!
Correlation structure induced partly by
hierarchy
Concise description of anomalies
ICDM 05 for contingency tables
ICDE 06 for hierarchical data under submission

36
Other applications

Monitoring emergency room visits by symptom and
location (JSM, 2005).
Monitoring calls to customer care centers to
augment usual slice and dice dashboard statistics
E.g. there was a 10 increase in Hang-ups for
calls from Maryland (ICDM, 2005)

37
Relevant articles

Agarwal, D., Feng, J. , Torres, V. (2006)
Monitoring massive streams simultaneously A
holistic Approach, Interface (refereed section)
Agarwal, D. (2005) Empirical Bayes Approach to
Detect Anomalies in Dynamic Multidimensional
Arrays, ICDM, Houston
Agarwal, D., DuMouchel, W , Goodall, G (2005)
KFC A Kalman filtering appraoch to detect
anomalies in Massive contingency tables, JSM,
Minneapolis

38
Problem 2 Building Efficient Representation for
Massive Dynamic Networks
39
Context

Transactional data Time stamped records of
interaction between pairs of entities,
e.g., telephone calls, credit card purchases,
e-mail exchanges, hyperlinks etc.
Gives rise to a dynamic graph, nodes represent
transactors, edges represent interactions over
time.
Goal Find a lossy but efficient representation
No unique solution, depends on objective

40
Our application

Directed graph phone calls on ATTs network.
Massive millions of nodes and edges
Dynamic lose nodes and edges, get new ones
Heterogeneous biz,res,cell,800 etc.
Sparse the 80/20 rule power law
Incomplete dont see all calls, miss calls
originating in competitors network, cell calls,
local calls etc

41
Applications

Fraud detection ( fraudsters compromising 800
numbers, international numbers etc.)
Marketing (viral marketing market to people with
strong network influence)
Repetitive debtors (catching subscription fraud)
Key observations
Analysis at local transactor level useful
global not needed
Facilitate near-real time applications

42
Representation

Create local graph centered around each node
captures interaction with the rest of the
network
Approximation
Graph at time t union of local sub-graphs
Capturing dynamics of graph over time (Cortes et
al)
Smoothing each local subgraph over time
Smoothed local subgraph around node X Community
of interest (COI) of X.

43
Smoothing

Based only on yesterdays data?
Too narrow
Union of all time periods?
Too broad
A moving average of the t most recent time
periods?
Better but does not capture slow drifts well,
logistically difficult
Exponentially weighted moving average (EWMA)
G(t)?G(t-1)(1- ?)g(t)
Gives more weight to recent data
Easy to maintain and update

44
Weighting past calls choosing theforgetting
factor
Calls fade out over time The larger ? is , the
longer the call has non-negligible weight
Selecting ? Standard problem in time
series Derive estimates from Kalman filter or
ARIMA (0,1,1) But whats our loss function here?
Graph provided by Chris Volinsky
45
Reducing redundancy

Only a small fraction have high degrees
Introduce a parameter k (positive integer)
COI of X is smoothed subgraph centered around X
Top k called by X other
Top k calling X other
Weights on the edges are those derived from EWMA.
Still not done this will lead to gain in more
and more edges over time introduce a third
parameter e such that an edge below this
threshold gets dropped.
Helps with noise reduction storage savings.

46
COI of X
Other inbound
X
Other outbound
47
How to select parameters?

Select L pre and post periods, maximize an
average similarity measure

48
Similarity functions
Pre-period
Post-period
TN1
p1
TN1
p2
TN2
TN2
pother
49
Selecting theta and k
Hellinger
Wdice
50
Selecting epsilon.
51
Repetitive Debtor

Thomas Hanley
62 Rio Robles, San Jose
CA
Disconnected non-payment

Deborah Hanley
62 Rio Robles, San Jose
CA
??

Key Calling patterns dont change
Compare new connections with fraudulent numbers
using a similarity function.
52
Validation with labels from experts
53
Enhancing COI My friends friends

Impute calls not seen on the network exploiting
social structure
Other issues
Quantify social characteristics like tendency to
call tendency to receive calls tendency to
return calls for each node.

54
Extended COI
d3
x
other
d0
X
d2
other
X
d1
55
Approach

Extended COI -gt social network representing the
nodes interaction with the network
Developed a rich class of statistical models to
do inference.

56
Parameters

Node i
ai expansiveness (tendency to call)
ßi attractiveness (tendency of being called)
Global parameters
? density of COI (reduces with increasing
sparseness)
? reciprocity of COI (tendency to return calls)
?s caller specific effect
?r cal lee specific effect
? call specific effect

57
Saß
Hyperparameters
density
Imputing tijs
Coefficients for edge covariates
tij
wij
tij
COI (wij,wji) i lt j Covariates
k
wji
58
(No Transcript)
59
Example

COI with 117 nodes, 172 edges.
14 missing edges, local calls from14 non ATT
local customers to seed node .
One node attributebiz/cell/res gives rise to 9
edge attributes.
cell-gtbiz, cell-gtcell, cell-gtres collapsed into
one block.
M1 uniform reciprocity, M2 differential
reciprocity
Latter gives better fit, edge covariates were
statistically significant. Results in Table 2 of
paper.

60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
Relevant Papers

S.Hill, D.Agarwal, R.Bell, C.Volinsky (2006)
Building an Effective Representation of Dynamic
Networks, Journal of Computational and Graphical
Statistics(to appear)
C.Cortes, D.Pregibon, C. Volinsky (2003)
Computational Methods for Dynamic Graphs,
Journal of Computational and Graphical
Statistics, 12, 950-970
D.Agarwal, D. Pregibon (2004) Enhancing
Communities of Interest using Bayesian Stochastic
Blockmodels, Siam Data Mining Conference,
Orlando