Title: On Mining Massive Dynamic Data
1On Mining Massive Dynamic Data
- Deepak Agarwal
- Yahoo! Research
- SF Bay Area Chapter, ACM
- 13th September, 2006
2Background
- PhD in statistics, university of Connecticut 01
- Multi-level hierarchical Bayesian models to study
deforestation in Madagascar. - Working with massive data in GIS got me
interested in Data Mining - Joined Statistics and Data Mining at ATT
- Massive Dynamic networks monitoring massive
streams. - Intrigued by internet advertising joined Y! in
2006
3Yahoo! Research
- Head Prabhakar Raghavan 2005
- Search
- Economics
- Machine Learning
- Statistics and Data Mining
- http//research.yahoo.com/
4Context
- Massive amounts of data
- Internet wave, telecommunications pose new
statistical challenges - Computer science ?progress in managing data,
computing summary statistics efficiently - Dynamic nature of data
- Ubiquitous, methods for static data not optimal
- Methods for time series, point processes germane
- Challenge building scalable methods
5Focus on three problems
- Estimating delta through time
- Monitoring massive streams
- Mining massive dynamic social networks
- Effective but lossy representation
- Sequential sampling for learning
- optimize the explore-exploit tradeoff
- Different from fixed sample size design
6Other research areas
- Detecting hotspots in massive spatial data
- Scan statistics (SODA 06, KDD 06)
- Forecasting long-term and short-term
- Applications at Yahoo!
- Data squashing
- Scaling down data to facilitate statistical
modeling (KDD 03) - Hierarchical Bayesian modeling
- Shrinkage estimation for massive data
- (KDD 02 SDM 04 ICDM 05)
7Problem 1 Monitoring massive streams
- Estimating delta through time several apps.
- Network monitoring Traffic volume in SNMP etc.
- Bio-surveillance Emergency room data events on
websites - spam detection e-mail spam web spam etc.
- Business intelligence traffic pattern to
customer care centers augments usual dashboard
type statistics
8Challenges
- Estimate accurate baseline model
-
- Change detection with good sensitivity/specificity
- Adjusting for multiple testing global features
- Adaptive procedure easy to update
- Incorporating correlations among series
9Application
- Question
- Can we detect social disruption events in China
before they get reported in the mainstream media? - Our Answer
- Probably yes, if we knew what data to use
10 Word patterns
11- We would like to thank Simon Urbanek for
providing the plot
12How did we get the patterns?
- Patterns emerged from retrospective analysis of
biological events (West Nile, SARS) foreseen as
potential indicators and warnings of social
disruption - Direct indicators (e.g, news reports of
outbreaks) - Indirect indicators (school and factory closings
etc.) - Patterns selected manually by experts.
- Contingency table obtained daily Websites (about
40) x patterns (about 25).
13Data Collection, Reporting.
- Crawler
- Crawl a max of 1K pages per website
- Parser
- Each webpage parsed into sentences by a parser
- Index
- Converted to UTF-8 and indexed incrementally,
lucene empowered indexing and search software - Anomalies reported in a newsletter form on a web
portal every morning.
14(No Transcript)
15- Number of crawled pages show variation, monitor
rate of occurrence per page for each pattern.
16Notations and transformation.
1740 days of initial data
18No short term seasonal effects in the rates
19Baseline model State space approach
20QQ-plots of standardized residuals to test the
conditional independence assumption in the
observation equation of the baseline model.
21State Equation, model update
22Estimating Variance components.
23Estimating forgetting factors
24Detecting anomalies, intervention strategy.
- Q-charting, monitor the EWMA of normal scores of
p-values (Liu and Lambert, 2006). - CUSUM based approach using Bayes factor (West and
Harrison, 1997 Gargallo and Salvador,2003) - In this application Only detected spikes
25Other Issues
- What if a spike/drop is detected? Dont use the
data point in model updating, increase the prior
variance by a factor of c (c9 has worked well
for this application) - Missing data
- Occurs when we download very few (or no pages).
- Draw an observation randomly from the predictive
distribution and proceed as usual. - Deleting uninformative series, adding new ones
- Delete a series if 95th percentile lt 1/1000.
26Multiple testing Bayesian Approach.
- Monitoring large number of independent streams
- need to correct for multiple testing
- Main idea
- Derive empirical null based on observed
deviations - Flag interesting cases after adjusting for global
characteristics in the system - Bayesian approach shrink residuals
- Shrinkage automatically builds in penalty for
conducting multiple tests (Scott and Berger, 2003)
27Bayesian procedure.
28Estimating hyperparameters
- Large number of time series
- Approximate likelihood data squashing
- likelihood approximated by weighted likelihood
- MAP estimate used as plug-in
- Moderate number of time series
- Fully Bayesian Inference using Gauss-Hermite
Quadrature
29Distribution of normal scores on a randomly
selected day
30(No Transcript)
31(No Transcript)
32Putting it all together
- Build a baseline model, any model that provides a
p-value of observed relative to predictive is
appropriate. The state space approach provides a
rich class. - Declare anomalies after adjusting for multiple
testing. We use a Bayesian procedure but other
approaches like FDR may be used - Delete time series that are uninformative (based
on a user defined criteria). Add new series to
the monitoring process. - For missing data, draw an observation randomly
from the predictive distribution. When an anomaly
occurs, make appropriate adjustments to maintain
the correct variance. - Update the baseline distribution with arrival of
new data. The updating process should be quick
for large applications.
33Dotted solid lines Days when reports appeared in
mainstream mediaDotted gray lines Days when our
system found spikes related to the reports that
appeared later.
34Rough validation using actual media reports.
- July 24th mystery illness kills 17 people in
China, we noticed several spikes on July 17th and
18th - Sept 29th and Dec 7th On Sept 29th , news
reports of China carving out emergency plans to
fight bird flu and prevent it from spreading to
humans. On Dec 7th , a confirmed case of bird flu
in humans reported. - We reported several spikes on Sept 12th and 14th,
Nov 2nd, 7th, 11th, and 16th mostly for the
pattern influenza, flu, pneumonia, meningitis. - On Nov 21st , four big spikes on bf3.syd.com.cn
on influenza, flu, pneumonia, meningitis - emergency, disaster, crisis prevention and
quarantine.
35Ongoing work
- Monitoring hierarchical streams
- Applications at Yahoo!
- Correlation structure induced partly by
hierarchy - Concise description of anomalies
- ICDM 05 for contingency tables
- ICDE 06 for hierarchical data under submission
36Other applications
- Monitoring emergency room visits by symptom and
location (JSM, 2005). - Monitoring calls to customer care centers to
augment usual slice and dice dashboard statistics - E.g. there was a 10 increase in Hang-ups for
calls from Maryland (ICDM, 2005)
37Relevant articles
- Agarwal, D., Feng, J. , Torres, V. (2006)
Monitoring massive streams simultaneously A
holistic Approach, Interface (refereed section) - Agarwal, D. (2005) Empirical Bayes Approach to
Detect Anomalies in Dynamic Multidimensional
Arrays, ICDM, Houston - Agarwal, D., DuMouchel, W , Goodall, G (2005)
KFC A Kalman filtering appraoch to detect
anomalies in Massive contingency tables, JSM,
Minneapolis
38Problem 2 Building Efficient Representation for
Massive Dynamic Networks
39Context
- Transactional data Time stamped records of
interaction between pairs of entities, - e.g., telephone calls, credit card purchases,
e-mail exchanges, hyperlinks etc. - Gives rise to a dynamic graph, nodes represent
transactors, edges represent interactions over
time. - Goal Find a lossy but efficient representation
- No unique solution, depends on objective
40Our application
- Directed graph phone calls on ATTs network.
- Massive millions of nodes and edges
- Dynamic lose nodes and edges, get new ones
- Heterogeneous biz,res,cell,800 etc.
- Sparse the 80/20 rule power law
- Incomplete dont see all calls, miss calls
originating in competitors network, cell calls,
local calls etc
41Applications
- Fraud detection ( fraudsters compromising 800
numbers, international numbers etc.) - Marketing (viral marketing market to people with
strong network influence) - Repetitive debtors (catching subscription fraud)
- Key observations
- Analysis at local transactor level useful
global not needed - Facilitate near-real time applications
42Representation
- Create local graph centered around each node
- captures interaction with the rest of the
network - Approximation
- Graph at time t union of local sub-graphs
- Capturing dynamics of graph over time (Cortes et
al) - Smoothing each local subgraph over time
- Smoothed local subgraph around node X Community
of interest (COI) of X.
43Smoothing
- Based only on yesterdays data?
- Too narrow
- Union of all time periods?
- Too broad
- A moving average of the t most recent time
periods? - Better but does not capture slow drifts well,
logistically difficult - Exponentially weighted moving average (EWMA)
- G(t)?G(t-1)(1- ?)g(t)
- Gives more weight to recent data
- Easy to maintain and update
44Weighting past calls choosing theforgetting
factor
Calls fade out over time The larger ? is , the
longer the call has non-negligible weight
Selecting ? Standard problem in time
series Derive estimates from Kalman filter or
ARIMA (0,1,1) But whats our loss function here?
Graph provided by Chris Volinsky
45Reducing redundancy
- Only a small fraction have high degrees
- Introduce a parameter k (positive integer)
- COI of X is smoothed subgraph centered around X
- Top k called by X other
- Top k calling X other
- Weights on the edges are those derived from EWMA.
- Still not done this will lead to gain in more
and more edges over time introduce a third
parameter e such that an edge below this
threshold gets dropped. - Helps with noise reduction storage savings.
46COI of X
Other inbound
X
Other outbound
47How to select parameters?
- Select L pre and post periods, maximize an
average similarity measure
48Similarity functions
Pre-period
Post-period
TN1
p1
TN1
p2
TN2
TN2
pother
49Selecting theta and k
Hellinger
Wdice
50Selecting epsilon.
51Repetitive Debtor
- Thomas Hanley
- 62 Rio Robles, San Jose
- CA
-
- Disconnected non-payment
- Deborah Hanley
- 62 Rio Robles, San Jose
- CA
-
-
- ??
Key Calling patterns dont change
Compare new connections with fraudulent numbers
using a similarity function.
52Validation with labels from experts
53Enhancing COI My friends friends
- Impute calls not seen on the network exploiting
social structure - Other issues
- Quantify social characteristics like tendency to
call tendency to receive calls tendency to
return calls for each node.
54Extended COI
d3
x
other
d0
X
d2
other
X
d1
55Approach
- Extended COI -gt social network representing the
nodes interaction with the network - Developed a rich class of statistical models to
do inference.
56Parameters
- Node i
- ai expansiveness (tendency to call)
- ßi attractiveness (tendency of being called)
- Global parameters
- ? density of COI (reduces with increasing
sparseness) - ? reciprocity of COI (tendency to return calls)
- ?s caller specific effect
- ?r cal lee specific effect
- ? call specific effect
57Saß
Hyperparameters
density
Imputing tijs
Coefficients for edge covariates
tij
wij
tij
COI (wij,wji) i lt j Covariates
k
wji
58(No Transcript)
59Example
- COI with 117 nodes, 172 edges.
- 14 missing edges, local calls from14 non ATT
local customers to seed node . - One node attributebiz/cell/res gives rise to 9
edge attributes. - cell-gtbiz, cell-gtcell, cell-gtres collapsed into
one block. - M1 uniform reciprocity, M2 differential
reciprocity - Latter gives better fit, edge covariates were
statistically significant. Results in Table 2 of
paper.
60(No Transcript)
61(No Transcript)
62(No Transcript)
63Relevant Papers
- S.Hill, D.Agarwal, R.Bell, C.Volinsky (2006)
Building an Effective Representation of Dynamic
Networks, Journal of Computational and Graphical
Statistics(to appear) - C.Cortes, D.Pregibon, C. Volinsky (2003)
Computational Methods for Dynamic Graphs,
Journal of Computational and Graphical
Statistics, 12, 950-970 - D.Agarwal, D. Pregibon (2004) Enhancing
Communities of Interest using Bayesian Stochastic
Blockmodels, Siam Data Mining Conference,
Orlando