Title: Model Maintenance in Dynamic Environments
1Model Maintenance in Dynamic Environments
- Venkatesh Ganti
- (Joint work with Raghu Ramakrishnan, Johannes
Gehrke, Mong Li Lee)
2Mining Environment
- Data repository for analysis
- Data mining models
- Frequent itemsets
- Decision trees
- Clusters
-
- OLAP
- Aggregate queries
- Repository updated regularly
- Query workloads change
Data Mining
Data Warehouse
OLAP
3Two Parts of this Talk
- Model Maintenance Maintaining models under
systematic data evolution ICDE 00 - Tuning samples Maintaining samples for
approximate query answering with respect to
changing query workloads VLDB00
4Systematic Block Evolution
- Data warehouses are updated with blocks of new
data - Block a set of tuples appended simultaneously to
the data warehouse
D
Result a sequence of database snapshots
5Model Maintenance Objective
- Allow selection of interesting time-varying
subsets to be modeled - Low response time to get the updated model
- Interesting classes of models
- Frequent itemsets (LITS)
- Clusters
- Decision trees (DT)
6Subset selection Data Span
- Span of interest
- Everything until nowUnrestricted window
- Recently collectedMost recent window
- Unrestricted Window (UW)
- Model the entire database
M(D1D2D3)
D3
D1
D2
7Data Span (contd.)
- Most Recent Window (MRW) of size w
- E.g., model data collected in the last 3 days
- Sliding Windows Models
M(D1D2D3)
D3
D1
D2
8Block Selection Sequence
- Maintain models on data collected on alternate
days within the last 4 weeks - Require fine granular selection
- Block selection sequence (BSS)
- A 0/1 sequence a bit for each block in the data
span - 1--the block is selected for modeling
- 0--the block is not selected for modeling
9BSS UW
- A sequence of 0/1 bits, one for each block in the
entire database - E.g., select all blocks collected on alternate
days -
1 0 1 0 1
D4
D3
D2
D1
D5
10BSS MRW
- Two types of BSS w.r.t. MRW
- Window-independent
- Window-relative
- Model data collected on Mondays within the last 4
weeks - BSS (1000000)
1 00 1 00 1 ...
D2-D7
D1
D9-D14
D8
M(D1D8)
11BSS MRW (contd.)
- Window-relative BSS
- Model all data collected on alternate days from
the start in a window of size 3 - BSS 101
D2
D3
D1
1 0 1
Here, each successive subset is disjoint from its
predecessor
12Model Maintenance Enumeration
LITS Clustering DT
UWBSS
MRWBSS
Includes both window-independent
and window-relative block selection sequences.
13Model Maintenance Algorithms
LITS Clustering DT
UWBSS
GEMM(A)
MRWBSS
GEMM GEneric Model Maintenance Algorithm for any
class of models that has an incremental
maintenance algorithm A under tuple insertions
14Maintenance under Insertions
- Algorithm A
- Input old dataset D, old model M(D), a block of
tuples d appended to D - Output M(Dd) A(D, d, M(D))
- Such algorithms exist for
- Frequent itemsets (ECUT, ECUT, BORDERS, FUP)
- Clusters (BIRCH)
- Decision trees (BOAT)
- Note We do NOT require A to handle deletions!
15GEMM
- Input
- Data span (and window size for MRW)
- BSS
- A model-update algorithm A under tuple insertions
(deletions not required) - Output
- An efficient model maintenance algorithm
16GEMMMRW
- Assume BSS is a sequence of 1s and w3
- We already know parts of future windows
17GEMM MRW (contd.)
Idea Start building models for future
windows E.g., At T3, we maintain models on
ltD1 D2 D3gt (model required for window at
T3) ltD2 D3gt (partial model for
window at T4) ltD3gt (partial model for
window at T5)
Models at T3 MltD1 D2 D3gt MltD2 D3gt MltD3gt
Models at T4 MltD2 D3 D4gt (for window at
T2) MltD3 D4gt (for window at T3) MltD4gt
(for window at T4)
Immediate
Offline
18GEMM Arbitrary BSS
1 0 1 0 1 ...
T3 Model on lt1.D1 0.D2 1.D3gt T4 D4 is
appended Model on lt0.D2 1.D3 0.D4gt T5
D5 is appended Model on lt1.D3 0.D4 1.D5gt
D1
D2
D1
D4
D3
D5
Idea We still know parts of future windows and
the corresponding BSS for each of them E.g., At
T3, we maintain models on lt1.D1 0.D2 1.D3gt
(model required at T3) lt0.D2 1.D3gt lt1.D3gt
identical
19GEMM Resource Requirements
- Response time to new model
- Updating one model with the new block
- Other updates offline
- Depends on the incremental algorithm
- Space requirements
- At most w models
- Space required for a model is orders of magnitude
less than that for data!
20Maintenance under Insertions
- Algorithm A
- Input old dataset D, old model M(D), a block of
tuples d appended to D - Output M(Dd) A(D, d, M(D))
- Such algorithms exist for
- Frequent itemsets (ECUT, ECUT, BORDERS, FUP)
- Clusters (BIRCH)
- Decision trees (BOAT)
- Note We do NOT require A to handle deletions!
21Frequent Itemset Models
- Set of customer transactions
- Frequent itemset a set of items purchased
together by many customers
Minimum frequency threshold 50 b, c, a,c
are frequent itemsets
22Incremental Algorithm FAAM97,TBAR97
D4
D3
D1
D2
- Input
- Old dataset
- Old set of frequent itemsets
- New block D4
- Steps
- Detect if new itemsets become frequent
- Count frequencies of a small number of itemsets
- Current algorithms scan (D1D2D3) completely
- Update model
23ECUTNew Counting Algorithm
- Transformed data representation
- Within each block Di
- item x sorted list of transaction identifiers
containing xTID-list(x)
TID-list(a) 1 TID-list(b) 2,3 TID-list(c)
1,2
Count(a,b) TID-list(a) intersection
TID-list(b)
24Experimental Comparison
25Comparing Count Times
26Summary of the first part
LITS Clustering DT
UWBSS
GEMM
MRWBSS
- Maintenance algorithms under tuple insertions
- Frequent itemsets
- ECUT, ECUT
- Clusters
- BIRCH
- Decision Trees
- BOAT
27Second Part of this Talk
- Model Maintenance Maintaining models under
systematic data evolution ICDE 00 - Maintaining samples with respect to changing
query workloads VLDB00
28Random Samples for AQUA
Agg. query Q
Exact answer
Typical AQUA approach
R
Approx. answer
Uniform Random sample
- All tuples in R are assumed to be equally
important while drawing S(R) - In practice, queries exhibit locality
- Consequence S(R) wastes precious real estate
29Problem
- Given
- Relation R
- Workload W Q1,,Qn
- Goal Dynamically tune random sample of R
w.r.t. W - Model to be maintained a simple random sample
30ICICLES
- R(Q) set of tuples in R required to answer Q
- Random sample of R U R(Q1) U U R(Qn)
- Tuples required often are more likely to be in
SW(R)
31Mail Order Dataset
32Conclusions and Future Work
Static dataset
Dynamic dataset
Workload indifferent
Workload sensitive
33