Model Maintenance in Dynamic Environments - PowerPoint PPT Presentation

About This Presentation

Title:

Model Maintenance in Dynamic Environments

Description:

Input: old dataset D, old model M(D), a block of tuples d appended to D ... A model-update algorithm A under tuple insertions (deletions not required) Output ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 34

Provided by: venkate

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Model Maintenance in Dynamic Environments

1
Model Maintenance in Dynamic Environments

Venkatesh Ganti
(Joint work with Raghu Ramakrishnan, Johannes
Gehrke, Mong Li Lee)

2
Mining Environment

Data repository for analysis
Data mining models
Frequent itemsets
Decision trees
Clusters
OLAP
Aggregate queries
Repository updated regularly
Query workloads change

Data Mining
Data Warehouse

OLAP
3
Two Parts of this Talk

Model Maintenance Maintaining models under
systematic data evolution ICDE 00
Tuning samples Maintaining samples for
approximate query answering with respect to
changing query workloads VLDB00

4
Systematic Block Evolution

Data warehouses are updated with blocks of new
data
Block a set of tuples appended simultaneously to
the data warehouse

D
Result a sequence of database snapshots
5
Model Maintenance Objective

Allow selection of interesting time-varying
subsets to be modeled
Low response time to get the updated model
Interesting classes of models
Frequent itemsets (LITS)
Clusters
Decision trees (DT)

6
Subset selection Data Span

Span of interest
Everything until nowUnrestricted window
Recently collectedMost recent window

Unrestricted Window (UW)
Model the entire database

M(D1D2D3)
D3
D1
D2
7
Data Span (contd.)

Most Recent Window (MRW) of size w
E.g., model data collected in the last 3 days
Sliding Windows Models

M(D1D2D3)
D3
D1
D2
8
Block Selection Sequence

Maintain models on data collected on alternate
days within the last 4 weeks
Require fine granular selection
Block selection sequence (BSS)
A 0/1 sequence a bit for each block in the data
span
1--the block is selected for modeling
0--the block is not selected for modeling

9
BSS UW

A sequence of 0/1 bits, one for each block in the
entire database
E.g., select all blocks collected on alternate
days

1 0 1 0 1
D4
D3
D2
D1
D5
10
BSS MRW

Two types of BSS w.r.t. MRW
Window-independent
Window-relative
Model data collected on Mondays within the last 4
weeks
BSS (1000000)

1 00 1 00 1 ...
D2-D7
D1
D9-D14
D8
M(D1D8)
11
BSS MRW (contd.)

Window-relative BSS
Model all data collected on alternate days from
the start in a window of size 3
BSS 101

D2
D3
D1
1 0 1
Here, each successive subset is disjoint from its
predecessor
12
Model Maintenance Enumeration
LITS Clustering DT
UWBSS
MRWBSS
Includes both window-independent
and window-relative block selection sequences.
13
Model Maintenance Algorithms
LITS Clustering DT
UWBSS
GEMM(A)
MRWBSS
GEMM GEneric Model Maintenance Algorithm for any
class of models that has an incremental
maintenance algorithm A under tuple insertions
14
Maintenance under Insertions

Algorithm A
Input old dataset D, old model M(D), a block of
tuples d appended to D
Output M(Dd) A(D, d, M(D))
Such algorithms exist for
Frequent itemsets (ECUT, ECUT, BORDERS, FUP)
Clusters (BIRCH)
Decision trees (BOAT)
Note We do NOT require A to handle deletions!

15
GEMM

Input
Data span (and window size for MRW)
BSS
A model-update algorithm A under tuple insertions
(deletions not required)
Output
An efficient model maintenance algorithm

16
GEMMMRW

Assume BSS is a sequence of 1s and w3

We already know parts of future windows

17
GEMM MRW (contd.)
Idea Start building models for future
windows E.g., At T3, we maintain models on
ltD1 D2 D3gt (model required for window at
T3) ltD2 D3gt (partial model for
window at T4) ltD3gt (partial model for
window at T5)
Models at T3 MltD1 D2 D3gt MltD2 D3gt MltD3gt
Models at T4 MltD2 D3 D4gt (for window at
T2) MltD3 D4gt (for window at T3) MltD4gt
(for window at T4)
Immediate
Offline
18
GEMM Arbitrary BSS
1 0 1 0 1 ...
T3 Model on lt1.D1 0.D2 1.D3gt T4 D4 is
appended Model on lt0.D2 1.D3 0.D4gt T5
D5 is appended Model on lt1.D3 0.D4 1.D5gt
D1
D2
D1
D4
D3
D5
Idea We still know parts of future windows and
the corresponding BSS for each of them E.g., At
T3, we maintain models on lt1.D1 0.D2 1.D3gt
(model required at T3) lt0.D2 1.D3gt lt1.D3gt
identical
19
GEMM Resource Requirements

Response time to new model
Updating one model with the new block
Other updates offline
Depends on the incremental algorithm
Space requirements
At most w models
Space required for a model is orders of magnitude
less than that for data!

20
Maintenance under Insertions

Algorithm A
Input old dataset D, old model M(D), a block of
tuples d appended to D
Output M(Dd) A(D, d, M(D))
Such algorithms exist for
Frequent itemsets (ECUT, ECUT, BORDERS, FUP)
Clusters (BIRCH)
Decision trees (BOAT)
Note We do NOT require A to handle deletions!

21
Frequent Itemset Models

Set of customer transactions
Frequent itemset a set of items purchased
together by many customers

Minimum frequency threshold 50 b, c, a,c
are frequent itemsets
22
Incremental Algorithm FAAM97,TBAR97
D4
D3
D1
D2

Input
Old dataset
Old set of frequent itemsets
New block D4
Steps
Detect if new itemsets become frequent
Count frequencies of a small number of itemsets
Current algorithms scan (D1D2D3) completely
Update model

23
ECUTNew Counting Algorithm

Transformed data representation
Within each block Di
item x sorted list of transaction identifiers
containing xTID-list(x)

TID-list(a) 1 TID-list(b) 2,3 TID-list(c)
1,2
Count(a,b) TID-list(a) intersection
TID-list(b)
24
Experimental Comparison
25
Comparing Count Times
26
Summary of the first part
LITS Clustering DT
UWBSS
GEMM
MRWBSS