Model Maintenance in Dynamic Environments - PowerPoint PPT Presentation

About This Presentation
Title:

Model Maintenance in Dynamic Environments

Description:

Input: old dataset D, old model M(D), a block of tuples d appended to D ... A model-update algorithm A under tuple insertions (deletions not required) Output ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 34
Provided by: venkate
Category:

less

Transcript and Presenter's Notes

Title: Model Maintenance in Dynamic Environments


1
Model Maintenance in Dynamic Environments
  • Venkatesh Ganti
  • (Joint work with Raghu Ramakrishnan, Johannes
    Gehrke, Mong Li Lee)

2
Mining Environment
  • Data repository for analysis
  • Data mining models
  • Frequent itemsets
  • Decision trees
  • Clusters
  • OLAP
  • Aggregate queries
  • Repository updated regularly
  • Query workloads change

Data Mining
Data Warehouse


OLAP
3
Two Parts of this Talk
  • Model Maintenance Maintaining models under
    systematic data evolution ICDE 00
  • Tuning samples Maintaining samples for
    approximate query answering with respect to
    changing query workloads VLDB00

4
Systematic Block Evolution
  • Data warehouses are updated with blocks of new
    data
  • Block a set of tuples appended simultaneously to
    the data warehouse

D
Result a sequence of database snapshots
5
Model Maintenance Objective
  • Allow selection of interesting time-varying
    subsets to be modeled
  • Low response time to get the updated model
  • Interesting classes of models
  • Frequent itemsets (LITS)
  • Clusters
  • Decision trees (DT)

6
Subset selection Data Span
  • Span of interest
  • Everything until nowUnrestricted window
  • Recently collectedMost recent window
  • Unrestricted Window (UW)
  • Model the entire database

M(D1D2D3)
D3
D1
D2
7
Data Span (contd.)
  • Most Recent Window (MRW) of size w
  • E.g., model data collected in the last 3 days
  • Sliding Windows Models

M(D1D2D3)
D3
D1
D2
8
Block Selection Sequence
  • Maintain models on data collected on alternate
    days within the last 4 weeks
  • Require fine granular selection
  • Block selection sequence (BSS)
  • A 0/1 sequence a bit for each block in the data
    span
  • 1--the block is selected for modeling
  • 0--the block is not selected for modeling

9
BSS UW
  • A sequence of 0/1 bits, one for each block in the
    entire database
  • E.g., select all blocks collected on alternate
    days

1 0 1 0 1
D4
D3
D2
D1
D5
10
BSS MRW
  • Two types of BSS w.r.t. MRW
  • Window-independent
  • Window-relative
  • Model data collected on Mondays within the last 4
    weeks
  • BSS (1000000)

1 00 1 00 1 ...
D2-D7
D1
D9-D14
D8
M(D1D8)
11
BSS MRW (contd.)
  • Window-relative BSS
  • Model all data collected on alternate days from
    the start in a window of size 3
  • BSS 101

D2
D3
D1
1 0 1
Here, each successive subset is disjoint from its
predecessor
12
Model Maintenance Enumeration
LITS Clustering DT
UWBSS
MRWBSS
Includes both window-independent
and window-relative block selection sequences.
13
Model Maintenance Algorithms
LITS Clustering DT
UWBSS
GEMM(A)
MRWBSS
GEMM GEneric Model Maintenance Algorithm for any
class of models that has an incremental
maintenance algorithm A under tuple insertions
14
Maintenance under Insertions
  • Algorithm A
  • Input old dataset D, old model M(D), a block of
    tuples d appended to D
  • Output M(Dd) A(D, d, M(D))
  • Such algorithms exist for
  • Frequent itemsets (ECUT, ECUT, BORDERS, FUP)
  • Clusters (BIRCH)
  • Decision trees (BOAT)
  • Note We do NOT require A to handle deletions!

15
GEMM
  • Input
  • Data span (and window size for MRW)
  • BSS
  • A model-update algorithm A under tuple insertions
    (deletions not required)
  • Output
  • An efficient model maintenance algorithm

16
GEMMMRW
  • Assume BSS is a sequence of 1s and w3
  • We already know parts of future windows

17
GEMM MRW (contd.)
Idea Start building models for future
windows E.g., At T3, we maintain models on
ltD1 D2 D3gt (model required for window at
T3) ltD2 D3gt (partial model for
window at T4) ltD3gt (partial model for
window at T5)
Models at T3 MltD1 D2 D3gt MltD2 D3gt MltD3gt
Models at T4 MltD2 D3 D4gt (for window at
T2) MltD3 D4gt (for window at T3) MltD4gt
(for window at T4)
Immediate
Offline
18
GEMM Arbitrary BSS
1 0 1 0 1 ...
T3 Model on lt1.D1 0.D2 1.D3gt T4 D4 is
appended Model on lt0.D2 1.D3 0.D4gt T5
D5 is appended Model on lt1.D3 0.D4 1.D5gt
D1
D2
D1
D4
D3
D5
Idea We still know parts of future windows and
the corresponding BSS for each of them E.g., At
T3, we maintain models on lt1.D1 0.D2 1.D3gt
(model required at T3) lt0.D2 1.D3gt lt1.D3gt
identical
19
GEMM Resource Requirements
  • Response time to new model
  • Updating one model with the new block
  • Other updates offline
  • Depends on the incremental algorithm
  • Space requirements
  • At most w models
  • Space required for a model is orders of magnitude
    less than that for data!

20
Maintenance under Insertions
  • Algorithm A
  • Input old dataset D, old model M(D), a block of
    tuples d appended to D
  • Output M(Dd) A(D, d, M(D))
  • Such algorithms exist for
  • Frequent itemsets (ECUT, ECUT, BORDERS, FUP)
  • Clusters (BIRCH)
  • Decision trees (BOAT)
  • Note We do NOT require A to handle deletions!

21
Frequent Itemset Models
  • Set of customer transactions
  • Frequent itemset a set of items purchased
    together by many customers

Minimum frequency threshold 50 b, c, a,c
are frequent itemsets
22
Incremental Algorithm FAAM97,TBAR97
D4
D3
D1
D2
  • Input
  • Old dataset
  • Old set of frequent itemsets
  • New block D4
  • Steps
  • Detect if new itemsets become frequent
  • Count frequencies of a small number of itemsets
  • Current algorithms scan (D1D2D3) completely
  • Update model

23
ECUTNew Counting Algorithm
  • Transformed data representation
  • Within each block Di
  • item x sorted list of transaction identifiers
    containing xTID-list(x)

TID-list(a) 1 TID-list(b) 2,3 TID-list(c)
1,2
Count(a,b) TID-list(a) intersection
TID-list(b)
24
Experimental Comparison
25
Comparing Count Times
26
Summary of the first part
LITS Clustering DT
UWBSS
GEMM
MRWBSS
  • Maintenance algorithms under tuple insertions
  • Frequent itemsets
  • ECUT, ECUT
  • Clusters
  • BIRCH
  • Decision Trees
  • BOAT

27
Second Part of this Talk
  • Model Maintenance Maintaining models under
    systematic data evolution ICDE 00
  • Maintaining samples with respect to changing
    query workloads VLDB00

28
Random Samples for AQUA
Agg. query Q
Exact answer
Typical AQUA approach
R
Approx. answer
Uniform Random sample
  • All tuples in R are assumed to be equally
    important while drawing S(R)
  • In practice, queries exhibit locality
  • Consequence S(R) wastes precious real estate

29
Problem
  • Given
  • Relation R
  • Workload W Q1,,Qn
  • Goal Dynamically tune random sample of R
    w.r.t. W
  • Model to be maintained a simple random sample

30
ICICLES
  • R(Q) set of tuples in R required to answer Q
  • Random sample of R U R(Q1) U U R(Qn)
  • Tuples required often are more likely to be in
    SW(R)

31
Mail Order Dataset
32
Conclusions and Future Work
Static dataset
Dynamic dataset
Workload indifferent
Workload sensitive
33
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com