Hierarchies in Data Mining

About This Presentation

Title:

Hierarchies in Data Mining

Description:

We propose desiderata that enable appropriate definition of query semantics for imprecise data ... Desideratum I: Consistency ... Desideratum II: Faithfulness ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 70

Provided by: carbonVide1

Category:

more less

Transcript and Presenter's Notes

Title: Hierarchies in Data Mining

1
Hierarchies in Data Mining

Raghu Ramakrishnan
ramakris_at_yahoo-inc.com
Chief Scientist for Audience and Cloud Computing
Yahoo!

2
About this Talk

Common thememultidimensional view of data
Reveals patterns that emerge at coarser
granularity
Widely recognized, e.g., generalized association
rules
Helps handle imprecision
Analyzing imprecise and aggregated data
Helps handle data sparsity
Even with massive datasets, sparsity is a
challenge!
Defines candidate space of subsets for
exploratory mining
Forecasting query results over future data
Using predictive models as summaries
Potentially, space of mining experiments?

3
Driving Applications

Business Intelligence of combined text and
relational data (Joint with IBM)
Burdick, Deshpande, Jayram, Vaithyanathan
Analyzing mass spectra from ATOFMS (NSF ITR
project with environmental chemists at UW and
Carleton College)
Chen, Chen, Huang, Musicant, Grossman, Schauer
Goal-oriented anonymization of cancer data (NSF
CyberTrust project)
Chen, LeFevre, DeWitt, Shavlik, Hanrahan (Chief
Epidemiologist, Wisconsin), Trentham-Dietz
Analyzing network traffic data
Chen, Yegneswaran, Barford
Content optimization and ad serving
Many people at Yahoo!

4
Background The Multidimensional Data ModelCube
Space
5
Star Schema
TIME timeid date week year
SERVICE pid timeid locid repair
PRODUCT pid pname Category Model
LOCATION locid country region state
FACT TABLE
DIMENSION TABLES
6
Dimension Hierarchies

For each dimension, the set of values can be
organized in a hierarchy

PRODUCT
TIME
LOCATION
year
automobile quarter
country
category week month
region
model date
state
7
Multidimensional Data Model

One fact table D(X,M)
XX1, X2, ... Dimension attributes
MM1, M2, Measure attributes
Domain hierarchy for each dimension attribute
Collection of domains Hier(Xi) (Di(1),...,
Di(k))
The extended domain EXi ?1kt DXi(k)
Value mapping function ?D1?D2(x)
e.g., ?month?year(12/2005) 2005
Form the value hierarchy graph
Stored as dimension table attribute (e.g., week
for a time value) or conversion functions (e.g.,
month, quarter)

8
Multidimensional Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
DIMENSION ATTRIBUTES
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
9
Cube Space

Cube space C EX1?EX2??EXd
Region Hyper rectangle in cube space
c (v1,v2,,vd) , vi ? EXi
E.g., c1 (NY, Camry) c2 (West, Sedan)
Region granularity
gran(c) (d1, d2, ..., dd), di Domain(c.vi)
E.g., gran(c1) (State, Model) gran(c2)
(State, Category)
Region coverage
coverage(c) all facts in c
Region set All regions with same granularity

10
OLAP Over Imprecise Datawith Doug Burdick,
Prasad Deshpande, T.S. Jayram, and Shiv
VaithyanathanIn VLDB 05, 06 joint work with IBM
Almaden
11
Imprecise Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
p5
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
12
Querying Imprecise Facts
Auto F150 Loc MA SUM(Repair) ???
How do we treat p5?
Truck
Sierra
F150
p5
p4
MA
p3
East
NY
p1
p2
13
Allocation (1)
Truck
p5
MA
p3
p4
East
NY
p1
p2
14
Allocation (2)

(Huh? Why 0.5 / 0.5?
- Hold on to that thought)

Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
15
Allocation (3)
Auto F150 Loc MA SUM(Repair) 150
Query the Extended Data Model!
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
16
Allocation Policies

Procedure for assigning allocation weights is
referred to as an allocation policy
Each allocation policy uses different information
to assign allocation weight
Key contributions
Appropriate characterization of the large space
of allocation policies (VLDB 05)
Designing efficient algorithms for allocation
policies that take into account the correlations
in the data (VLDB 06)

17
Motivating Example
Query COUNT
Truck
Sierra
F150

We propose desiderata that enable appropriate
definition of query semantics for imprecise data

MA
p5
East
NY
18
Desideratum I Consistency

Consistency specifies the relationship between
answers to related queries on a fixed data set

Truck
Sierra
F150
p3
MA
p5
East
NY
p1
p2
19
Desideratum II Faithfulness
Data Set 1
Data Set 2
Data Set 3
Sierra
F150
MA
NY

Faithfulness specifies the relationship between
answers to a fixed query on related data sets

20
Imprecise facts lead to many possible
worlds Kripke63,
p1
p2
p3
p5
w1
p4
w4
w2
w3
p2
p1
p5
p4
p4
p5
p3
p3
p2
p2
p1
p1
21
Query Semantics

Given all possible worlds together with their
probabilities, queries are easily answered using
expected values
But number of possible worlds is exponential!
Allocation gives facts weighted assignments to
possible completions, leading to an extended
version of the data
Size increase is linear in number of (completions
of) imprecise facts
Queries operate over this extended version

22
Storing Allocations using the Extended Data Model
Truck
p5
p4
p3
East
p1
p2
23
Allocation Policy Count
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
24
Allocation Policy Measure
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
25
Allocation Policy Template
26
Allocation Graph
27
Example Processing of Allocation Graph
Precise Cells
1) Compute Qsum(r)
Cell(MA,Civic)
Cell(NY,F150)
2) Compute pc,r
Cell(NY,Sierra)
2 / 3
2
Cell(MA,F150)
Imprecise Facts
3
ltMA,Truckgt
1
Cell(MA,Sierra)
1 / 3
28
Processing Allocation Graph

What if precise cells and imprecise facts do not
fit into memory?
Need to scan precise cells twice for each
imprecise fact

Identify groups of imprecise facts that can be
processed in same scan
Algorithm will process these groups

ltMA,Sedangt
p6
Cell(MA,Civic)
c1
p7
ltMA,Truckgt
ltCA,ALLgt
p8
c2
Cell(MA,Sierra)
ltEast,Truckgt
p9
c3
Cell(NY,F150)
ltWest,Sedangt
p10
ltALL,Civicgt
p11
c4
Cell(CA,Civic)
ltALL,Sierragt
p12
ltWest,Civicgt
c5
p13
Cell(CA,Sierra)
ltWest,Sierragt
p14
29
Summary

Consistency and faithfulness
Desiderata for designing query semantics for
imprecise data
Allocation is the key to our framework
Aggregation operators with appropriate guarantees
of consistency and faithfulness
Efficient algorithms for allocation policies
Lots of recent work on uncertainty and
probabilistic data processing
Sensor data, errors, Bayesian inference

VLDB 05 (semantics), 06 (implementation)
30
Dealing with Data Sparsity

Deepak Agarwal, Andrei Broder, Deepayan
Chakrabarti, Dejan Diklic, Vanja Josifovski,
Mayssam Sayyadian
Estimating Rates of Rare Events at Multiple
Resolutions, KDD 2007

31
Motivating ApplicationContent Match Problem

Problem
Which ads are good on what pages
Pages no control Ads can control
First simplification
(Page, Ad) completely characterized by a set of
high-dimensional features
Naïve Approach
Experiment with all possible pairs several times
and estimate CTR.
Of course, this doesnt work
Most (ad, page) pairs have very few impressions,
if any,
and even fewer clicks
Severe data sparsity

32
Estimation in the Tail

Use an existing, well-understood hierarchy
Categorize ads and webpages to leaves of the
hierarchy
CTR estimates of siblings are correlated
The hierarchy allows us to aggregate data
Coarser resolutions
provide reliable estimates for rare events
which then influences estimation at finer
resolutions

Similar coarsening, different
motivation Mining Generalized Association
Rules Ramakrishnan Srikant, Rakesh Agrawal , VLDB
1995
33
System Overview
Retrospective dataURL, ad, isClicked
Crawl URLs
a sample of URLs
Classify pages and ads
Rare event estimation using hierarchy
Impute impressions, fix sampling bias
34
Sampling of Webpages

Naïve strategy sample at random from the set of
URLs
Sampling errors in impression volume AND click
volume
Instead, we propose
Crawling all URLs with at least one click, and
a sample of the remaining URLs
Variability is only in impression volume

35
Imputation of Impression Volume
Z(0)

Region node (page node, ad node)
Build a Region Hierarchy
A cross-product of the page hierarchy and the ad
hierarchy

Z(i)
Leaf Region
Page leaves
Ad leaves
Page hierarchy
Ad hierarchy
36
Exploiting Taxonomy Structure

Consider the bottom two levels of the taxonomy
Each cell corresponds to a (page, ad)-class pair

Key point Children under a parent node are
alike
and expected to have similar CTRs
(i.e., form a cohesive block)

37
Imputation of Impression Volume
For any level Z(i)
impressions nij mij xij
sums to ?nij K.?mijrow constraint
sums toTotal impressions(known)
sums to impressions on ads of this ad
classcolumn constraint
38
Imputation of Impression Volume
sums to
block constraint
39
Imputing xij

Iterative Proportional Fitting Darroch/1972
Initialize xij nij mij
Top-down
Scale all xij in every block in Z(i1) to sum to
its parent in Z(i)
Scale all xij in Z(i1) to sum to the row totals
Scale all xij in Z(i1) to sum to the column
totals
Repeat for every level Z(i)
Bottom-up Similar

Z(i)
Z(i1)
block
Page classes
Ad classes
40
Imputation Summary

Given
nij (impressions in clicked pool)
mij (impressions in sampled non-clicked pool)
impressions on ads of each ad class in the ad
hierarchy
We get
Estimated impression volume Ñij nij mij
xijin each region ij of every level Z(.)

41
Dealing with Data Sparsity

Deepak Agarwal, Pradheep Elango, Nitin Motgi,
Seung-Taek Park, Raghu Ramakrishnan, Scott Roy,
Joe Zachariah
Real-time Content Optimization through Active
User Feedback, NIPS 2008

42
Yahoo! Home Page Featured Box

It is the top-center part of the Y! Front Page
It has four tabs Featured, Entertainment,
Sports, and Video

43
Novel Aspects

Classical Arms assumed fixed over time
We gain and lose arms over time
Some theoretical work by Whittle in 80s
operations research
Classical Serving rule updated after each pull
We compute optimal design in batch mode
Classical Generally. CTR assumed stationary
We have highly dynamic, non-stationary CTRs

44
Bellwether AnalysisGlobal Aggregates from Local
Regionswith Beechung Chen, Jude Shavlik, and
Pradeep TammaIn VLDB 06
45
Motivating Example

A company wants to predict the first year
worldwide profit of a new item (e.g., a new
movie)
By looking at features and profits of previous
(similar) movies, we predict expected total
profit (1-year US sales) for new movie
Wait a year and write a query! If you cant wait,
stay awake
The most predictive features may be based on
sales data gathered by releasing the new movie in
many regions (different locations over
different time periods).
Example region-based features 1st week sales
in Peoria, week-to-week sales growth in
Wisconsin, etc.
Gathering this data has a cost (e.g., marketing
expenses, waiting time)
Problem statement Find the most predictive
region features that can be obtained within a
given cost budget

46
Key Ideas

Large datasets are rarely labeled with the
targets that we wish to learn to predict
But for the tasks we address, we can readily use
OLAP queries to generate features (e.g., 1st week
sales in Peoria) and even targets (e.g., profit)
for mining
We use data-mining models as building blocks in
the mining process, rather than thinking of them
as the end result
The central problem is to find data subsets
(bellwether regions) that lead to predictive
features which can be gathered at low cost for a
new case

47
Motivating Example

A company wants to predict the first years
worldwide profit for a new item, by using its
historical database
Database Schema

The combination of the underlined attributes
forms a key

48
A Straightforward Approach

Build a regression model to predict item profit
There is much room for accuracy improvement!

By joining and aggregating tables in the
historical database we can create a training set
Item-table features
Target
An Example regression model Profit ?0 ?1
Laptop ?2 Desktop ?3 RdExpense
49
Using Regional Features

Example region 1st week, HK
Regional features
Regional Profit The 1st week profit in HK
Regional Ad Expense The 1st week ad expense in
HK
A possibly more accurate model
Profit1yr, All ?0 ?1 Laptop ?2 Desktop
?3 RdExpense
?4 Profit1wk, HK ?5
AdExpense1wk, HK
Problem Which region should we use?
The smallest region that improves the accuracy
the most
We give each candidate region a cost
The most cost-effective region is the
bellwether region

50
Basic Bellwether Problem
51
Basic Bellwether Problem
Location domain hierarchy

Historical database DB
Training item set I
Candidate region set R
E.g., 1-n week, Location
Target generation query??i(DB) returns the
target value of item i ? I
E.g., ??sum(Profit) ??i, 1-52, All ProfitTable
Feature generation query ?i,r(DB), i ? Ir and r
? R
Ir The set of items in region r
E.g., Categoryi, RdExpensei, Profiti, 1-n,
Loc, AdExpensei, 1-n, Loc
Cost query ??r(DB), r ? R, the cost of
collecting data from r
Predictive model hr(x), r ? R, trained on
(?i,r(DB), ?i(DB)) i ? Ir
E.g., linear regression model

52
Basic Bellwether Problem
Features ?i,r(DB)
Target ?i(DB)
Aggregate over data records in region r 1-2,
USA
Total Profit in 1-52, All
r

For each region r, build a predictive model
hr(x) and then choose bellwether region
Coverage(r)?? fraction of all items in region ?
minimum coverage support
Cost(r, DB)?? cost threshold
Error(hr) is minimized

53
Experiment on a Mail Order Dataset
Error-vs-Budget Plot

Bel Err The error of the bellwether region found
using a given budget
Avg Err The average error of all the cube
regions with costs under a given budget
Smp Err The error of a set of randomly sampled
(non-cube) regions with costs under a given budget

1-8 month, MD
(RMSE Root Mean Square Error)
54
Experiment on a Mail Order Dataset
Uniqueness Plot

Y-axis Fraction of regions that are as good as
the bellwether region
The fraction of regions that satisfy the
constraints and have errors within the 99
confidence interval of the error of the
bellwether region
We have 99 confidence that that 1-8 month, MD
is a quite unusual bellwether region

1-8 month, MD
55
Basic Bellwether Computation

OLAP-style bellwether analysis
Candidate regions Regions in a data cube
Queries OLAP-style aggregate queries
E.g., Sum(Profit) over a region
Efficient computation
Use iceberg cube techniques to prune infeasible
regions (Beyer-Ramakrishnan, ICDE 99
Han-Pei-Dong-Wang SIGMOD 01)
Infeasible regions Regions with cost gt B or
coverage lt C
Share computation by generating the features and
target values for all the feasible regions all
together
Exploit distributive and algebraic aggregate
functions
Simultaneously generating all the features and
target values reduces DB scans and repeated
aggregate computation

56
Subset Bellwether Problem
57
Subset-Based Bellwether Prediction

Motivation Different subsets of items may have
different bellwether regions
E.g., The bellwether region for laptops may be
different from the bellwether region for clothes
Two approaches

Bellwether Cube
Bellwether Tree
RD Expenses
Category
58
Bellwether Tree

How to build a bellwether tree
Similar to regression tree construction
Starting from the root node, recursively split
the current leaf node using the best split
criterion
A split criterion partitions a set of items into
disjoint subsets
Pick the split that reduces the error the most
Stop splitting when the number of items in the
current leaf node falls under a threshold value
Prune the tree to avoid overfitting

1
2
7
3
4
8
9
5
6
59
Bellwether Tree

How to split a node
Split criterion
Numeric split Ak ? ?
Categorical split Ak
(Ak is an item-table feature)
Pick the best split criterion
Best split The split that can reduce the error
the most

Find bellwether region for S h Bellwether model
for S
Find bellwether region for Sp hp Bellwether
model for Sp
Total parent error
Total child error
(S is the set of items at the parent node, and Sp
is the set of items at the pth child node)
60
Problem of Naïve Tree Construction

A naïve bellwether tree construction algorithm
will scan the dataset n?m times
n is the number of nodes
m is the number of candidate split criteria
Idea Extending the RainForest framework Gehrke
et al., 98

For each node
Try all candidate split criteria to find the
best one
It needs to scan the dataset m times

2
7
3
4
8
9
5
6
61
Efficient Tree Construction

Idea Extending the RainForest framework Gehrke
et al., 98
Build the tree level by level
Scan the entire dataset once per level and keep
small sufficient statistics in memory (size
O(n?s?c))
Sufficient Statistics for a split criterion
Sp and Error(hp Sp),
for p 1 to of children
Split all the nodes at that level
after the scan based on the sufficient
statistics
Further improved by a hybrid algorithm

1st scan
1
2nd scan
2
3
3rd scan
4
5
6
7
4th scan
8
9
62
Bellwether Cube
RD Expenses
Category
Rollup
Drilldown
RD Expenses
Category
The number in a cell is the error of the
bellwether region for that subset of items
63
Problem of Naïve Cube Construction

A naïve bellwether cube construction algorithm
will conduct a basic bellwether search for the
subset of items in each cell
A basic bellwether search involves building a
model for each candidate region

For each cell
Build a model for each
candidate region

64
Efficient Cube Construction

Idea Transform model construction into
computation of distributive or algebraic
aggregate functions
Let S1, , Sn partition S
S S1 ? ? Sn and Si ? Sj ??
Distributive function ?(S) F(?(S1), ,
?(Sn))
E.g., Count(S) Sum(Count(S1), , Count(Sn))
Algebraic function ?(S) F(G(S1), , G(Sn))
G(Si) returns a length-fixed vector of values
E.g., Avg(S) F(G(S1), , G(Sn))
G(Si) Sum(Si), Count(Si)
F(a1, b1, , an, bn) Sum(ai) / Sum(bi)

65
Efficient Cube Construction

Build models for each finest-grained cells
For higher-level cells, use data cube computation
techniques to compute the aggregate functions

For each finest-grained cell
Build models to find the
bellwether region

For each higher-level cell
Compute aggregate functions
to find the bellwether region

66
Efficient Cube Construction

Classification models
Use the prediction cube Chen et al., 05
execution framework
Regression models (Weighted linear regression
model builds on work in Chen-Dong-Han-Wah-Wang
VLDB 02)
Having the sum of squared error (SSE) for each
candidate region is sufficient to find the
bellwether region
SSE(S) is an algebraic function, where S is a set
of item
SSE(S) q( g(Sk) k 1, , n )
S1, , Sn partition S
g(Sk) ?Yk?WkYk, Xk?WkXk, Xk?WkYk?
q(?Ak, Bk, Ck? k 1, , n) ?k Ak ? (?k
Ck)?(?k Bk)?1(?k Ck)

where
Yk is the vector of target values for set Sk of
items Xk is the matrix of features for set Sk of
items Wk is the weight matrix for set Sk of items
67
Experimental Results
68
Experimental Results Summary

We have shown the existence of bellwether regions
on a real mail-order dataset
We characterize the behavior of bellwether trees
and bellwether cubes using synthetic datasets
We show our computation techniques improve
efficiency by orders of magnitude
We show our computation techniques scale linearly
in the size of the dataset

69
Characteristics of Bellwether Trees Cubes

Result
Bellwether trees cubes have better accuracy
than basic bellwether search
Increase noise ?? increase error
Increase complexity ? increase error

Dataset generation
Use random tree to generate
different bellwether regions
for different subset of items
Parameters
Noise
Concept complexity of tree nodes

15 nodes
Noise level 0.5
70
Efficiency Comparison
Naïve computation methods
Our computation techniques
71
Scalability
72
Exploratory MiningPrediction Cubeswith
Beechung Chen, Lei Chen, and Yi LinIn VLDB 05
73
The Idea

Build OLAP data cubes in which cell values
represent decision/prediction behavior
In effect, build a tree for each cell/region in
the cubeobserve that this is not the same as a
collection of trees used in an ensemble method!
The idea is simple, but it leads to promising
data mining tools
Ultimate objective Exploratory analysis of the
entire space of data mining choices
Choice of algorithms, data conditioning
parameters

74
Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
75
Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
76
Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
77
Example (3/7) Decision Analysis

Are there branches (and time windows) where
approvals were closely tied to sensitive
attributes (e.g., race)?
Suppose you partitioned the training data by
location and time, chose the partition for a
given branch and time window, and built a
classifier. You could then ask, Are the
predictions of this classifier closely correlated
with race?
Are there branches and times with decision making
reminiscent of 1950s Alabama?
Requires comparison of classifiers trained using
different subsets of data.

78
Example (4/7) Prediction Cubes

Build a model using data from USA in Dec., 1985
Evaluate that model

Measure in a cell
Accuracy of the model
Predictiveness of Race
measured based on that
model
Similarity between that
model and a given model

79
Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
80
Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
81
Example (7/7) Prediction Cube
Cell value Predictiveness of Race
82
Efficient Computation

Reduce prediction cube computation to data cube
computation
Represent a data-mining model as a distributive
or algebraic (bottom-up computable) aggregate
function, so that data-cube techniques can be
directly applied

83
Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
84
Functions on Sets

Bottom-up computable functions Functions that
can be computed using only summary information
Distributive function ?(X) F(?(X1), ,
?(Xn))
X X1 ? ? Xn and Xi ? Xj ??
E.g., Count(X) Sum(Count(X1), , Count(Xn))
Algebraic function ?(X) F(G(X1), , G(Xn))
G(Xi) returns a length-fixed vector of values
E.g., Avg(X) F(G(X1), , G(Xn))
G(Xi) Sum(Xi), Count(Xi)
F(s1, c1, , sn, cn) Sum(si) / Sum(ci)

85
Scoring Function

Represent a model as a function of sets
Conceptually, a machine-learning model h(X
?Z(D)) is a scoring function Score(y, x ?Z(D))
that gives each class y a score on test example x
h(x ?Z(D)) argmax y Score(y, x ?Z(D))
Score(y, x ?Z(D)) ? p(y x, ?Z(D))
?Z(D) The set of training examples (a cube
subset of D)

86
Bottom-up Score Computation

Key observations
Observation 1 Score(y, x ?Z(D)) is a function
of cube subset ?Z(D) if it is distributive or
algebraic, bottom-up data cube computation
techniques can be directly applied
Observation 2 Having the scores for all the test
examples and all the cells is sufficient to
compute a prediction cube
Scores ?? predictions ?? cell values
Details depend on what each cell means (i.e.,
type of prediction cubes) but straightforward

87
Machine-Learning Models

Naïve Bayes
Scoring function algebraic
Kernel-density-based classifier
Scoring function distributive
Decision tree, random forest
Neither distributive, nor algebraic
PBE Probability-based ensemble (new)
To make any machine-learning model distributive
Approximation

88
Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
89
Probability-Based Ensemble

Scoring function
h(y x bi(D)) Model hs estimation of p(y x,
bi(D))
g(bi x) A model that predicts the probability
that x belongs to base subset bi(D)

90
Outline

Motivating example
Definition of prediction cubes
Efficient prediction cube materialization
Experimental results
Conclusion

91
Experiments

Quality of PBE on 8 UCI datasets
The quality of the PBE version of a model is
slightly worse (0 6) than the quality of the
model trained directly on the whole training
data.
Efficiency of the bottom-up score computation
technique
Case study on demographic data

PBE
vs.
92
Efficiency of Bottom-up Score Computation

Machine-learning models
J48 J48 decision tree
RF Random forest
NB Naïve Bayes
KDC Kernel-density-based classifier
Bottom-up method vs. Exhaustive method

? PBE-J48
PBE-RF
NB
KDC

? J48ex
RFex
NBex
KDCex

93
Synthetic Dataset

Dimensions Z1, Z2 and Z3.
Decision rule

Z1 and Z2
Z3
94
Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
95
Conclusions
96
Related Work Building Models on OLAP Results

Multi-dimensional regression Chen, VLDB 02
Goal Detect changes of trends
Build linear regression models for cube cells
Step-by-step regression in stream cubes Liu,
PAKDD 03
Loglinear-based quasi cubes Barbara, J. IIS 01
Use loglinear model to approximately compress
dense regions of a data cube
NetCube Margaritis, VLDB 01
Build Bayes Net on the entire dataset of
approximate answer count queries

97
Related Work (Contd.)

Cubegrades Imielinski, J. DMKD 02
Extend cubes with ideas from association rules
How does the measure change when we rollup or
drill down?
Constrained gradients Dong, VLDB 01
Find pairs of similar cell characteristics
associated with big changes in measure
User-cognizant multidimensional analysis
Sarawagi, VLDBJ 01
Help users find the most informative unvisited
regions in a data cube using max entropy
principle
Multi-Structural DBs Fagin et al., PODS 05, VLDB
05
Experiment Databases Towards an Improved
Experimental Methodology in Machine Learning
Blockeel Vanschoren, PKDD 2007

98
Take-Home Messages

Promising exploratory data analysis paradigm
Can use models to identify interesting subsets
Concentrate only on subsets in cube space
Those are meaningful subsets, tractable
Precompute results and provide the users with an
interactive tool
A simple way to plug something into cube-style
analysis
Try to describe/approximate something by a
distributive or algebraic function

99
Big Picture

Why stop with decision behavior? Can apply to
other kinds of analyses too
Why stop at browsing? Can mine prediction cubes
in their own right
Exploratory analysis of mining space
Dimension attributes can be parameters related to
algorithm, data conditioning, etc.
Tractable evaluation is a challenge
Large number of dimensions, real-valued
dimension attributes, difficulties in
compositional evaluation
Active learning for experiment design, extending
compositional methods

100
Conclusion

Hierarchies are widely used, and a promising tool
to help us deal with
Data sparsity
Data imprecision and uncertainty
Exploratory analysis
Experiment planning and management
Area is as yet under-appreciated
Lots of work on taxonomies and how to use them,
but there are many novel ways of using them that
have not received enough attention