Title: Hierarchies in Data Mining
1Hierarchies in Data Mining
- Raghu Ramakrishnan
- ramakris_at_yahoo-inc.com
- Chief Scientist for Audience and Cloud Computing
- Yahoo!
2About this Talk
- Common thememultidimensional view of data
- Reveals patterns that emerge at coarser
granularity - Widely recognized, e.g., generalized association
rules - Helps handle imprecision
- Analyzing imprecise and aggregated data
- Helps handle data sparsity
- Even with massive datasets, sparsity is a
challenge! - Defines candidate space of subsets for
exploratory mining - Forecasting query results over future data
- Using predictive models as summaries
- Potentially, space of mining experiments?
3Driving Applications
- Business Intelligence of combined text and
relational data (Joint with IBM) - Burdick, Deshpande, Jayram, Vaithyanathan
- Analyzing mass spectra from ATOFMS (NSF ITR
project with environmental chemists at UW and
Carleton College) - Chen, Chen, Huang, Musicant, Grossman, Schauer
- Goal-oriented anonymization of cancer data (NSF
CyberTrust project) - Chen, LeFevre, DeWitt, Shavlik, Hanrahan (Chief
Epidemiologist, Wisconsin), Trentham-Dietz - Analyzing network traffic data
- Chen, Yegneswaran, Barford
- Content optimization and ad serving
- Many people at Yahoo!
4Background The Multidimensional Data ModelCube
Space
5Star Schema
TIME timeid date week year
SERVICE pid timeid locid repair
PRODUCT pid pname Category Model
LOCATION locid country region state
FACT TABLE
DIMENSION TABLES
6 Dimension Hierarchies
- For each dimension, the set of values can be
organized in a hierarchy
PRODUCT
TIME
LOCATION
year
automobile quarter
country
category week month
region
model date
state
7Multidimensional Data Model
- One fact table D(X,M)
- XX1, X2, ... Dimension attributes
- MM1, M2, Measure attributes
- Domain hierarchy for each dimension attribute
- Collection of domains Hier(Xi) (Di(1),...,
Di(k)) - The extended domain EXi ?1kt DXi(k)
- Value mapping function ?D1?D2(x)
- e.g., ?month?year(12/2005) 2005
- Form the value hierarchy graph
- Stored as dimension table attribute (e.g., week
for a time value) or conversion functions (e.g.,
month, quarter)
8Multidimensional Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
DIMENSION ATTRIBUTES
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
9Cube Space
- Cube space C EX1?EX2??EXd
- Region Hyper rectangle in cube space
- c (v1,v2,,vd) , vi ? EXi
- E.g., c1 (NY, Camry) c2 (West, Sedan)
- Region granularity
- gran(c) (d1, d2, ..., dd), di Domain(c.vi)
- E.g., gran(c1) (State, Model) gran(c2)
(State, Category) - Region coverage
- coverage(c) all facts in c
- Region set All regions with same granularity
10OLAP Over Imprecise Datawith Doug Burdick,
Prasad Deshpande, T.S. Jayram, and Shiv
VaithyanathanIn VLDB 05, 06 joint work with IBM
Almaden
11Imprecise Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
p5
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
12Querying Imprecise Facts
Auto F150 Loc MA SUM(Repair) ???
How do we treat p5?
Truck
Sierra
F150
p5
p4
MA
p3
East
NY
p1
p2
13 Allocation (1)
Truck
p5
MA
p3
p4
East
NY
p1
p2
14 Allocation (2)
- (Huh? Why 0.5 / 0.5?
- - Hold on to that thought)
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
15 Allocation (3)
Auto F150 Loc MA SUM(Repair) 150
Query the Extended Data Model!
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
16Allocation Policies
- Procedure for assigning allocation weights is
referred to as an allocation policy - Each allocation policy uses different information
to assign allocation weight - Key contributions
- Appropriate characterization of the large space
of allocation policies (VLDB 05) - Designing efficient algorithms for allocation
policies that take into account the correlations
in the data (VLDB 06)
17Motivating Example
Query COUNT
Truck
Sierra
F150
- We propose desiderata that enable appropriate
definition of query semantics for imprecise data
MA
p5
East
NY
18Desideratum I Consistency
- Consistency specifies the relationship between
answers to related queries on a fixed data set
Truck
Sierra
F150
p3
MA
p5
East
NY
p1
p2
19Desideratum II Faithfulness
Data Set 1
Data Set 2
Data Set 3
Sierra
F150
MA
NY
- Faithfulness specifies the relationship between
answers to a fixed query on related data sets
20Imprecise facts lead to many possible
worlds Kripke63,
p1
p2
p3
p5
w1
p4
w4
w2
w3
p2
p1
p5
p4
p4
p5
p3
p3
p2
p2
p1
p1
21Query Semantics
- Given all possible worlds together with their
probabilities, queries are easily answered using
expected values - But number of possible worlds is exponential!
- Allocation gives facts weighted assignments to
possible completions, leading to an extended
version of the data - Size increase is linear in number of (completions
of) imprecise facts - Queries operate over this extended version
22Storing Allocations using the Extended Data Model
Truck
p5
p4
p3
East
p1
p2
23Allocation Policy Count
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
24Allocation Policy Measure
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
25Allocation Policy Template
26Allocation Graph
27Example Processing of Allocation Graph
Precise Cells
1) Compute Qsum(r)
Cell(MA,Civic)
Cell(NY,F150)
2) Compute pc,r
Cell(NY,Sierra)
2 / 3
2
Cell(MA,F150)
Imprecise Facts
3
ltMA,Truckgt
1
Cell(MA,Sierra)
1 / 3
28Processing Allocation Graph
- What if precise cells and imprecise facts do not
fit into memory? - Need to scan precise cells twice for each
imprecise fact
- Identify groups of imprecise facts that can be
processed in same scan - Algorithm will process these groups
ltMA,Sedangt
p6
Cell(MA,Civic)
c1
p7
ltMA,Truckgt
ltCA,ALLgt
p8
c2
Cell(MA,Sierra)
ltEast,Truckgt
p9
c3
Cell(NY,F150)
ltWest,Sedangt
p10
ltALL,Civicgt
p11
c4
Cell(CA,Civic)
ltALL,Sierragt
p12
ltWest,Civicgt
c5
p13
Cell(CA,Sierra)
ltWest,Sierragt
p14
29Summary
- Consistency and faithfulness
- Desiderata for designing query semantics for
imprecise data - Allocation is the key to our framework
- Aggregation operators with appropriate guarantees
of consistency and faithfulness - Efficient algorithms for allocation policies
- Lots of recent work on uncertainty and
probabilistic data processing - Sensor data, errors, Bayesian inference
VLDB 05 (semantics), 06 (implementation)
30Dealing with Data Sparsity
- Deepak Agarwal, Andrei Broder, Deepayan
Chakrabarti, Dejan Diklic, Vanja Josifovski,
Mayssam Sayyadian - Estimating Rates of Rare Events at Multiple
Resolutions, KDD 2007
31Motivating ApplicationContent Match Problem
- Problem
- Which ads are good on what pages
- Pages no control Ads can control
- First simplification
- (Page, Ad) completely characterized by a set of
high-dimensional features - Naïve Approach
- Experiment with all possible pairs several times
and estimate CTR. - Of course, this doesnt work
- Most (ad, page) pairs have very few impressions,
if any, - and even fewer clicks
- Severe data sparsity
-
32Estimation in the Tail
- Use an existing, well-understood hierarchy
- Categorize ads and webpages to leaves of the
hierarchy - CTR estimates of siblings are correlated
- The hierarchy allows us to aggregate data
- Coarser resolutions
- provide reliable estimates for rare events
- which then influences estimation at finer
resolutions
Similar coarsening, different
motivation Mining Generalized Association
Rules Ramakrishnan Srikant, Rakesh Agrawal , VLDB
1995
33System Overview
Retrospective dataURL, ad, isClicked
Crawl URLs
a sample of URLs
Classify pages and ads
Rare event estimation using hierarchy
Impute impressions, fix sampling bias
34Sampling of Webpages
- Naïve strategy sample at random from the set of
URLs - Sampling errors in impression volume AND click
volume - Instead, we propose
- Crawling all URLs with at least one click, and
- a sample of the remaining URLs
- Variability is only in impression volume
35Imputation of Impression Volume
Z(0)
- Region node (page node, ad node)
- Build a Region Hierarchy
- A cross-product of the page hierarchy and the ad
hierarchy
Z(i)
Leaf Region
Page leaves
Ad leaves
Page hierarchy
Ad hierarchy
36 Exploiting Taxonomy Structure
- Consider the bottom two levels of the taxonomy
- Each cell corresponds to a (page, ad)-class pair
- Key point Children under a parent node are
alike - and expected to have similar CTRs
- (i.e., form a cohesive block)
37Imputation of Impression Volume
For any level Z(i)
impressions nij mij xij
sums to ?nij K.?mijrow constraint
sums toTotal impressions(known)
sums to impressions on ads of this ad
classcolumn constraint
38Imputation of Impression Volume
sums to
block constraint
39Imputing xij
- Iterative Proportional Fitting Darroch/1972
- Initialize xij nij mij
- Top-down
- Scale all xij in every block in Z(i1) to sum to
its parent in Z(i) - Scale all xij in Z(i1) to sum to the row totals
- Scale all xij in Z(i1) to sum to the column
totals - Repeat for every level Z(i)
- Bottom-up Similar
Z(i)
Z(i1)
block
Page classes
Ad classes
40Imputation Summary
- Given
- nij (impressions in clicked pool)
- mij (impressions in sampled non-clicked pool)
- impressions on ads of each ad class in the ad
hierarchy - We get
- Estimated impression volume Ñij nij mij
xijin each region ij of every level Z(.)
41Dealing with Data Sparsity
- Deepak Agarwal, Pradheep Elango, Nitin Motgi,
Seung-Taek Park, Raghu Ramakrishnan, Scott Roy,
Joe Zachariah - Real-time Content Optimization through Active
User Feedback, NIPS 2008
42 Yahoo! Home Page Featured Box
- It is the top-center part of the Y! Front Page
- It has four tabs Featured, Entertainment,
Sports, and Video
43Novel Aspects
- Classical Arms assumed fixed over time
- We gain and lose arms over time
- Some theoretical work by Whittle in 80s
operations research - Classical Serving rule updated after each pull
- We compute optimal design in batch mode
- Classical Generally. CTR assumed stationary
- We have highly dynamic, non-stationary CTRs
44Bellwether AnalysisGlobal Aggregates from Local
Regionswith Beechung Chen, Jude Shavlik, and
Pradeep TammaIn VLDB 06
45Motivating Example
- A company wants to predict the first year
worldwide profit of a new item (e.g., a new
movie) - By looking at features and profits of previous
(similar) movies, we predict expected total
profit (1-year US sales) for new movie - Wait a year and write a query! If you cant wait,
stay awake - The most predictive features may be based on
sales data gathered by releasing the new movie in
many regions (different locations over
different time periods). - Example region-based features 1st week sales
in Peoria, week-to-week sales growth in
Wisconsin, etc. - Gathering this data has a cost (e.g., marketing
expenses, waiting time) - Problem statement Find the most predictive
region features that can be obtained within a
given cost budget
46Key Ideas
- Large datasets are rarely labeled with the
targets that we wish to learn to predict - But for the tasks we address, we can readily use
OLAP queries to generate features (e.g., 1st week
sales in Peoria) and even targets (e.g., profit)
for mining - We use data-mining models as building blocks in
the mining process, rather than thinking of them
as the end result - The central problem is to find data subsets
(bellwether regions) that lead to predictive
features which can be gathered at low cost for a
new case
47Motivating Example
- A company wants to predict the first years
worldwide profit for a new item, by using its
historical database - Database Schema
- The combination of the underlined attributes
forms a key
48A Straightforward Approach
- Build a regression model to predict item profit
- There is much room for accuracy improvement!
By joining and aggregating tables in the
historical database we can create a training set
Item-table features
Target
An Example regression model Profit ?0 ?1
Laptop ?2 Desktop ?3 RdExpense
49Using Regional Features
- Example region 1st week, HK
- Regional features
- Regional Profit The 1st week profit in HK
- Regional Ad Expense The 1st week ad expense in
HK - A possibly more accurate model
-
- Profit1yr, All ?0 ?1 Laptop ?2 Desktop
?3 RdExpense - ?4 Profit1wk, HK ?5
AdExpense1wk, HK - Problem Which region should we use?
- The smallest region that improves the accuracy
the most - We give each candidate region a cost
- The most cost-effective region is the
bellwether region
50Basic Bellwether Problem
51Basic Bellwether Problem
Location domain hierarchy
- Historical database DB
- Training item set I
- Candidate region set R
- E.g., 1-n week, Location
- Target generation query??i(DB) returns the
target value of item i ? I - E.g., ??sum(Profit) ??i, 1-52, All ProfitTable
- Feature generation query ?i,r(DB), i ? Ir and r
? R - Ir The set of items in region r
- E.g., Categoryi, RdExpensei, Profiti, 1-n,
Loc, AdExpensei, 1-n, Loc - Cost query ??r(DB), r ? R, the cost of
collecting data from r - Predictive model hr(x), r ? R, trained on
(?i,r(DB), ?i(DB)) i ? Ir - E.g., linear regression model
52Basic Bellwether Problem
Features ?i,r(DB)
Target ?i(DB)
Aggregate over data records in region r 1-2,
USA
Total Profit in 1-52, All
r
- For each region r, build a predictive model
hr(x) and then choose bellwether region - Coverage(r)?? fraction of all items in region ?
minimum coverage support - Cost(r, DB)?? cost threshold
- Error(hr) is minimized
53Experiment on a Mail Order Dataset
Error-vs-Budget Plot
- Bel Err The error of the bellwether region found
using a given budget - Avg Err The average error of all the cube
regions with costs under a given budget - Smp Err The error of a set of randomly sampled
(non-cube) regions with costs under a given budget
1-8 month, MD
(RMSE Root Mean Square Error)
54Experiment on a Mail Order Dataset
Uniqueness Plot
- Y-axis Fraction of regions that are as good as
the bellwether region - The fraction of regions that satisfy the
constraints and have errors within the 99
confidence interval of the error of the
bellwether region - We have 99 confidence that that 1-8 month, MD
is a quite unusual bellwether region
1-8 month, MD
55Basic Bellwether Computation
- OLAP-style bellwether analysis
- Candidate regions Regions in a data cube
- Queries OLAP-style aggregate queries
- E.g., Sum(Profit) over a region
- Efficient computation
- Use iceberg cube techniques to prune infeasible
regions (Beyer-Ramakrishnan, ICDE 99
Han-Pei-Dong-Wang SIGMOD 01) - Infeasible regions Regions with cost gt B or
coverage lt C - Share computation by generating the features and
target values for all the feasible regions all
together - Exploit distributive and algebraic aggregate
functions - Simultaneously generating all the features and
target values reduces DB scans and repeated
aggregate computation
56Subset Bellwether Problem
57Subset-Based Bellwether Prediction
- Motivation Different subsets of items may have
different bellwether regions - E.g., The bellwether region for laptops may be
different from the bellwether region for clothes - Two approaches
Bellwether Cube
Bellwether Tree
RD Expenses
Category
58Bellwether Tree
- How to build a bellwether tree
- Similar to regression tree construction
- Starting from the root node, recursively split
the current leaf node using the best split
criterion - A split criterion partitions a set of items into
disjoint subsets - Pick the split that reduces the error the most
- Stop splitting when the number of items in the
current leaf node falls under a threshold value - Prune the tree to avoid overfitting
1
2
7
3
4
8
9
5
6
59Bellwether Tree
- How to split a node
- Split criterion
- Numeric split Ak ? ?
- Categorical split Ak
- (Ak is an item-table feature)
- Pick the best split criterion
- Best split The split that can reduce the error
the most
Find bellwether region for S h Bellwether model
for S
Find bellwether region for Sp hp Bellwether
model for Sp
Total parent error
Total child error
(S is the set of items at the parent node, and Sp
is the set of items at the pth child node)
60Problem of Naïve Tree Construction
- A naïve bellwether tree construction algorithm
will scan the dataset n?m times - n is the number of nodes
- m is the number of candidate split criteria
- Idea Extending the RainForest framework Gehrke
et al., 98
1
- For each node
- Try all candidate split criteria to find the
best one - It needs to scan the dataset m times
2
7
3
4
8
9
5
6
61Efficient Tree Construction
- Idea Extending the RainForest framework Gehrke
et al., 98 - Build the tree level by level
- Scan the entire dataset once per level and keep
small sufficient statistics in memory (size
O(n?s?c)) - Sufficient Statistics for a split criterion
- Sp and Error(hp Sp),
- for p 1 to of children
- Split all the nodes at that level
- after the scan based on the sufficient
statistics - Further improved by a hybrid algorithm
1st scan
1
2nd scan
2
3
3rd scan
4
5
6
7
4th scan
8
9
62Bellwether Cube
RD Expenses
Category
Rollup
Drilldown
RD Expenses
Category
The number in a cell is the error of the
bellwether region for that subset of items
63Problem of Naïve Cube Construction
- A naïve bellwether cube construction algorithm
will conduct a basic bellwether search for the
subset of items in each cell - A basic bellwether search involves building a
model for each candidate region
- For each cell
- Build a model for each
- candidate region
64Efficient Cube Construction
- Idea Transform model construction into
computation of distributive or algebraic
aggregate functions - Let S1, , Sn partition S
- S S1 ? ? Sn and Si ? Sj ??
- Distributive function ?(S) F(?(S1), ,
?(Sn)) - E.g., Count(S) Sum(Count(S1), , Count(Sn))
- Algebraic function ?(S) F(G(S1), , G(Sn))
- G(Si) returns a length-fixed vector of values
- E.g., Avg(S) F(G(S1), , G(Sn))
- G(Si) Sum(Si), Count(Si)
- F(a1, b1, , an, bn) Sum(ai) / Sum(bi)
65Efficient Cube Construction
- Build models for each finest-grained cells
- For higher-level cells, use data cube computation
techniques to compute the aggregate functions
- For each finest-grained cell
- Build models to find the
- bellwether region
- For each higher-level cell
- Compute aggregate functions
- to find the bellwether region
66Efficient Cube Construction
- Classification models
- Use the prediction cube Chen et al., 05
execution framework - Regression models (Weighted linear regression
model builds on work in Chen-Dong-Han-Wah-Wang
VLDB 02) - Having the sum of squared error (SSE) for each
candidate region is sufficient to find the
bellwether region - SSE(S) is an algebraic function, where S is a set
of item - SSE(S) q( g(Sk) k 1, , n )
- S1, , Sn partition S
- g(Sk) ?Yk?WkYk, Xk?WkXk, Xk?WkYk?
- q(?Ak, Bk, Ck? k 1, , n) ?k Ak ? (?k
Ck)?(?k Bk)?1(?k Ck)
where
Yk is the vector of target values for set Sk of
items Xk is the matrix of features for set Sk of
items Wk is the weight matrix for set Sk of items
67Experimental Results
68Experimental Results Summary
- We have shown the existence of bellwether regions
on a real mail-order dataset - We characterize the behavior of bellwether trees
and bellwether cubes using synthetic datasets - We show our computation techniques improve
efficiency by orders of magnitude - We show our computation techniques scale linearly
in the size of the dataset
69Characteristics of Bellwether Trees Cubes
- Result
- Bellwether trees cubes have better accuracy
than basic bellwether search - Increase noise ?? increase error
- Increase complexity ? increase error
- Dataset generation
- Use random tree to generate
- different bellwether regions
- for different subset of items
- Parameters
- Noise
- Concept complexity of tree nodes
15 nodes
Noise level 0.5
70Efficiency Comparison
Naïve computation methods
Our computation techniques
71Scalability
72Exploratory MiningPrediction Cubeswith
Beechung Chen, Lei Chen, and Yi LinIn VLDB 05
73The Idea
- Build OLAP data cubes in which cell values
represent decision/prediction behavior - In effect, build a tree for each cell/region in
the cubeobserve that this is not the same as a
collection of trees used in an ensemble method! - The idea is simple, but it leads to promising
data mining tools - Ultimate objective Exploratory analysis of the
entire space of data mining choices - Choice of algorithms, data conditioning
parameters
74Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
75Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
76Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
77Example (3/7) Decision Analysis
- Are there branches (and time windows) where
approvals were closely tied to sensitive
attributes (e.g., race)? - Suppose you partitioned the training data by
location and time, chose the partition for a
given branch and time window, and built a
classifier. You could then ask, Are the
predictions of this classifier closely correlated
with race? - Are there branches and times with decision making
reminiscent of 1950s Alabama? - Requires comparison of classifiers trained using
different subsets of data.
78Example (4/7) Prediction Cubes
- Build a model using data from USA in Dec., 1985
- Evaluate that model
- Measure in a cell
- Accuracy of the model
- Predictiveness of Race
- measured based on that
- model
- Similarity between that
- model and a given model
79Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
80Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
81Example (7/7) Prediction Cube
Cell value Predictiveness of Race
82Efficient Computation
- Reduce prediction cube computation to data cube
computation - Represent a data-mining model as a distributive
or algebraic (bottom-up computable) aggregate
function, so that data-cube techniques can be
directly applied
83Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
84Functions on Sets
- Bottom-up computable functions Functions that
can be computed using only summary information - Distributive function ?(X) F(?(X1), ,
?(Xn)) - X X1 ? ? Xn and Xi ? Xj ??
- E.g., Count(X) Sum(Count(X1), , Count(Xn))
- Algebraic function ?(X) F(G(X1), , G(Xn))
- G(Xi) returns a length-fixed vector of values
- E.g., Avg(X) F(G(X1), , G(Xn))
- G(Xi) Sum(Xi), Count(Xi)
- F(s1, c1, , sn, cn) Sum(si) / Sum(ci)
85Scoring Function
- Represent a model as a function of sets
- Conceptually, a machine-learning model h(X
?Z(D)) is a scoring function Score(y, x ?Z(D))
that gives each class y a score on test example x - h(x ?Z(D)) argmax y Score(y, x ?Z(D))
- Score(y, x ?Z(D)) ? p(y x, ?Z(D))
- ?Z(D) The set of training examples (a cube
subset of D)
86Bottom-up Score Computation
- Key observations
- Observation 1 Score(y, x ?Z(D)) is a function
of cube subset ?Z(D) if it is distributive or
algebraic, bottom-up data cube computation
techniques can be directly applied - Observation 2 Having the scores for all the test
examples and all the cells is sufficient to
compute a prediction cube - Scores ?? predictions ?? cell values
- Details depend on what each cell means (i.e.,
type of prediction cubes) but straightforward
87Machine-Learning Models
- Naïve Bayes
- Scoring function algebraic
- Kernel-density-based classifier
- Scoring function distributive
- Decision tree, random forest
- Neither distributive, nor algebraic
- PBE Probability-based ensemble (new)
- To make any machine-learning model distributive
- Approximation
88Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
89Probability-Based Ensemble
- Scoring function
- h(y x bi(D)) Model hs estimation of p(y x,
bi(D)) - g(bi x) A model that predicts the probability
that x belongs to base subset bi(D)
90Outline
- Motivating example
- Definition of prediction cubes
- Efficient prediction cube materialization
- Experimental results
- Conclusion
91Experiments
- Quality of PBE on 8 UCI datasets
- The quality of the PBE version of a model is
slightly worse (0 6) than the quality of the
model trained directly on the whole training
data. - Efficiency of the bottom-up score computation
technique - Case study on demographic data
PBE
vs.
92Efficiency of Bottom-up Score Computation
- Machine-learning models
- J48 J48 decision tree
- RF Random forest
- NB Naïve Bayes
- KDC Kernel-density-based classifier
- Bottom-up method vs. Exhaustive method
93Synthetic Dataset
- Dimensions Z1, Z2 and Z3.
- Decision rule
Z1 and Z2
Z3
94Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
95Conclusions
96Related Work Building Models on OLAP Results
- Multi-dimensional regression Chen, VLDB 02
- Goal Detect changes of trends
- Build linear regression models for cube cells
- Step-by-step regression in stream cubes Liu,
PAKDD 03 - Loglinear-based quasi cubes Barbara, J. IIS 01
- Use loglinear model to approximately compress
dense regions of a data cube - NetCube Margaritis, VLDB 01
- Build Bayes Net on the entire dataset of
approximate answer count queries
97Related Work (Contd.)
- Cubegrades Imielinski, J. DMKD 02
- Extend cubes with ideas from association rules
- How does the measure change when we rollup or
drill down? - Constrained gradients Dong, VLDB 01
- Find pairs of similar cell characteristics
associated with big changes in measure - User-cognizant multidimensional analysis
Sarawagi, VLDBJ 01 - Help users find the most informative unvisited
regions in a data cube using max entropy
principle - Multi-Structural DBs Fagin et al., PODS 05, VLDB
05 - Experiment Databases Towards an Improved
Experimental Methodology in Machine Learning
Blockeel Vanschoren, PKDD 2007
98Take-Home Messages
- Promising exploratory data analysis paradigm
- Can use models to identify interesting subsets
- Concentrate only on subsets in cube space
- Those are meaningful subsets, tractable
- Precompute results and provide the users with an
interactive tool - A simple way to plug something into cube-style
analysis - Try to describe/approximate something by a
distributive or algebraic function
99Big Picture
- Why stop with decision behavior? Can apply to
other kinds of analyses too - Why stop at browsing? Can mine prediction cubes
in their own right - Exploratory analysis of mining space
- Dimension attributes can be parameters related to
algorithm, data conditioning, etc. - Tractable evaluation is a challenge
- Large number of dimensions, real-valued
dimension attributes, difficulties in
compositional evaluation - Active learning for experiment design, extending
compositional methods
100Conclusion
- Hierarchies are widely used, and a promising tool
to help us deal with - Data sparsity
- Data imprecision and uncertainty
- Exploratory analysis
- Experiment planning and management
- Area is as yet under-appreciated
- Lots of work on taxonomies and how to use them,
but there are many novel ways of using them that
have not received enough attention