Title: Prediction Cubes
1Prediction Cubes
- Bee-Chung Chen, Lei Chen,
- Yi Lin and Raghu Ramakrishnan
- University of Wisconsin - Madison
2Subset Mining
- We want to find interesting subsets of the
dataset - Interestingness Defined by the model built on
a subset - Cube space A combination of dimension attribute
values defines a candidate subset (just like
regular OLAP) - We want the measures to represent
decision/prediction behavior - Summarize a subset using the model built on it
- Big change from regular OLAP!
3The Idea
- Build OLAP data cubes in which cell values
represent decision/prediction behavior - In effect, build a tree for each cell/region in
the cubeobserve that this is not the same as a
collection of trees used in an ensemble method! - The idea is simple, but it leads to promising
data mining tools - Ultimate objective Exploratory analysis of the
entire space of data mining choices - Choice of algorithms, data conditioning
parameters
4Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
5Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
6Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
7Example (3/7) Decision Analysis
- Are there branches (and time windows) where
approvals were closely tied to sensitive
attributes (e.g., race)? - Suppose you partitioned the training data by
location and time, chose the partition for a
given branch and time window, and built a
classifier. You could then ask, Are the
predictions of this classifier closely correlated
with race? - Are there branches and times with decision making
reminiscent of 1950s Alabama? - Requires comparison of classifiers trained using
different subsets of data.
8Example (4/7) Prediction Cubes
- Build a model using data from USA in Dec., 1985
- Evaluate that model
- Measure in a cell
- Accuracy of the model
- Predictiveness of Race
- measured based on that
- model
- Similarity between that
- model and a given model
9Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
10Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
11Example (7/7) Prediction Cube
Cell value Predictiveness of Race
12Efficient Computation
- Reduce prediction cube computation to data cube
computation - Represent a data-mining model as a distributive
or algebraic (bottom-up computable) aggregate
function, so that data-cube techniques can be
directly applied
13Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
14Functions on Sets
- Bottom-up computable functions Functions that
can be computed using only summary information - Distributive function ?(X) F(?(X1), ,
?(Xn)) - X X1 ? ? Xn and Xi ? Xj ??
- E.g., Count(X) Sum(Count(X1), , Count(Xn))
- Algebraic function ?(X) F(G(X1), , G(Xn))
- G(Xi) returns a length-fixed vector of values
- E.g., Avg(X) F(G(X1), , G(Xn))
- G(Xi) Sum(Xi), Count(Xi)
- F(s1, c1, , sn, cn) Sum(si) / Sum(ci)
15Scoring Function
- Represent a model as a function of sets
- Conceptually, a machine-learning model h(X
?Z(D)) is a scoring function Score(y, x ?Z(D))
that gives each class y a score on test example x - h(x ?Z(D)) argmax y Score(y, x ?Z(D))
- Score(y, x ?Z(D)) ? p(y x, ?Z(D))
- ?Z(D) The set of training examples (a cube
subset of D)
16Bottom-up Score Computation
- Key observations
- Observation 1 Score(y, x ?Z(D)) is a function
of cube subset ?Z(D) if it is distributive or
algebraic, the data cube bottom-up technique can
be directly applied - Observation 2 Having the scores for all the test
examples and all the cells is sufficient to
compute a prediction cube - Scores ?? predictions ?? cell values
- Details depend on what each cell means (i.e.,
type of prediction cubes) but straightforward
17Machine-Learning Models
- Naïve Bayes
- Scoring function algebraic
- Kernel-density-based classifier
- Scoring function distributive
- Decision tree, random forest
- Neither distributive, nor algebraic
- PBE Probability-based ensemble (new)
- To make any machine-learning model distributive
- Approximation
18Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
19Probability-Based Ensemble
- Scoring function
- h(y x bi(D)) Model hs estimation of p(y x,
bi(D)) - g(bi x) A model that predicts the probability
that x belongs to base subset bi(D)
20Outline
- Motivating example
- Definition of prediction cubes
- Efficient prediction cube materialization
- Experimental results
- Conclusion
21Experiments
- Quality of PBE on 8 UCI datasets
- The quality of the PBE version of a model is
slightly worse (0 6) than the quality of the
model trained directly on the whole training
data. - Efficiency of the bottom-up score computation
technique - Case study on demographic data
PBE
vs.
22Efficiency of Bottom-up Score Computation
- Machine-learning models
- J48 J48 decision tree
- RF Random forest
- NB Naïve Bayes
- KDC Kernel-density-based classifier
- Bottom-up method vs. Exhaustive method
23Synthetic Dataset
- Dimensions Z1, Z2 and Z3.
- Decision rule
Z1 and Z2
Z3
24Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
25Related Work Building models on OLAP Results
- Multi-dimensional regression Chen, VLDB 02
- Goal Detect changes of trends
- Build linear regression models for cube cells
- Step-by-step regression in stream cubes Liu,
PAKDD 03 - Loglinear-based quasi cubes Barbara, J. IIS 01
- Use loglinear model to approximately compress
dense regions of a data cube - NetCube Margaritis, VLDB 01
- Build Bayes Net on the entire dataset of
approximate answer count queries
26Related Work (Contd.)
- Cubegrades Imielinski, J. DMKD 02
- Extend cubes with ideas from association rules
- How does the measure change when we rollup or
drill down? - Constrained gradients Dong, VLDB 01
- Find pairs of similar cell characteristics
associated with big changes in measure - User-cognizant multidimensional analysis
Sarawagi, VLDBJ 01 - Help users find the most informative unvisited
regions in a data cube using max entropy
principle - Multi-Structural DBs Fagin et al., PODS 05, VLDB
05
27Take-Home Messages
- Promising exploratory data analysis paradigm
- Can use models to identify interesting subsets
- Concentrate only on subsets in cube space
- Those are meaningful subsets, tractable
- Precompute results and provide the users with an
interactive tool - A simple way to plug something into cube-style
analysis - Try to describe/approximate something by a
distributive or algebraic function
28Big Picture
- Why stop with decision behavior? Can apply to
other kinds of analyses too - Why stop at browsing? Can mine prediction cubes
in their own right - Exploratory analysis of mining space
- Dimension attributes can be parameters related to
algorithm, data conditioning, etc. - Tractable evaluation is a challenge
- Large number of dimensions, real-valued
dimension attributes, difficulties in
compositional evaluation - Active learning for experiment design, extending
compositional methods
29Community Information Management (CIM)
UI
Anhai Doan University of Illinois at
Urbana-Champaign Raghu Ramakrishnan University
of Wisconsin-Madison
30Structured Web-Queries
UI
- Example Queries
- How many alumni are top-10 faculty members?
- Wisconsin does very well, by the way
- Find trends in publications
- By topic, by conference, by alumni of schools
- Change tracking
- Alert me if my co-authors publish new papers or
move to new jobs - Information is extracted from text sources on the
web, then queried
31Key Ideas
UI
- Communities are ideally scoped chunks of the web
for which to build enhanced portals - Relative uniformity in content, interests
- Can exploit people power via mass
collaboration, to augment extraction - CIM platform Facilitate collaborative creation
and maintenance of community portals - Extraction management
- Uncertainty, provenance, maintenance,
compositional inference for refining extracted
information - Mass collaboration for extraction and integration
Watch for new DBWorld!
32Challenges
UI
- User Interaction
- Declarative specification of background knowledge
and user feedback - Intelligent prompting for user input
- Explanation of results
33Challenges
UI
- Extraction and Query Plans
- Starting from user input (ER schema, hints) and
background knowledge (e.g., standard types,
look-up tables), compile a query into an
execution plan - Must cover extraction, storage and indexing, and
relational processing - And maintenance!
- Algebra to represent such plans? Query optimizer?
- Handling uncertainty, constraints, conflicts,
multiple related sources, ranking, modular
architecture
34Challenges
UI
- Managing extracted data
- Mapping between extracted metadata and source
data - Uncertainty of mapping
- Conflicts (in user input, background knowledge,
or from multiple sources) - Evolution over time