Title: Learning with Trees
1Learning with Trees
Rob Nowak
University of
Wisconsin-Madison Collaborators Rui Castro,
Clay Scott, Rebecca Willett
www.ece.wisc.edu/nowak
Artwork Piet Mondrian
2Basic Problem Partitioning
Many problems in statistical learning theory boil
down to finding a good partition
function
partition
3Classification
Learning and Classification build a decision
rule based on labeled training data
Labeled training features
Classification rule partition of feature space
4Signal and Image Processing
MRI data brain aneurysm
Extracted vascular network
Recover complex geometrical structure from noisy
data
5Partitioning Schemes
Support Vector Machine
image partitions
6Why Trees ?
- Simplicity of design
- Interpretability
- Ease of implementation
- Good performance in practice
Trees are one of the most popular and widely used
machine learning / data analysis tools
CART Breiman, Friedman, Olshen, and Stone,
1984 Classification and Regression Trees
C4.5 Quinlan 1993, C4.5 Programs for Machine
Learning
JPEG 2000 Image compression standard, 2000
http//www.jpeg.org/jpeg2000/
7Example Gamma-Ray Burst Analysis
photon counts
burst
Compton Gamma-Ray Observatory Burst and
Transient Source Experiment (BATSE)
time
One burst (10s of seconds) emits as much energy
as our entire Milky Way does in one hundred years
!
x-ray after glow
8Trees and Partitions
coarse partition
9Estimation using Pruned Tree
piecewise constant fits to data on each piece of
the partition provides a good estimate
Each leaf corresponds to a sample f(ti ),
i0,,N-1
10Gamma-Ray Burst 845
piecewise linear fit on each cell
piecewise polynomial fit on each cell
11Recursive Partitions
12Adapted Partition
13Image Denoising
14Decision (Classification) Trees
Bayes decision boundary
labeled training data
complete partition
pruned partition
decision tree - majority vote at each leaf
15Classification
Ideal classfier
Adapted partition
histogram
256 cells in each partition
16Image Partitions
1024 cells in each partition
17(No Transcript)
18Image Coding
JPEG 0.125 bpp
JPEG 2000 0.125 bpp
non-adaptive partitioning
adaptive partitioning
19Probabilistic Framework
20Prediction Problem
21Challenge
22Empirical Risk
23Empirical Risk Minimization
24Classification and Regression Trees
25Classification and Regression Trees
1
1
1
1
1
1
0
1
0
1
0
0
0
0
1
0
0
0
0
26Empirical Risk Minimization on Trees
27Overfitting Problem
crude
stable
accurate
variable
28Bias/Variance Trade-off
large bias
small variance
coarse partition
small bias
large variance
fine partition
29Estimation and Approximation Error
30Estimation Error in Regression
31Estimation Error in Classification
32Partition Complexity and Overfitting
empirical risk
leaves
33Controlling Overfitting
34Complexity Regularization
35Per-Cell Variance Bounds Regression
36Per-Cell Variance Bounds Classification
37Variance Bounds
38A Slightly Weaker Variance Bound
39Complexity Regularization
40Example Image Denoising
This is special case of wavelet denoising using
Haar wavelet basis
41Theory of Complexity Regularization
42Coffee Break !
43Classification
44Probabilistic Framework
45Learning from Data
0
1
0
1
46Approximation and Estimation
0
Approximation
1
BIAS
Model selection
VARIANCE
47Classifier Approximations
0
1
48Approximation Error
Symmetric difference set
Error
49Approximation Error
boundary smoothness
risk functional (transition) smoothness
50Boundary Smoothness
51Transition Smoothness
52Transition Smoothness
53Fundamental Limit to Learning
Mammen Tsybakov (1999)
54Related Work
55Box-Counting Class
56Box-Counting Sub-Classes
57Dyadic Decision Trees
Bayes decision boundary
labeled training data
pruned RDP
complete RDP
Dyadic decision tree - majority vote at each
leaf
Joint work with Clay Scott, 2004
58Dyadic Decision Trees
59The Classifier Learning Problem
Training Data
Model Class
Problem
60Empirical Risk
61Chernoffs Bound
62Chernoffs Bound
actual risk is probably not much larger than
empirical risk
63Error Deviation Bounds
64Uniform Deviation Bound
65Setting Penalties
66Setting Penalties
prefix codes for trees
code 0001001111 6 bits for leaf labels
67Uniform Deviation Bound
68Decision Tree Selection
Compare with
Oracle Bound
Approximation Error (Bias)
Estimation Error (Variance)
69Rate of Convergence
BUT
Why too slow ?
70Balanced vs. Unbalanced Trees
same number of leafs
all T leaf trees are equally favored
71Spatial Adaptation
local error
local empirical error
72Relative Chernoff Bound
73Designing Leaf Penalties
prefix code construction
01 right branch
00
010001110
01
11 terminate
0/1 label
74Uniform Deviation Bound
Compare with
75Spatial Adaptivity
Key local complexity is offset by small volumes!
76Bound Comparison for Unbalanced Tree
J leafs depth J-1
77Balanced vs. Unbalanced Trees
same number of leafs
78Decision Tree Selection
Oracle Bound
Approximation Error
Estimation Error
79Rate of Convergence
80Computable Penalty
achieves same rate of convergence
81Adapting to Dimension - Feature Rejection
0
1
82Adapting to Dimension - Data Manifold
83Computational Issues
84DDTs in Action
85Comparison to State-of-Art
ODCT DDT cross-validation
Best results (1) AdaBoost with RBF-Network, (2)
Kernel Fisher Discriminant, (3) SVM with
RBF-Kernel,
86Application to Level Set Estimation
Elevation Map St. Louis
Penalty proportional to T
Noisy data
Spatially adapt. penalty
87Conclusions and Future Work
Open Problem
www.ece.wisc.edu/nowak
More Info
www.ece.wisc.edu/nowak/ece901