Additive Models, Trees, and Related Methods Part I - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Additive Models, Trees, and Related Methods Part I

Description:

Additive Models, Trees, and Related Methods (Part I) Joy, Jie, ... Bagging (section 8.7) can reduce the variance. Discussions on Trees. Discussions on Trees ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 34

Provided by: Joy293

Category:

more less

Transcript and Presenter's Notes

Title: Additive Models, Trees, and Related Methods Part I

1
Additive Models, Trees, and Related Methods (Part
I)

Joy, Jie, Lucian
Oct 22nd, 2002

2
Outline

Tree-Based Methods
CART (30 minutes ) 130pm 200pm
HME (10 minutes) 200pm 220pm
PRIM (20) minutes 220pm-240pm
Discussions (10 minutes) 240pm-250pm

3
Tree-Based Methods

Overview
Principle behind Divide and conquer
Variance will be increased
Finesse the curse of dimensionality with the
price of mis-specifying the model
Partition the feature space into a set of
rectangles
For simplicity, use recursive binary partition
Fit a simple model (e.g. constant) for each
rectangle
Classification and Regression Trees (CART)
Regress Trees
Classification Trees
Hierarchical Mixture Experts (HME)

4
CART

An example (in regression case)

5
How CART Sees An Elephant
It was six men of Indostan To learning much
inclined, Who went to see the Elephant (Though
all of them were blind), That each by
observation Might satisfy his mind . -- The
Blind Men and the Elephant by John Godfrey Saxe
(1816-1887)
6
Basic Issues in Tree-based Methods

How to grow a tree?
How large should we grow the tree?

7
Regression Trees

Partition the space into M regions R1, R2, ,
RM.

8
Regression Trees Grow the Tree

The best partition to minimize the sum of
squared error
Finding the global minimum is computationally
infeasible
Greedy algorithm at each level choose variable j
and value s as
The greedy algorithm makes the tree unstable
The error made at the upper level will be
propagated to the lower level

9
Regression Tree how large should we grow the
tree ?

Trade-off between accuracy and generalization
Very large tree overfit
Small tree might not capture the structure
Strategies
1 split only when we can decrease the error
(short-sighted, e.g. XOR)
2 Cost-complexity pruning (preferred)

10
Regression Tree - Pruning

Cost-complexity pruning
Pruning collapsing some internal nodes
Cost complexity
Choose best alpha weakest link pruning
Each time collapse an internal node which add
smallest error
Choose from this tree sequence the best one by
cross-validation

11
Classification Trees

Classify the observations in node m to the major
class in the node
Pmk is the proportion of observation of class k
in node m
Define impurity for a node
Misclassification error
Cross-entropy

12
Classification Trees

Gini index (a famous index of measuring income
inequality)

13
Classification Trees

Cross-entropy and Gini are more sensitive
To grow the tree use CE or Gini
To prune the tree use Misclassification rate (or
any other method)

14
Discussions on Tree-based Methods

Categorical Predictors
Problem Consider splits of sub tree t into tL
and tR based on categorical predictor x which has
q possible values 2(q-1)-1 ways !
Theorem (Fisher 1958)
There is an optimal partition B1, B2 of B such
that
For and
Order the predictor classes according to the mean
of the outcome Y.
Intuition Treat the categorical predictor as if
it were ordered

15
Discussions on Tree-based Methods

The Loss Matrix
Consequences of misclassification depends on
class
Define loss function L
Modify the Gini index as
In a terminal node m , classify it to class k as

16
Discussions on Trees

Missing Predictor Values
If we have enough training data discard
observations with mission value
Fill in (impute) the missing value. E.g. the mean
of known values
Create a category called missing
Surrogate variables
Choose primary predictor and split point
The first surrogate predictor best mimics the
split by the primary predictor, the second does
second best,
When sending observations down the tree, use
primary first. If the value of primary is
missing, use the first surrogate. If the first
surrogate is missing, use the second.

17
Discussions on Trees

Binary Splits?
Question (Yan Liu)
This question is on the limitation of multiway
split for building tree it is said on page273
that the problem with multi-way split is that it
fragments the data too quickly, leaving
insufficient data at the next level down. Can you
give me an intuitive explanation of why the
binary splits are more preferred? In my
understanding, one of the problems in multiway
split might be that it is hard to find the best
attributes and split points, is that right?
Answer why binary splits are preferred?
More standard framework to train
To be or not to be is easier to decide

18
Discussions on Trees

Linear Combination Splits
Split the node based on
Improve the predictive power
Hurt interpretability
Instability of Trees
Inherited from the hierarchical nature
Bagging (section 8.7) can reduce the variance

19
Discussions on Trees
20
Discussions on Trees
Majority vote
Average
21
Hierarchical Mixture Experts

The gating networks provide a nested, soft
partitioning of the input space
The expert network provide local regression
surface within the partition
Both mixture coefficients and mixture components
are Generalized Linear Models (GLIM)

22
Hierarchical Mixture Experts

Expert node output
Lower level gate
Lower level gate output

23
Hierarchical Mixture Experts

Likelihood of the training data
Gradient descent learning algorithm to update Uij
Applying EM to HME for training
Latent variable indicator zi which branch to go
See Jordan 1994 for details

24
Hierarchical Mixture Experts -- EM
Each histogram displays the distribution of
posterior probabilities across the training set
at each node in the tree
25
Comparison

All methods perform better than Linear
BP has lowest relative error
BP hard to converge

16 experts for HME, four level hierarchy 16 basis
function for MARS (Jordan 94)
26
Model Selection for HME

Structural parameters need to be decided
Number of levels
Branching factor of the tree K
No methods for finding a good tree topology as in
CART

27
Questions and Discussions

CART
Rong Jin 1. According to Eqn. (9.16),
successfully splitting in a large subtree is more
valuable than doing it for small subtree. Could
you justify it ?
Rong Jin In the discussion of general regression
tree or classification tree, it only considers
the partition of feature space in a simple binary
way. Is there any works that has been done along
the line of nonlinear partition of feature space?
Rong Jin Does it make any sense to do the
overlap split ?
Ben Could you make it clearer about the
differences between using L_k,k'as loss vs. as
weights? (p. 272)

28
Questions and Discussions

Locality
Rong Jin Both tree model and kernel function try
to capture the locality. For tree model, the
locality is created through the partition of the
feature space while the kernel function is able
to express the locality using the special
distance function. Please comment on the these
two methods on their ability of expressing
localized function.
Ben Could you make comparisons between the tree
methods introduced here with kNN and kernel
methods?

29
Questions and Discussions

Gini and other measures
Yan The classification (or regression) trees
discussed in thischapter used a lot of criterion
to select the attribute and splitpoints, such as
misclassification error, gini index and
cross-entropy.When should we use these criteria?
Is the entropy more preferred thanthe other two?
(Also I want to make some clarification is the
Gini index refers to gain ratio,a nd
cross-entropy refers to information gain?)
Jian Zhang From the book we know that Gini index
has many nice properties, liketight upper bound
of error, training error rate with probability,
etc.Should we prefer it in classification task
for those reasons?
Weng-keen How is the Gini index equal to the
training error?
Yan Jun For tree based methods, can you give me
an intuitive explanationabout the Gini index
measure?
Ben . What does minimizing node impurity mean?
Is it just to decrease overall variance? Is there
any implication to the bias? How does the usual
bias-vairance tradeoff play a role here?

30
Questions and Discussions

HME
Yan In my understanding, the HME is more like
neural network combined with Gaussian linear
regression (or logistic regression) interms that
the input of the neural network is the output of
the Gaussian regression. Is my understanding
right?
Ben For 2-class case depicted in Fig. 9.13
(HME), why do we need to do two times of
mixtures? (two layers)
Ben In equ. 9.30 why the 2 upper bound of
summations are the same K? -- Yes

31
Reference

Fisher, W.D. 1958. On grouping for maximum
homogeneity. J. Amer. Statist. Assoc., 53
789-798
Breiman, L. 1984. Classification and Regression
Trees. Wadsworth International Group
Jordan, M. I., Jacobs, R. A. (1994).
Hierarchical mixtures of experts and the EM
algorithm. Neural Computation, 6, 181-214

32
Hierarchical Mixture Experts

The softmax function derives naturally from
log-linear models and leads to convenient
interpretations of the weights in terms of odds
ratios. You could, however, use a variety of
other nonnegative functions on the real line in
place of the exp function. Or you could constrain
the net inputs to the output units to be
nonnegative, and just divide by the sum--that's
called the Bradley-Terry-Luce model

33
Hierarchical Mixture Experts

Write a Comment

User Comments (0)