Title: Additive Models, Trees, and Related Methods Part I
1Additive Models, Trees, and Related Methods (Part
I)
- Joy, Jie, Lucian
- Oct 22nd, 2002
2Outline
- Tree-Based Methods
- CART (30 minutes ) 130pm 200pm
- HME (10 minutes) 200pm 220pm
- PRIM (20) minutes 220pm-240pm
- Discussions (10 minutes) 240pm-250pm
3Tree-Based Methods
- Overview
- Principle behind Divide and conquer
- Variance will be increased
- Finesse the curse of dimensionality with the
price of mis-specifying the model - Partition the feature space into a set of
rectangles - For simplicity, use recursive binary partition
- Fit a simple model (e.g. constant) for each
rectangle - Classification and Regression Trees (CART)
- Regress Trees
- Classification Trees
- Hierarchical Mixture Experts (HME)
4CART
- An example (in regression case)
5How CART Sees An Elephant
It was six men of Indostan To learning much
inclined, Who went to see the Elephant (Though
all of them were blind), That each by
observation Might satisfy his mind . -- The
Blind Men and the Elephant by John Godfrey Saxe
(1816-1887)
6Basic Issues in Tree-based Methods
- How to grow a tree?
- How large should we grow the tree?
7Regression Trees
- Partition the space into M regions R1, R2, ,
RM.
8Regression Trees Grow the Tree
- The best partition to minimize the sum of
squared error - Finding the global minimum is computationally
infeasible - Greedy algorithm at each level choose variable j
and value s as - The greedy algorithm makes the tree unstable
- The error made at the upper level will be
propagated to the lower level
9Regression Tree how large should we grow the
tree ?
- Trade-off between accuracy and generalization
- Very large tree overfit
- Small tree might not capture the structure
- Strategies
- 1 split only when we can decrease the error
(short-sighted, e.g. XOR) - 2 Cost-complexity pruning (preferred)
10Regression Tree - Pruning
- Cost-complexity pruning
- Pruning collapsing some internal nodes
- Cost complexity
- Choose best alpha weakest link pruning
- Each time collapse an internal node which add
smallest error - Choose from this tree sequence the best one by
cross-validation
11Classification Trees
- Classify the observations in node m to the major
class in the node - Pmk is the proportion of observation of class k
in node m - Define impurity for a node
- Misclassification error
- Cross-entropy
12Classification Trees
- Gini index (a famous index of measuring income
inequality)
13Classification Trees
- Cross-entropy and Gini are more sensitive
- To grow the tree use CE or Gini
- To prune the tree use Misclassification rate (or
any other method)
14Discussions on Tree-based Methods
- Categorical Predictors
- Problem Consider splits of sub tree t into tL
and tR based on categorical predictor x which has
q possible values 2(q-1)-1 ways ! - Theorem (Fisher 1958)
- There is an optimal partition B1, B2 of B such
that - For and
- Order the predictor classes according to the mean
of the outcome Y. - Intuition Treat the categorical predictor as if
it were ordered
15Discussions on Tree-based Methods
- The Loss Matrix
- Consequences of misclassification depends on
class - Define loss function L
- Modify the Gini index as
- In a terminal node m , classify it to class k as
16Discussions on Trees
- Missing Predictor Values
- If we have enough training data discard
observations with mission value - Fill in (impute) the missing value. E.g. the mean
of known values - Create a category called missing
- Surrogate variables
- Choose primary predictor and split point
- The first surrogate predictor best mimics the
split by the primary predictor, the second does
second best, - When sending observations down the tree, use
primary first. If the value of primary is
missing, use the first surrogate. If the first
surrogate is missing, use the second.
17Discussions on Trees
- Binary Splits?
- Question (Yan Liu)
- This question is on the limitation of multiway
split for building tree it is said on page273
that the problem with multi-way split is that it
fragments the data too quickly, leaving
insufficient data at the next level down. Can you
give me an intuitive explanation of why the
binary splits are more preferred? In my
understanding, one of the problems in multiway
split might be that it is hard to find the best
attributes and split points, is that right? - Answer why binary splits are preferred?
- More standard framework to train
- To be or not to be is easier to decide
18Discussions on Trees
- Linear Combination Splits
- Split the node based on
- Improve the predictive power
- Hurt interpretability
- Instability of Trees
- Inherited from the hierarchical nature
- Bagging (section 8.7) can reduce the variance
19Discussions on Trees
20Discussions on Trees
Majority vote
Average
21Hierarchical Mixture Experts
- The gating networks provide a nested, soft
partitioning of the input space - The expert network provide local regression
surface within the partition - Both mixture coefficients and mixture components
are Generalized Linear Models (GLIM)
22Hierarchical Mixture Experts
- Expert node output
- Lower level gate
- Lower level gate output
23Hierarchical Mixture Experts
- Likelihood of the training data
- Gradient descent learning algorithm to update Uij
- Applying EM to HME for training
- Latent variable indicator zi which branch to go
- See Jordan 1994 for details
24Hierarchical Mixture Experts -- EM
Each histogram displays the distribution of
posterior probabilities across the training set
at each node in the tree
25Comparison
- All methods perform better than Linear
- BP has lowest relative error
- BP hard to converge
16 experts for HME, four level hierarchy 16 basis
function for MARS (Jordan 94)
26Model Selection for HME
- Structural parameters need to be decided
- Number of levels
- Branching factor of the tree K
- No methods for finding a good tree topology as in
CART
27Questions and Discussions
- CART
- Rong Jin 1. According to Eqn. (9.16),
successfully splitting in a large subtree is more
valuable than doing it for small subtree. Could
you justify it ? - Rong Jin In the discussion of general regression
tree or classification tree, it only considers
the partition of feature space in a simple binary
way. Is there any works that has been done along
the line of nonlinear partition of feature space? - Rong Jin Does it make any sense to do the
overlap split ? - Ben Could you make it clearer about the
differences between using L_k,k'as loss vs. as
weights? (p. 272)
28Questions and Discussions
- Locality
- Rong Jin Both tree model and kernel function try
to capture the locality. For tree model, the
locality is created through the partition of the
feature space while the kernel function is able
to express the locality using the special
distance function. Please comment on the these
two methods on their ability of expressing
localized function. - Ben Could you make comparisons between the tree
methods introduced here with kNN and kernel
methods?
29Questions and Discussions
- Gini and other measures
- Yan The classification (or regression) trees
discussed in thischapter used a lot of criterion
to select the attribute and splitpoints, such as
misclassification error, gini index and
cross-entropy.When should we use these criteria?
Is the entropy more preferred thanthe other two?
(Also I want to make some clarification is the
Gini index refers to gain ratio,a nd
cross-entropy refers to information gain?) - Jian Zhang From the book we know that Gini index
has many nice properties, liketight upper bound
of error, training error rate with probability,
etc.Should we prefer it in classification task
for those reasons? - Weng-keen How is the Gini index equal to the
training error? - Yan Jun For tree based methods, can you give me
an intuitive explanationabout the Gini index
measure? - Ben . What does minimizing node impurity mean?
Is it just to decrease overall variance? Is there
any implication to the bias? How does the usual
bias-vairance tradeoff play a role here?
30Questions and Discussions
- HME
- Yan In my understanding, the HME is more like
neural network combined with Gaussian linear
regression (or logistic regression) interms that
the input of the neural network is the output of
the Gaussian regression. Is my understanding
right? - Ben For 2-class case depicted in Fig. 9.13
(HME), why do we need to do two times of
mixtures? (two layers) - Ben In equ. 9.30 why the 2 upper bound of
summations are the same K? -- Yes
31Reference
- Fisher, W.D. 1958. On grouping for maximum
homogeneity. J. Amer. Statist. Assoc., 53
789-798 - Breiman, L. 1984. Classification and Regression
Trees. Wadsworth International Group - Jordan, M. I., Jacobs, R. A. (1994).
Hierarchical mixtures of experts and the EM
algorithm. Neural Computation, 6, 181-214
32Hierarchical Mixture Experts
- The softmax function derives naturally from
log-linear models and leads to convenient
interpretations of the weights in terms of odds
ratios. You could, however, use a variety of
other nonnegative functions on the real line in
place of the exp function. Or you could constrain
the net inputs to the output units to be
nonnegative, and just divide by the sum--that's
called the Bradley-Terry-Luce model
33Hierarchical Mixture Experts