Additive Models, Trees, and Related Methods Part I - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Additive Models, Trees, and Related Methods Part I

Description:

Additive Models, Trees, and Related Methods (Part I) Joy, Jie, ... Bagging (section 8.7) can reduce the variance. Discussions on Trees. Discussions on Trees ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 34
Provided by: Joy293
Category:

less

Transcript and Presenter's Notes

Title: Additive Models, Trees, and Related Methods Part I


1
Additive Models, Trees, and Related Methods (Part
I)
  • Joy, Jie, Lucian
  • Oct 22nd, 2002

2
Outline
  • Tree-Based Methods
  • CART (30 minutes ) 130pm 200pm
  • HME (10 minutes) 200pm 220pm
  • PRIM (20) minutes 220pm-240pm
  • Discussions (10 minutes) 240pm-250pm

3
Tree-Based Methods
  • Overview
  • Principle behind Divide and conquer
  • Variance will be increased
  • Finesse the curse of dimensionality with the
    price of mis-specifying the model
  • Partition the feature space into a set of
    rectangles
  • For simplicity, use recursive binary partition
  • Fit a simple model (e.g. constant) for each
    rectangle
  • Classification and Regression Trees (CART)
  • Regress Trees
  • Classification Trees
  • Hierarchical Mixture Experts (HME)

4
CART
  • An example (in regression case)

5
How CART Sees An Elephant
It was six men of Indostan To learning much
inclined, Who went to see the Elephant (Though
all of them were blind), That each by
observation Might satisfy his mind . -- The
Blind Men and the Elephant by John Godfrey Saxe
(1816-1887)
6
Basic Issues in Tree-based Methods
  • How to grow a tree?
  • How large should we grow the tree?

7
Regression Trees
  • Partition the space into M regions R1, R2, ,
    RM.

8
Regression Trees Grow the Tree
  • The best partition to minimize the sum of
    squared error
  • Finding the global minimum is computationally
    infeasible
  • Greedy algorithm at each level choose variable j
    and value s as
  • The greedy algorithm makes the tree unstable
  • The error made at the upper level will be
    propagated to the lower level

9
Regression Tree how large should we grow the
tree ?
  • Trade-off between accuracy and generalization
  • Very large tree overfit
  • Small tree might not capture the structure
  • Strategies
  • 1 split only when we can decrease the error
    (short-sighted, e.g. XOR)
  • 2 Cost-complexity pruning (preferred)

10
Regression Tree - Pruning
  • Cost-complexity pruning
  • Pruning collapsing some internal nodes
  • Cost complexity
  • Choose best alpha weakest link pruning
  • Each time collapse an internal node which add
    smallest error
  • Choose from this tree sequence the best one by
    cross-validation

11
Classification Trees
  • Classify the observations in node m to the major
    class in the node
  • Pmk is the proportion of observation of class k
    in node m
  • Define impurity for a node
  • Misclassification error
  • Cross-entropy

12
Classification Trees
  • Gini index (a famous index of measuring income
    inequality)

13
Classification Trees
  • Cross-entropy and Gini are more sensitive
  • To grow the tree use CE or Gini
  • To prune the tree use Misclassification rate (or
    any other method)

14
Discussions on Tree-based Methods
  • Categorical Predictors
  • Problem Consider splits of sub tree t into tL
    and tR based on categorical predictor x which has
    q possible values 2(q-1)-1 ways !
  • Theorem (Fisher 1958)
  • There is an optimal partition B1, B2 of B such
    that
  • For and
  • Order the predictor classes according to the mean
    of the outcome Y.
  • Intuition Treat the categorical predictor as if
    it were ordered

15
Discussions on Tree-based Methods
  • The Loss Matrix
  • Consequences of misclassification depends on
    class
  • Define loss function L
  • Modify the Gini index as
  • In a terminal node m , classify it to class k as

16
Discussions on Trees
  • Missing Predictor Values
  • If we have enough training data discard
    observations with mission value
  • Fill in (impute) the missing value. E.g. the mean
    of known values
  • Create a category called missing
  • Surrogate variables
  • Choose primary predictor and split point
  • The first surrogate predictor best mimics the
    split by the primary predictor, the second does
    second best,
  • When sending observations down the tree, use
    primary first. If the value of primary is
    missing, use the first surrogate. If the first
    surrogate is missing, use the second.

17
Discussions on Trees
  • Binary Splits?
  • Question (Yan Liu)
  • This question is on the limitation of multiway
    split for building tree it is said on page273
    that the problem with multi-way split is that it
    fragments the data too quickly, leaving
    insufficient data at the next level down. Can you
    give me an intuitive explanation of why the
    binary splits are more preferred? In my
    understanding, one of the problems in multiway
    split might be that it is hard to find the best
    attributes and split points, is that right?
  • Answer why binary splits are preferred?
  • More standard framework to train
  • To be or not to be is easier to decide

18
Discussions on Trees
  • Linear Combination Splits
  • Split the node based on
  • Improve the predictive power
  • Hurt interpretability
  • Instability of Trees
  • Inherited from the hierarchical nature
  • Bagging (section 8.7) can reduce the variance

19
Discussions on Trees
20
Discussions on Trees
Majority vote
Average
21
Hierarchical Mixture Experts
  • The gating networks provide a nested, soft
    partitioning of the input space
  • The expert network provide local regression
    surface within the partition
  • Both mixture coefficients and mixture components
    are Generalized Linear Models (GLIM)

22
Hierarchical Mixture Experts
  • Expert node output
  • Lower level gate
  • Lower level gate output

23
Hierarchical Mixture Experts
  • Likelihood of the training data
  • Gradient descent learning algorithm to update Uij
  • Applying EM to HME for training
  • Latent variable indicator zi which branch to go
  • See Jordan 1994 for details

24
Hierarchical Mixture Experts -- EM
Each histogram displays the distribution of
posterior probabilities across the training set
at each node in the tree
25
Comparison
  • All methods perform better than Linear
  • BP has lowest relative error
  • BP hard to converge

16 experts for HME, four level hierarchy 16 basis
function for MARS (Jordan 94)
26
Model Selection for HME
  • Structural parameters need to be decided
  • Number of levels
  • Branching factor of the tree K
  • No methods for finding a good tree topology as in
    CART

27
Questions and Discussions
  • CART
  • Rong Jin 1. According to Eqn. (9.16),
    successfully splitting in a large subtree is more
    valuable than doing it for small subtree. Could
    you justify it ?
  • Rong Jin In the discussion of general regression
    tree or classification tree, it only considers
    the partition of feature space in a simple binary
    way. Is there any works that has been done along
    the line of nonlinear partition of feature space?
  • Rong Jin Does it make any sense to do the
    overlap split ?
  • Ben Could you make it clearer about the
    differences between using L_k,k'as loss vs. as
    weights? (p. 272)

28
Questions and Discussions
  • Locality
  • Rong Jin Both tree model and kernel function try
    to capture the locality. For tree model, the
    locality is created through the partition of the
    feature space while the kernel function is able
    to express the locality using the special
    distance function. Please comment on the these
    two methods on their ability of expressing
    localized function.
  • Ben Could you make comparisons between the tree
    methods introduced here with kNN and kernel
    methods?

29
Questions and Discussions
  • Gini and other measures
  • Yan The classification (or regression) trees
    discussed in thischapter used a lot of criterion
    to select the attribute and splitpoints, such as
    misclassification error, gini index and
    cross-entropy.When should we use these criteria?
    Is the entropy more preferred thanthe other two?
    (Also I want to make some clarification is the
    Gini index refers to gain ratio,a nd
    cross-entropy refers to information gain?)
  • Jian Zhang From the book we know that Gini index
    has many nice properties, liketight upper bound
    of error, training error rate with probability,
    etc.Should we prefer it in classification task
    for those reasons?
  • Weng-keen How is the Gini index equal to the
    training error?
  • Yan Jun For tree based methods, can you give me
    an intuitive explanationabout the Gini index
    measure?
  • Ben . What does minimizing node impurity mean?
    Is it just to decrease overall variance? Is there
    any implication to the bias? How does the usual
    bias-vairance tradeoff play a role here?

30
Questions and Discussions
  • HME
  • Yan In my understanding, the HME is more like
    neural network combined with Gaussian linear
    regression (or logistic regression) interms that
    the input of the neural network is the output of
    the Gaussian regression. Is my understanding
    right?
  • Ben For 2-class case depicted in Fig. 9.13
    (HME), why do we need to do two times of
    mixtures? (two layers)
  • Ben In equ. 9.30 why the 2 upper bound of
    summations are the same K? -- Yes

31
Reference
  • Fisher, W.D. 1958. On grouping for maximum
    homogeneity. J. Amer. Statist. Assoc., 53
    789-798
  • Breiman, L. 1984. Classification and Regression
    Trees. Wadsworth International Group
  • Jordan, M. I., Jacobs, R. A. (1994).
    Hierarchical mixtures of experts and the EM
    algorithm. Neural Computation, 6, 181-214

32
Hierarchical Mixture Experts
  • The softmax function derives naturally from
    log-linear models and leads to convenient
    interpretations of the weights in terms of odds
    ratios. You could, however, use a variety of
    other nonnegative functions on the real line in
    place of the exp function. Or you could constrain
    the net inputs to the output units to be
    nonnegative, and just divide by the sum--that's
    called the Bradley-Terry-Luce model

33
Hierarchical Mixture Experts
Write a Comment
User Comments (0)
About PowerShow.com