Boosting - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Boosting

Description:

Bagging does not modify the weight distribution over examples or mislabels, but ... In forming the final hypothesis, bagging gives equal weight to each of the ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 47
Provided by: coursesWa1
Category:
Tags: bagging | boosting

less

Transcript and Presenter's Notes

Title: Boosting


1
Boosting
  • LING 572
  • Fei Xia
  • 02/01/06

2
Outline
  • Basic concepts
  • Theoretical validity
  • Case study
  • POS tagging
  • Summary

3
Basic concepts
4
Overview of boosting
  • Introduced by Schapire and Freund in 1990s.
  • Boosting convert a weak learning algorithm
    into a strong one.
  • Main idea Combine many weak classifiers to
    produce a powerful committee.
  • Algorithms
  • AdaBoost adaptive boosting
  • Gentle AdaBoost
  • BrownBoost

5
Bagging
ML
Random sample with replacement
f1
ML
f2
f
ML
fT
Random sample with replacement
6
Boosting
ML
f1
Training Sample
ML
Weighted Sample
f2
f

ML
fT
  • Weighted Sample

7
Main ideas
  • Train a set of weak hypotheses h1, ., hT.
  • The combined hypothesis H is a weighted majority
    vote of the T weak hypotheses.
  • Each hypothesis ht has a weight at.
  • During the training, focus on the examples that
    are misclassified.
  • ? At round t, example xi has the weight Dt(i).

8
Algorithm highlight
  • Training time (h1, ?1), ., (ht, ?t),
  • Test time for x,
  • Call each classifier ht, and calculate ht(x)
  • Calculate the sum ?t ?t ht(x)

9
Basic Setting
  • Binary classification problem
  • Training data
  • Dt(i) the weight of xi at round t. D1(i)1/m.
  • A learner L that finds a weak hypothesis ht X ?
    Y given the training set and Dt
  • The error of a weak hypothesis ht

10
The basic AdaBoost algorithm
  • For t1, , T
  • Train weak learner ht X ? -1, 1 using
    training data and Dt
  • Get the error rate
  • Choose classifier weight
  • Update the instance weights

11
The new weights
When
When
12
An example
o

o


13
Two iterations
Initial weights
1st iteration
2nd iteration
14
The general AdaBoost algorithm
15
The basic and general algorithms
  • In the basic algorithm, it can be proven that
  • The hypothesis weight at is decided at round t
  • Di (The weight distribution of training examples)
    is updated at every round t.
  • Choice of weak learner
  • its error should be less than 0.5
  • Ex DT (C4.5), decision stump

16
Experiment results(Freund and Schapire, 1996)
Error rate on a set of 27 benchmark problems
17
Theoretical validity
18
Training error of H(x)
Final hypothesis
Training error is defined to be
It can be proved that training error
19
Training error for basic algorithm
Let
Training error
? Training error drops exponentially fast.
20
Generalization error (expected test error)
  • Generalization error, with high probability, is
    at most
  • T the number of rounds of boosting
  • m the size of the sample
  • d VC-dimension of the base classifier space

21
Selecting weak hypotheses
  • Training error
  • Choose ht that minimize Zt.
  • See case study for details.

22
Multiclass boosting
23
Two ways
  • Converting a multiclass problem to binary problem
    first
  • One-vs-all
  • All-pairs
  • ECOC
  • Extending boosting directly
  • AdaBoost.M1
  • AdaBoost.M2 ? Prob 2 in Hw5

24
Case study
25
Overview(Abney, Schapire and Singer, 1999)
  • Boosting applied to Tagging and PP attachment
  • Issues
  • How to learn weak hypotheses?
  • How to deal with multi-class problems?
  • Local decision vs. globally best sequence

26
Weak hypotheses
  • In this paper, a weak hypothesis h simply tests a
    predicate (a.k.a. feature), F
  • h(x) p1 if F(x) is true, h(x)p0 o.w.
  • ? h(x)pF(x)
  • Examples
  • POS tagging F is PreviousWordthe
  • PP attachment F is Vaccused, N1president,
    Pof
  • Choosing a list of hypotheses ? choosing a list
    of features.

27
Finding weak hypotheses
  • The training error of the combined hypothesis is
    at most
  • where
  • ? choose ht that minimizes Zt.
  • ht corresponds to a (Ft, p0, p1) tuple.

28
  • Schapire and Singer (1998) show that given a
    predicate F, Zt is minimized when

where
29
Finding weak hypotheses (cont)
  • For each F, calculate Zt
  • Choose the one with min Zt.

30
Boosting results on POS tagging?
31
Sequential model
  • Sequential model a Viterbi-style optimization to
    choose a globally best sequence of labels.

32
Previous results
33
Summary
34
Main ideas
  • Boosting combines many weak classifiers to
    produce a powerful committee.
  • Base learning algorithms that only need to be
    better than random.
  • The instance weights are updated during training
    to put more emphasis on hard examples.

35
Strengths of AdaBoost
  • Theoretical validity it comes with a set of
    theoretical guarantee (e.g., training error, test
    error)
  • It performs well on many tasks.
  • It can identify outliners i.e. examples that are
    either mislabeled or that are inherently
    ambiguous and hard to categorize.

36
Weakness of AdaBoost
  • The actual performance of boosting depends on the
    data and the base learner.
  • Boosting seems to be especially susceptible to
    noise.
  • When the number of outliners is very large, the
    emphasis placed on the hard examples can hurt the
    performance.
  • ? Gentle AdaBoost, BrownBoost

37
Other properties
  • Simplicity (conceptual)
  • Efficiency at training
  • Efficiency at testing time
  • Handling multi-class
  • Interpretability

38
Bagging vs. Boosting (Freund and Schapire 1996)
  • Bagging always uses resampling rather than
    reweighting.
  • Bagging does not modify the weight distribution
    over examples or mislabels, but instead always
    uses the uniform distribution
  • In forming the final hypothesis, bagging gives
    equal weight to each of the weak hypotheses

39
Relation to other topics
  • Game theory
  • Linear programming
  • Bregman distances
  • Support-vector machines
  • Brownian motion
  • Logistic regression
  • Maximum-entropy methods such as iterative scaling.

40
Additional slides
41
Sources of Bias and Variance
  • Bias arises when the classifier cannot represent
    the true function that is, the classifier
    underfits the data
  • Variance arises when the classifier overfits the
    data
  • There is often a tradeoff between bias and
    variance

42
Effect of Bagging
  • If the bootstrap replicate approximation were
    correct, then bagging would reduce variance
    without changing bias.
  • In practice, bagging can reduce both bias and
    variance
  • For high-bias classifiers, it can reduce bias
  • For high-variance classifiers, it can reduce
    variance

43
Effect of Boosting
  • In the early iterations, boosting is primary a
    bias-reducing method
  • In later iterations, it appears to be primarily a
    variance-reducing method

44
How to choose at for ht with range -1,1?
  • Training error
  • Choose at that minimize Zt.

?
45
Issues
  • Given ht, how to choose at?
  • How to select ht?

46
How to choose at when ht has range -1,1?
Write a Comment
User Comments (0)
About PowerShow.com