Boosting - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Boosting

Description:

Bagging does not modify the weight distribution over examples or mislabels, but ... In forming the final hypothesis, bagging gives equal weight to each of the ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 47

Provided by: coursesWa1

Category:

more less

Transcript and Presenter's Notes

Title: Boosting

1
Boosting

LING 572
Fei Xia
02/01/06

2
Outline

Basic concepts
Theoretical validity
Case study
POS tagging
Summary

3
Basic concepts
4
Overview of boosting

Introduced by Schapire and Freund in 1990s.
Boosting convert a weak learning algorithm
into a strong one.
Main idea Combine many weak classifiers to
produce a powerful committee.
Algorithms
AdaBoost adaptive boosting
Gentle AdaBoost
BrownBoost

5
Bagging
ML
Random sample with replacement
f1
ML
f2
f
ML
fT
Random sample with replacement
6
Boosting
ML
f1
Training Sample
ML
Weighted Sample
f2
f

ML
fT

Weighted Sample

7
Main ideas

Train a set of weak hypotheses h1, ., hT.
The combined hypothesis H is a weighted majority
vote of the T weak hypotheses.
Each hypothesis ht has a weight at.
During the training, focus on the examples that
are misclassified.
? At round t, example xi has the weight Dt(i).

8
Algorithm highlight

Training time (h1, ?1), ., (ht, ?t),
Test time for x,
Call each classifier ht, and calculate ht(x)
Calculate the sum ?t ?t ht(x)

9
Basic Setting

Binary classification problem
Training data
Dt(i) the weight of xi at round t. D1(i)1/m.
A learner L that finds a weak hypothesis ht X ?
Y given the training set and Dt
The error of a weak hypothesis ht

10
The basic AdaBoost algorithm

For t1, , T
Train weak learner ht X ? -1, 1 using
training data and Dt
Get the error rate
Choose classifier weight
Update the instance weights

11
The new weights
When
When
12
An example
o

o

13
Two iterations
Initial weights
1st iteration
2nd iteration
14
The general AdaBoost algorithm
15
The basic and general algorithms

In the basic algorithm, it can be proven that
The hypothesis weight at is decided at round t
Di (The weight distribution of training examples)
is updated at every round t.
Choice of weak learner
its error should be less than 0.5
Ex DT (C4.5), decision stump

16
Experiment results(Freund and Schapire, 1996)
Error rate on a set of 27 benchmark problems
17
Theoretical validity
18
Training error of H(x)
Final hypothesis
Training error is defined to be
It can be proved that training error
19
Training error for basic algorithm
Let
Training error
? Training error drops exponentially fast.
20
Generalization error (expected test error)

Generalization error, with high probability, is
at most
T the number of rounds of boosting
m the size of the sample
d VC-dimension of the base classifier space

21
Selecting weak hypotheses

Training error
Choose ht that minimize Zt.
See case study for details.

22
Multiclass boosting
23
Two ways

Converting a multiclass problem to binary problem
first
One-vs-all
All-pairs
ECOC
Extending boosting directly
AdaBoost.M1
AdaBoost.M2 ? Prob 2 in Hw5

24
Case study
25
Overview(Abney, Schapire and Singer, 1999)

Boosting applied to Tagging and PP attachment
Issues
How to learn weak hypotheses?
How to deal with multi-class problems?
Local decision vs. globally best sequence

26
Weak hypotheses

In this paper, a weak hypothesis h simply tests a
predicate (a.k.a. feature), F
h(x) p1 if F(x) is true, h(x)p0 o.w.
? h(x)pF(x)
Examples
POS tagging F is PreviousWordthe
PP attachment F is Vaccused, N1president,
Pof
Choosing a list of hypotheses ? choosing a list
of features.

27
Finding weak hypotheses

The training error of the combined hypothesis is
at most
where
? choose ht that minimizes Zt.
ht corresponds to a (Ft, p0, p1) tuple.

Schapire and Singer (1998) show that given a
predicate F, Zt is minimized when

where
29
Finding weak hypotheses (cont)

For each F, calculate Zt
Choose the one with min Zt.

30
Boosting results on POS tagging?
31
Sequential model

Sequential model a Viterbi-style optimization to
choose a globally best sequence of labels.

32
Previous results
33
Summary
34
Main ideas

Boosting combines many weak classifiers to
produce a powerful committee.
Base learning algorithms that only need to be
better than random.
The instance weights are updated during training
to put more emphasis on hard examples.

35
Strengths of AdaBoost

Theoretical validity it comes with a set of
theoretical guarantee (e.g., training error, test
error)
It performs well on many tasks.
It can identify outliners i.e. examples that are
either mislabeled or that are inherently
ambiguous and hard to categorize.

36
Weakness of AdaBoost

The actual performance of boosting depends on the
data and the base learner.
Boosting seems to be especially susceptible to
noise.
When the number of outliners is very large, the
emphasis placed on the hard examples can hurt the
performance.
? Gentle AdaBoost, BrownBoost

37
Other properties

Simplicity (conceptual)
Efficiency at training
Efficiency at testing time
Handling multi-class
Interpretability

38
Bagging vs. Boosting (Freund and Schapire 1996)

Bagging always uses resampling rather than
reweighting.
Bagging does not modify the weight distribution
over examples or mislabels, but instead always
uses the uniform distribution
In forming the final hypothesis, bagging gives
equal weight to each of the weak hypotheses

39
Relation to other topics

Game theory
Linear programming
Bregman distances
Support-vector machines
Brownian motion
Logistic regression
Maximum-entropy methods such as iterative scaling.

40
Additional slides
41
Sources of Bias and Variance

Bias arises when the classifier cannot represent
the true function that is, the classifier
underfits the data
Variance arises when the classifier overfits the
data
There is often a tradeoff between bias and
variance

42
Effect of Bagging

If the bootstrap replicate approximation were
correct, then bagging would reduce variance
without changing bias.
In practice, bagging can reduce both bias and
variance
For high-bias classifiers, it can reduce bias
For high-variance classifiers, it can reduce
variance

43
Effect of Boosting