Title: Boosting
- LING 572
- Fei Xia
- 02/01/06
- Basic concepts
- Theoretical validity
- Case study
- POS tagging
- Summary
3Basic concepts
4Overview of boosting
- Introduced by Schapire and Freund in 1990s.
- Boosting convert a weak learning algorithm
into a strong one. -
- Main idea Combine many weak classifiers to
produce a powerful committee. - Algorithms
- AdaBoost adaptive boosting
- Gentle AdaBoost
- BrownBoost
Random sample with replacement
Random sample with replacement
Training Sample
Weighted Sample
7Main ideas
- Train a set of weak hypotheses h1, ., hT.
- The combined hypothesis H is a weighted majority
vote of the T weak hypotheses. - Each hypothesis ht has a weight at.
- During the training, focus on the examples that
are misclassified. - ? At round t, example xi has the weight Dt(i).
8Algorithm highlight
- Training time (h1, ?1), ., (ht, ?t),
- Test time for x,
- Call each classifier ht, and calculate ht(x)
- Calculate the sum ?t ?t ht(x)
9Basic Setting
- Binary classification problem
- Training data
- Dt(i) the weight of xi at round t. D1(i)1/m.
- A learner L that finds a weak hypothesis ht X ?
Y given the training set and Dt - The error of a weak hypothesis ht
10The basic AdaBoost algorithm
- For t1, , T
- Train weak learner ht X ? -1, 1 using
training data and Dt - Get the error rate
- Choose classifier weight
- Update the instance weights
11The new weights
12An example
13Two iterations
Initial weights
1st iteration
2nd iteration
14The general AdaBoost algorithm
15The basic and general algorithms
- In the basic algorithm, it can be proven that
- The hypothesis weight at is decided at round t
- Di (The weight distribution of training examples)
is updated at every round t. - Choice of weak learner
- its error should be less than 0.5
- Ex DT (C4.5), decision stump
16Experiment results(Freund and Schapire, 1996)
Error rate on a set of 27 benchmark problems
17Theoretical validity
18Training error of H(x)
Final hypothesis
Training error is defined to be
It can be proved that training error
19Training error for basic algorithm
Training error
? Training error drops exponentially fast.
20Generalization error (expected test error)
- Generalization error, with high probability, is
at most - T the number of rounds of boosting
- m the size of the sample
- d VC-dimension of the base classifier space
21Selecting weak hypotheses
- Training error
- Choose ht that minimize Zt.
- See case study for details.
22Multiclass boosting
23Two ways
- Converting a multiclass problem to binary problem
first - One-vs-all
- All-pairs
- Extending boosting directly
- AdaBoost.M1
- AdaBoost.M2 ? Prob 2 in Hw5
24Case study
25Overview(Abney, Schapire and Singer, 1999)
- Boosting applied to Tagging and PP attachment
- Issues
- How to learn weak hypotheses?
- How to deal with multi-class problems?
- Local decision vs. globally best sequence
26Weak hypotheses
- In this paper, a weak hypothesis h simply tests a
predicate (a.k.a. feature), F - h(x) p1 if F(x) is true, h(x)p0 o.w.
- ? h(x)pF(x)
- Examples
- POS tagging F is PreviousWordthe
- PP attachment F is Vaccused, N1president,
Pof - Choosing a list of hypotheses ? choosing a list
of features.
27Finding weak hypotheses
- The training error of the combined hypothesis is
at most -
- where
- ? choose ht that minimizes Zt.
- ht corresponds to a (Ft, p0, p1) tuple.
28- Schapire and Singer (1998) show that given a
predicate F, Zt is minimized when
29Finding weak hypotheses (cont)
- For each F, calculate Zt
- Choose the one with min Zt.
30Boosting results on POS tagging?
31Sequential model
- Sequential model a Viterbi-style optimization to
choose a globally best sequence of labels.
32Previous results
34Main ideas
- Boosting combines many weak classifiers to
produce a powerful committee. - Base learning algorithms that only need to be
better than random. - The instance weights are updated during training
to put more emphasis on hard examples.
35Strengths of AdaBoost
- Theoretical validity it comes with a set of
theoretical guarantee (e.g., training error, test
error) - It performs well on many tasks.
- It can identify outliners i.e. examples that are
either mislabeled or that are inherently
ambiguous and hard to categorize.
36Weakness of AdaBoost
- The actual performance of boosting depends on the
data and the base learner. - Boosting seems to be especially susceptible to
noise. - When the number of outliners is very large, the
emphasis placed on the hard examples can hurt the
performance. - ? Gentle AdaBoost, BrownBoost
37Other properties
- Simplicity (conceptual)
- Efficiency at training
- Efficiency at testing time
- Handling multi-class
- Interpretability
38Bagging vs. Boosting (Freund and Schapire 1996)
- Bagging always uses resampling rather than
reweighting. - Bagging does not modify the weight distribution
over examples or mislabels, but instead always
uses the uniform distribution - In forming the final hypothesis, bagging gives
equal weight to each of the weak hypotheses
39Relation to other topics
- Game theory
- Linear programming
- Bregman distances
- Support-vector machines
- Brownian motion
- Logistic regression
- Maximum-entropy methods such as iterative scaling.
40Additional slides
41Sources of Bias and Variance
- Bias arises when the classifier cannot represent
the true function that is, the classifier
underfits the data - Variance arises when the classifier overfits the
data - There is often a tradeoff between bias and
42Effect of Bagging
- If the bootstrap replicate approximation were
correct, then bagging would reduce variance
without changing bias. - In practice, bagging can reduce both bias and
variance - For high-bias classifiers, it can reduce bias
- For high-variance classifiers, it can reduce
43Effect of Boosting
- In the early iterations, boosting is primary a
bias-reducing method - In later iterations, it appears to be primarily a
variance-reducing method
44How to choose at for ht with range -1,1?
- Training error
- Choose at that minimize Zt.
- Given ht, how to choose at?
- How to select ht?
46How to choose at when ht has range -1,1?