Title: Introduction%20to%20Boosting
1Introduction to Boosting
- Slides Adapted from Che Wanxiang(???) at HIT, and
Robin Dhamankar of - Many thanks!
2Ideas
- Boosting is considered to be one of the most
significant developments in machine learning - Finding many weak rules of thumb is easier than
finding a single, highly prediction rule - Key in combining the weak rules
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Boosting(Algorithm)
- W(x) is the distribution of weights over the N
training points ? W(xi)1 - Initially assign uniform weights W0(x) 1/N for
all x, step k0 - At each iteration k
- Find best weak classifier Ck(x) using weights
Wk(x) - With error rate ek and based on a loss function
- weight ak the classifier Cks weight in the final
hypothesis - For each xi , update weights based on ek to get
Wk1(xi ) - CFINAL(x) sign ? ai Ci (x)
16Boosting (Algorithm)
17(No Transcript)
18(No Transcript)
19Boosting As Additive Model
- The final prediction in boosting f(x) can be
expressed as an additive expansion of individual
classifiers - The process is iterative and can be expressed as
follows. - Typically we would try to minimize a loss
function on the training examples
20Boosting As Additive Model
- Simple case Squared-error loss
- Forward stage-wise modeling amounts to just
fitting the residuals from previous iteration. - Squared-error loss not robust for classification
21Boosting As Additive Model
- AdaBoost for Classification
- L(y, f (x)) exp(-y f (x)) - the exponential
loss function
22Boosting As Additive Model
First assume that ß is constant, and minimize
w.r.t. G
23Boosting As Additive Model
errm It is the training error on the weighted
samples The last equation tells us that in each
iteration we must find a classifier that
minimizes the training error on the weighted
samples.
24Boosting As Additive Model
Now that we have found G, we minimize w.r.t. ß
25AdaBoost(Algorithm)
- W(x) is the distribution of weights over the N
training points ? W(xi)1 - Initially assign uniform weights W0(x) 1/N for
all x. - At each iteration k
- Find best weak classifier Ck(x) using weights
Wk(x) - Compute ek the error rate as
- ek ? W(xi ) I(yi ? Ck(xi )) / ? W(xi
) - weight ak the classifier Cks weight in the final
hypothesis Set ak log ((1 ek )/ek ) - For each xi , Wk1(xi ) Wk(xi ) expak I(yi
? Ck(xi )) - CFINAL(x) sign ? ai Ci (x)
26AdaBoost(Example)
Original Training set Equal Weights to all
training samples
Taken from A Tutorial on Boosting by Yoav
Freund and Rob Schapire
27AdaBoost(Example)
ROUND 1
28AdaBoost(Example)
ROUND 2
29AdaBoost(Example)
ROUND 3
30AdaBoost(Example)
31AdaBoost (Characteristics)
- Why exponential loss function?
- Computational
- Simple modular re-weighting
- Derivative easy so determing optimal parameters
is relatively easy - Statistical
- In a two label case it determines one half the
log odds of P(Y1x) gt We can use the sign as
the classification rule - Accuracy depends upon number of iterations ( How
sensitive.. we will see soon).
32Boosting performance
Decision stumps are very simple rules of thumb
that test condition on a single
attribute. Decision stumps formed the individual
classifiers whose predictions were combined to
generate the final prediction. The
misclassification rate of the Boosting algorithm
was plotted against the number of iterations
performed.
33Boosting performance
Steep decrease in error
34Boosting performance
- Pondering over how many iterations would be
sufficient. - Observations
- First few ( about 50) iterations increase the
accuracy substantially.. Seen by the steep
decrease in misclassification rate. - As iterations increase training error decreases
? and generalization error decreases ?
35Can Boosting do well if?
- Limited training data?
- Probably not ..
- Many missing values ?
- Noise in the data ?
- Individual classifiers not very accurate ?
- It cud if the individual classifiers have
considerable mutual disagreement.
36Application Data mining
- Challenges in real world data mining problems
- Data has large number of observations and large
number of variables on each observation. - Inputs are a mixture of various different kinds
of variables - Missing values, outliers and variables with
skewed distribution. - Results to be obtained fast and they should be
interpretable. - So off-shelf techniques are difficult to come up
with. - Boosting Decision Trees ( AdaBoost or MART) come
close to an off-shelf technique for Data Mining.
37(No Transcript)
38(No Transcript)
39(No Transcript)
40ATT May I help you?
41(No Transcript)