Introduction%20to%20Boosting - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction%20to%20Boosting

Description:

... distribution of weights over the N training points W(xi)=1. Initially assign uniform weights W0(x) = 1 ... Find best weak classifier Ck(x) using weights Wk(x) ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 42

Provided by: QYa3

Category:

more less

Transcript and Presenter's Notes

Title: Introduction%20to%20Boosting

1
Introduction to Boosting

Slides Adapted from Che Wanxiang(???) at HIT, and
Robin Dhamankar of
Many thanks!

2
Ideas

Boosting is considered to be one of the most
significant developments in machine learning
Finding many weak rules of thumb is easier than
finding a single, highly prediction rule
Key in combining the weak rules

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Boosting(Algorithm)

W(x) is the distribution of weights over the N
training points ? W(xi)1
Initially assign uniform weights W0(x) 1/N for
all x, step k0
At each iteration k
Find best weak classifier Ck(x) using weights
Wk(x)
With error rate ek and based on a loss function
weight ak the classifier Cks weight in the final
hypothesis
For each xi , update weights based on ek to get
Wk1(xi )
CFINAL(x) sign ? ai Ci (x)

16
Boosting (Algorithm)
17
(No Transcript)
18
(No Transcript)
19
Boosting As Additive Model

The final prediction in boosting f(x) can be
expressed as an additive expansion of individual
classifiers
The process is iterative and can be expressed as
follows.
Typically we would try to minimize a loss
function on the training examples

20
Boosting As Additive Model

Simple case Squared-error loss
Forward stage-wise modeling amounts to just
fitting the residuals from previous iteration.
Squared-error loss not robust for classification

21
Boosting As Additive Model

AdaBoost for Classification
L(y, f (x)) exp(-y f (x)) - the exponential
loss function

22
Boosting As Additive Model
First assume that ß is constant, and minimize
w.r.t. G
23
Boosting As Additive Model
errm It is the training error on the weighted
samples The last equation tells us that in each
iteration we must find a classifier that
minimizes the training error on the weighted
samples.
24
Boosting As Additive Model
Now that we have found G, we minimize w.r.t. ß
25
AdaBoost(Algorithm)

W(x) is the distribution of weights over the N
training points ? W(xi)1
Initially assign uniform weights W0(x) 1/N for
all x.
At each iteration k
Find best weak classifier Ck(x) using weights
Wk(x)
Compute ek the error rate as
ek ? W(xi ) I(yi ? Ck(xi )) / ? W(xi
)
weight ak the classifier Cks weight in the final
hypothesis Set ak log ((1 ek )/ek )
For each xi , Wk1(xi ) Wk(xi ) expak I(yi
? Ck(xi ))
CFINAL(x) sign ? ai Ci (x)

26
AdaBoost(Example)
Original Training set Equal Weights to all
training samples
Taken from A Tutorial on Boosting by Yoav
Freund and Rob Schapire
27
AdaBoost(Example)
ROUND 1
28
AdaBoost(Example)
ROUND 2
29
AdaBoost(Example)
ROUND 3
30
AdaBoost(Example)
31
AdaBoost (Characteristics)

Why exponential loss function?
Computational
Simple modular re-weighting
Derivative easy so determing optimal parameters
is relatively easy
Statistical
In a two label case it determines one half the
log odds of P(Y1x) gt We can use the sign as
the classification rule
Accuracy depends upon number of iterations ( How
sensitive.. we will see soon).

32
Boosting performance
Decision stumps are very simple rules of thumb
that test condition on a single
attribute. Decision stumps formed the individual
classifiers whose predictions were combined to
generate the final prediction. The
misclassification rate of the Boosting algorithm
was plotted against the number of iterations
performed.
33
Boosting performance
Steep decrease in error
34
Boosting performance

Pondering over how many iterations would be
sufficient.
Observations
First few ( about 50) iterations increase the
accuracy substantially.. Seen by the steep
decrease in misclassification rate.
As iterations increase training error decreases
? and generalization error decreases ?

35
Can Boosting do well if?

Limited training data?
Probably not ..
Many missing values ?
Noise in the data ?
Individual classifiers not very accurate ?
It cud if the individual classifiers have
considerable mutual disagreement.

36
Application Data mining

Challenges in real world data mining problems
Data has large number of observations and large
number of variables on each observation.
Inputs are a mixture of various different kinds
of variables
Missing values, outliers and variables with
skewed distribution.
Results to be obtained fast and they should be
interpretable.
So off-shelf techniques are difficult to come up
with.
Boosting Decision Trees ( AdaBoost or MART) come
close to an off-shelf technique for Data Mining.