Experiments with a New Boosting Algorithm - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Experiments with a New Boosting Algorithm

Description:

Data over which the neural network is trained. Many examples are fed into the ... AdaBoost.MH uses Hamming loss, as well as updated learning algorithms, to ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 23

Provided by: QMa9

Category:

more less

Transcript and Presenter's Notes

Title: Experiments with a New Boosting Algorithm

1
Experiments with a New Boosting Algorithm

Yoav Freund and Robert E. Schapire

Improved Boosting Algorithms Using
Confidence-rated Predictions
Robert E. Schapire and Yoram Singer
2
Neural Network Terminology

Training Set
Data over which the neural network is trained
Many examples are fed into the network, on the
order of 10,000
Training Test Set
Independent data used to check the progress of
the network
This data is not used for training

3
Basic Learning Algorithm

FindAttrTest
The training dataset is created
Ex integers between 1 and 10
A random subset of the entire training dataset is
used to train the algorithm
Ex 1, 3, 7, 10
A threshold is picked
Ex 9 (note this is initialized to some number
before the training begins)
The algorithm returns 1 if the test example is
less than the threshold or -1 if greater.
Ex 1 1 1 -1
It also returns the threshold used (called the
hypothesis)
Ex 9
The threshold is updated to minimize the error
Ex error Soutput 2
Ex new threshold old threshold 0.1error
8.8
Another subset of data is fed into the algorithm,
and the process is repeated
After a number of these repetitions, the final
threshold in this case is the mean of the data
set
This is called the final hypothesis
In the above example, it would return 5

4
Basics of Bagging

Multiple copies of the learning algorithm
(FindAttrTest) are made
Each copy is trained a number of times
Each copy is trained on a set of m randomly
picked examples from the entire training dataset
The output classifier (threshold) that appears
the most often is the one that is selected as
optimal
In the example on the previous slide, the
algorithm that predicted 5 earliest would be the
one that was selected as optimal

5
Basics of Boosting

A distribution Dt is fed into the learning
algorithm
A hypothesis ht is returned from the weak learner
Calculate the error of the distribution
et SDt (i)
Update the distribution so that the examples that
were classified incorrectly are fed back to the
algorithm and examples classified correctly are
removed
Repeat these steps T times
In the earlier example, Boosting would give more
weight to numbers incorrectly calculated by the
algorithm in a previous run, thus finding the
optimal hypothesis much faster than pure random
selection

6
AdaBoost.M1

Input Sequence of m samples lt(x1,y1),,(xm,ym)gt
with labels yi in Y 1,,k
xi is the input vector of data
yi is the correct output class for the xi input
Example
Function we want to represent F sign(A)
xi consists of possible values for A taken from a
large dataset S
x1 3
x2 -2
Etc.
yi consists of the correct output for the given
inputs
y1 1
y2 -1
Etc.

7
AdaBoost.M1

Initialize D1(i) 1/m for all i
Repeat the following procedure for t
1, 2, , T
Call the learning algorithm and provide it with
the distribution Dt(i)
Get back hypothesis ht which maps X to Y
Calculate the error of ht
et SDt(i)
where i is all ht(xi) ? yi (all incorrect
hypotheses)
On first round this is simply Nincorrect / m
if the error is bigger than 0.5, stop training

8
AdaBoost.M1

Set
This sets the scaling factor for the weight of a
correctly predicted value
Ex et 0.5 ? ßt 1 (bad prediction, weight
remains same)
Ex et 0.3 ? ßt .43 (algorithm is getting
better so reduce weight)
Ex et 0.01 (near perfect guess!) ? ßt 0.01
(algorithm is near perfect for this input so
dont waste much time training with it any more)
Update the distribution Dt using the above
scaling factor
where Zt is a normalization constant to keep it a
distribution

(Correct guess)
(Incorrect guess)
9
AdaBoost.M1

Output the final hypothesis
where t are the hypotheses which correctly
predicted the output ht(x) y
This equation says that the hypothesis that has
the lowest weight is the one that is used as the
final hypothesis
Remember that a high weight means that the
algorithm is not very good at predicting the
output

10
AdaBoost.M1

One major disadvantage of AdaBoost.M1 is its
inability to handle errors larger than 0.5
This is because the weighting factor becomes
larger than 1 for errors greater than 0.5
This can lead to weights reaching very high for
errors close to 1
This would cause the algorithm to ONLY train the
one example, and learn that ONLY that one example

11
AdaBoost.M2

AdaBoost.M2 was designed to overcome this
difficulty
The learning algorithm is expanded to output a
vector rather than just a scalar
Each output in the vector is a probability that
Class N matches the Input
Ex Number recognition
Input is 7
Outputs would be high for 1 and 7 and medium to
low for the other digits (since they appear
similar)
0.85, 0.6, 0.3, 0.7, 0.4, etc

12
AdaBoost.M2

Steps similar to AdaBoost.M1, except for error
calculation and distribution update
Input Sequence of m samples lt(x1,y1),,(xm,ym)gt
with labels yi in Y 1,,k
Uses modified learning algorithm
A vector of probability outputs
Each element in the vector is the probability
that the input is part of the class associated
with that element
Ex
Input is any letter from the alphabet
Three output classes are A, B, C
Output for a particular input is the probability
that the input is that class
Ex input is handwritten B, output is 0.4, 0.8,
0.2 for the three classes
Integer T, specifying the number of iterations to
be performed

13
AdaBoost.M2

Call learning algorithm
Get back hypothesis vector
Calculate error (now called pseudo-loss for a
vector output)
Update input distribution
Output best hypothesis after T repetitions

14
Experiments

A collection of machine learning datasets are
available on the UC Irvine website
These datasets were used to test the improved
accuracy and speed of the AdaBoost algorithms

15
Results of Experiments
16
Results of Experiments

Boosting vs Bagging
AdaBoost.M1 yielded a 55.1 increase in accuracy
over just using FindAttrTest
Bagging using error yielded only a 8.4 increase
in accuracy
Bagging using pseudo-loss still only yielded
10.6 boost in accuracy
AdaBoost.M2 was at least as good as the
AdaBoost.M1 in all trials, but in 9 of 27 of the
trials, yielded an incredible boost in accuracy

17
Problem

What if a group of data can belong to multiple
classes?
The AdaBoost.M1 and AdaBoost.M2 can only support
output data that belongs to one class at most
AdaBoost.MH, AdaBoost.MR, and AdaBoost.MO were
created to solve these problems
They also provide a confidence rating on how sure
the algorithm is that its output is correct

18
AdaBoost

Multiclass problems are ones where each example
can belong to various classes
One such example is newspaper article
classification, where each article can belong to
multiple categories
AdaBoost.MH uses Hamming loss, as well as updated
learning algorithms, to increase accuracy in
multiclass problems
This algorithm tries to predict all and only all
of the correct class labels (cannot just predict
one label, must predict them all)
The pseudo-loss depends on how the predicted set
of labels differs from the actual set of labels
AdaBoost.MO uses output coding to improve the
accuracy of multiclass problems
This algorithm maps the single class label into a
coded output label
Each label (x, y) is mapped (one-to-one) to a new
label (x, ?(y))
This new label is called the output coded label
AdaBoost.MR uses ranking loss to improve
classification accuracy
Goal is to find a hypothesis which ranks the
output labels in hopes that the highest ranked
label will be the correct hypothesis

19
Experiments

Three algorithms tested
Discrete Valued AdaBoost.MH
Output value for each class is 1 or -1
Real Valued AdaBoost.MH
Output value for each class can be any real
number
Discrete Valued AdaBoost.MR
Output value for each class is 1 or -1

20
Experiment Results (UCI)
Training Set Error
In some graphs, the real AdaBoost.MH performed
better, and in other the discrete AdaBoost.MH
performed better In a few cases, the training
set was trained too much, and the error started
to rise after an optimal number of trainings
Training Test Set Error
Discrete AdaBoost.MH
Real AdaBoost.MH
21
ExperimentNewspaper Article Classification

Newspaper articles are fed into the learning
algorithms
Classifier makes decision based on the presence
or absence of a phrase in the document
Each article is classified into one and only one
category
The categories are
Domestic
Entertainment
Financial
International
Political
Washington
Same functions as previous experiment were used
to classify the output

22
Experiment Results Newspaper Article
Classification

In this scenario, the Real Valued AdaBoost.MH
greatly outperforms the other two methods
Number of training rounds to reach a test
accuracy of 40
Discrete AdaBoost.MR 33347 rounds
Discrete AdaBoost.MH 16938 rounds
Real AdaBoost.MH 268 rounds

Write a Comment

User Comments (0)