Title: Experiments with a New Boosting Algorithm
1Experiments with a New Boosting Algorithm
- Yoav Freund and Robert E. Schapire
Improved Boosting Algorithms Using
Confidence-rated Predictions
Robert E. Schapire and Yoram Singer
2Neural Network Terminology
- Training Set
- Data over which the neural network is trained
- Many examples are fed into the network, on the
order of 10,000 - Training Test Set
- Independent data used to check the progress of
the network - This data is not used for training
3Basic Learning Algorithm
- FindAttrTest
- The training dataset is created
- Ex integers between 1 and 10
- A random subset of the entire training dataset is
used to train the algorithm - Ex 1, 3, 7, 10
- A threshold is picked
- Ex 9 (note this is initialized to some number
before the training begins) - The algorithm returns 1 if the test example is
less than the threshold or -1 if greater. - Ex 1 1 1 -1
- It also returns the threshold used (called the
hypothesis) - Ex 9
- The threshold is updated to minimize the error
- Ex error Soutput 2
- Ex new threshold old threshold 0.1error
8.8 - Another subset of data is fed into the algorithm,
and the process is repeated - After a number of these repetitions, the final
threshold in this case is the mean of the data
set - This is called the final hypothesis
- In the above example, it would return 5
4Basics of Bagging
- Multiple copies of the learning algorithm
(FindAttrTest) are made - Each copy is trained a number of times
- Each copy is trained on a set of m randomly
picked examples from the entire training dataset - The output classifier (threshold) that appears
the most often is the one that is selected as
optimal - In the example on the previous slide, the
algorithm that predicted 5 earliest would be the
one that was selected as optimal
5Basics of Boosting
- A distribution Dt is fed into the learning
algorithm - A hypothesis ht is returned from the weak learner
- Calculate the error of the distribution
- et SDt (i)
- Update the distribution so that the examples that
were classified incorrectly are fed back to the
algorithm and examples classified correctly are
removed - Repeat these steps T times
- In the earlier example, Boosting would give more
weight to numbers incorrectly calculated by the
algorithm in a previous run, thus finding the
optimal hypothesis much faster than pure random
selection
6AdaBoost.M1
- Input Sequence of m samples lt(x1,y1),,(xm,ym)gt
with labels yi in Y 1,,k - xi is the input vector of data
- yi is the correct output class for the xi input
- Example
- Function we want to represent F sign(A)
- xi consists of possible values for A taken from a
large dataset S - x1 3
- x2 -2
- Etc.
- yi consists of the correct output for the given
inputs - y1 1
- y2 -1
- Etc.
7AdaBoost.M1
- Initialize D1(i) 1/m for all i
- Repeat the following procedure for t
1, 2, , T - Call the learning algorithm and provide it with
the distribution Dt(i) - Get back hypothesis ht which maps X to Y
- Calculate the error of ht
- et SDt(i)
- where i is all ht(xi) ? yi (all incorrect
hypotheses) - On first round this is simply Nincorrect / m
- if the error is bigger than 0.5, stop training
8AdaBoost.M1
- Set
- This sets the scaling factor for the weight of a
correctly predicted value - Ex et 0.5 ? ßt 1 (bad prediction, weight
remains same) - Ex et 0.3 ? ßt .43 (algorithm is getting
better so reduce weight) - Ex et 0.01 (near perfect guess!) ? ßt 0.01
(algorithm is near perfect for this input so
dont waste much time training with it any more) - Update the distribution Dt using the above
scaling factor - where Zt is a normalization constant to keep it a
distribution
(Correct guess)
(Incorrect guess)
9AdaBoost.M1
- Output the final hypothesis
- where t are the hypotheses which correctly
predicted the output ht(x) y - This equation says that the hypothesis that has
the lowest weight is the one that is used as the
final hypothesis - Remember that a high weight means that the
algorithm is not very good at predicting the
output
10AdaBoost.M1
- One major disadvantage of AdaBoost.M1 is its
inability to handle errors larger than 0.5 - This is because the weighting factor becomes
larger than 1 for errors greater than 0.5 - This can lead to weights reaching very high for
errors close to 1 - This would cause the algorithm to ONLY train the
one example, and learn that ONLY that one example
11AdaBoost.M2
- AdaBoost.M2 was designed to overcome this
difficulty - The learning algorithm is expanded to output a
vector rather than just a scalar - Each output in the vector is a probability that
Class N matches the Input - Ex Number recognition
- Input is 7
- Outputs would be high for 1 and 7 and medium to
low for the other digits (since they appear
similar) - 0.85, 0.6, 0.3, 0.7, 0.4, etc
12AdaBoost.M2
- Steps similar to AdaBoost.M1, except for error
calculation and distribution update - Input Sequence of m samples lt(x1,y1),,(xm,ym)gt
with labels yi in Y 1,,k - Uses modified learning algorithm
- A vector of probability outputs
- Each element in the vector is the probability
that the input is part of the class associated
with that element - Ex
- Input is any letter from the alphabet
- Three output classes are A, B, C
- Output for a particular input is the probability
that the input is that class - Ex input is handwritten B, output is 0.4, 0.8,
0.2 for the three classes - Integer T, specifying the number of iterations to
be performed
13AdaBoost.M2
- Call learning algorithm
- Get back hypothesis vector
- Calculate error (now called pseudo-loss for a
vector output) - Update input distribution
- Output best hypothesis after T repetitions
14Experiments
- A collection of machine learning datasets are
available on the UC Irvine website - These datasets were used to test the improved
accuracy and speed of the AdaBoost algorithms
15Results of Experiments
16Results of Experiments
- Boosting vs Bagging
- AdaBoost.M1 yielded a 55.1 increase in accuracy
over just using FindAttrTest - Bagging using error yielded only a 8.4 increase
in accuracy - Bagging using pseudo-loss still only yielded
10.6 boost in accuracy - AdaBoost.M2 was at least as good as the
AdaBoost.M1 in all trials, but in 9 of 27 of the
trials, yielded an incredible boost in accuracy
17Problem
- What if a group of data can belong to multiple
classes? - The AdaBoost.M1 and AdaBoost.M2 can only support
output data that belongs to one class at most - AdaBoost.MH, AdaBoost.MR, and AdaBoost.MO were
created to solve these problems - They also provide a confidence rating on how sure
the algorithm is that its output is correct
18AdaBoost
- Multiclass problems are ones where each example
can belong to various classes - One such example is newspaper article
classification, where each article can belong to
multiple categories - AdaBoost.MH uses Hamming loss, as well as updated
learning algorithms, to increase accuracy in
multiclass problems - This algorithm tries to predict all and only all
of the correct class labels (cannot just predict
one label, must predict them all) - The pseudo-loss depends on how the predicted set
of labels differs from the actual set of labels - AdaBoost.MO uses output coding to improve the
accuracy of multiclass problems - This algorithm maps the single class label into a
coded output label - Each label (x, y) is mapped (one-to-one) to a new
label (x, ?(y)) - This new label is called the output coded label
- AdaBoost.MR uses ranking loss to improve
classification accuracy - Goal is to find a hypothesis which ranks the
output labels in hopes that the highest ranked
label will be the correct hypothesis
19Experiments
- Three algorithms tested
- Discrete Valued AdaBoost.MH
- Output value for each class is 1 or -1
- Real Valued AdaBoost.MH
- Output value for each class can be any real
number - Discrete Valued AdaBoost.MR
- Output value for each class is 1 or -1
20Experiment Results (UCI)
Training Set Error
In some graphs, the real AdaBoost.MH performed
better, and in other the discrete AdaBoost.MH
performed better In a few cases, the training
set was trained too much, and the error started
to rise after an optimal number of trainings
Training Test Set Error
Discrete AdaBoost.MH
Real AdaBoost.MH
21ExperimentNewspaper Article Classification
- Newspaper articles are fed into the learning
algorithms - Classifier makes decision based on the presence
or absence of a phrase in the document - Each article is classified into one and only one
category - The categories are
- Domestic
- Entertainment
- Financial
- International
- Political
- Washington
- Same functions as previous experiment were used
to classify the output
22Experiment Results Newspaper Article
Classification
- In this scenario, the Real Valued AdaBoost.MH
greatly outperforms the other two methods - Number of training rounds to reach a test
accuracy of 40 - Discrete AdaBoost.MR 33347 rounds
- Discrete AdaBoost.MH 16938 rounds
- Real AdaBoost.MH 268 rounds