Experiments with a New Boosting Algorithm - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Experiments with a New Boosting Algorithm

Description:

Data over which the neural network is trained. Many examples are fed into the ... AdaBoost.MH uses Hamming loss, as well as updated learning algorithms, to ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 23
Provided by: QMa9
Category:

less

Transcript and Presenter's Notes

Title: Experiments with a New Boosting Algorithm


1
Experiments with a New Boosting Algorithm
  • Yoav Freund and Robert E. Schapire

Improved Boosting Algorithms Using
Confidence-rated Predictions
Robert E. Schapire and Yoram Singer
2
Neural Network Terminology
  • Training Set
  • Data over which the neural network is trained
  • Many examples are fed into the network, on the
    order of 10,000
  • Training Test Set
  • Independent data used to check the progress of
    the network
  • This data is not used for training

3
Basic Learning Algorithm
  • FindAttrTest
  • The training dataset is created
  • Ex integers between 1 and 10
  • A random subset of the entire training dataset is
    used to train the algorithm
  • Ex 1, 3, 7, 10
  • A threshold is picked
  • Ex 9 (note this is initialized to some number
    before the training begins)
  • The algorithm returns 1 if the test example is
    less than the threshold or -1 if greater.
  • Ex 1 1 1 -1
  • It also returns the threshold used (called the
    hypothesis)
  • Ex 9
  • The threshold is updated to minimize the error
  • Ex error Soutput 2
  • Ex new threshold old threshold 0.1error
    8.8
  • Another subset of data is fed into the algorithm,
    and the process is repeated
  • After a number of these repetitions, the final
    threshold in this case is the mean of the data
    set
  • This is called the final hypothesis
  • In the above example, it would return 5

4
Basics of Bagging
  • Multiple copies of the learning algorithm
    (FindAttrTest) are made
  • Each copy is trained a number of times
  • Each copy is trained on a set of m randomly
    picked examples from the entire training dataset
  • The output classifier (threshold) that appears
    the most often is the one that is selected as
    optimal
  • In the example on the previous slide, the
    algorithm that predicted 5 earliest would be the
    one that was selected as optimal

5
Basics of Boosting
  • A distribution Dt is fed into the learning
    algorithm
  • A hypothesis ht is returned from the weak learner
  • Calculate the error of the distribution
  • et SDt (i)
  • Update the distribution so that the examples that
    were classified incorrectly are fed back to the
    algorithm and examples classified correctly are
    removed
  • Repeat these steps T times
  • In the earlier example, Boosting would give more
    weight to numbers incorrectly calculated by the
    algorithm in a previous run, thus finding the
    optimal hypothesis much faster than pure random
    selection

6
AdaBoost.M1
  • Input Sequence of m samples lt(x1,y1),,(xm,ym)gt
    with labels yi in Y 1,,k
  • xi is the input vector of data
  • yi is the correct output class for the xi input
  • Example
  • Function we want to represent F sign(A)
  • xi consists of possible values for A taken from a
    large dataset S
  • x1 3
  • x2 -2
  • Etc.
  • yi consists of the correct output for the given
    inputs
  • y1 1
  • y2 -1
  • Etc.

7
AdaBoost.M1
  • Initialize D1(i) 1/m for all i
  • Repeat the following procedure for t
    1, 2, , T
  • Call the learning algorithm and provide it with
    the distribution Dt(i)
  • Get back hypothesis ht which maps X to Y
  • Calculate the error of ht
  • et SDt(i)
  • where i is all ht(xi) ? yi (all incorrect
    hypotheses)
  • On first round this is simply Nincorrect / m
  • if the error is bigger than 0.5, stop training

8
AdaBoost.M1
  • Set
  • This sets the scaling factor for the weight of a
    correctly predicted value
  • Ex et 0.5 ? ßt 1 (bad prediction, weight
    remains same)
  • Ex et 0.3 ? ßt .43 (algorithm is getting
    better so reduce weight)
  • Ex et 0.01 (near perfect guess!) ? ßt 0.01
    (algorithm is near perfect for this input so
    dont waste much time training with it any more)
  • Update the distribution Dt using the above
    scaling factor
  • where Zt is a normalization constant to keep it a
    distribution

(Correct guess)
(Incorrect guess)
9
AdaBoost.M1
  • Output the final hypothesis
  • where t are the hypotheses which correctly
    predicted the output ht(x) y
  • This equation says that the hypothesis that has
    the lowest weight is the one that is used as the
    final hypothesis
  • Remember that a high weight means that the
    algorithm is not very good at predicting the
    output

10
AdaBoost.M1
  • One major disadvantage of AdaBoost.M1 is its
    inability to handle errors larger than 0.5
  • This is because the weighting factor becomes
    larger than 1 for errors greater than 0.5
  • This can lead to weights reaching very high for
    errors close to 1
  • This would cause the algorithm to ONLY train the
    one example, and learn that ONLY that one example

11
AdaBoost.M2
  • AdaBoost.M2 was designed to overcome this
    difficulty
  • The learning algorithm is expanded to output a
    vector rather than just a scalar
  • Each output in the vector is a probability that
    Class N matches the Input
  • Ex Number recognition
  • Input is 7
  • Outputs would be high for 1 and 7 and medium to
    low for the other digits (since they appear
    similar)
  • 0.85, 0.6, 0.3, 0.7, 0.4, etc

12
AdaBoost.M2
  • Steps similar to AdaBoost.M1, except for error
    calculation and distribution update
  • Input Sequence of m samples lt(x1,y1),,(xm,ym)gt
    with labels yi in Y 1,,k
  • Uses modified learning algorithm
  • A vector of probability outputs
  • Each element in the vector is the probability
    that the input is part of the class associated
    with that element
  • Ex
  • Input is any letter from the alphabet
  • Three output classes are A, B, C
  • Output for a particular input is the probability
    that the input is that class
  • Ex input is handwritten B, output is 0.4, 0.8,
    0.2 for the three classes
  • Integer T, specifying the number of iterations to
    be performed

13
AdaBoost.M2
  • Call learning algorithm
  • Get back hypothesis vector
  • Calculate error (now called pseudo-loss for a
    vector output)
  • Update input distribution
  • Output best hypothesis after T repetitions

14
Experiments
  • A collection of machine learning datasets are
    available on the UC Irvine website
  • These datasets were used to test the improved
    accuracy and speed of the AdaBoost algorithms

15
Results of Experiments
16
Results of Experiments
  • Boosting vs Bagging
  • AdaBoost.M1 yielded a 55.1 increase in accuracy
    over just using FindAttrTest
  • Bagging using error yielded only a 8.4 increase
    in accuracy
  • Bagging using pseudo-loss still only yielded
    10.6 boost in accuracy
  • AdaBoost.M2 was at least as good as the
    AdaBoost.M1 in all trials, but in 9 of 27 of the
    trials, yielded an incredible boost in accuracy

17
Problem
  • What if a group of data can belong to multiple
    classes?
  • The AdaBoost.M1 and AdaBoost.M2 can only support
    output data that belongs to one class at most
  • AdaBoost.MH, AdaBoost.MR, and AdaBoost.MO were
    created to solve these problems
  • They also provide a confidence rating on how sure
    the algorithm is that its output is correct

18
AdaBoost
  • Multiclass problems are ones where each example
    can belong to various classes
  • One such example is newspaper article
    classification, where each article can belong to
    multiple categories
  • AdaBoost.MH uses Hamming loss, as well as updated
    learning algorithms, to increase accuracy in
    multiclass problems
  • This algorithm tries to predict all and only all
    of the correct class labels (cannot just predict
    one label, must predict them all)
  • The pseudo-loss depends on how the predicted set
    of labels differs from the actual set of labels
  • AdaBoost.MO uses output coding to improve the
    accuracy of multiclass problems
  • This algorithm maps the single class label into a
    coded output label
  • Each label (x, y) is mapped (one-to-one) to a new
    label (x, ?(y))
  • This new label is called the output coded label
  • AdaBoost.MR uses ranking loss to improve
    classification accuracy
  • Goal is to find a hypothesis which ranks the
    output labels in hopes that the highest ranked
    label will be the correct hypothesis

19
Experiments
  • Three algorithms tested
  • Discrete Valued AdaBoost.MH
  • Output value for each class is 1 or -1
  • Real Valued AdaBoost.MH
  • Output value for each class can be any real
    number
  • Discrete Valued AdaBoost.MR
  • Output value for each class is 1 or -1

20
Experiment Results (UCI)
Training Set Error
In some graphs, the real AdaBoost.MH performed
better, and in other the discrete AdaBoost.MH
performed better In a few cases, the training
set was trained too much, and the error started
to rise after an optimal number of trainings
Training Test Set Error
Discrete AdaBoost.MH
Real AdaBoost.MH
21
ExperimentNewspaper Article Classification
  • Newspaper articles are fed into the learning
    algorithms
  • Classifier makes decision based on the presence
    or absence of a phrase in the document
  • Each article is classified into one and only one
    category
  • The categories are
  • Domestic
  • Entertainment
  • Financial
  • International
  • Political
  • Washington
  • Same functions as previous experiment were used
    to classify the output

22
Experiment Results Newspaper Article
Classification
  • In this scenario, the Real Valued AdaBoost.MH
    greatly outperforms the other two methods
  • Number of training rounds to reach a test
    accuracy of 40
  • Discrete AdaBoost.MR 33347 rounds
  • Discrete AdaBoost.MH 16938 rounds
  • Real AdaBoost.MH 268 rounds
Write a Comment
User Comments (0)
About PowerShow.com