ECOC for Text Classification - PowerPoint PPT Presentation

About This Presentation
Title:

ECOC for Text Classification

Description:

Some Recent work ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 40
Provided by: Rayi151
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: ECOC for Text Classification


1
Some Recent work
  • ECOC for Text Classification
  • Hybrids of EM Co-Training (with Kamal Nigam)
  • Learning to build a monolingual corpus from the
    web (with Rosie Jones)
  • Effect of Smoothing on Naive Bayes for text
    classification (with Tong Zhang)
  • Hypertext Categorization using link and extracted
    information (with Sean Slattery Yiming Yang)

2
Using Error-Correcting Codes For Text
Classification
  • Rayid Ghani
  • Center for Automated Learning Discovery
  • Carnegie Mellon University

This presentation can be accessed at
http//www.cs.cmu.edu/rayid/talks/
3
Outline
  • Introduction to ECOC
  • Intuition Motivation
  • Some Questions?
  • Experimental Results
  • Semi-Theoretical Model
  • Types of Codes
  • Drawbacks
  • Conclusions

4
Introduction
  • Decompose a multiclass classification problem
    into multiple binary problems
  • One-Per-Class Approach (moderately expensive)
  • All-Pairs (very expensive)
  • Distributed Output Code (efficient but what about
    performance?)
  • Error-Correcting Output Codes (?)

5
(No Transcript)
6
Is it a good idea?
  • Larger margin for error since errors can now be
    corrected
  • One-per-class is a code with minimum hamming
    distance (HD) 2
  • Distributed codes have low HD
  • The individual binary problems can be harder than
    before
  • Useless unless number of classes gt 5

7
Training ECOC
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
  • Given m distinct classes

1. Create an m x n binary matrix M.
2. Each class is assigned ONE row of M.
3. Each column of the matrix divides the classes
into TWO groups.
4. Train the Base classifiers to learn the n
binary problems.
8
Training ECOC
  • Given m distinct classes
  • Create an m x n binary matrix M.
  • Each class is assigned ONE row of M.
  • Each column of the matrix divides the classes
    into TWO groups.
  • Train the Base classifiers to learn the n binary
    problems.

9
Testing ECOC
  • To test a new instance
  • Apply each of the n classifiers to the new
    instance
  • Combine the predictions to obtain a binary
    string(codeword) for the new point
  • Classify to the class with the nearest codeword
    (usually hamming distance is used as the distance
    measure)

10
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
11
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
12
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
13
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
X
1 1 1 1 0
14
  • Single classifier learns a complex boundary
    once
  • Ensemble learns a complex boundary multiple
    times
  • ECOC learns a simple boudary multiple times

15
Questions?
  • How well does it work?
  • How long should the code be?
  • Do we need a lot of training data?
  • What kind of codes can we use?
  • Are there intelligent ways of creating the code?

16
Previous Work
  • Combine with Boosting ADABOOST.OC (Schapire,
    1997), (Guruswami Sahai, 1999)
  • Local Learners (Ricci Aha, 1997)
  • Text Classification (Berger, 1999)

17
Experimental Setup
  • Generate the code
  • BCH Codes
  • Choose a Base Learner
  • Naive Bayes Classifier as used in text
    classification tasks (McCallum Nigam 1998)

18
Dataset
  • Industry Sector Dataset
  • Consists of company web pages classified into 105
    economic sectors
  • Standard stoplist
  • No Stemming
  • Skip all MIME headers and HTML tags
  • Experimental approach similar to McCallum et al.
    (1998) for comparison purposes.

19
Results
ECOC is 88 accurate!
Classification Accuracies on five random 50-50
train-test splits of the Industry Sector dataset
with a vocabulary size of 10000.
20
Results
Industry Sector Data Set
Naïve Bayes Shrinkage1 ME2 ME/ w Prior3 ECOC 63-bit
66.1 76 79 81.1 88.5
ECOC reduces the error of the Naïve Bayes
Classifier by 66
  1. (McCallum et al. 1998) 2,3. (Nigam et
    al. 1999)

21
The Longer the Better!
Table 2 Average Classification Accuracy on 5
random 50-50 train-test splits of the Industry
Sector dataset with a vocabulary size of 10000
words selected using Information Gain.
  • Longer codes mean larger codeword separation
  • The minimum hamming distance of a code C is the
    smallest distance between any pair of distance
    codewords in C
  • If minimum hamming distance is h, then the code
    can correct ? (h-1)/2? errors

22
Size Matters?
23
Size does NOT matter!
24
Semi-Theoretical Model
  • Model ECOC by a Binomial Distribution B(n,p)
  • n length of the code
  • p probability of each bit being classified
    incorrectly

25
Semi-Theoretical Model
  • Model ECOC by a Binomial Distribution B(n,p)
  • n length of the code
  • p probability of each bit being classified
    incorrectly

of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
26
Semi-Theoretical Model
  • Model ECOC by a Binomial Distribution B(n,p)
  • n length of the code
  • p probability of each bit being classified
    incorrectly

of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
27
(No Transcript)
28
Talk.misc.religion Comp.sys.ibm.hardware
Comp.os.windows Comp.sys.ibm.hardware
Comp.os.windows Talk.misc.religion
Comp.os.windows Alt.atheism
Talk.misc.religion Alt.atheism
Comp.sys.ibm.hardware Alt.atheism
Alt.atheism
Talk.misc.religion
Comp.os.windows
Talk.misc.religion Comp.sys.ibm.hardware Comp.
os.windows
Alt.atheism Comp.sys.ibm.hardware Comp.os.win
dows
Alt.atheism Comp.sys.ibm.hardware Talk.misc.r
eligion
29
Types of Codes
Types of Codes
  • Data-Independent
  • Data-Dependent

Hand-Constructed Adaptive
Algebraic Random
30
What is a Good Code?
  • Row Separation
  • Column Separation (Independence of errors for
    each binary classifier)
  • Efficiency (for long codes)

31
Choosing Codes
Random Algebraic
Row Sep On Average For long codes Guaranteed
Col Sep On Average For long codes Can be Guaranteed
Efficiency No Yes
32
Experimental Results
Code Min Row HD Max Row HD Min Col HD Max Col HD Error Rate
15-Bit BCH 5 15 49 64 20.6
19-Bit Hybrid 5 18 15 69 22.3
15-bit Random 2 (1.5) 13 42 60 24.1
33
Interesting Questions?
  • NB does not give good probabilitiy estimates-
    using ECOC results in better estimates?
  • Assignment of codewords to classes?
  • Can Decoding be posed as a supervised learning
    task?

34
Drawbacks
  • Can be computationally expensive
  • Random Codes throw away the real-world nature of
    the data by picking random partitions to create
    artificial binary problems

35
Current Work
  • Combine ECOC with Co-Training to use unlabeled
    data
  • Automatically construct optimal / adaptive codes

36
Conclusion
  • Performs well on text classification tasks
  • Can be used when training data is sparse
  • Algebraic codes perform better than random codes
    for a given code length
  • Hand-constructed codes may not be the answer

37
Background
  • Co-training seems to be the way to go when there
    is (and maybe even when there isnt) a feature
    split in the data
  • Reported results on co-training only deal with
    very small (toy) problems mostly binary
    classification tasks (Blum Mitchell 98, Nigam
    Ghani 2000)

38
Co-Training Challenge
  • Task Apply cotraining to a 65 class dataset
    containing 130,000 training examples
  • Result Cotraining fails!

39
Solution?
  • ECOC seems to work well when there are a large
    number of classes
  • ECOC decomposes a multiclass problems into
    several binary problems
  • Cotraining works well with binary problems

Combine ECOC and Cotrain
40
Algorithm
  • Learn each bit for ECOC using a cotrained
    classifier

41
Dataset (Job Descriptions)
  • 65 classes
  • 32000 examples
  • Two feature sets
  • Title
  • Description

42
(No Transcript)
43
Results
  • 10 Train, 50 unlabeled, 40 test
  • NB 40.3
  • ECOC 48.9
  • EM 30.83
  • CoTraining
  • ECOC-EM
  • ECOC-Cotrain
  • ECOC-CoEM
Write a Comment
User Comments (0)
About PowerShow.com