Title: Web Taxonomy Integration through CoBootstrapping
1Web Taxonomy Integration throughCo-Bootstrapping
- Dell Zhang
- National University of Singapore
- Wee Sun Lee
- National University of Singapore
- SIGIR04
2 3Problem Statement
- Games Roleplaying
- Final Fantasy Fan
- Dragon Quest Home
- EverQuest Addict
- Warcraft III Clan
- Games Strategy
- Shogun Total War
- Warcraft III Clan
- Games Roleplaying
- Final Fantasy Fan
- Dragon Quest Home
- Games Strategy
- Shogun Total War
- Games Online
- EverQuest Addict
- Warcraft III Clan
- Games Single-Player
- Warcraft III Clan
4Possible Approach
Classify
Train
- Games Roleplaying
- Final Fantasy Fan
- Dragon Quest Home
- Games Strategy
- Shogun Total War
- EverQuest Addict
- Warcraft III Clan
ignores original Yahoo! categories
5Another Approach (1/2)
- Use Yahoo! categories
- Advantage
- similar categories
- Potential Problem
- different structure
- categories do not match exactly
6Another Approach (2/2)
- Example Crayon Shin-chan
-
- Entertainment Comics and Animation
Animation Anime Titles Crayon
Shin-chan -
- Arts Animation Anime Titles C
Crayon Shin-chan
7This Papers Approach
- Weak Learner (as opposed to Naïve Bayes)
- Boosting to combine Weak Hypotheses
- New Idea Co-Bootstrapping to exploit source
categories
8Assumptions
- Multi-category data are reduced to binary data
- Totoro Fan Cartoon My Neighbor Totoro
- Toys My Neighbor Totoro
- is converted into
- Totoro Fan Cartoon My Neighbor Totoro
- Totoro Fan Toys My Neighbor Totoro
- Hierarchies are ignored
- Console Sega and Console Sega Dreamcast are
not related
9- Weak Learner
- Boosting
- Co-Bootstrapping
10Weak Learner
- A type of classifier similar to Naïve Bayes
- accept
- - reject
- Term may be a word or n-gram or
Weak Hypothesis (term-based classifier)
After Training
Weak Learner
11Weak Hypothesis Example
- contain Crayon Shin-chan ?
- in Comics Crayon Shin-chan
- not in Education Early Childhood
- not contain Crayon Shin-chan ?
- not in Comics Crayon Shin-chan
- in Education Early Childhood
12Weak Learner Inputs (1/2)
- Training data are in the form x1, y1,
x2, y2, , xm, ym - xi is a document
- yi is a category
- xi, yi means document xi is in category yi
- D(x, y) is a distribution over all combinations
of xi and yi - D(xi, yj) indicates the importance of (xi, yj)
- w is the term (automatically found)
13Weak Learner Algorithm
- For each possible category y, compute four
values - Note (xi,y) with greater D (xi,y) has more
influence.
14Weak Hypothesis h(x, y)
- Given unclassified document x and category y
- If x contains w, then
- Else if x does not contain w, then
15Weak Learner Comments
- If sign h(x,y) , then x is in y
- h(x,y) is the confidence
- The term w is found as follows
- Repeatedly run weak learner for all possible w
- Choose the run with the smallest
- value as the model
- Boosting Minimizes probability of h(x,y) having
wrong sign
16- Weak Learner
- Boosting
- Co-Bootstrapping
17Boosting Idea
- Train the weak learner on different Dt(x, y)
distributions - After each run, adjust Dt(x, y) by putting more
weight on the most often misclassified training
data - Output the final hypothesis as a linear
combination of weak hypotheses
18Boosting Algorithm
- Given x1, y1, x2, y2, , xm, ym, where xi
? X and yi ? Y - Initialize D1(x,y) 1/(mk)
- for t 1,,T do
- Pass distribution Dt to weak learner
- Get weak hypothesis ht(x, y)
- Choose ?t ? R
-
- Update
- end for
- Output the final hypothesis
19Boosting Algorithm Initialization
- Given x1, y1, x2, y2, , xm, ym
- Initialize D(x, y) 1/(mk)
- k total number of categories
- uniform distribution
20Boosting Algorithm Loop
- for t 1,,T do
- Run weak learner using distribution D
- Get weak hypothesis ht(x, y)
- For each possible pair (x,y) in training data
- If ht(x,y) guesses incorrectly, increase D(x,y)
- end for
- return
21- Weak Learner
- Boosting
- Co-Bootstrapping
22Co-Bootstrapping Idea
- We want to use Yahoo! categories to increase
classification accuracy
23Recall Example Problem
- Games Online
- EverQuest Addict
- Warcraft III Clan
- Games Single-Player
- Warcraft III Clan
- Games Roleplaying
- Final Fantasy Fan
- Dragon Quest Home
- Games Strategy
- Shogun Total War
24Co-Bootstrapping Algorithm (1/4)
- 1. Run AdaBoost on Yahoo! sites
- Get classifier Y1
- 2. Run AdaBoost on Google sites
- Get classifier G1
- 3. Run Y1 on Google sites
- Get predicted Yahoo! categories for Google sites
- 4. Run G1 on Yahoo! sites
- Get predicted Google categories for Yahoo! sites
25Co-Bootstrapping Algorithm (2/4)
- 5. Run AdaBoost on Yahoo! sites
- Include Google category as a feature
- Get classifier Y2
- 6. Run AdaBoost on Google sites
- Include Yahoo! category as a feature
- Get classifier G2
- 7. Run Y2 on original Google sites
- get more accurate Yahoo! categories for Google
sites
- 8. Run G2 on original Yahoo! sites
- get more accurate Google categories for Yahoo!
sites
26Co-Bootstrapping Algorithm (3/4)
- 9. Run AdaBoost on Yahoo! sites
- Include Google category as a feature
- Get classifier Y3
- 10. Run AdaBoost on Google sites
- Include Yahoo! category as a feature
- Get classifier G3
- 11. Run Y3 on original Google sites
- get even more accurate Yahoo! categories for
Google sites
- 12. Run G3 on original Yahoo! sites
- get even more accurate Google categories for
Yahoo! sites
27Co-Bootstrapping Algorithm (4/4)
- Repeat, repeat, and repeat
- Hopefully, the classification will become more
accurate after each iteration
28- Enhanced Naïve Bayes
- (Benchmark)
29Enhanced Naïve Bayes (1/2)
- Given
- document x
- source category S of x
- Predict master category C
- In NB, PrC x ? PrC ?w?x(Prw C)n(x,w)
- w word
- n(x,w) number of occurrences of w in x
- PrC x, S ? PrC S ?w?x(Prw C)n(x,w)
30Enhanced Naïve Bayes (2/2)
- PrC
- Estimate PrC S ?
- C ? S number of docs in S that is classified
into C by NB classifier
31 32Datasets
33Number of Categories/Dataset (1/2)
Top level categories only
34Number of Categories/Dataset (2/2)
- Book
- Horror
- Science Fiction
- Non-fiction
- Biography
- History
Merge into Non-fiction
35Number of Websites
36Method (1/2)
- Classify Yahoo! Book websites into Google Book
categories (G?Y) - Find G?Y for Book
- Hide Google categories for in G?Y
- G?Y ? Yahoo! Book
- Randomly take G?Y sites from G-Y ? Google Book
37Method (2/2)
- For each dataset, do G?Y five times and G?Y five
times - macro F-score calculate F-score for each
category, then average over all categories - micro F-score calculate F-score on the entire
dataset - recall 100?
- Doesnt say anything about multi-category ENB
38Results (1/3)
- Co-Boostrapping-AdaBoost AdaBoost
macro-averaged F scores
micro-averaged F scores
39Results (2/3)
- Co-Bootstrapping-AdaBoost iteratively improves
AdaBoost
Book Dataset
40Results (3/3)
- Co-Boostrapping-AdaBoost Enhanced Naïve Bayes
macro-averaged F scores
micro-averaged F scores
41Contribution
- Co-Bootstrapping improves Boosting performance
- Does not require ? as in ENB