Web Taxonomy Integration through Co-Bootstrapping - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Web Taxonomy Integration through Co-Bootstrapping

Description:

10. Problem Statement. Characteristics. It's a multi-class, multi-label ... Top/ Arts/ Movies/ Genres/ Movie / Health ... Comedy, Horror, etc. ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 37

Provided by: dellz

Category:

more less

Transcript and Presenter's Notes

Title: Web Taxonomy Integration through Co-Bootstrapping

1
Web Taxonomy Integration through Co-Bootstrapping

Dell Zhang, Wee Sun Lee
SIGIR2004

2
Outline

Introduction
Problem Statement
State-of-the-Art
Our Approach
Experiments
Conclusion

3
Introduction

Taxonomy
A taxonomy, or directory or catalog, is a
division of a set of objects (e.g. documents)
into a set of categories.

Adopted from a slide of Srikant.
4
Introduction

Taxonomy Integration
Integrating objects from a source taxonomy N into
a master taxonomy M.

Adopted from a slide of Srikant.
5
Introduction

Taxonomy Integration
Integrating objects from a source taxonomy N into
a master taxonomy M.

M
C1
C2
C3
a
b
c
d
e
f
x
y
z
w
z
Adopted from a slide of Srikant.
6
Introduction

Applications
Today
Web Marketplaces (e.g. Amazon)
Web Portals (e.g., NCSTRL)
Personal Bookmarks
Organizational Resources

7
Introduction

Applications
Future Semantic Web
Ontology Merging
Ontology Mapping
Content-based similarity between two concepts
(categories)
Doan, A., Madhavan, J., Domingos, P. and Halevy,
A. Learning to Map between Ontologies on the
Semantic Web. in Proceedings of WWW2002.
Need to do taxonomy integration first.

8
Introduction

Why Machine Learning?
Correspondences between taxonomies are inevitably
noisy and fuzzy.
Arts/Music/Styles/
Entertainment/Music/Genres/
Computers_and_Internet/Software/Freeware
Computers/Open_Source/Software
Manual taxonomy integration is tedious,
error-prone, and unscalable.

9
Problem Statement

Taxonomy Integration as A Classification Problem
A master taxonomy M with a set of categories
C1, C2, , Cm each containing a set of objects
Classes master categories C1, C2, , Cm
Training examples objects in M
A source taxonomy N with a set of categories
S1, S2, , Sn each containing a set of objects
Test examples objects in N

10
Problem Statement

Characteristics
Its a multi-class, multi-label classification
problem.
There are usually more than two possible classes.
One object may belong to more than one class.
The test examples are already known to the
machine learning algorithm.
The test examples are already labeled with a set
of categories which are not identical with but
relevant to the set of categories to be assigned.

11
State-of-the-Art

Conventional Learning Algorithms
Do not exploit information in the source
taxonomy.
NB (Naïve Bayes)
SVM (Support Vector Machine)
Enhanced Learning Algorithms
Exploit information in the source taxonomy to
build better classifiers for the master taxonomy.

12
State-of-the-Art

Enhanced Learning Algorithms
Agrawal, R. and Srikant, R. On Integrating
Catalogs. in Proceedings of WWW2001.
ENB (Enhanced Naïve Bayes)

13
State-of-the-Art

Enhanced Learning Algorithms
Sarawagi, S., Chakrabarti, S. and Godbole, S.
Cross-Training Learning Probabilistic Mappings
between Topics. in Proceedings of KDD2003.
EM2D (Expectation Maximization in 2-Dimensions)
CT-SVM (Cross-Training SVMs)

14
State-of-the-Art

Enhanced Learning Algorithms
Zhang, D. and Lee, W.S. Web Taxonomy Integration
using Support Vector Machines. in Proceedings of
WWW2004.
CS-TSVM (Cluster Shrinkage Transductive SVM)

15
Our Approach

Motivation
Possible useful semantic relationships between a
master category C and a source category S
include
identical,
mutual-exclusive,
superset,
subset,
partially-overlapping.

16
Our Approach

Motivation
In addition, semantic relationships may involve
multiple master and source categories.
For example, a master category C may be a subset
of the union of two source categories Sa and Sb,
so if an object does not belong to either Sa or
Sb, it cannot belong to C.

17
Our Approach

Motivation
The real-world semantic relationships are noisy
and fuzzy, but they can still provide valuable
information for classification.
For example, knowing that most (80) objects in a
source category S belong to one master category
Ca and the rest (20) examples belong to another
master category Cb is obviously helpful.

18
Our Approach

Idea
The difficulty is that the knowledge about those
semantic relationships is not explicit but hidden
in the data.
If we have indicator functions for each category
in N, we can imagine taking those indicator
functions as features when we learn the
classifier for M. This allows us to exploit the
semantic relationship among the categories of M
and N without explicitly figuring out what the
semantic relationships are.

19
Our Approach

Idea
Specifically, for each object in M, we augment
the set of conventional term-features FT with a
set of source category-features FNfNi. The
j-th source category-feature fNi of a given
object x is a binary feature indicating whether x
belongs to the j-th source category, i.e., Sj.
In the same way, we can get a set of master
category-features FM.

20
Our Approach

Major Considerations (1/2)
Q1 How can we train the classifier using two
different kinds of features (term-features and
category-features)?
A1 Boosting.
Inspired by
Cai, L. and Hofmann, T. Text Categorization by
Boosting Automatically Extracted Concepts. in
Proceedings of SIGIR2003.

21
Our Approach

Boosting
A boosting learning algorithm combines many weak
hypotheses/learners (moderately accurate
classification rules) into a highly accurate
classifier.
A boosting learning algorithm can utilize
different kinds of weak hypotheses/learners for
different kinds of features, and weight them
optimally.
For example, boosting enables us to use the
decision tree hypotheses/learners for the
category-features and the Naïve Bayes
hypotheses/learners for the term-features.

22
Our Approach

Boosting
The most popular boosting algorithm is AdaBoost
introduced in 1995 by Freund and Schapire.
Our work is based on its multi-class multi-label
version, AdaBoost.MH.
The weak hypotheses we used in this paper are
simple decision stumps each corresponds to a
binary term-feature or category-feature.

23
Our Approach

Two Major Considerations (2/2)
Q2 How can we train the classifier while the
values of the source category-features FN of the
training examples are unknown ?
A2 Co-Bootstrapping.
Inspired by
Blum, A. and Mitchell, T. Combining Labeled and
Unlabeled Data with Co-Training. in Proceedings
of COLT1998.
It is named Cross-Training by Chakrabarti etal.

24
Our Approach

Co-Bootstrapping
Train two classifiers symmetrically, one for the
master categories and the other for the source
categories.
These two classifiers collaborate to mutually
bootstrap themselves together.

25
Our Approach
26
Experiments

Datasets
Tasks
Features
Measures
Settings
Results

27
Experiments Datasets

Taxonomies

Google Yahoo
Book / Top/ Shopping/ Publications/ Books/ / Business_and_Economy/ Shopping_and_Services/ Books/ Bookstores/
Disease / Top/ Health/ Conditions_and_Diseases/ / Health/ Diseases_and_Conditions/
Movie / Top/ Arts/ Movies/ Genres/ / Entertainment/ Movies_and_Film/ Genres/
Music / Top/ Arts/ Music/ Styles/ / Entertainment/ Music/ Genres/
News / Top/ News/ By_Subject/ / News_and_Media/
28
Experiments Datasets

Taxonomies
Categories
For example
Movie Action, Comedy, Horror, etc.
Objects
Each object is considered as a text document
which is composed of the title and annotation of
its corresponding webpage.

29
Experiments Datasets

Number of categories
The number of categories per object in these
datasets is 1.54 on average.

Google Yahoo
Book 49 41
Disease 30 51
Movie 34 25
Music 47 24
News 27 34
30
Experiments Datasets

Number of objects
The set of objects in GnY covers only a small
portion (usually less than 10) of the set of
objects in Google or Yahoo alone, which suggests
the great benefit of automatically integrating
them.

Google Yahoo G?Y GnY
Book 10,842 11,268 21,111 999
Disease 34,047 9,785 41,439 2,393
Movie 36,787 14,366 49,744 1,409
Music 76,420 24,518 95,971 4,967
News 31,504 19,419 49,303 1,620
31
Experiments Datasets

Category Distribution
Highly skewed

Googles Book taxonomy
Yahoos Book taxonomy
32
Experiments Tasks

Two symmetric tasks for each dataset
G?Y (integrating objects from Yahoo into Google)
Y?G (integrating objects from Google into Yahoo)

33
Experiments Tasks

Test data GnY
We do not need to manually label them because
their categories in both taxonomies are known.
Training data randomly sampled subsets of G Y
For G?Y tasks, we randomly sample n objects from
the set G Y as training examples, where n is
the number of test examples. For Y?G tasks,
For each task, we do such random sampling for 5
times, and report the averaged performance.

34
Experiments Tasks

Training phase
We hide the test examples' master categories but
expose their source categories to the learning
algorithm.
Test phase
We compare the hidden master categories of the
test examples with the predictions of the
learning algorithm.

35
Experiments Features

Term-Features
A document A bag-of-words.
Pre-procession
removal of stop-words and stemming.
Each term corresponds to a binary feature whose
value indicates the presence or absence of that
term in the given document.
Category-Features

36
Experiments Measures

For one category
F score (F1 measure), which is the harmonic
average of precision (p) and recall (r).

37
Experiments Measures

For all categories
Macro-averaged F score (maF)
Compute the F scores for the binary decisions on
each individual category first and then average
them over categories.
Micro-averaged F score (miF)
Compute the F scores globally over all the
binary decisions.

38
Experiments Measures

For all categories
The maF tends to be dominated by the
classification performance on common categories,
whereas the miF is more influenced by the
classification performance on rare categories.
Providing both kinds of scores is more
informative than providing either alone,
especially when the category distributions are
highly skewed.
Yang, Y. and Liu, X. A Re-examination of Text
Categorization Methods. in Proceedings of
SIGIR1999.

39
Experiments Settings

NB and ENB
We use Lidstones smoothing parameter0.1.
We run ENB with a series of exponentially
increasing values of the parameter omega (0, 1,
3, 10, 30, 100, 300, 1000), and report the best
experimental results.

40
Experiments Settings

AB and CB-AB
BoosTexter
http//www.research.att.com/schapire/BoosTexter/
It implements AdaBoost on top of "decision
stumps".

41
Experiments Results

ENB gt NB

macro-averaged F scores
micro-averaged F scores
42
Experiments Results

AB gt NB

macro-averaged F scores
micro-averaged F scores
43
Experiments Results

CB-AB gt AB

macro-averaged F scores
micro-averaged F scores
44
Experiments Results

CB-AB iteratively improves AB

45
Experiments Results

CB-AB gt ENB

macro-averaged F scores
micro-averaged F scores
46
Conclusion

Main Contribution
The CB-AB approach to taxonomy integration
It achieves multi-class multi-label
classification.
It is a discriminative learning algorithm.
It enhances the AdaBoost algorithm via exploiting
information in the source taxonomy.
It does not require a tune set (a set of objects
whose source categories and master categories are
all known).
It enables usage of different weak
hypotheses/learners for term-features and
category-features.

47
Conclusion

Comparison

Co-Training Co-Bootstrapping
Classes One set of classes. Two sets of classes (1) one set of source categories (2) one set of master categories.
Features Two disjoint sets of features V1 and V2. Two sets of features (1) conventional-features plus source category-features (2) conventional-features plus master category-features
Assumption V1 and V2 are compatible and uncorrelated (conditionally independent). The source and master taxonomies have some semantic overlap, i.e., they are somewhat correlated.
48
Conclusion

Future Work
Theoretical analysis of the Co-Bootstrapping
algorithm.
Can we refine the Co-Bootstrapping algorithm to
make it theoretically-justified as the
Greedy-Agreement algorithm for Co-Training?
Abney, S.P. Bootstrapping. in Proceedings of
ACL2002.

49
Conclusion