Title: Web Taxonomy Integration through Co-Bootstrapping
1Web Taxonomy Integration through Co-Bootstrapping
- Dell Zhang, Wee Sun Lee
- SIGIR2004
2Outline
- Introduction
- Problem Statement
- State-of-the-Art
- Our Approach
- Experiments
- Conclusion
3Introduction
- Taxonomy
- A taxonomy, or directory or catalog, is a
division of a set of objects (e.g. documents)
into a set of categories.
Adopted from a slide of Srikant.
4Introduction
- Taxonomy Integration
- Integrating objects from a source taxonomy N into
a master taxonomy M.
Adopted from a slide of Srikant.
5Introduction
- Taxonomy Integration
- Integrating objects from a source taxonomy N into
a master taxonomy M.
M
C1
C2
C3
a
b
c
d
e
f
x
y
z
w
z
Adopted from a slide of Srikant.
6Introduction
- Applications
- Today
- Web Marketplaces (e.g. Amazon)
- Web Portals (e.g., NCSTRL)
- Personal Bookmarks
- Organizational Resources
7Introduction
- Applications
- Future Semantic Web
- Ontology Merging
- Ontology Mapping
- Content-based similarity between two concepts
(categories) - Doan, A., Madhavan, J., Domingos, P. and Halevy,
A. Learning to Map between Ontologies on the
Semantic Web. in Proceedings of WWW2002. -
- Need to do taxonomy integration first.
8Introduction
- Why Machine Learning?
- Correspondences between taxonomies are inevitably
noisy and fuzzy. - Arts/Music/Styles/
Entertainment/Music/Genres/ - Computers_and_Internet/Software/Freeware
Computers/Open_Source/Software - Manual taxonomy integration is tedious,
error-prone, and unscalable.
9Problem Statement
- Taxonomy Integration as A Classification Problem
- A master taxonomy M with a set of categories
C1, C2, , Cm each containing a set of objects - Classes master categories C1, C2, , Cm
- Training examples objects in M
- A source taxonomy N with a set of categories
S1, S2, , Sn each containing a set of objects - Test examples objects in N
10Problem Statement
- Characteristics
- Its a multi-class, multi-label classification
problem. - There are usually more than two possible classes.
- One object may belong to more than one class.
- The test examples are already known to the
machine learning algorithm. - The test examples are already labeled with a set
of categories which are not identical with but
relevant to the set of categories to be assigned.
11State-of-the-Art
- Conventional Learning Algorithms
- Do not exploit information in the source
taxonomy. - NB (Naïve Bayes)
- SVM (Support Vector Machine)
-
- Enhanced Learning Algorithms
- Exploit information in the source taxonomy to
build better classifiers for the master taxonomy.
12State-of-the-Art
- Enhanced Learning Algorithms
- Agrawal, R. and Srikant, R. On Integrating
Catalogs. in Proceedings of WWW2001. - ENB (Enhanced Naïve Bayes)
13State-of-the-Art
- Enhanced Learning Algorithms
- Sarawagi, S., Chakrabarti, S. and Godbole, S.
Cross-Training Learning Probabilistic Mappings
between Topics. in Proceedings of KDD2003. - EM2D (Expectation Maximization in 2-Dimensions)
- CT-SVM (Cross-Training SVMs)
14State-of-the-Art
- Enhanced Learning Algorithms
- Zhang, D. and Lee, W.S. Web Taxonomy Integration
using Support Vector Machines. in Proceedings of
WWW2004. - CS-TSVM (Cluster Shrinkage Transductive SVM)
15Our Approach
- Motivation
- Possible useful semantic relationships between a
master category C and a source category S
include - identical,
- mutual-exclusive,
- superset,
- subset,
- partially-overlapping.
16Our Approach
- Motivation
- In addition, semantic relationships may involve
multiple master and source categories. - For example, a master category C may be a subset
of the union of two source categories Sa and Sb,
so if an object does not belong to either Sa or
Sb, it cannot belong to C.
17Our Approach
- Motivation
- The real-world semantic relationships are noisy
and fuzzy, but they can still provide valuable
information for classification. - For example, knowing that most (80) objects in a
source category S belong to one master category
Ca and the rest (20) examples belong to another
master category Cb is obviously helpful.
18Our Approach
- Idea
- The difficulty is that the knowledge about those
semantic relationships is not explicit but hidden
in the data. - If we have indicator functions for each category
in N, we can imagine taking those indicator
functions as features when we learn the
classifier for M. This allows us to exploit the
semantic relationship among the categories of M
and N without explicitly figuring out what the
semantic relationships are.
19Our Approach
- Idea
- Specifically, for each object in M, we augment
the set of conventional term-features FT with a
set of source category-features FNfNi. The
j-th source category-feature fNi of a given
object x is a binary feature indicating whether x
belongs to the j-th source category, i.e., Sj. - In the same way, we can get a set of master
category-features FM.
20Our Approach
- Major Considerations (1/2)
- Q1 How can we train the classifier using two
different kinds of features (term-features and
category-features)? - A1 Boosting.
- Inspired by
- Cai, L. and Hofmann, T. Text Categorization by
Boosting Automatically Extracted Concepts. in
Proceedings of SIGIR2003.
21Our Approach
- Boosting
- A boosting learning algorithm combines many weak
hypotheses/learners (moderately accurate
classification rules) into a highly accurate
classifier. - A boosting learning algorithm can utilize
different kinds of weak hypotheses/learners for
different kinds of features, and weight them
optimally. - For example, boosting enables us to use the
decision tree hypotheses/learners for the
category-features and the Naïve Bayes
hypotheses/learners for the term-features.
22Our Approach
- Boosting
- The most popular boosting algorithm is AdaBoost
introduced in 1995 by Freund and Schapire. - Our work is based on its multi-class multi-label
version, AdaBoost.MH. - The weak hypotheses we used in this paper are
simple decision stumps each corresponds to a
binary term-feature or category-feature.
23Our Approach
- Two Major Considerations (2/2)
- Q2 How can we train the classifier while the
values of the source category-features FN of the
training examples are unknown ? - A2 Co-Bootstrapping.
- Inspired by
- Blum, A. and Mitchell, T. Combining Labeled and
Unlabeled Data with Co-Training. in Proceedings
of COLT1998. - It is named Cross-Training by Chakrabarti etal.
24Our Approach
- Co-Bootstrapping
- Train two classifiers symmetrically, one for the
master categories and the other for the source
categories. - These two classifiers collaborate to mutually
bootstrap themselves together.
25Our Approach
26Experiments
- Datasets
- Tasks
- Features
- Measures
- Settings
- Results
27Experiments Datasets
Google Yahoo
Book / Top/ Shopping/ Publications/ Books/ / Business_and_Economy/ Shopping_and_Services/ Books/ Bookstores/
Disease / Top/ Health/ Conditions_and_Diseases/ / Health/ Diseases_and_Conditions/
Movie / Top/ Arts/ Movies/ Genres/ / Entertainment/ Movies_and_Film/ Genres/
Music / Top/ Arts/ Music/ Styles/ / Entertainment/ Music/ Genres/
News / Top/ News/ By_Subject/ / News_and_Media/
28Experiments Datasets
- Taxonomies
- Categories
- For example
- Movie Action, Comedy, Horror, etc.
- Objects
- Each object is considered as a text document
which is composed of the title and annotation of
its corresponding webpage.
29Experiments Datasets
- Number of categories
- The number of categories per object in these
datasets is 1.54 on average.
Google Yahoo
Book 49 41
Disease 30 51
Movie 34 25
Music 47 24
News 27 34
30Experiments Datasets
- Number of objects
- The set of objects in GnY covers only a small
portion (usually less than 10) of the set of
objects in Google or Yahoo alone, which suggests
the great benefit of automatically integrating
them.
Google Yahoo G?Y GnY
Book 10,842 11,268 21,111 999
Disease 34,047 9,785 41,439 2,393
Movie 36,787 14,366 49,744 1,409
Music 76,420 24,518 95,971 4,967
News 31,504 19,419 49,303 1,620
31Experiments Datasets
- Category Distribution
- Highly skewed
Googles Book taxonomy
Yahoos Book taxonomy
32Experiments Tasks
- Two symmetric tasks for each dataset
- G?Y (integrating objects from Yahoo into Google)
- Y?G (integrating objects from Google into Yahoo)
33Experiments Tasks
- Test data GnY
- We do not need to manually label them because
their categories in both taxonomies are known. - Training data randomly sampled subsets of G Y
- For G?Y tasks, we randomly sample n objects from
the set G Y as training examples, where n is
the number of test examples. For Y?G tasks, - For each task, we do such random sampling for 5
times, and report the averaged performance.
34Experiments Tasks
- Training phase
- We hide the test examples' master categories but
expose their source categories to the learning
algorithm. - Test phase
- We compare the hidden master categories of the
test examples with the predictions of the
learning algorithm.
35Experiments Features
- Term-Features
- A document A bag-of-words.
- Pre-procession
- removal of stop-words and stemming.
- Each term corresponds to a binary feature whose
value indicates the presence or absence of that
term in the given document. - Category-Features
36Experiments Measures
- For one category
- F score (F1 measure), which is the harmonic
average of precision (p) and recall (r). -
37Experiments Measures
- For all categories
- Macro-averaged F score (maF)
- Compute the F scores for the binary decisions on
each individual category first and then average
them over categories. - Micro-averaged F score (miF)
- Compute the F scores globally over all the
binary decisions.
38Experiments Measures
- For all categories
- The maF tends to be dominated by the
classification performance on common categories,
whereas the miF is more influenced by the
classification performance on rare categories. - Providing both kinds of scores is more
informative than providing either alone,
especially when the category distributions are
highly skewed. - Yang, Y. and Liu, X. A Re-examination of Text
Categorization Methods. in Proceedings of
SIGIR1999.
39Experiments Settings
- NB and ENB
- We use Lidstones smoothing parameter0.1.
- We run ENB with a series of exponentially
increasing values of the parameter omega (0, 1,
3, 10, 30, 100, 300, 1000), and report the best
experimental results.
40Experiments Settings
- AB and CB-AB
- BoosTexter
- http//www.research.att.com/schapire/BoosTexter/
- It implements AdaBoost on top of "decision
stumps".
41Experiments Results
macro-averaged F scores
micro-averaged F scores
42Experiments Results
macro-averaged F scores
micro-averaged F scores
43Experiments Results
macro-averaged F scores
micro-averaged F scores
44Experiments Results
- CB-AB iteratively improves AB
45Experiments Results
macro-averaged F scores
micro-averaged F scores
46Conclusion
- Main Contribution
- The CB-AB approach to taxonomy integration
- It achieves multi-class multi-label
classification. - It is a discriminative learning algorithm.
- It enhances the AdaBoost algorithm via exploiting
information in the source taxonomy. - It does not require a tune set (a set of objects
whose source categories and master categories are
all known). - It enables usage of different weak
hypotheses/learners for term-features and
category-features.
47Conclusion
Co-Training Co-Bootstrapping
Classes One set of classes. Two sets of classes (1) one set of source categories (2) one set of master categories.
Features Two disjoint sets of features V1 and V2. Two sets of features (1) conventional-features plus source category-features (2) conventional-features plus master category-features
Assumption V1 and V2 are compatible and uncorrelated (conditionally independent). The source and master taxonomies have some semantic overlap, i.e., they are somewhat correlated.
48Conclusion
- Future Work
- Theoretical analysis of the Co-Bootstrapping
algorithm. - Can we refine the Co-Bootstrapping algorithm to
make it theoretically-justified as the
Greedy-Agreement algorithm for Co-Training? - Abney, S.P. Bootstrapping. in Proceedings of
ACL2002.
49Conclusion
- Future Work
- Empirical comparison with CS-TSVM, EM2D and
CT-SVM. - Exploiting the hierarchical structure of
taxonomies. - Incorporating commonsense knowledge and domain
constraints. - Extending to full-functional ontology mapping
systems.
50Questions, Comments, Suggestions,
?
51Thank You