Title: Organizing Structured Web Sources by Query Schemas: A Clustering Approach
1Organizing Structured Web Sources by Query
Schemas A Clustering Approach
- Bin He
- Joint work with Tao Tao, Kevin Chen-Chuan Chang
- Univ. Illinois at Urbana-Champaign
2Background MetaQuerier Large-Scale Integration
of the deep Web
Query
Result
MetaQuerier
The Deep Web
3MetaQuerier System architecture
MetaQuerier
Front-end Query Execution
Query Translation
Source Selection
Result Compilation
Query Web databases
Find Web databases
Deep Web Repository
Unified Interfaces
Subject Domains
Query Capabilities
Query Interfaces
Back-end Semantics Discovery
Database Crawler
Interface Extraction
Source Organization
Schema Matching
4In MetaQuerier, source organization is to cluster
query interfaces into implicit domains
Airfares
Automobiles
Books
5What are the representative feature of query
interfaces?
- Is query schema the feature we are looking for?
6Query schemas are appropriate representatives of
Web databases distinctive property
Airfares
Hotels
Movies
Number of observations
Attributes Index
Attributes Index
Attributes Index
- Each domain contains a dominant range of
attributes, distinctive from other domains - Some attributes are only observed in one domain
(anchor attributes) - For example ISBN for Books, MPAA Rating for
Movies ,
- Source organization becomes the clustering of
query schemas
7Query schemas can be viewed as categorical data
- Query schemas as transactions
- S1 author, title, subject, ISBN
- S2 author, title, category, publisher
- S3 make, model, price, zip code
- S4 manufacturer, model, price
- S5 from, to, departure date, return date,
number of passengers - S6 departure city, arrival city, number of
adults, number of children -
- Thus, we can apply algorithms for clustering
categorical data
8Clustering categorical data Objective function
- Clustering needs to have an objective function to
evaluate the quality of clusters - Existing objective functions
- Likelihood 1998 (Model-based clustering)
- Context Linkage ROCK 2000
- Entropy COOLCAT 2002
- In this paper, we propose a new objective
function - Model-Differentiation
9Model-Differentiation A new objective function
for model-based clustering
- Assumption of model-base clustering
- Each cluster Ci has a generative model Mi to
generate its data with probabilistic behavior - What is a good clustering result? (our
observation) - data in different clusters are very
dissimilar - models of different clusters are very dissimilar
- a new objective function
- maximize the dissimilarity of models
- To realize, we need to answer three questions
- How to model the data?
- How to estimate the model, given data?
- How to measure the dissimilarity of models?
10Modeling Multinomial distribution
- Each attribute is an independent event
- A schema is generated by a series of sampling
from M
Model M
A schema title, author, ISBN
Vocabulary author (P1) publisher (P2) title
(P3) ISBN (P4) city (P5) price (P6) model (P7)
P1
ISBN
author
title
P3
P4
Probability P1P3P4
11Model estimation Given a set of data, how to
estimate its model?
- Maximum likelihood estimation
- S1 title, author, ISBN, S2 author,
ISBN, publisher - S3 author, title, price, S4
author, title, price -
-
- Vocabulary author, title, ISBN, price,
publisher
12Measuring the dissimilarity of models
Statistical hypothesis testing
- Multinomial distribution can be directly tested
by ?2 testing
S1 title, author, ISBN, S2 author, ISBN,
price, S3 make, model, price
Pro
Mlt1,2gt
Pro
M3
1. Combining S1 and S2
Attrs
Attrs
Pro
Mlt1,3gt
Pro
M2
2. Combining S1 and S3
Attrs
Attrs
Pro
Mlt2,3gt
Pro
M1
3. Combining S2 and S3
Attrs
Attrs
Inspire a hierarchical agglomerative clustering
(HAC) algorithm
13Hypothesis testing needs sufficient observations
Pre-clustering to form small clusters
Distinguishable
S2
S1 with anchor attributes
S1 and S2 should be in the same domain and thus
pre-clustered
How to decide whether an S is distinguishable ?
Sup(S1)
Any Si, Sj in Sup(S1)
S1
14Post-classification Handling loners
Separate
Pre-clustering
Model clustering
Loners too small for X2 test after pre-clustering
Naïve Bayesian
15Experiments
- Data
- Questions to answer
- Can schema clustering effectively organize Web
databases? - Can it build a domain hierarchy correctly?
16We also try existing objective functions
- Three existing objective functions
- Likelihood maximize likelihood
- Entropy maximize entropy
- Context Linkage minimize cross links
- To be fair, keep pre-clustering and post
classification, and only change the clustering
step by different measures
17Effectiveness of Clustering
- 8 domains, 8 clusters
- Most Web databases are clustered correctly
- Quantitatively analysis Conditional Entropy (the
smaller, the better) - Model-Differentiation 0.32
Likelihood 0.42 - Entropy 0.38 Context Linkage 0.61
18To build a domain hierarchy
- After 8 clusters, continue to run the HAC
algorithm to merge them together - It is consistent with common-sense close
concepts are merged first
19Conclusions
- Cluster Web databases using their query schemas
- First work on clustering Web databases, not pages
- Query schemas are good representatives
- Essentially a problem of clustering categorical
data - A new objective function Model-Differentiation
- Realized by statistical hypothesis testing
- Derive different similarity measure for HAC
20Thank You!