Organizing Structured Web Sources by Query Schemas: A Clustering Approach

1 / 20

About This Presentation

Title:

Organizing Structured Web Sources by Query Schemas: A Clustering Approach

Description:

Organizing Structured Web Sources by Query Schemas: A Clustering Approach. Bin He ... For example: ISBN for Books, MPAA Rating for Movies , ... –

Number of Views:69

Avg rating:3.0/5.0

Slides: 21

Provided by: ZhenZ7

Category:

more less

Transcript and Presenter's Notes

Title: Organizing Structured Web Sources by Query Schemas: A Clustering Approach

1
Organizing Structured Web Sources by Query
Schemas A Clustering Approach

Bin He
Joint work with Tao Tao, Kevin Chen-Chuan Chang
Univ. Illinois at Urbana-Champaign

2
Background MetaQuerier Large-Scale Integration
of the deep Web
Query
Result
MetaQuerier
The Deep Web
3
MetaQuerier System architecture
MetaQuerier
Front-end Query Execution
Query Translation
Source Selection
Result Compilation
Query Web databases
Find Web databases
Deep Web Repository
Unified Interfaces
Subject Domains
Query Capabilities
Query Interfaces
Back-end Semantics Discovery
Database Crawler
Interface Extraction
Source Organization
Schema Matching
4
In MetaQuerier, source organization is to cluster
query interfaces into implicit domains
Airfares
Automobiles
Books
5
What are the representative feature of query
interfaces?

Is query schema the feature we are looking for?

6
Query schemas are appropriate representatives of
Web databases distinctive property
Airfares
Hotels
Movies
Number of observations
Attributes Index
Attributes Index
Attributes Index

Each domain contains a dominant range of
attributes, distinctive from other domains
Some attributes are only observed in one domain
(anchor attributes)
For example ISBN for Books, MPAA Rating for
Movies ,

Source organization becomes the clustering of
query schemas

7
Query schemas can be viewed as categorical data

Query schemas as transactions
S1 author, title, subject, ISBN
S2 author, title, category, publisher
S3 make, model, price, zip code
S4 manufacturer, model, price
S5 from, to, departure date, return date,
number of passengers
S6 departure city, arrival city, number of
adults, number of children
Thus, we can apply algorithms for clustering
categorical data

8
Clustering categorical data Objective function

Clustering needs to have an objective function to
evaluate the quality of clusters
Existing objective functions
Likelihood 1998 (Model-based clustering)
Context Linkage ROCK 2000
Entropy COOLCAT 2002
In this paper, we propose a new objective
function
Model-Differentiation

9
Model-Differentiation A new objective function
for model-based clustering

Assumption of model-base clustering
Each cluster Ci has a generative model Mi to
generate its data with probabilistic behavior
What is a good clustering result? (our
observation)
data in different clusters are very
dissimilar
models of different clusters are very dissimilar
a new objective function
maximize the dissimilarity of models
To realize, we need to answer three questions
How to model the data?
How to estimate the model, given data?
How to measure the dissimilarity of models?

10
Modeling Multinomial distribution

Each attribute is an independent event
A schema is generated by a series of sampling
from M

Model M
A schema title, author, ISBN
Vocabulary author (P1) publisher (P2) title
(P3) ISBN (P4) city (P5) price (P6) model (P7)
P1
ISBN
author
title
P3
P4
Probability P1P3P4
11
Model estimation Given a set of data, how to
estimate its model?

Maximum likelihood estimation
S1 title, author, ISBN, S2 author,
ISBN, publisher
S3 author, title, price, S4
author, title, price
Vocabulary author, title, ISBN, price,
publisher

12
Measuring the dissimilarity of models
Statistical hypothesis testing

Multinomial distribution can be directly tested
by ?2 testing

S1 title, author, ISBN, S2 author, ISBN,
price, S3 make, model, price
Pro
Mlt1,2gt
Pro
M3
1. Combining S1 and S2
Attrs
Attrs
Pro
Mlt1,3gt
Pro
M2
2. Combining S1 and S3
Attrs
Attrs
Pro
Mlt2,3gt
Pro
M1
3. Combining S2 and S3
Attrs
Attrs
Inspire a hierarchical agglomerative clustering
(HAC) algorithm
13
Hypothesis testing needs sufficient observations
Pre-clustering to form small clusters
Distinguishable
S2
S1 with anchor attributes
S1 and S2 should be in the same domain and thus
pre-clustered
How to decide whether an S is distinguishable ?
Sup(S1)
Any Si, Sj in Sup(S1)
S1
14
Post-classification Handling loners
Separate
Pre-clustering
Model clustering
Loners too small for X2 test after pre-clustering
Naïve Bayesian
15
Experiments

Data
Questions to answer
Can schema clustering effectively organize Web
databases?
Can it build a domain hierarchy correctly?

16
We also try existing objective functions

Three existing objective functions
Likelihood maximize likelihood
Entropy maximize entropy
Context Linkage minimize cross links
To be fair, keep pre-clustering and post
classification, and only change the clustering
step by different measures

17
Effectiveness of Clustering