Organizing Structured Web Sources by Query Schemas: A Clustering Approach

1 / 20
About This Presentation
Title:

Organizing Structured Web Sources by Query Schemas: A Clustering Approach

Description:

Organizing Structured Web Sources by Query Schemas: A Clustering Approach. Bin He ... For example: ISBN for Books, MPAA Rating for Movies , ... –

Number of Views:69
Avg rating:3.0/5.0
Slides: 21
Provided by: ZhenZ7
Category:

less

Transcript and Presenter's Notes

Title: Organizing Structured Web Sources by Query Schemas: A Clustering Approach


1
Organizing Structured Web Sources by Query
Schemas A Clustering Approach
  • Bin He
  • Joint work with Tao Tao, Kevin Chen-Chuan Chang
  • Univ. Illinois at Urbana-Champaign

2
Background MetaQuerier Large-Scale Integration
of the deep Web
Query
Result
MetaQuerier
The Deep Web
3
MetaQuerier System architecture
MetaQuerier
Front-end Query Execution
Query Translation
Source Selection
Result Compilation
Query Web databases
Find Web databases
Deep Web Repository
Unified Interfaces
Subject Domains
Query Capabilities
Query Interfaces
Back-end Semantics Discovery
Database Crawler
Interface Extraction
Source Organization
Schema Matching
4
In MetaQuerier, source organization is to cluster
query interfaces into implicit domains
Airfares
Automobiles
Books
5
What are the representative feature of query
interfaces?
  • Is query schema the feature we are looking for?

6
Query schemas are appropriate representatives of
Web databases distinctive property
Airfares
Hotels
Movies
Number of observations
Attributes Index
Attributes Index
Attributes Index
  • Each domain contains a dominant range of
    attributes, distinctive from other domains
  • Some attributes are only observed in one domain
    (anchor attributes)
  • For example ISBN for Books, MPAA Rating for
    Movies ,
  • Source organization becomes the clustering of
    query schemas

7
Query schemas can be viewed as categorical data
  • Query schemas as transactions
  • S1 author, title, subject, ISBN
  • S2 author, title, category, publisher
  • S3 make, model, price, zip code
  • S4 manufacturer, model, price
  • S5 from, to, departure date, return date,
    number of passengers
  • S6 departure city, arrival city, number of
    adults, number of children
  • Thus, we can apply algorithms for clustering
    categorical data

8
Clustering categorical data Objective function
  • Clustering needs to have an objective function to
    evaluate the quality of clusters
  • Existing objective functions
  • Likelihood 1998 (Model-based clustering)
  • Context Linkage ROCK 2000
  • Entropy COOLCAT 2002
  • In this paper, we propose a new objective
    function
  • Model-Differentiation

9
Model-Differentiation A new objective function
for model-based clustering
  • Assumption of model-base clustering
  • Each cluster Ci has a generative model Mi to
    generate its data with probabilistic behavior
  • What is a good clustering result? (our
    observation)
  • data in different clusters are very
    dissimilar
  • models of different clusters are very dissimilar
  • a new objective function
  • maximize the dissimilarity of models
  • To realize, we need to answer three questions
  • How to model the data?
  • How to estimate the model, given data?
  • How to measure the dissimilarity of models?

10
Modeling Multinomial distribution
  • Each attribute is an independent event
  • A schema is generated by a series of sampling
    from M

Model M
A schema title, author, ISBN
Vocabulary author (P1) publisher (P2) title
(P3) ISBN (P4) city (P5) price (P6) model (P7)
P1
ISBN
author
title
P3
P4
Probability P1P3P4
11
Model estimation Given a set of data, how to
estimate its model?
  • Maximum likelihood estimation
  • S1 title, author, ISBN, S2 author,
    ISBN, publisher
  • S3 author, title, price, S4
    author, title, price
  • Vocabulary author, title, ISBN, price,
    publisher

12
Measuring the dissimilarity of models
Statistical hypothesis testing
  • Multinomial distribution can be directly tested
    by ?2 testing

S1 title, author, ISBN, S2 author, ISBN,
price, S3 make, model, price
Pro
Mlt1,2gt
Pro
M3
1. Combining S1 and S2
Attrs
Attrs
Pro
Mlt1,3gt
Pro
M2
2. Combining S1 and S3
Attrs
Attrs
Pro
Mlt2,3gt
Pro
M1
3. Combining S2 and S3
Attrs
Attrs
Inspire a hierarchical agglomerative clustering
(HAC) algorithm
13
Hypothesis testing needs sufficient observations
Pre-clustering to form small clusters
Distinguishable
S2
S1 with anchor attributes
S1 and S2 should be in the same domain and thus
pre-clustered
How to decide whether an S is distinguishable ?
Sup(S1)
Any Si, Sj in Sup(S1)
S1
14
Post-classification Handling loners
Separate
Pre-clustering
Model clustering
Loners too small for X2 test after pre-clustering
Naïve Bayesian
15
Experiments
  • Data
  • Questions to answer
  • Can schema clustering effectively organize Web
    databases?
  • Can it build a domain hierarchy correctly?

16
We also try existing objective functions
  • Three existing objective functions
  • Likelihood maximize likelihood
  • Entropy maximize entropy
  • Context Linkage minimize cross links
  • To be fair, keep pre-clustering and post
    classification, and only change the clustering
    step by different measures

17
Effectiveness of Clustering
  • 8 domains, 8 clusters
  • Most Web databases are clustered correctly
  • Quantitatively analysis Conditional Entropy (the
    smaller, the better)
  • Model-Differentiation 0.32
    Likelihood 0.42
  • Entropy 0.38 Context Linkage 0.61

18
To build a domain hierarchy
  • After 8 clusters, continue to run the HAC
    algorithm to merge them together
  • It is consistent with common-sense close
    concepts are merged first

19
Conclusions
  • Cluster Web databases using their query schemas
  • First work on clustering Web databases, not pages
  • Query schemas are good representatives
  • Essentially a problem of clustering categorical
    data
  • A new objective function Model-Differentiation
  • Realized by statistical hypothesis testing
  • Derive different similarity measure for HAC

20
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com