Title: Statistical Schema Matching across Web Query Interfaces
1Statistical Schema Matching across Web Query
Interfaces
SIGMOD 2003
- Bin He
- Joint work with Kevin Chen-Chuan Chang
2Background MetaQuerier Large-Scale Integration
of the deep Web
Query
Result
MetaQuerier
The Deep Web
3Challenge matching query interfaces (QIs)
Book Domain
Music Domain
4Traditional approaches of schema matching
Pairwise Attribute Correspondence
- Examples
- LSD, Cupid
- Scale is a challenge
- Only small scale
- Large-scale is a
- must for our task
- Scale is an opportunity
- Holistic information
- are not exploited
S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Pairwise Attribute Correspondence
S1.author S3.name
S1.subject S2.category
5Observation of large-scale sources concerted
complexity of QIs
- Deep Web sources are proliferating
- 127,000 online deep Web sources (Deep Web survey,
UIUC, 2003) - Query Interfaces
- designed for human users (more understandable and
consistent) - concerted complexity
6A hidden schema model exists?
Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
7A hidden schema model exists?
- Our View (Hypothesis)
- Now the problem is
Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
P
M
Given
, can we discover
?
QIs
8A new approach Hidden Model Discovery
- Scalability large-scale matching
- Solvability exploit statistical information
S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
9Towards hidden model discovery Statistical
schema matching (MGS)
M
1. Define the abstract Model structure M to solve
a target question
P(QIM)
10MGSSD Specialize MGS for synonym discovery
- We believe MGS is generally applicable to a wide
range of schema matching tasks - E.g., attribute grouping
- Our focus in this paper discover synonym
attributes - Author Writer, Subject Category
- No complex matching
- (LastName, FirstName) Author
- No hierarchical matching
- Query interface as flat schema
11Hypothesis Modeling 1. The Structure
- Goal capture synonym relationship
- Two-level model structure
12Hypothesis Modeling 2. Instantiation probability
1.Observing an attribute
P(authorM)
a1 ß1
13Hypothesis Generation
- Prune the space of model candidates
- Generate M such that P(QIM)gt0 for any observed
QI - mutual exclusion assumption
- Example
- Observations QI1 author, subject and QI2
author, category - Space of model any set partition of author,
subject, category
14Hypothesis Generation
- Prune the space of model candidates
- Generate M such that P(QIM)gt0 for any observed
QI - mutual exclusion assumption
- Example
- Observations QI1 author, subject and QI2
author, category - Space of model any set partition of author,
subject, category - Model candidates after pruning
15Hypothesis Selection
- Rank the model candidates
- Intuition select the model that generates the
closest distribution to the observations - Approach statistical hypothesis testing
Est
Est
M1
M4
1
0.5
QIs
QIs
Obr
Observations
1
QIs
16Real World Data and Final Algorithm
- Hypothesis testing needs sufficient observations,
while in the real world - Rare attributes
- Rare interfaces e.g., publisher, price
- Final Iterative Algorithm
Attribute Selection
Extract the common parts of model candidates of
last iteration
Hypothesis Generation
Combine rare interfaces
Hypothesis Selection
17Case Study Music and Movie Domains
- To have sufficient observations handle the
attributes with at least 10 occurrence.
Mmusic
C1
C2
C3
C4
C5
band
artist
song
album
title
label
format
Mmovie
C1
C2
C3
C4
star
artist
actor
genre
category
title
director
18Case Study Book Domain
Mbook1
C1
C2
C3
C4
C5
C6
author
last name
first name
subject
category
publisher
title
isbn
Mbook2
C1
C2
C3
C4
C5
C6
author
first name
last name
subject
category
publisher
title
isbn
19Promise Limitation, Future Issues
- Promise
- Use minimal light-weight information attribute
name - Effective with sufficient instances
- Leverage challenge as opportunity
- Limitation
- Need sufficient observations
- Homonyms
- Future Issues
- Complex matching (Last Name, First Name)
Author - Efficient approximation algorithm
- Incorporating other matching techniques
20Thank You