Statistical Schema Matching across Web Query Interfaces - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Statistical Schema Matching across Web Query Interfaces

Description:

Statistical Schema Matching across Web Query Interfaces. Bin He ... LSD, Cupid. Scale is a challenge. Only small scale. Large-scale is a. must for our task ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 21

Provided by: Bin107

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Schema Matching across Web Query Interfaces

1
Statistical Schema Matching across Web Query
Interfaces
SIGMOD 2003

Bin He
Joint work with Kevin Chen-Chuan Chang

2
Background MetaQuerier Large-Scale Integration
of the deep Web
Query
Result
MetaQuerier
The Deep Web
3
Challenge matching query interfaces (QIs)
Book Domain
Music Domain
4
Traditional approaches of schema matching
Pairwise Attribute Correspondence

Examples
LSD, Cupid
Scale is a challenge
Only small scale
Large-scale is a
must for our task
Scale is an opportunity
Holistic information
are not exploited

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Pairwise Attribute Correspondence
S1.author S3.name
S1.subject S2.category
5
Observation of large-scale sources concerted
complexity of QIs

Deep Web sources are proliferating
127,000 online deep Web sources (Deep Web survey,
UIUC, 2003)
Query Interfaces
designed for human users (more understandable and
consistent)
concerted complexity

6
A hidden schema model exists?

Our View (Hypothesis)

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
7
A hidden schema model exists?

Our View (Hypothesis)
Now the problem is

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
P
M
Given
, can we discover
?
QIs
8
A new approach Hidden Model Discovery

Scalability large-scale matching
Solvability exploit statistical information

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
9
Towards hidden model discovery Statistical
schema matching (MGS)
M
1. Define the abstract Model structure M to solve
a target question
P(QIM)
10
MGSSD Specialize MGS for synonym discovery

We believe MGS is generally applicable to a wide
range of schema matching tasks
E.g., attribute grouping
Our focus in this paper discover synonym
attributes
Author Writer, Subject Category
No complex matching
(LastName, FirstName) Author
No hierarchical matching
Query interface as flat schema

11
Hypothesis Modeling 1. The Structure

Goal capture synonym relationship
Two-level model structure

12
Hypothesis Modeling 2. Instantiation probability
1.Observing an attribute
P(authorM)
a1 ß1
13
Hypothesis Generation

Prune the space of model candidates
Generate M such that P(QIM)gt0 for any observed
QI
mutual exclusion assumption
Example
Observations QI1 author, subject and QI2
author, category
Space of model any set partition of author,
subject, category

14
Hypothesis Generation

Prune the space of model candidates
Generate M such that P(QIM)gt0 for any observed
QI
mutual exclusion assumption
Example
Observations QI1 author, subject and QI2
author, category
Space of model any set partition of author,
subject, category
Model candidates after pruning

15
Hypothesis Selection

Rank the model candidates
Intuition select the model that generates the
closest distribution to the observations
Approach statistical hypothesis testing

Est
Est
M1
M4
1
0.5
QIs
QIs
Obr
Observations
1
QIs
16
Real World Data and Final Algorithm

Hypothesis testing needs sufficient observations,
while in the real world
Rare attributes
Rare interfaces e.g., publisher, price
Final Iterative Algorithm

Attribute Selection
Extract the common parts of model candidates of
last iteration
Hypothesis Generation
Combine rare interfaces
Hypothesis Selection
17
Case Study Music and Movie Domains

To have sufficient observations handle the
attributes with at least 10 occurrence.

Mmusic
C1
C2
C3
C4
C5
band
artist
song
album
title
label
format
Mmovie
C1
C2
C3
C4
star
artist
actor
genre
category
title
director
18
Case Study Book Domain

Case of Hyponyms

Mbook1
C1
C2
C3
C4
C5
C6
author
last name
first name
subject
category
publisher
title
isbn
Mbook2
C1
C2
C3
C4
C5
C6
author
first name
last name
subject
category
publisher
title
isbn
19
Promise Limitation, Future Issues