Statistical Schema Matching across Web Query Interfaces - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Statistical Schema Matching across Web Query Interfaces

Description:

Statistical Schema Matching across Web Query Interfaces. Bin He ... LSD, Cupid. Scale is a challenge. Only small scale. Large-scale is a. must for our task ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 21
Provided by: Bin107
Category:

less

Transcript and Presenter's Notes

Title: Statistical Schema Matching across Web Query Interfaces


1
Statistical Schema Matching across Web Query
Interfaces
SIGMOD 2003
  • Bin He
  • Joint work with Kevin Chen-Chuan Chang

2
Background MetaQuerier Large-Scale Integration
of the deep Web
Query
Result
MetaQuerier
The Deep Web
3
Challenge matching query interfaces (QIs)
Book Domain
Music Domain
4
Traditional approaches of schema matching
Pairwise Attribute Correspondence
  • Examples
  • LSD, Cupid
  • Scale is a challenge
  • Only small scale
  • Large-scale is a
  • must for our task
  • Scale is an opportunity
  • Holistic information
  • are not exploited

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Pairwise Attribute Correspondence
S1.author S3.name
S1.subject S2.category
5
Observation of large-scale sources concerted
complexity of QIs
  • Deep Web sources are proliferating
  • 127,000 online deep Web sources (Deep Web survey,
    UIUC, 2003)
  • Query Interfaces
  • designed for human users (more understandable and
    consistent)
  • concerted complexity

6
A hidden schema model exists?
  • Our View (Hypothesis)

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
7
A hidden schema model exists?
  • Our View (Hypothesis)
  • Now the problem is

Instantiation probabilityP(QI1M)
P
M
QI1
QIs
Finite Vocabulary
Statistical Model
Generate QIs with different probabilities
P
M
Given
, can we discover
?
QIs
8
A new approach Hidden Model Discovery
  • Scalability large-scale matching
  • Solvability exploit statistical information

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
9
Towards hidden model discovery Statistical
schema matching (MGS)
M
1. Define the abstract Model structure M to solve
a target question
P(QIM)
10
MGSSD Specialize MGS for synonym discovery
  • We believe MGS is generally applicable to a wide
    range of schema matching tasks
  • E.g., attribute grouping
  • Our focus in this paper discover synonym
    attributes
  • Author Writer, Subject Category
  • No complex matching
  • (LastName, FirstName) Author
  • No hierarchical matching
  • Query interface as flat schema

11
Hypothesis Modeling 1. The Structure
  • Goal capture synonym relationship
  • Two-level model structure

12
Hypothesis Modeling 2. Instantiation probability
1.Observing an attribute
P(authorM)
a1 ß1
13
Hypothesis Generation
  • Prune the space of model candidates
  • Generate M such that P(QIM)gt0 for any observed
    QI
  • mutual exclusion assumption
  • Example
  • Observations QI1 author, subject and QI2
    author, category
  • Space of model any set partition of author,
    subject, category

14
Hypothesis Generation
  • Prune the space of model candidates
  • Generate M such that P(QIM)gt0 for any observed
    QI
  • mutual exclusion assumption
  • Example
  • Observations QI1 author, subject and QI2
    author, category
  • Space of model any set partition of author,
    subject, category
  • Model candidates after pruning

15
Hypothesis Selection
  • Rank the model candidates
  • Intuition select the model that generates the
    closest distribution to the observations
  • Approach statistical hypothesis testing

Est
Est
M1
M4
1
0.5
QIs
QIs
Obr
Observations
1
QIs
16
Real World Data and Final Algorithm
  • Hypothesis testing needs sufficient observations,
    while in the real world
  • Rare attributes
  • Rare interfaces e.g., publisher, price
  • Final Iterative Algorithm

Attribute Selection
Extract the common parts of model candidates of
last iteration
Hypothesis Generation
Combine rare interfaces
Hypothesis Selection
17
Case Study Music and Movie Domains
  • To have sufficient observations handle the
    attributes with at least 10 occurrence.

Mmusic
C1
C2
C3
C4
C5
band
artist
song
album
title
label
format
Mmovie
C1
C2
C3
C4
star
artist
actor
genre
category
title
director
18
Case Study Book Domain
  • Case of Hyponyms

Mbook1
C1
C2
C3
C4
C5
C6
author
last name
first name
subject
category
publisher
title
isbn
Mbook2
C1
C2
C3
C4
C5
C6
author
first name
last name
subject
category
publisher
title
isbn
19
Promise Limitation, Future Issues
  • Promise
  • Use minimal light-weight information attribute
    name
  • Effective with sufficient instances
  • Leverage challenge as opportunity
  • Limitation
  • Need sufficient observations
  • Homonyms
  • Future Issues
  • Complex matching (Last Name, First Name)
    Author
  • Efficient approximation algorithm
  • Incorporating other matching techniques

20
Thank You
Write a Comment
User Comments (0)
About PowerShow.com