Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan - PowerPoint PPT Presentation

About This Presentation
Title:

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan

Description:

Numerous matching techniques have been developed ... Matching systems are still tuned manually, by trial and error ... Given schema S & matching system M. tunes ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: zam34
Category:

less

Transcript and Presenter's Notes

Title: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan


1
eTuner Tuning Schema Matching Software using
Synthetic Scenarios
  • Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan
  • University of Illinois, USA
  • Arnon Rosenthal
  • MITRE Corp., USA

2
Main Points
  • Tuning matching systems
    long standing problem
  • becomes increasingly worse
  • We propose a principled solution
  • exploits synthetic input/output pairs
  • promising, though much work remains
  • Idea applicable to other contexts

3
Schema Matching
price agent-name address
120,000 George Bush Crawford, TX 239,900
Hillary Clinton New York City, NY
Schema 1
1-1 match
complex match
listed-price contact-name city
state
Schema 2
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
4
Schema Matching is Ubiquitous
  • Databases
  • data integration,
  • model management
  • data translation,
  • collaborative data sharing
  • keyword querying, schema/view integration
  • data warehousing, peer data management,
  • AI
  • knowledge bases, ontology merging, information
    gathering agents, ...
  • Web
  • e-commerce, Deep Web, Semantic Web
  • eGovernment, bio-informatics, scientific data
    management

5
Current State of Affairs
  • Finding semantic mappings is now a key
    bottleneck!
  • largely done by hand, labor intensive error
    prone
  • Numerous matching techniques have been developed
  • Databases IBM Almaden, Microsoft Research, BYU,
    George Mason, U Leipzig, U
    Wisconsin, NCSU, U Illinois, Washington,
    Humboldt-Universität zu Berlin, ...
  • AI Stanford, Karlsruhe University, NEC Japan,
    ...
  • Techniques are often synergistic, leading to
    multi-component matching
    architectures
  • each component employs a particular technique
  • final predictions combine those of the components

6
An Example LSD SIGMOD-01
agent name
Schema 1
address agent-name
0.5
Name Matcher
contact agent
Urbana, IL James Smith Seattle, WA
Mike Doan
Combiner
Schema 2
0.3
0.1
Naive Bayes Matcher
area contact-agent
Peoria, IL (206) 634 9435 Kent, WA
(617) 335 4243
area gt (address, 0.7),
(description, 0.3) contact-agent gt
(agent-phone, 0.7), (agent-name, 0.3) comments
gt (address, 0.6), (desc, 0.4)
Match Selector
Constraint Enforcer
area address contact-agent agent-phone ... com
ments desc
Only one attribute of Schema 2 matches address
7
Multi-Component Matching Solutions
  • Developed in many recent works
  • e.g., Doan et. al., WebDB-00, SIGMOD-01 DoRahm,
    VLDB-02 Embley et.al.-02 Bernstein et. al.
    SIGMOD Record-04 Madhavan et. al. 05
  • Now commonly adopted, with industrial-strength
    systems
  • e.g., Protoplasm MSR, COMA Univ of Lepzig

Match selector
Combiner
Matcher
Matcher 1
Matcher n

Matcher 1
Matcher n

LSD
COMA
SF
  • Such systems are very powerful ...
  • maximize accuracy highly customizable to
    individual domain
  • ... but place a serious tuning burden on domain
    users

LSD-SF
8
Tuning Schema Matching Systems
  • Given a particular matching situation
  • how to select the right components?
  • how to adjust the multitude of knobs?

Knobs of decision tree matcher
Threshold selector
Bipartite graph selector
Characteristics of attr.
A search enforcer Relax. labeler ILP
Split measure
Average combiner
Min combiner
Max combiner
Weighted sum combiner
Post-prune?
Size of validation set

q-gram name matcher
Decision tree matcher
Naïve Bays matcher


TF/IDF name matcher
SVM matcher
Library of matching components
Execution graph
  • Untuned versions produce inferior accuracy,
    however ...

9
... Tuning is Extremely Difficult
  • Large number of knobs
  • e.g., 8-29 in our experiments
  • Wide variety of techniques
  • database, machine learning, IR, information
    theory, etc.
  • Complex interaction among components
  • Not clear how to compare the quality of knob
    configs
  • Matching systems are still tuned manually, by
    trial and error
  • Multiple component systems make tuning even worse

Developing efficient tuning techniques is crucial
to making matching systems attractive
in practice
10
The eTuner Solution
  • Given schema S matching system M
  • tunes M to maximize average accuracy
    of matching S
    with future schemas
  • incurs virtually no cost to user
  • Key challenge 1 Evaluation
  • must search for best knob config
  • how to compute the quality of any knob config C?
  • if knowing ground-truth matches for a
    representative workload W (S,T1), ...,
    (S,Tn), then can use W to evaluate C
  • but often have no such W
  • Key challenge 2 Search
  • how to efficiently evaluate the huge space of
    knob configs?

11
Key Idea Generate Synthetic Input/Output Pairs
  • Need workload W (S,T1), (S,T2), , (S,Tn)
  • To generate W
  • start with S
  • perturb S to generate T1
  • perturb S to generate T2
  • etc.
  • Know the perturbation gt know matches between S
    Ti

12
Key Idea Generate Synthetic Input/Output Pairs
V1
V
1
Perturb of tables
3
2
Perturb of columnsin each table
.
Split S into V and U with disjoint data tuples
.
.
EMPLOYEES
Vn
Perturb column and table names
EMPLOYEES
Perturb data tuples in each table
U
EMPS
1
3
2
EMPLOYEES
EMPS
EMPLOYEES
EMPS.emp-last EMPLOYEES.last EMPS.id
EMPLOYEES.id EMPS.wage
EMPLOYEES.salary()
O1 a set of semantic matches
V1
U
13
Examples of Perturbation Rules
  • Number of tables
  • merge two tables based on a join path
  • splits a table into two
  • Structure of table
  • merges two columns
  • e.g., neighboring columns, or sharing
    prefix/suffix (last-name, first-name)
  • drop a column
  • swap location of two columns
  • Names of tables/columns
  • rules capture common name transformations
  • abbreviation to the first 3-4 characters,
    dropping all vowels, synonyms, dropping prefixes,
    adding table name to column name, etc
  • Data values
  • rules capture common format transformations 12/4
    gt Dec 4
  • values are changed based on some distributions
    (e.g., Gaussian)

See paper for details
14
The eTuner Architecture
Tuning Procedures
Perturbation Rules
Workload Generator
Staged Tuner
Synthetic Workload
Tuned Matching Tool M
U O1 V1 U O2 V2 U On Vn
Matching Tool M
Schema S
(Optional)
15
The Staged Tuner
Level 4
Level 3
Tuning direction
Level 2
Level 1
  • Tune sequentially starting with lowest-level
    components
  • Assume
  • execution graph has k levels, m nodes per level
  • each node can be assigned one of n components
  • each component has p knobs, each of which has q
    values
  • tuning examines (npqkm) out of (npq)(km)
    knob configs

16
Empirical Evaluation
Domains
Matching systems
17
Matching Accuracy
Off-the-shelf
Domain-dependent
eTUNER Automatic
Source-dependent
Domain-independent
eTUNER Human-assisted
0.9
0.9
0.8
LSD
COMA
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Course
Inventory
Product
Real Estate
Course
Inventory
Product
Real Estate
0.9
0.9
SF
0.8
LSD-SF
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Course
Inventory
Product
Real Estate
Course
Inventory
Product
Real Estate
eTuner achieves higher accuracy than current best
methods, at virtually no cost to the user
18
Cost of Using eTuner
  • You have a schema S and a matching system M
  • Vendor supplies eTuner
  • will hook it up with matching system M
  • Vendor supplies a matching system M
  • bundles eTuner inside

19
Sensitivity Analysis
  • Adding perturbation rules
  • Exploiting prior match results (enriching the
    workload)

0.7
0.9
0.8
0.6
0.7
0.5
0.6
0.4
0.5
Accuracy (F1)
0.4
0.3
0.3
Tuned LSD
0.2
0.2
0.1
0.1
0.0
0.0
1
10
20
25
40
50
0
22
44
66
88
Schemas in Synthetic Workload ()
Previous matches in collection ()
20
Summary The eTuner Project _at_ Illinois
  • Tuning matching systems is crucial
  • long standing problem, is getting worse
  • a next logical step in schema matching research
  • Provides an automatic principled solution
  • generates a synthetic workload, employs it to
    tune efficiently
  • incurs virtually no cost to human users
  • exploits user assistance whenever available
  • Extensive experiments over 4 domains with 4
    systems
  • Future directions
  • find optimal synthetic workload
  • apply to other matching scenarios
  • adapt ideas to scenarios beyond schema matching
    (see 3rd speaker)

21
Backup User Assistance
  • S(phone1,phone2,)
  • Generate V by dropping phone2 V(phone1,)
  • Rename phone1 in V V(x,)
  • Problem
  • x matches phone1, x does not match phone2
  • User
  • group phone1 and phone2
  • so if x matches phone1, it will also match phone2
  • Intuition tell system do not bother to try
    distinguish phone1 and phone2
Write a Comment
User Comments (0)
About PowerShow.com