Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan

About This Presentation

Title:

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan

Description:

Numerous matching techniques have been developed ... Matching systems are still tuned manually, by trial and error ... Given schema S & matching system M. tunes ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 22

Provided by: zam34

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan

1
eTuner Tuning Schema Matching Software using
Synthetic Scenarios

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan
University of Illinois, USA
Arnon Rosenthal
MITRE Corp., USA

2
Main Points

Tuning matching systems
long standing problem
becomes increasingly worse
We propose a principled solution
exploits synthetic input/output pairs
promising, though much work remains
Idea applicable to other contexts

3
Schema Matching
price agent-name address
120,000 George Bush Crawford, TX 239,900
Hillary Clinton New York City, NY
Schema 1
1-1 match
complex match
listed-price contact-name city
state
Schema 2
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
4
Schema Matching is Ubiquitous

Databases
data integration,
model management
data translation,
collaborative data sharing
keyword querying, schema/view integration
data warehousing, peer data management,
AI
knowledge bases, ontology merging, information
gathering agents, ...
Web
e-commerce, Deep Web, Semantic Web
eGovernment, bio-informatics, scientific data
management

5
Current State of Affairs

Finding semantic mappings is now a key
bottleneck!
largely done by hand, labor intensive error
prone
Numerous matching techniques have been developed
Databases IBM Almaden, Microsoft Research, BYU,
George Mason, U Leipzig, U
Wisconsin, NCSU, U Illinois, Washington,
Humboldt-Universität zu Berlin, ...
AI Stanford, Karlsruhe University, NEC Japan,
...
Techniques are often synergistic, leading to
multi-component matching
architectures
each component employs a particular technique
final predictions combine those of the components

6
An Example LSD SIGMOD-01
agent name
Schema 1
address agent-name
0.5
Name Matcher
contact agent
Urbana, IL James Smith Seattle, WA
Mike Doan
Combiner
Schema 2
0.3
0.1
Naive Bayes Matcher
area contact-agent
Peoria, IL (206) 634 9435 Kent, WA
(617) 335 4243
area gt (address, 0.7),
(description, 0.3) contact-agent gt
(agent-phone, 0.7), (agent-name, 0.3) comments
gt (address, 0.6), (desc, 0.4)
Match Selector
Constraint Enforcer
area address contact-agent agent-phone ... com
ments desc
Only one attribute of Schema 2 matches address
7
Multi-Component Matching Solutions

Developed in many recent works
e.g., Doan et. al., WebDB-00, SIGMOD-01 DoRahm,
VLDB-02 Embley et.al.-02 Bernstein et. al.
SIGMOD Record-04 Madhavan et. al. 05
Now commonly adopted, with industrial-strength
systems
e.g., Protoplasm MSR, COMA Univ of Lepzig

Match selector
Combiner
Matcher
Matcher 1
Matcher n

Matcher 1
Matcher n

LSD
COMA
SF

Such systems are very powerful ...
maximize accuracy highly customizable to
individual domain
... but place a serious tuning burden on domain
users

LSD-SF
8
Tuning Schema Matching Systems

Given a particular matching situation
how to select the right components?
how to adjust the multitude of knobs?

Knobs of decision tree matcher
Threshold selector
Bipartite graph selector
Characteristics of attr.
A search enforcer Relax. labeler ILP
Split measure
Average combiner
Min combiner
Max combiner
Weighted sum combiner
Post-prune?
Size of validation set

q-gram name matcher
Decision tree matcher
Naïve Bays matcher

TF/IDF name matcher
SVM matcher
Library of matching components
Execution graph

Untuned versions produce inferior accuracy,
however ...

9
... Tuning is Extremely Difficult

Large number of knobs
e.g., 8-29 in our experiments
Wide variety of techniques
database, machine learning, IR, information
theory, etc.
Complex interaction among components
Not clear how to compare the quality of knob
configs
Matching systems are still tuned manually, by
trial and error
Multiple component systems make tuning even worse

Developing efficient tuning techniques is crucial
to making matching systems attractive
in practice
10
The eTuner Solution

Given schema S matching system M
tunes M to maximize average accuracy
of matching S
with future schemas
incurs virtually no cost to user
Key challenge 1 Evaluation
must search for best knob config
how to compute the quality of any knob config C?
if knowing ground-truth matches for a
representative workload W (S,T1), ...,
(S,Tn), then can use W to evaluate C
but often have no such W
Key challenge 2 Search
how to efficiently evaluate the huge space of
knob configs?

11
Key Idea Generate Synthetic Input/Output Pairs

Need workload W (S,T1), (S,T2), , (S,Tn)
To generate W
start with S
perturb S to generate T1
perturb S to generate T2
etc.
Know the perturbation gt know matches between S
Ti

12
Key Idea Generate Synthetic Input/Output Pairs
V1
V
1
Perturb of tables
3
2
Perturb of columnsin each table
.
Split S into V and U with disjoint data tuples
.
.
EMPLOYEES
Vn
Perturb column and table names
EMPLOYEES
Perturb data tuples in each table
U
EMPS
1
3
2
EMPLOYEES
EMPS
EMPLOYEES
EMPS.emp-last EMPLOYEES.last EMPS.id
EMPLOYEES.id EMPS.wage
EMPLOYEES.salary()
O1 a set of semantic matches
V1
U
13
Examples of Perturbation Rules

Number of tables
merge two tables based on a join path
splits a table into two
Structure of table
merges two columns
e.g., neighboring columns, or sharing
prefix/suffix (last-name, first-name)
drop a column
swap location of two columns
Names of tables/columns
rules capture common name transformations
abbreviation to the first 3-4 characters,
dropping all vowels, synonyms, dropping prefixes,
adding table name to column name, etc
Data values
rules capture common format transformations 12/4
gt Dec 4
values are changed based on some distributions
(e.g., Gaussian)

See paper for details
14
The eTuner Architecture
Tuning Procedures
Perturbation Rules
Workload Generator
Staged Tuner
Synthetic Workload
Tuned Matching Tool M
U O1 V1 U O2 V2 U On Vn
Matching Tool M
Schema S
(Optional)
15
The Staged Tuner
Level 4
Level 3
Tuning direction
Level 2
Level 1

Tune sequentially starting with lowest-level
components
Assume
execution graph has k levels, m nodes per level
each node can be assigned one of n components
each component has p knobs, each of which has q
values
tuning examines (npqkm) out of (npq)(km)
knob configs

16
Empirical Evaluation
Domains
Matching systems
17
Matching Accuracy
Off-the-shelf
Domain-dependent
eTUNER Automatic
Source-dependent
Domain-independent
eTUNER Human-assisted
0.9
0.9
0.8
LSD
COMA
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Course
Inventory
Product
Real Estate
Course
Inventory
Product
Real Estate
0.9
0.9
SF
0.8
LSD-SF
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Course
Inventory
Product
Real Estate
Course
Inventory
Product
Real Estate
eTuner achieves higher accuracy than current best
methods, at virtually no cost to the user
18
Cost of Using eTuner

You have a schema S and a matching system M
Vendor supplies eTuner
will hook it up with matching system M
Vendor supplies a matching system M
bundles eTuner inside

19
Sensitivity Analysis

Adding perturbation rules
Exploiting prior match results (enriching the
workload)

0.7
0.9
0.8
0.6
0.7
0.5
0.6
0.4
0.5
Accuracy (F1)
0.4
0.3
0.3
Tuned LSD
0.2
0.2
0.1
0.1
0.0
0.0
1
10
20
25
40
50
0
22
44
66
88
Schemas in Synthetic Workload ()
Previous matches in collection ()
20
Summary The eTuner Project _at_ Illinois