Title: Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan
1eTuner Tuning Schema Matching Software using
Synthetic Scenarios
- Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan
- University of Illinois, USA
- Arnon Rosenthal
- MITRE Corp., USA
-
2Main Points
- Tuning matching systems
long standing problem - becomes increasingly worse
- We propose a principled solution
- exploits synthetic input/output pairs
- promising, though much work remains
- Idea applicable to other contexts
3Schema Matching
price agent-name address
120,000 George Bush Crawford, TX 239,900
Hillary Clinton New York City, NY
Schema 1
1-1 match
complex match
listed-price contact-name city
state
Schema 2
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
4Schema Matching is Ubiquitous
- Databases
- data integration,
- model management
- data translation,
- collaborative data sharing
- keyword querying, schema/view integration
- data warehousing, peer data management,
- AI
- knowledge bases, ontology merging, information
gathering agents, ... - Web
- e-commerce, Deep Web, Semantic Web
- eGovernment, bio-informatics, scientific data
management
5Current State of Affairs
- Finding semantic mappings is now a key
bottleneck! - largely done by hand, labor intensive error
prone - Numerous matching techniques have been developed
- Databases IBM Almaden, Microsoft Research, BYU,
George Mason, U Leipzig, U
Wisconsin, NCSU, U Illinois, Washington,
Humboldt-Universität zu Berlin, ... - AI Stanford, Karlsruhe University, NEC Japan,
... - Techniques are often synergistic, leading to
multi-component matching
architectures - each component employs a particular technique
- final predictions combine those of the components
6An Example LSD SIGMOD-01
agent name
Schema 1
address agent-name
0.5
Name Matcher
contact agent
Urbana, IL James Smith Seattle, WA
Mike Doan
Combiner
Schema 2
0.3
0.1
Naive Bayes Matcher
area contact-agent
Peoria, IL (206) 634 9435 Kent, WA
(617) 335 4243
area gt (address, 0.7),
(description, 0.3) contact-agent gt
(agent-phone, 0.7), (agent-name, 0.3) comments
gt (address, 0.6), (desc, 0.4)
Match Selector
Constraint Enforcer
area address contact-agent agent-phone ... com
ments desc
Only one attribute of Schema 2 matches address
7Multi-Component Matching Solutions
- Developed in many recent works
- e.g., Doan et. al., WebDB-00, SIGMOD-01 DoRahm,
VLDB-02 Embley et.al.-02 Bernstein et. al.
SIGMOD Record-04 Madhavan et. al. 05 - Now commonly adopted, with industrial-strength
systems - e.g., Protoplasm MSR, COMA Univ of Lepzig
Match selector
Combiner
Matcher
Matcher 1
Matcher n
Matcher 1
Matcher n
LSD
COMA
SF
- Such systems are very powerful ...
- maximize accuracy highly customizable to
individual domain - ... but place a serious tuning burden on domain
users
LSD-SF
8Tuning Schema Matching Systems
- Given a particular matching situation
- how to select the right components?
- how to adjust the multitude of knobs?
Knobs of decision tree matcher
Threshold selector
Bipartite graph selector
Characteristics of attr.
A search enforcer Relax. labeler ILP
Split measure
Average combiner
Min combiner
Max combiner
Weighted sum combiner
Post-prune?
Size of validation set
q-gram name matcher
Decision tree matcher
Naïve Bays matcher
TF/IDF name matcher
SVM matcher
Library of matching components
Execution graph
- Untuned versions produce inferior accuracy,
however ...
9... Tuning is Extremely Difficult
- Large number of knobs
- e.g., 8-29 in our experiments
- Wide variety of techniques
- database, machine learning, IR, information
theory, etc. - Complex interaction among components
- Not clear how to compare the quality of knob
configs - Matching systems are still tuned manually, by
trial and error - Multiple component systems make tuning even worse
Developing efficient tuning techniques is crucial
to making matching systems attractive
in practice
10The eTuner Solution
- Given schema S matching system M
- tunes M to maximize average accuracy
of matching S
with future schemas - incurs virtually no cost to user
- Key challenge 1 Evaluation
- must search for best knob config
- how to compute the quality of any knob config C?
- if knowing ground-truth matches for a
representative workload W (S,T1), ...,
(S,Tn), then can use W to evaluate C - but often have no such W
- Key challenge 2 Search
- how to efficiently evaluate the huge space of
knob configs?
11Key Idea Generate Synthetic Input/Output Pairs
- Need workload W (S,T1), (S,T2), , (S,Tn)
- To generate W
- start with S
- perturb S to generate T1
- perturb S to generate T2
- etc.
- Know the perturbation gt know matches between S
Ti
12Key Idea Generate Synthetic Input/Output Pairs
V1
V
1
Perturb of tables
3
2
Perturb of columnsin each table
.
Split S into V and U with disjoint data tuples
.
.
EMPLOYEES
Vn
Perturb column and table names
EMPLOYEES
Perturb data tuples in each table
U
EMPS
1
3
2
EMPLOYEES
EMPS
EMPLOYEES
EMPS.emp-last EMPLOYEES.last EMPS.id
EMPLOYEES.id EMPS.wage
EMPLOYEES.salary()
O1 a set of semantic matches
V1
U
13Examples of Perturbation Rules
- Number of tables
- merge two tables based on a join path
- splits a table into two
- Structure of table
- merges two columns
- e.g., neighboring columns, or sharing
prefix/suffix (last-name, first-name) - drop a column
- swap location of two columns
- Names of tables/columns
- rules capture common name transformations
- abbreviation to the first 3-4 characters,
dropping all vowels, synonyms, dropping prefixes,
adding table name to column name, etc - Data values
- rules capture common format transformations 12/4
gt Dec 4 - values are changed based on some distributions
(e.g., Gaussian)
See paper for details
14The eTuner Architecture
Tuning Procedures
Perturbation Rules
Workload Generator
Staged Tuner
Synthetic Workload
Tuned Matching Tool M
U O1 V1 U O2 V2 U On Vn
Matching Tool M
Schema S
(Optional)
15The Staged Tuner
Level 4
Level 3
Tuning direction
Level 2
Level 1
- Tune sequentially starting with lowest-level
components - Assume
- execution graph has k levels, m nodes per level
- each node can be assigned one of n components
- each component has p knobs, each of which has q
values - tuning examines (npqkm) out of (npq)(km)
knob configs
16Empirical Evaluation
Domains
Matching systems
17Matching Accuracy
Off-the-shelf
Domain-dependent
eTUNER Automatic
Source-dependent
Domain-independent
eTUNER Human-assisted
0.9
0.9
0.8
LSD
COMA
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Course
Inventory
Product
Real Estate
Course
Inventory
Product
Real Estate
0.9
0.9
SF
0.8
LSD-SF
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Course
Inventory
Product
Real Estate
Course
Inventory
Product
Real Estate
eTuner achieves higher accuracy than current best
methods, at virtually no cost to the user
18Cost of Using eTuner
- You have a schema S and a matching system M
- Vendor supplies eTuner
- will hook it up with matching system M
- Vendor supplies a matching system M
- bundles eTuner inside
19Sensitivity Analysis
- Adding perturbation rules
- Exploiting prior match results (enriching the
workload)
0.7
0.9
0.8
0.6
0.7
0.5
0.6
0.4
0.5
Accuracy (F1)
0.4
0.3
0.3
Tuned LSD
0.2
0.2
0.1
0.1
0.0
0.0
1
10
20
25
40
50
0
22
44
66
88
Schemas in Synthetic Workload ()
Previous matches in collection ()
20Summary The eTuner Project _at_ Illinois
- Tuning matching systems is crucial
- long standing problem, is getting worse
- a next logical step in schema matching research
- Provides an automatic principled solution
- generates a synthetic workload, employs it to
tune efficiently - incurs virtually no cost to human users
- exploits user assistance whenever available
- Extensive experiments over 4 domains with 4
systems - Future directions
- find optimal synthetic workload
- apply to other matching scenarios
- adapt ideas to scenarios beyond schema matching
(see 3rd speaker)
21Backup User Assistance
- S(phone1,phone2,)
- Generate V by dropping phone2 V(phone1,)
- Rename phone1 in V V(x,)
- Problem
- x matches phone1, x does not match phone2
- User
- group phone1 and phone2
- so if x matches phone1, it will also match phone2
- Intuition tell system do not bother to try
distinguish phone1 and phone2