Title: Robert McCann
1Mapping Maintenance
for Data Integration Systems
- Robert McCann
- University of Illinois
- Joint work with Bedoor AlShebli, Quoc Le, Hoa
Nguyen, Long Vu, AnHai Doan - VLDB 2005
2Data Integration Systems
Find homes under 300K
mediated schema
source schema 2
source schema 3
source schema 1
wrapper
wrapper
windermere.com
yahoo.com
3Mapping Maintenance is a Key Bottleneck
- Constructing mappings has proven difficult
- (see first speaker)
- but maintenance often quickly dominates cost
- E.g., Integrated Genome Database Project Stein,
03 - 12 genomic databases, each remodeled data twice
per year - System broke every two weeks, abandoned after 1
year - E.g., Integration Project at Illinois
- Integrated 400 DB researcher homepages
- 2 system administrators, stopped after 3 months
Reducing maintenance costs is now crucial!
4Problem Definition
mediated schema
mediated schema
?
price location beds baths
180,000 61801 2 2
260,000 98195 3 2
5Example 1 Change Source Schema or Data
wrapper
homeseekers.com
6Example 2 Change Presentation Format
- Display location as zipcode
wrapper
homeseekers.com
7The MAVERIC Approach
- Suppose administrator wants to maintain mappings
for 1 year - 1. For a short initial period (e.g., 5 weeks)
- Administrator manually verifies each mapping
- MAVERIC probes the source to learn data
characteristics - 2. For remaining time (e.g., 47 weeks)
- MAVERIC probes the source to observe new data
instances - MAVERIC outputs an alarm if characteristics
differ - If an alarm, administrator repairs mappings
8Example
Learned data characteristics
If beds lt baths, output alarm
If average price lt 100,000, output alarm
If layout of attributes changes, output alarm
9Contributions
- Develop core MAVERIC system
- An ensemble of sensors that exploit multiple
characteristics of data - A combiner that leverages the most effective
sensors - Significantly improve core system
- Generate synthetic data to improve training
- Leverage external data to improve training
- Employ filters to reduce false alarms
- Extensive evaluation over 114 sources in 6
domains - Core MAVERIC outperforms related work, improving
F-1 by 4-19 - Enhancements further improve F-1 by 2-13
10Training the Core MAVERIC System
- Sensors learn internal profiles of data
characteristics - Combiner learns weight for each sensor
employ Winnow to learn weights
layout of attributes in HTML pages
price location beds / baths
avg value of price
11Verifying with the Core MAVERIC System
- Sensors leverage internal profiles to output
sensor scores - Combiner combines scores based upon weights
alarm if combined score ?
score1
new avg price
12Improving Training via Perturbation
- Idea expand training data by generating
synthetic data - Simulate natural source changes during training
- Source data changes, e.g., insert and delete
tuples - Presentation format changes, e.g., 29.99 becomes
29.99 USD
perturber - apply change - reapply
wrapper - test results
perturbed results
training data for S
original results
query results at tn
wrapper
source S at tn
System practices ahead of time
13Example Reformatting Price
training data
perturbed training example
original training example
price location beds baths
?
185,000 USD Urbana, IL 3 2
original results
perturbed results
wrapper
wrapper
185,000 Urbana, IL 3bed/2bath
185,000 USD Urbana, IL 3bed/2bath
original HTML
perturbed HTML
homeseekers.com
14Additional Improvements
- Improve training by borrowing data from other
sources
mediated schema
source schema
source schema
comments amount
category price
wrapper
wrapper
This 185,000 USD
house 185,000
S
S
- Reduce false alarms via filtering
- Web Search Engines
- price is 185,000 USD
- costs 185,000 USD
Other Sources
- Monetary Recognizers
- 185,000
- 185000.00
potentially corrupt attribute
price
price is valid
185,000 USD
amount
210 K
(see paper for details)
15Empirical Evaluation
- Test verification ability over 114 sources in 6
domains
Domain Number of Sources Schema Size (Number of Attributes) Probing Schedule Snapshots Snapshots
Domain Number of Sources Schema Size (Number of Attributes) Probing Schedule Correct Mappings Broken Mappings
Flights 19 8 weekly for 10 weeks 164 26
Books 21 6 weekly for 12 weeks 210 42
Researchers 60 4 daily for 313 days 12480 6274
Real Estate 5 17 11 snapshots per source 30 25
Inventory 4 7 11 snapshots per source 24 20
Courses 5 11 11 snapshots per source 30 25
16Core MAVERIC Outperforms Prior Work
- Compare with recent system
- Lerman et al, Journal of AI Research 03
Domain Lerman System Lerman System Sensor Ensemble Sensor Ensemble
Domain P / R F-1 P / R F-1
Flights 0.81 / 1.00 0.85 0.93 / 0.98 0.93
Books 0.83 / 1.00 0.89 0.90 / 0.99 0.93
Researchers 0.77 / 0.99 0.84 0.90 / 0.99 0.93
Real Estate 0.45 / 0.90 0.63 0.80 / 0.82 0.82
Inventory 0.52 / 0.89 0.67 0.75 / 0.90 0.77
Courses 0.49 / 0.94 0.66 0.92 / 0.88 0.88
Achieve F-1 from 82-93, an improvement of 4-19
in all domains
17Enhancements Boost Performance
- Progressively enhanced versions of MAVERIC
Each enhancement improved F-1 in at least 4
domains
18Reasons for Mistakes
- Unrecognized instance formats
- E.g., trained over TIME with format 200 pm,
- source changed format to 1400, output
false alarm - E.g., trained over DAYS with format M-W-F,
- source changed format to Mon Wed
Fri, output false alarm - Train with additional perturbations? Leverage
more sources? - Attributes with similar values
- E.g., trained with ORDER-DATE before SHIP-DATE,
- source reversed order, missed alarm
on reversed values - (ORDER-DATE 7/13/2004, SHIP-DATE
7/4/2004) - Include additional domain constraints?
19Related Work
- Schema matching
- Dhamankar et al, 04, He Chang, 03, Kang
Naughton, 03, Rahm Bernstein, 01, Doan, 01 - Quantify semantics to compute matching scores
- Activity monitoring
- Shavlik Shavlik, 04, Lazarevic et al, 03,
Stolfo et al, 01, Fawcett Provost, 99,
Allan et al, 98 - Profile normal behavior to detect notable events
(e.g., intrusions) - Mapping and wrapper maintenance
- Wrapper verification Lerman et al, 03,
Kushmerick, 00 - Mapping and wrapper repair Velegrakis et al,
03, Meng et al, 03, Chidlovskii, 01
20Conclusion Future Work
- Developed MAVERIC to reduce maintenance costs
- An ensemble of sensors that exploit multiple
characteristics of data - Significantly improved core system
- Perturbation, multi-source training, and
filtering - Extensively evaluated over 114 sources in 6
domains - Core outperformed related work, improving F-1 by
4-19 - Enhancements further improved F-1 by 2-13
- Future work
- Further improve and evaluate MAVERIC
- Develop a solution for repairing broken mappings