Title: Reconciling Schemas of Disparate Data Sources: A MachineLearning Approach
1Reconciling Schemas of Disparate Data Sources A
Machine-Learning Approach
- AnHai Doan
- Pedro Domingos
- Alon Halevy
2Data Integration
3Problem Solution
- Problem
- Large-scale Data Integration Systems
- Bottleneck Semantic Mappings
- 1-1 Mappings
- Solution
- Multi-strategy Learning
- Integrity Constraints
- XML Structure Learner
4Learning Source Descriptions (LSD)
- Components
- Base learners
- Meta-learner
- Prediction converter
- Constraint handler
- Operations
- Training phase
- Matching phase
5Learners
- Basic Learners
- Name Matcher (Whirl)
- Content Matcher (Whirl)
- Naïve Bayes Learner
- County-Name Recognizer
- XML Learner
- Meta-Learner (Stacking)
6XML Learner
7XML Learner (Cont.)
8Constraint Handler
9Constraint Handler (Cont.)
- Search Heuristic
- Mapping Cost
10Training Phase
11Example1 (Training Phase)
12Example1 (Cont.)
13Example1 (Cont.)
(location ,ADDRESS)
(Miami, FL, ADDRESS)
14Matching Phase
15Example2 (Matching Phase)
16Example2 (Cont.)
17Example2 (Cont.)
18Empirical Evaluation
19Measures
- Matching accuracy of a source
- Average matching accuracy of a source
- Average matching accuracy of a domain
20Experiment Result
21Experiment Result (Cont.)
Contributions of base learners and the constraint
handler
22Experiment Result (Cont.)
Contributions of Schema information and Data
Instances
23Experiment Result (Cont.)
Performance sensitivity to the amount of data
instances
24Limitations
- Enough Training Data
- Domain Dependent Learners
- Ambiguities in Sources
- Efficiency
- Overlapping of Schemas
25Conclusion and Future Work
- Improve over time
- Extensible framework
- Multiple types of knowledge
- Non 1-1 mapping ?