Automating Schema Matching for Data Integration - PowerPoint PPT Presentation

About This Presentation
Title:

Automating Schema Matching for Data Integration

Description:

Car Year Make Model Mileage Price PhoneNr. 0001 1989 Subaru SW $1900 (363)835-8597 ... Each row is a car. ... table rows (e.g. for each Car). UFMG, June 2002 ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 67
Provided by: davidw8
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automating Schema Matching for Data Integration


1
Automating Schema Matchingfor Data Integration
  • David W. Embley
  • Brigham Young University

Funded by NSF
2
Information Exchange
Source
Target
Information Extraction
Schema Matching
3
Presentation Outline
  • Information Extraction
  • Schema Matching for HTML Table
  • Direct Schema Matching
  • Indirect Schema Matching
  • Conclusions and Future Work

4
Information Extraction
5
Extracting Pertinent Information from Documents
6
A Conceptual Modeling Solution
7
Car-Ads Ontology
  • Car -gtobject
  • Car 0..1 has Year 1..
  • Car 0..1 has Make 1..
  • Car 0...1 has Model 1..
  • Car 0..1 has Mileage 1..
  • Car 0.. has Feature 1..
  • Car 0..1 has Price 1..
  • PhoneNr 1.. is for Car 0..
  • PhoneNr 0..1 has Extension 1..
  • Year matches 4
  • constant extract \d2
  • context "(\\d)4-9\d,
    \d"
  • substitute "" -gt "19" ,
  • End

8
Recognition and Extraction
9
Schema Matching forHTML Tables
10
Table-Schema Matching(Basic Idea)
  • Many tables on the Web
  • Ontology-Based Extraction
  • Works well for unstructured or semistructured
    data
  • What about structured data tables?
  • Method
  • Form attribute-value pairs
  • Do extraction
  • Infer mappings from extraction patterns

11
Problem Different Schemas
  • Target Database Schema
  • Car, Year, Make, Model, Mileage, Price,
    PhoneNr, PhoneNr, Extension, Car, Feature
  • Different Source Table Schemas
  • Run , Yr, Make, Model, Tran, Color, Dr
  • Make, Model, Year, Colour, Price, Auto, Air
    Cond., AM/FM, CD
  • Vehicle, Distance, Price, Mileage
  • Year, Make, Model, Trim, Invoice/Retail, Engine,
    Fuel Economy

12
Problem Attribute is Value
13
Problem Attribute-Value is Value
14
Problem Value is not Value
15
Problem Implied Values
16
Problem Missing Attributes
17
Problem Compound Attributes
18
Problem Merged Values
19
Problem Values not of Interest
20
Problem Factored Values
21
Problem Split Values
22
Problem Information Behind Links
23
Solution
  • Form attribute-value pairs (adjust if necessary)
  • Do extraction
  • Infer mappings from extraction patterns

24
Solution Remove Internal Factoring
Discover Nesting Make, (Model, (Year, Colour,
Price, Auto, Air Cond, AM/FM, CD))
25
Solution Replace Boolean Values
ACURA
ACURA
Legend
26
Solution Form Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAuto,
Autogt, ltAir Cond., Air Cond.gt, ltAM/FM, AM/FMgt
27
Solution Adjust Attribute-Value Pairs
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAutogt,
ltAir Condgt, ltAM/FMgt
28
Solution Do Extraction
ACURA
ACURA
Legend
ltMake, Hondagt, ltModel, Civic EXgt, ltYear, 1995gt,
ltColour, Whitegt, ltPrice, 6300gt, ltAutogt,
ltAir Condgt, ltAM/FMgt
29
Solution Infer Mappings
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
30
Solution Do Extraction
ACURA
ACURA
Legend
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
31
Solution Do Extraction
ACURA
ACURA
Legend
?PriceTable
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
32
Solution Do Extraction
ACURA
ACURA
Legend
?Colour?Feature?ColourTable ? ?Auto?Feature?Auto?A
utoTable ? ?Air Cond.?Feature?Air Cond. ?Air
Cond.Table ? ?AM/FM?Feature?AM/FM?AM/FMTable ?
?CD?Feature?CD?CDTable
Yes,
Yes,
Yes,
Yes,
Car, Year, Make, Model, Mileage, Price,
PhoneNr, PhoneNr, Extension, Car, Feature
33
Experiment
  • Tables from 60 sites
  • 10 training tables
  • 50 test tables
  • 357 mappings (from all 60 sites)
  • 172 direct mappings (same attribute and meaning)
  • 185 indirect mappings (29 attribute synonyms, 5
    Yes/No columns, 68 unions over columns for
    Feature, 19 factored values, and 89 columns of
    merged values that needed to be split)

34
Results
  • 10 training tables
  • 100 of the 57 mappings (no false mappings)
  • 94.6 of the values in linked pages (5.4 false
    declarations)
  • 50 test tables
  • 94.7 of the 300 mappings (no false mappings)
  • On the bases of sampling 3,000 values in linked
    pages, we obtained 97 recall and 86 precision
  • 16 missed mappings
  • 4 partial (not all unions included)
  • 6 non-U.S. car-ads (unrecognized makes and
    models)
  • 2 U.S. unrecognized makes and models
  • 3 prices (missing or found MSRP instead)
  • 1 mileage (mileages less than 1,000)

35
Direct Schema Matching
36
Attribute Matchingfor Populated Schemas
  • Central Idea Exploit All Data Metadata
  • Matching Possibilities (Facets)
  • Attribute Names
  • Data-Value Characteristics
  • Expected Data Values
  • Data-Dictionary Information
  • Structural Properties

37
Approach
  • Target Schema T
  • Source Schema S
  • Framework
  • Individual Facet Matching
  • Combining Facets
  • Best-First Match Iteration

38
Example
Car
Car
Style
has
01
0
01
01
has
has
has
Cost
Mileage
Miles
Source Schema S
Target Schema T
39
Individual Facet Matching
  • Attribute Names
  • Data-Value Characteristics
  • Expected Data Values

40
Attribute Names
  • Target and Source Attributes
  • T A
  • S B
  • WordNet
  • C4.5 Decision Tree feature selection, trained on
    schemas in DB books
  • f0 same word
  • f1 synonym
  • f2 sum of distances to a common hypernym root
  • f3 number of different common hypernym roots
  • f4 sum of the number of senses of A and B

41
WordNet Rule
42
Confidence Measures
43
Data-Value Characteristics
  • C4.5 Decision Tree
  • Features
  • Numeric data
  • (Mean, variation, standard deviation, )
  • Alphanumeric data
  • (String length, numeric ratio, space ratio)

44
Confidence Measures
45
Expected Data Values
  • Target Schema T and Source Schema S
  • Regular expression recognizer for attribute A in
    T
  • Data instances for attribute B in S
  • Hit Ratio N/N for (A, B) match
  • N number of B data instances recognized by the
    regular expressions of A
  • N number of B data instances

46
Confidence Measures
47
Combined Measures
Threshold 0.5
48
Final Confidence Measures
0
0
0
49
Experimental Results
  • This schema, plus 6 other schemas
  • 32 matched attributes
  • 376 unmatched attributes
  • Matched 100
  • Unmatched 99.5
  • Feature ---Color
  • Feature ---Body Type

50
Indirect Schema Matching
51
Schema Matching
Color
Year
Year
Feature
Make
Make Model
Body Type
Cost
Model
Car
Style
Phone
Cost
Miles
Mileage
Source
52
Mapping Generation
  • Direct Matches as described earlier
  • Attribute Names based on WordNet
  • Value Characteristics based on value lengths,
    averages,
  • Expected Values based on regular-expression
    recognizers
  • Indirect Matches
  • Direct matches
  • Structure Evaluation
  • Union
  • Selection
  • Decomposition
  • Composition

53
Union and Selection
Color
Year
Year
Feature
Make
Make Model
Body Type
Cost
Model
Car
Style
Phone
Cost
Miles
Mileage
Source
54
Decomposition and Composition
Color
Year
Year
Feature
Make
Make Model
Body Type
Cost
Model
Car
Style
Phone
Cost
Miles
Mileage
Source
55
Structure
Example Taken From MBR, VLDB01
PO
PurchaseOrder
Items
POShipTo
POBillTo
POLines
DeliverTo
InvoiceTo
Count
Address
Item
ItemCount
City
Street
City
Street
Item
ItemNumber
City
Street
Line
Qty
UoM
Quantity
UnitOfMeasure
Target
Source
56
Structure(Nonlexical Matches)
PO
PurchaseOrder
Items
POShipTo
POBillTo
POLines
DeliverTo
InvoiceTo
DeliverTo
Count
Address
Item
Count
City
Street
City
Street
Item
ItemNumber
City
Street
Line
Qty
UoM
Quantity
UnitOfMeasure
Target
Source
57
Structure(Join over FD Relationship Sets, )
PO
PurchaseOrder
Items
POBillTo
POLines
InvoiceTo
POShipTo
DeliverTo
City
Count
City
Item
Count
Street
City
Street
City
Street
Item
Street
ItemNumber
Line
Qty
UoM
Quantity
UnitOfMeasure
Target
Source
58
Structure(Lexical Matches)
PO
PurchaseOrder
Items
POBillTo
POLines
InvoiceTo
POShipTo
DeliverTo
City
City
Count
Count
City
City
Item
Count
Street
Street
Count
City
Street
City
Street
Item
City
Street
City
Street
Street
Street
ItemNumber
Line
Qty
UoM
Line
Qty
Quantity
Quantity
UnitOfMeasure
Target
Source
59
Experiments
  • Methodology
  • Measures
  • Precision
  • Recall
  • F Measure

60
Results
Applications (Number of Schemes) Precision () Recall () F () Correct False Positive False Negative
Course Schedule (5) 98 93 96 119 2 9
Faculty Member (5) 100 100 100 140 0 0
Real Estate (5) 92 96 94 235 20 10
Indirect Matches 94 (precision, recall,
F-measure)
Data borrowed from Univ. of Washington DDH,
SIGMOD01
Rough Comparison with U of W Results (Direct
Matches only) Course Schedule Accuracy
71 Faculty Members Accuracy, 92
Real Estate (2 tests) Accuracy 75
61
Conclusions and Future Work
62
Conclusions
  • Table Mappings
  • Tables 94.7 (Recall) 100 (Precision)
  • Linked Text 97 (Recall) 86 (Precision)
  • Direct Attribute Matching
  • Matched 32 of 32 100 Recall
  • 2 False Positives 94 Precision
  • Direct and Indirect Attribute Matching
  • Matched 494 of 513 96 Recall
  • 22 False Positives 96 Precision

www.deg.byu.edu
63
Current Future WorkImprove and Extend
Indirect Matching
  • Improve Object-Set Matching (e.g. Lex/non-Lex)
  • Add Relationship-Set Matching
  • Computations

64
Current Future WorkTables Behind Forms
  • Crawling the Hidden Web
  • Filling in Forms from Global Queries

65
Current Future WorkDeveloping Extraction
Ontologies
  • Creation from Knowledge Sources and Sample
    Application Pages
  • ?K Ontology Data Frames, Lexicons,
  • RDF Ontologies
  • User Creation by Example

66
Current Future Workand Much More
  • Table Understanding
  • Microfilm Census Records
  • Generate Ontologies by Reading Tables

www.deg.byu.edu
Write a Comment
User Comments (0)
About PowerShow.com