Brigham Young University - PowerPoint PPT Presentation

About This Presentation
Title:

Brigham Young University

Description:

Data-Value Characteristics. Expected Data Values. Data-Dictionary Information ... Ford Taurus. Ford F150. CarMake . CarModel. Legend. Mustang. A4. CarModel ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 51
Provided by: davidw8
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Brigham Young University


1
Source Discovery and Schema Mapping for Data
Integration
  • Brigham Young University
  • Li Xu

2
Data Integration
Find houses with four bedrooms priced under
200,000
global schema
Mediator
source schema 2
source schema 3
source schema 1
wrappers
homes.com
realestate.com
homeseekers.com
3
Problems
  • How to Recognize Applicable Information Sources
    for an Application?
  • How to Specify Mapping between the Source Schemas
    and the Global Schema?
  • How to Reformulate User Queries?
  • How to Merge Data from Heterogeneous Sources?

4
Recognizing Ontology-ApplicableHTML Documents
5
Application Ontology
How to specify an application?
6
Applicable HTML Documents
  • Multiple-Record Documents
  • Single-Record Documents
  • HTML Forms

How to distinguish an applicable HTML document?
7
Multiple-Record Docs
8
Single-Record Doc.
9
HTML Forms
10
Recognition Heuristics
  • h1 Densities
  • h2 Expected Values
  • h3 Grouping

How to measure the applicability of an HTML
document for an application?
11
h1 Densities
12
h2 Expected Values
ltYear0.98, Make0.93, Model0.91, Mileage0.45,
Price0.80, Feature2.10, PhoneNr1.15gt
13
h3 Grouping (of 1-Max Object Sets)
14
Classification Problem
  • Subtasks
  • Multiple Records
  • Singleton Record
  • Application Form
  • Learning Algorithm Decision Tree C4.5
  • (h10, h11, , h2, h3, Positive)
  • (h10, h11, , h2, h3, Negative)

How to construct recognition rules for an
application?
15
Experiments Car Ads and Obituaries
  • Training Sets
  • Car Ads (Yes No)
  • 143 363
  • 614 636
  • 50 69
  • Obituaries (Yes No)
  • 68 135
  • 50 69
  • 62 135
  • Test Sets
  • Car Ads (40 40)
  • Precision 95
  • Recall 98
  • F-measure 96
  • Obituaries (40 40)
  • Precision 95
  • Recall 95
  • F-measure 95

16
Link Analysis
17
Form Filling
18
Form Filling (Cont.)
19
Incorrect Positive ResponseMotorcycle
Year Make Price Mileage PhoneNr Feature
20
HistoricalFigure
Deceased Name Death Date Birth Date Age Relationsh
ip Relative Name
21
AutomatingSchema Mapping for Data Integration
22
Schema Mapping
Color
Year
Year
Feature
Make
Make Model
Body Type
Cost
Model
Car
Style
Phone
Cost
Miles
Mileage
Source
23
Schema Mappingfor Populated Schemas
  • Central Idea Exploit All Data Metadata
  • Matching Possibilities (Facets)
  • Attribute Names
  • Data-Value Characteristics
  • Expected Data Values
  • Data-Dictionary Information
  • Structural Properties

24
The Approach
  • Input
  • Two Graphs, S and T
  • Data Instances for S and T
  • Lightweight Domain Ontology
  • Output
  • A Source-to-Target Mapping between S and T
  • Should enable translating data instances from S
    to T.
  • Direct and Many Indirect Matches
  • (t, s)
  • (t, s lt ?)
  • Framework
  • Individual Facet Matching
  • Combination of Individual Matchers

25
Attribute Names
  • Target and Source Attributes
  • T A
  • S B
  • WordNet
  • C4.5 Decision Tree feature selection, trained on
    schemas in DB books
  • f0 same word
  • f1 synonym
  • f2 sum of distances to a common hypernym root
  • f3 number of different common hypernym roots
  • f4 sum of the number of senses of A and B

26
WordNet Rule
27
Data-Value Characteristics
  • C4.5 Decision Tree
  • Features
  • Numeric data
  • (Mean, variation, standard deviation, )
  • Alphanumeric data
  • (String length, numeric ratio, space ratio)

28
Expected Data Values
  • Concepts and Relationships
  • Data Recognizers
  • CarMake
  • ford
  • honda
  • CarModel
  • accord
  • mustang
  • taurus

Make Model
Brand Model
Ford Mustang Ford Taurus Ford F150
Legend Mustang A4
Acura Audi BMW
CarMake . CarModel
CarMake
CarModel
Target
Source
29
Structure Matching
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Target
Source
30
Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Target
Source
31
Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Target
Source
32
Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Source
Target
33
Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Source
Target
34
Structure Matching (Cont.)
MLS
MLS
Bedrooms
Name
Basic_features
location
House
Agent
beds
SQFT
Fax
location_ description
agent
Golf course
Water front
Address
Address
name
fax
phone
Street
City
State
Source
Target
35
House, MLS vs. MLS
MLS
MLS
Bedrooms
Basic_features
location
House
beds
SQFT
location_ description
Golf course
Water front
Address
Street
City
State
Source
Target
36
House, MLS vs. MLS
MLS
MLS
Bedrooms
Basic_features
location
House
beds
SQFT
location_ description
Golf course
Water front
Address
Street
City
State
Source
Target
37
House, MLS vs. MLS
MLS
MLS
Bedrooms
Basic_features
House
SQFT
House
location_ description
beds
Golf course
Water front
location
Address
Address1
Street
City
State
Source
Target
38
House, MLS vs. MLS
Basic_features
MLS
MLS
Bedrooms
SQFT
location_ description
House
beds
House
location
Water front
Golf course
Water front
Golf course
Address1
Address
Street
City
State
Street1
City1
State1
Target
Source
39
Agent vs. agent
Name
agent
Agent
address
Fax
name
fax
phone
Address
Street
City
State
Source
Target
40
Agent vs. agent
agent
Name
name
Agent
phone
Fax
fax
address
Address2
Address
Street
City
State
Street2,
City2
State2
Source
Target
41
Inter-Relationship Set
MLS
MLS
Bedrooms
Name
House
Agent
House
Fax
Golf course
Water front
Address
agent
Street
City
State
Source
Target
42
Example Source-To-Target Mapping
House
MLS
name
beds
agent
City
Golf course
Street
State
fax
Water front
Address
Address1
Address2
43
Target-based Integration and Query System (TIQS)
  • Definition I (T, Si, Mi)
  • Phases
  • Design (Source-to-Target Mappings Mi)
  • Query Processing (Rule Unfolding)

44
Query Reformulation
  • Query
  • House-Bedrooms(x, 4) - House-Bedrooms(x, 4),
  • House-Golf_course(x, Yes),
  • House-Water_front(x, Yes)

45
Query Reformulation
  • Query
  • House-Bedrooms(x, 4) - House-Bedrooms(x, 4),
  • House-Golf_Course(x, Yes),
  • House-Water_Front(x, Yes)

46
TIQS (Cont.)
  • User Queries
  • Logic Rules
  • Maximal and Sound Query Answers
  • Advantages
  • Rule Unfolding
  • Scalability

47
Experimental Results
Application (Number of Schemes) Precision () Recall () F () Number Matches Number Correct Number Incorrect
Faculty Member (5) 100 100 100 540 540 0
Course Schedule (5) 99 93 96 490 454 6
Real Estate (5) 90 94 92 876 820 92
Indirect Matches (precision 87, recall 94,
F-measure 90)
Data borrowed from Univ. of Washington DDH,
SIGMOD01
  • Rough Comparison with U of W Results
  • Course Schedule Accuracy 71
  • Real Estate (2 tests) Accuracy 75
  • Faculty Member Accuracy, 92

48
Conclusion
  • A Robust and Flexible Approach to Check
    Applicability of HTML documents
  • A Composite Approach to Automate Schema Mapping
  • Direct Matches
  • Indirect Matches
  • An Approach that Combines Advantages of Basic
    Approaches to Data Integration

49
Future Work
  • Test More Applications and Data to Evaluate the
    Approaches
  • Extend Training Classifiers for Applicability
    Checking
  • Further Automating Schema Mapping
  • Automate Ontology Mapping on the Semantic Web
  • Automate Mapping between XML Documents

50
Thanks ! Questions?
Write a Comment
User Comments (0)
About PowerShow.com