Data Integration: A Status Report - PowerPoint PPT Presentation

About This Presentation

Title:

Data Integration: A Status Report

Description:

Mine: Information Manifold, Tukwila, LSD. Companies: Many startups, big guys getting in. ... Big guys making announcements: IBM, BEA, MS, (Oracle still being defiant) ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 34

Provided by: uw387

Category:

more less

Transcript and Presenter's Notes

Title: Data Integration: A Status Report

1
Data IntegrationA Status Report

Alon Halevy
University of Washington, Seattle
BTW 2003

2
Data Integration Report

Recent progress
Mediation languages
Query processing (XML and other)
Commercial
Current challenges
Flexible architectures peer-data mgmt.
Getting to the root of semantic heterogeneity
schema mapping.

3
Data Integration Systems

This is one possible architecture (virtual
integration)
Only logical mediated schema is central. Data
stays at the sources.

4
Motivation and Activity

Application areas of data integration
Enterprise information integration ()
The government
Data sources on the web
Scientific data sharing.
Many research projects
Mine Information Manifold, Tukwila, LSD.
Companies
Many startups, big guys getting in.

5
Outline

Recent progress
Mediation languages
Adaptive Query processing
XML data management
Commercial
Current challenges
Flexible architectures peer-data mgmt.
Getting to the root of semantic heterogeneity
schema mapping.
Crossing the Structure Chasm.

6
Mediation Languages
Goal
Mediated Schema
Language for Specifying Semantic relationships
7
Global-as-View (GAV)
Create view Actor AS R1 Union Select A,B From
S2 Union
Mediated Schema
Title, Actor,
R1
R2
R3
R4
R5
8
Local-as-View (LAV)
(GLAV)
Create View R5 as Select From Movie Where
langGerman
Create View R1 as Select title, name From Title
Join Actor Where Yeargt1970
Mediated Schema
Title, Actor
R1
R2
R3
R4
R5
9
Adaptive Query Processing

Problem no stats, network unstable
Cannot Plan and then execute
Need to adapt plan during execution.
Idea already in Ingres (1976)
Proposed before data integration
Cole and Graefe (choose nodes)
Kabra and Dewitt (mid-query re-opt).

10
Convergent Query ProcessingZack Ives, Ph.D
2002, U. Penn

Processor starts with initial plan
Monitors execution, accumulating stats.
Switches plan when a better one found
Reuses intermediate results.
Final, cleanup phase.
Possible transformation types
Plan partitioning, data partitioning, low-level
rescheduling.
Can be aggressive (e.g., with aggregations).

11
XML Query Processing

XML facilitates integration.
Mediator query processor may manipulate XML
directly.
Progress on
Publishing to XML, XML views on relations
Physical algebras for manipulating XML
Optimization of XQuery.

12
The Commercial World

Some startups
Nimble, MetaMatrix, Calixa, Enosys,
Big guys making announcements
IBM, BEA, MS, (Oracle still being defiant).
Progress analysts have buzzword -- EII.
Challenges
Integration with EAI?
Yet another middleware?
Horizontal vs. vertical?

13
Outline

Recent progress
Mediation languages
Adaptive Query processing
XML data management
Commercial
Current challenges
Flexible architectures peer-data mgmt.
Getting to the root of semantic heterogeneity
schema mapping.

14
Peer Data-Management

PDMS a network of peers
Peers can
Export base data
Provide views on base data
Serve as logical mediators for other peers
A peer can be both a server and a client.
Semantic relationships are specified locally
(between small sets of peers).

15
Network of Mappings (Piazza)
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Leipzig
Saarbruecken
Berlin
16
Advantages of PDMS

No need for a central mediated schema.
Can map data opportunistically, as is most
convenient.
Queries are posed using the peers schema.
Answers come from anywhere in the system.
Semantic Web.
This is not P2P file sharing.
Data has rich semantics
Membership is not as dynamic.

17
Schema Mediation
When can LAV and GAV be combined to form such a
network structure? ICDE-03, WWW-03 for XML
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Leipzig
Saarbruecken
Berlin
18
Query Optimization

Problems
redundant paths
expensive reformulation.

CiteSeer
UW
Stanford

Possible solution
Pre-compose some paths

DBLP
Leipzig
Saarbruecken
Berlin
19
Mapping Composition

Incredibly subtle! w/ Madhavan
In general, composition can be an infinite set of
GLAV formulas.
Results
Finite in many cases
Even when infinite, often has finite, useful
encoding.
Hence, compositions can usually be pre-optimized.

20
Management of Updatesw/ Mork, Gribble

Problem when updates are generated, we dont
know who will use them.
Solution
represent updates as first-class citizens
Complement with boosters
Rules for usage.

CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
21
Other Research Issues
Intelligent data placement Management of mapping
networks Improving networks finding additional
connections. Indexing of views
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
22
Schema Matching/Mapping

Given
S1 and S2 a pair of schemas/DTDs/ontologies,
Possibly, data accompanying instances
Additional domain knowledge
Find
A match between S1 and S2
A set of correspondences between the terms.
Ultimately, a mapping
Should enable translating data between the
schemas.

23
Example House Listings
house
address
num-baths
Water view
Lake Mountains
?
1-1 mapping
non 1-1 mapping
house
location view
full-baths
half-baths
front back
24
Motivations

Heart of any data sharing architecture
Virtual, warehouse, messaging,
web services, semantic web
Translation of legacy data, EAI,
Key operator in model management
Algebra for manipulating models of data
See Bernstein, CIDR-03, Melnik et al. SIGMOD
03.
Currently, a bottleneck. Done mostly by hand.

25
Approaches to Matching

Matching is hard because schema does not fully
capture the semantics.
Many techniques proposed. They consider
similarities in
Attribute names (synonyms)
Data values, data types
Relationships between columns
Structural similarities
Anything a human expert would try!
Hence, lets try to simulate a human.

26
Philosophy of Solutions

Effective schema matching requires a principled
combination of techniques.
Like human experts, the matcher should improve
over time
Learn from seeing many schemas, matches.
LSD Doan, Ph.D 2002, U. of Illinois
COMA Do et al.

27
Corpus Based SolutionMadhavan, Bernstein, Chen,
Halevy, Shenoy

Collect a corpus of schemas and matches.
Learn from the corpus
Create a classifier for every corpus element
Use multi-strategy learning.
Given S1 and S2
Compare each schema element to corpus elements.
If two elements similarity vectors are close,
then maybe they match each other.

28
Learning from Corpus vs. Learning from the schemas
29
Finding Different Matches
30
Other Corpus Based Tools

Conjecture a corpus of schemas can be the basis
for many useful tools.
Auto-complete
I start creating a schema (or show sample data),
and the tool suggests a completion.
Query reformulation
I ask a query using my terminology, and it gets
reformulated appropriately.
Improving structured queries over structured web
sites (and focused crawling, a la BINGO!)

31
The Corpus