Title: Data Integration: A Status Report
1Data IntegrationA Status Report
- Alon Halevy
- University of Washington, Seattle
- BTW 2003
-
2Data Integration Report
- Recent progress
- Mediation languages
- Query processing (XML and other)
- Commercial
- Current challenges
- Flexible architectures peer-data mgmt.
- Getting to the root of semantic heterogeneity
schema mapping.
3Data Integration Systems
- This is one possible architecture (virtual
integration) - Only logical mediated schema is central. Data
stays at the sources.
4Motivation and Activity
- Application areas of data integration
- Enterprise information integration ()
- The government
- Data sources on the web
- Scientific data sharing.
- Many research projects
- Mine Information Manifold, Tukwila, LSD.
- Companies
- Many startups, big guys getting in.
5Outline
- Recent progress
- Mediation languages
- Adaptive Query processing
- XML data management
- Commercial
- Current challenges
- Flexible architectures peer-data mgmt.
- Getting to the root of semantic heterogeneity
schema mapping. - Crossing the Structure Chasm.
6Mediation Languages
Goal
Mediated Schema
Language for Specifying Semantic relationships
7Global-as-View (GAV)
Create view Actor AS R1 Union Select A,B From
S2 Union
Mediated Schema
Title, Actor,
R1
R2
R3
R4
R5
8Local-as-View (LAV)
(GLAV)
Create View R5 as Select From Movie Where
langGerman
Create View R1 as Select title, name From Title
Join Actor Where Yeargt1970
Mediated Schema
Title, Actor
R1
R2
R3
R4
R5
9Adaptive Query Processing
- Problem no stats, network unstable
- Cannot Plan and then execute
- Need to adapt plan during execution.
- Idea already in Ingres (1976)
- Proposed before data integration
- Cole and Graefe (choose nodes)
- Kabra and Dewitt (mid-query re-opt).
10Convergent Query ProcessingZack Ives, Ph.D
2002, U. Penn
- Processor starts with initial plan
- Monitors execution, accumulating stats.
- Switches plan when a better one found
- Reuses intermediate results.
- Final, cleanup phase.
- Possible transformation types
- Plan partitioning, data partitioning, low-level
rescheduling. - Can be aggressive (e.g., with aggregations).
11XML Query Processing
- XML facilitates integration.
- Mediator query processor may manipulate XML
directly. - Progress on
- Publishing to XML, XML views on relations
- Physical algebras for manipulating XML
- Optimization of XQuery.
12The Commercial World
- Some startups
- Nimble, MetaMatrix, Calixa, Enosys,
- Big guys making announcements
- IBM, BEA, MS, (Oracle still being defiant).
- Progress analysts have buzzword -- EII.
- Challenges
- Integration with EAI?
- Yet another middleware?
- Horizontal vs. vertical?
13Outline
- Recent progress
- Mediation languages
- Adaptive Query processing
- XML data management
- Commercial
- Current challenges
- Flexible architectures peer-data mgmt.
- Getting to the root of semantic heterogeneity
schema mapping.
14 Peer Data-Management
- PDMS a network of peers
- Peers can
- Export base data
- Provide views on base data
- Serve as logical mediators for other peers
- A peer can be both a server and a client.
- Semantic relationships are specified locally
(between small sets of peers).
15Network of Mappings (Piazza)
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Leipzig
Saarbruecken
Berlin
16Advantages of PDMS
- No need for a central mediated schema.
- Can map data opportunistically, as is most
convenient. - Queries are posed using the peers schema.
Answers come from anywhere in the system. - Semantic Web.
- This is not P2P file sharing.
- Data has rich semantics
- Membership is not as dynamic.
17Schema Mediation
When can LAV and GAV be combined to form such a
network structure? ICDE-03, WWW-03 for XML
CiteSeer
UW
Stanford
GAV, LAV GLAV
DBLP
Leipzig
Saarbruecken
Berlin
18Query Optimization
- Problems
- redundant paths
- expensive reformulation.
CiteSeer
UW
Stanford
- Possible solution
- Pre-compose some paths
DBLP
Leipzig
Saarbruecken
Berlin
19Mapping Composition
- Incredibly subtle! w/ Madhavan
- In general, composition can be an infinite set of
GLAV formulas. - Results
- Finite in many cases
- Even when infinite, often has finite, useful
encoding. - Hence, compositions can usually be pre-optimized.
20Management of Updatesw/ Mork, Gribble
- Problem when updates are generated, we dont
know who will use them. - Solution
- represent updates as first-class citizens
- Complement with boosters
- Rules for usage.
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
21Other Research Issues
Intelligent data placement Management of mapping
networks Improving networks finding additional
connections. Indexing of views
CiteSeer
UW
Stanford
DBLP
Leipzig
Saarbruecken
Berlin
22Schema Matching/Mapping
- Given
- S1 and S2 a pair of schemas/DTDs/ontologies,
- Possibly, data accompanying instances
- Additional domain knowledge
- Find
- A match between S1 and S2
- A set of correspondences between the terms.
- Ultimately, a mapping
- Should enable translating data between the
schemas.
23Example House Listings
house
address
num-baths
Water view
Lake Mountains
?
1-1 mapping
non 1-1 mapping
house
location view
full-baths
half-baths
front back
24Motivations
- Heart of any data sharing architecture
- Virtual, warehouse, messaging,
- web services, semantic web
- Translation of legacy data, EAI,
- Key operator in model management
- Algebra for manipulating models of data
- See Bernstein, CIDR-03, Melnik et al. SIGMOD
03. - Currently, a bottleneck. Done mostly by hand.
25Approaches to Matching
- Matching is hard because schema does not fully
capture the semantics. - Many techniques proposed. They consider
similarities in - Attribute names (synonyms)
- Data values, data types
- Relationships between columns
- Structural similarities
- Anything a human expert would try!
- Hence, lets try to simulate a human.
26Philosophy of Solutions
- Effective schema matching requires a principled
combination of techniques. - Like human experts, the matcher should improve
over time - Learn from seeing many schemas, matches.
- LSD Doan, Ph.D 2002, U. of Illinois
- COMA Do et al.
27Corpus Based SolutionMadhavan, Bernstein, Chen,
Halevy, Shenoy
- Collect a corpus of schemas and matches.
- Learn from the corpus
- Create a classifier for every corpus element
- Use multi-strategy learning.
- Given S1 and S2
- Compare each schema element to corpus elements.
- If two elements similarity vectors are close,
then maybe they match each other.
28Learning from Corpus vs. Learning from the schemas
29Finding Different Matches
30Other Corpus Based Tools
- Conjecture a corpus of schemas can be the basis
for many useful tools. - Auto-complete
- I start creating a schema (or show sample data),
and the tool suggests a completion. - Query reformulation
- I ask a query using my terminology, and it gets
reformulated appropriately. - Improving structured queries over structured web
sites (and focused crawling, a la BINGO!)
31The Corpus
- Contents
- Schemas, ontologies, meta-data, data, queries.
- Sample statistics
- How often does a word appear as a relation name?
- When it does, what tend to be the attribute
names? - What other tables are there? What are the foreign
keys?
32Conclusion Crossing the Structure Chasm
- Data authoring, querying and sharing is
everywhere done by novices too. - Semantic web the extreme example.
Corpus Of schemas
33Some References
- www.cs.washington.edu/homes/alon
- Piazza WebDB01, ICDE03, WWW03
- The Structure Chasm CIDR-03
- Mediation surveys VLDB Journal 01
- Lenzerini, PODS 02 tutorial.
- Schema matching
- Rahm and Bernstein, VLDB Journal 01.