Title: Naveen Ashish
1 Information Mediation Integrating Information
from Multiple Information Sources
- Naveen Ashish
- Amit P. Sheth
- Department of Computer Science and
- Large Scale Distributed Information Systems Lab
- University of Georgia, Athens
2What is an Information Agent/Mediator ?
- A software system that provides integrated and
structured query access to multiple distributed
information sources - Sources may be databases of various kinds or Web
sources - Sources are autonomously created and
heterogeneous - Accessible via a network
- Mediator provides the illusion of a single
information source
3Information Agents aka Mediators
Example Restaurant and Theatre Info on the Web
Map Servers
Ariadne Mediator
Health Ratings
4Why the Interest in Building Such Systems ?
Legacy System
Object-Oriented DB
5Mediators on the Web
6Organization of Remainder of Talk
- Introduction
- Information Agents, System Architecture
- Research Issues
- Information Modeling
- Query Planning
- Semi-automatic Wrapper Generation
- Performance Optimization by Materialization
- Resolving Inconsistencies
- Industry Products for Data Extraction and
Integration - Start-up Ventures
7Representative Systems (Research Projects)
- SIMS/Ariadne University of Southern
California/ISI - TSIMMIS Stanford
- Information Manifold ATT Research
- Garlic IBM Almaden
- Tukwila University of Washington
- InfoSleuth MCC
- DISCO University of Maryland/INRIA
- HERMES University of Maryland
- InfoMaster Stanford
- InfoQuilt University of Georgia
8Information Modeling
- Multiple, heterogeneous, autonomously created
information sources - Users sees an integrated (global) view
- Queries a mediated schema
- A uniform model for all sources
- Must be (at least) expressive enough to model the
most complex information source - Each source provides a set of relations or
classes - Translation (model) is done by wrapper at each
source - Integration
- Global as view, Local as view
9Global as View
- For each relation (class) in mediated schema we
specify how to obtain its tuples from the sources
DOH Ratings
Address Lat Lon
10Heterogeneity Resolution
- Sources may use different models
- OO, Relational, Legacy, ..
- May be Web sources
- Wrapper exports contents in a uniform model
- Structural and schematic differences
- (name, address) (name, street, city, state, zip)
- Semantic
- (name, phonenumber) (name, telephone)
11Global as View Models
- KR based models (SIMS, Ariadne, .)
- OO, based on ODMG (DISCO, Garlic )
interface Restaurant attribute string
name attribute string address attribute
string cuisine attribute string
review extent restaurant 0 of Restaurant
wrapper w0 repository r0 map ((zagts0restaurant0)
(namen) (addressa)(cuisinec))
12Local as View
- For every information source S describe it in
terms of relations in the mediated schema
v1(name,address,cuisine,rating) -
Restaurant(name,address, cuisine,rating) city
Santa Monica v2(name, foodrating) -
Restaurant(name,address,cuisine,rating) .
13Query Planning and Optimization
- Mediator must generate an information gathering
plan - Constraints on execution
- Binding patterns ....
- Optimization of query plans
- Current areas of work
- Optimization
- Approximate answers (incomplete sources)
- Query planning for other sources such as
simulations, computer programs etc. - Query execution engines
14Query Plans and Plan Quality
Low-Quality Plan
High-Quality Plan
15Accessing Sources via Wrappers
SELECT address, tel FROM Restaurant WHERE cuisine
Chinois, 2720 Main St, 310-777-9876 Peking Star,
1 Broad St, 213-999-7676 .....
16Semi-Automatic Wrapper Generation
- Need wrappers for several sites
- Building wrappers by hand is tedious and time
consuming - Approaches to automating the process
- Exploit format information (structure, HTML etc.
) - Template based approaches
- Machine learning techniques
ltnamegt Peking Star lt/namegt ltaddressgt 1 Broad
Street, Los Angeles lt/addressgt ltphonegt31-822-1511
17Wrappers .... Work in Progress
- Database wrappers
- Variety of techniques for Web wrappers
- Upmarking
- To XML
- Building Web-bases
- Other Artificial Intelligence techniques
- Natural Language Processing
- IR
- Classifiers
18Performance Issue
- Query processing time is typically very high
- Despite the mediator generating efficient query
plans - Cost of fetching data and pages from remote
sources dominates - Have to typically fetch a large number of Web
pages - The Web sources are not designed for database
like query access - The Web sources can be slow
- Further improve performance by materializing data
at the mediator side.
19Store and Materialize Data Locally
Wrapped Web Source (SLOW)
Materialized Data (FAST)
20Selective Materialization
- Why not simply materialize all the data in all
the Web sources being integrated and have a
really fast mediator ?? - Will not scale, amount of space needed may be too
much - Web sources can get updated
- Cost of keeping data consistent can get
prohibitive - We are building a mediator, not a data warehouse
! - Approach then is to selectively materialize data
- How do we automatically identify the portion of
data most useful to materialize ?
21Selecting Data to Materialize
Distribution of User Queries (Identify frequently
accessed classes)
Structure of Sources (Prefetch data to speed up
expensive queries)
Classes of Data to Materialize
Updates (Have to consider maintenance cost)
22Inconsistency Resolution
- Same object in different formats
- United States and US
- Red Lobster and The Red Lobster
- John Smith, Smith, J. , J. Smith, Dr.
John Smith ... - Has appeared in other database and IR contexts
- Solutions
- Mapping tables
- For finite domains (such as cities, countries,
companies ) - Simply maintain an enumerated list of possible
formats for each object - (New York, N.Y., NYC, New York City, Big
23Mapping Functions
- Mapping functions
- When domain is not finite (person names)
- Domain specific mapping transformations
- Stemming common words (Inc., Corp., The etc.)
- Matching full word and abbreviation
- Match 2 formats with a score
- Current work
- Learning mapping functions from example matches
- IR based approaches
- Building metabases
24Mediator Prototypes and Software
- Software and tools from mediator research
projects - What may be available.
- Mediator kernels (integration engines)
- Data modeling tools, Description Logic systems
- Wrapper and extractor toolkits and software
- Plenty of papers !
- Ariadne, USC/ISI, http//www.isi.edu/ariadne
- TSIMMIS, Stanford, http//www-db.stanford.edu/tsim
mis/ - MIX, UCSD, http//feast.ucsd.edu/Projects/MIX/
- InfoSleuth, MCC, http//www.mcc.com/projects/infos
leuth/ - DISCO, U Maryland, http//www.umiacs.umd.edu/labs/
CLIP/im.html - Garlic, IBM Almaden, http//www.almaden.ibm.com/cs
/garlic.html - Tukwila, U Washington, http//data.cs.washington.e
25Applications of Mediators
- Heterogeneous and Distributed Database
Integration - Legacy systems integration
- Web Sources Integration
- Data Integration for E-commerce
- Integrating product catalogs, multiple vendors
- Data Warehousing
- For populating data warehouses
- Bioinformatics
- Information Management Environments
- Digital Libraries
- Healthcare Information Systems
26Industry Products (IBM DB2 DataJoiner)
- IBM DB2 DataJoiner
- http//www-4.ibm.com/software/data/datajoiner/
- Enterprise data integration middleware
- DataJoiner functionality now incorporated in IBM
DB2 UDB - http//www-4.ibm.com/software/data/db2/udb/about.h
tml - Native support for popular relational data
sources - DB2, Informix, SQL Server, Sybase, Teradata and
others - Supports non relational data sources
- Support for Web data
- Available on variety of platforms and OS
27Start-up ventures Junglee Corp
- Website www.amazon.com (Acquired)
- Researcher Founders Rajaraman, Gupta,
Harinarayanan, Mathur - Products and Services
- Tools for data extraction and integration
- Building warehouse from multiple Web sources
- Integrating apartment listings from multiple
sources - Integrating job postings from multiple online job
sources - Market focus Online shopping
- Current Status Acquired by Amazon
- Similar ventures Netbots Inc. (www.excite.com)
Acquired by Excite
- Website www.cohera.com
- Researcher Founders Stonebraker, Hellerstein
- Products and Services
- Cohera E-Catalog System
- Integrates product data from multiple sellers and
product catalogs - Set of software servers and tools for building
and running live e-catalogs - Market(s) Targetted E-Commerce
- Customers E-Commerce communities - ThomasNet,
Trapezo, LiveListings, FoodService.Com - Current Status Founded October 1997, Privately
Held - Similar ventures Ensosys Markets Inc.
(www.enosysmarkets.com) - Mergent Inc. (www.mergent.com)
29Nimble Technology
- Website www.nimble.com
- Researcher Founders Levy, Weld
- Products and Services
- Nimble Data Integration Suite
- XML base integration approach
- Current focus on multiple information sources
integration - Tools for data extraction and Data Integration
Engine - Market focus CRM, Business Intelligence, B2B,
Portals - Current Status Founded June 1999, Privately Held
30WhizbangLabs !
- Website www.whizbanglabs.com
- Researcher Founders Quass, Geddes, Mitchell
- Products and Services
- Technology for building Webbases - databases
created by extracting data from Web pages - Topic specific
- Topic specific crawler for retrieving pages
- Tools for extracting data from Web pages,
cleaning data and loading into database - Market focus Content providing portals
- Current Status Founded March 1999, Privately
held - Similar ventures Fetch Technologies
31Bioinformatics A Data Integration Grand Challenge
- Mapping of Human Genetic Code complete
- New, revolutionary, computational approach to
drug discovery - Huge amounts of genetic, chemical and biological
data being generated at an exponential rate in
biotech/pharma RD - Complex structures, maps, sequence data etc.
- Drug discovery scientists need integrated access
to this data - Look for patterns across data sources
- Need to integrate data from multiple labs
- Lab procedures (thus the data) keeps changing
- Good amount of genomic data is free text
- DiscoveryLink State of the art Life Sciences
data integration middleware from IBM - http//www-4.ibm.com/software/webservers/lifescien
- Information mediation
- Issues in building such systems
- Research projects
- Industry products
- Start-up ventures
- Applicable to wide areas such as E-commerce,
database and legacy systems integration, Web
source extraction, content management, portals,
digital libraries, bioinformatics.