Naveen Ashish - PowerPoint PPT Presentation

About This Presentation
Title:

Naveen Ashish

Description:

Naveen Ashish Amit P. Sheth Department of Computer Science and Large Scale Distributed Information Systems Lab University of Georgia, Athens What is an Information ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 33
Provided by: MacR9
Category:
Tags: amazon | ashish | naveen | tools

less

Transcript and Presenter's Notes

Title: Naveen Ashish


1
Information Mediation Integrating Information
from Multiple Information Sources
  • Naveen Ashish
  • Amit P. Sheth
  • Department of Computer Science and
  • Large Scale Distributed Information Systems Lab
  • University of Georgia, Athens

2
What is an Information Agent/Mediator ?
  • A software system that provides integrated and
    structured query access to multiple distributed
    information sources
  • Sources may be databases of various kinds or Web
    sources
  • Sources are autonomously created and
    heterogeneous
  • Accessible via a network
  • Mediator provides the illusion of a single
    information source

3
Information Agents aka Mediators
Example Restaurant and Theatre Info on the Web
Map Servers
Geocoders
Ariadne Mediator
Zagat
Health Ratings
Movies
4
Why the Interest in Building Such Systems ?
Oracle
MEDIATOR
Sybase
IBM DB2
Legacy System
Object-Oriented DB
5
Mediators on the Web
Wrapper
MEDIATOR
DB2
DB1
6
Organization of Remainder of Talk
  • Introduction
  • Information Agents, System Architecture
  • Research Issues
  • Information Modeling
  • Query Planning
  • Semi-automatic Wrapper Generation
  • Performance Optimization by Materialization
  • Resolving Inconsistencies
  • Industry Products for Data Extraction and
    Integration
  • Start-up Ventures

7
Representative Systems (Research Projects)
  • SIMS/Ariadne University of Southern
    California/ISI
  • TSIMMIS Stanford
  • Information Manifold ATT Research
  • Garlic IBM Almaden
  • Tukwila University of Washington
  • InfoSleuth MCC
  • DISCO University of Maryland/INRIA
  • HERMES University of Maryland
  • InfoMaster Stanford
  • InfoQuilt University of Georgia

8
Information Modeling
  • Multiple, heterogeneous, autonomously created
    information sources
  • Users sees an integrated (global) view
  • Queries a mediated schema
  • A uniform model for all sources
  • Must be (at least) expressive enough to model the
    most complex information source
  • Each source provides a set of relations or
    classes
  • Translation (model) is done by wrapper at each
    source
  • Integration
  • Global as view, Local as view

9
Global as View
  • For each relation (class) in mediated schema we
    specify how to obtain its tuples from the sources

Name
Phonenumber
RESTAURANT
Name
DOH Ratings
GEOCODER
Rating
Address Lat Lon
ZAGAT
FODORS
Name
Name
FODORS
Phone
ZAGAT
Address
Reviews
Telephone
10
Heterogeneity Resolution
  • Sources may use different models
  • OO, Relational, Legacy, ..
  • May be Web sources
  • Wrapper exports contents in a uniform model
  • Structural and schematic differences
  • (name, address) (name, street, city, state, zip)
  • Semantic
  • (name, phonenumber) (name, telephone)

11
Global as View Models
  • KR based models (SIMS, Ariadne, .)
  • LOOM, CLASSIC
  • OO, based on ODMG (DISCO, Garlic )

interface Restaurant attribute string
name attribute string address attribute
string cuisine attribute string
review extent restaurant 0 of Restaurant
wrapper w0 repository r0 map ((zagts0restaurant0)
(namen) (addressa)(cuisinec))
12
Local as View
  • For every information source S describe it in
    terms of relations in the mediated schema

v1(name,address,cuisine,rating) -
Restaurant(name,address, cuisine,rating) city
Santa Monica v2(name, foodrating) -
Restaurant(name,address,cuisine,rating) .
13
Query Planning and Optimization
  • Mediator must generate an information gathering
    plan
  • Constraints on execution
  • Binding patterns ....
  • Optimization of query plans
  • Current areas of work
  • Optimization
  • Approximate answers (incomplete sources)
  • Query planning for other sources such as
    simulations, computer programs etc.
  • Query execution engines

14
Query Plans and Plan Quality
Low-Quality Plan
High-Quality Plan
15
Accessing Sources via Wrappers
SELECT address, tel FROM Restaurant WHERE cuisine
chinese
Chinois, 2720 Main St, 310-777-9876 Peking Star,
1 Broad St, 213-999-7676 .....
16
Semi-Automatic Wrapper Generation
  • Need wrappers for several sites
  • Building wrappers by hand is tedious and time
    consuming
  • Approaches to automating the process
  • Exploit format information (structure, HTML etc.
    )
  • Template based approaches
  • Machine learning techniques
  • XML

ltnamegt Peking Star lt/namegt ltaddressgt 1 Broad
Street, Los Angeles lt/addressgt ltphonegt31-822-1511
lt/phonegt
17
Wrappers .... Work in Progress
  • Database wrappers
  • Variety of techniques for Web wrappers
  • Upmarking
  • To XML
  • Building Web-bases
  • Other Artificial Intelligence techniques
  • Natural Language Processing
  • IR
  • Classifiers

18
Performance Issue
  • Query processing time is typically very high
  • Despite the mediator generating efficient query
    plans
  • Cost of fetching data and pages from remote
    sources dominates
  • Have to typically fetch a large number of Web
    pages
  • The Web sources are not designed for database
    like query access
  • The Web sources can be slow
  • Further improve performance by materializing data
    at the mediator side.

19
Store and Materialize Data Locally
Wrapped Web Source (SLOW)
MEDIATOR
Materialized Data (FAST)
20
Selective Materialization
  • Why not simply materialize all the data in all
    the Web sources being integrated and have a
    really fast mediator ??
  • Will not scale, amount of space needed may be too
    much
  • Web sources can get updated
  • Cost of keeping data consistent can get
    prohibitive
  • We are building a mediator, not a data warehouse
    !
  • Approach then is to selectively materialize data
  • How do we automatically identify the portion of
    data most useful to materialize ?

21
Selecting Data to Materialize
Distribution of User Queries (Identify frequently
accessed classes)
Structure of Sources (Prefetch data to speed up
expensive queries)
Classes of Data to Materialize
SELECTING CLASSES
Updates (Have to consider maintenance cost)
22
Inconsistency Resolution
  • Same object in different formats
  • United States and US
  • Red Lobster and The Red Lobster
  • John Smith, Smith, J. , J. Smith, Dr.
    John Smith ...
  • Has appeared in other database and IR contexts
  • Solutions
  • Mapping tables
  • For finite domains (such as cities, countries,
    companies )
  • Simply maintain an enumerated list of possible
    formats for each object
  • (New York, N.Y., NYC, New York City, Big
    Apple)

23
Mapping Functions
  • Mapping functions
  • When domain is not finite (person names)
  • Domain specific mapping transformations
  • Stemming common words (Inc., Corp., The etc.)
  • Matching full word and abbreviation
  • Match 2 formats with a score
  • Current work
  • Learning mapping functions from example matches
  • IR based approaches
  • Building metabases

24
Mediator Prototypes and Software
  • Software and tools from mediator research
    projects
  • What may be available.
  • Mediator kernels (integration engines)
  • Data modeling tools, Description Logic systems
  • Wrapper and extractor toolkits and software
  • Plenty of papers !
  • Ariadne, USC/ISI, http//www.isi.edu/ariadne
  • TSIMMIS, Stanford, http//www-db.stanford.edu/tsim
    mis/
  • MIX, UCSD, http//feast.ucsd.edu/Projects/MIX/
  • InfoSleuth, MCC, http//www.mcc.com/projects/infos
    leuth/
  • DISCO, U Maryland, http//www.umiacs.umd.edu/labs/
    CLIP/im.html
  • Garlic, IBM Almaden, http//www.almaden.ibm.com/cs
    /garlic.html
  • Tukwila, U Washington, http//data.cs.washington.e
    du/integration/tukwila/

25
Applications of Mediators
  • Heterogeneous and Distributed Database
    Integration
  • Legacy systems integration
  • Web Sources Integration
  • Data Integration for E-commerce
  • Integrating product catalogs, multiple vendors
  • Data Warehousing
  • For populating data warehouses
  • Bioinformatics
  • Information Management Environments
  • Digital Libraries
  • Healthcare Information Systems

26
Industry Products (IBM DB2 DataJoiner)
  • IBM DB2 DataJoiner
  • http//www-4.ibm.com/software/data/datajoiner/
  • Enterprise data integration middleware
  • DataJoiner functionality now incorporated in IBM
    DB2 UDB
  • http//www-4.ibm.com/software/data/db2/udb/about.h
    tml
  • Native support for popular relational data
    sources
  • DB2, Informix, SQL Server, Sybase, Teradata and
    others
  • Supports non relational data sources
  • Support for Web data
  • Available on variety of platforms and OS

27
Start-up ventures Junglee Corp
  • Website www.amazon.com (Acquired)
  • Researcher Founders Rajaraman, Gupta,
    Harinarayanan, Mathur
  • Products and Services
  • Tools for data extraction and integration
  • Building warehouse from multiple Web sources
  • Integrating apartment listings from multiple
    sources
  • Integrating job postings from multiple online job
    sources
  • Market focus Online shopping
  • Current Status Acquired by Amazon
  • Similar ventures Netbots Inc. (www.excite.com)
    Acquired by Excite

28
Cohera
  • Website www.cohera.com
  • Researcher Founders Stonebraker, Hellerstein
  • Products and Services
  • Cohera E-Catalog System
  • Integrates product data from multiple sellers and
    product catalogs
  • Set of software servers and tools for building
    and running live e-catalogs
  • Market(s) Targetted E-Commerce
  • Customers E-Commerce communities - ThomasNet,
    Trapezo, LiveListings, FoodService.Com
  • Current Status Founded October 1997, Privately
    Held
  • Similar ventures Ensosys Markets Inc.
    (www.enosysmarkets.com)
  • Mergent Inc. (www.mergent.com)

29
Nimble Technology
  • Website www.nimble.com
  • Researcher Founders Levy, Weld
  • Products and Services
  • Nimble Data Integration Suite
  • XML base integration approach
  • Current focus on multiple information sources
    integration
  • Tools for data extraction and Data Integration
    Engine
  • Market focus CRM, Business Intelligence, B2B,
    Portals
  • Current Status Founded June 1999, Privately Held

30
WhizbangLabs !
  • Website www.whizbanglabs.com
  • Researcher Founders Quass, Geddes, Mitchell
  • Products and Services
  • Technology for building Webbases - databases
    created by extracting data from Web pages
  • Topic specific
  • Topic specific crawler for retrieving pages
  • Tools for extracting data from Web pages,
    cleaning data and loading into database
  • Market focus Content providing portals
  • Current Status Founded March 1999, Privately
    held
  • Similar ventures Fetch Technologies
    (www.fetch.com)

31
Bioinformatics A Data Integration Grand Challenge
  • Mapping of Human Genetic Code complete
  • New, revolutionary, computational approach to
    drug discovery
  • Huge amounts of genetic, chemical and biological
    data being generated at an exponential rate in
    biotech/pharma RD
  • Complex structures, maps, sequence data etc.
  • Drug discovery scientists need integrated access
    to this data
  • Look for patterns across data sources
  • Need to integrate data from multiple labs
  • Lab procedures (thus the data) keeps changing
  • Good amount of genomic data is free text
  • DiscoveryLink State of the art Life Sciences
    data integration middleware from IBM
  • http//www-4.ibm.com/software/webservers/lifescien
    ces/discovery.html

32
Conclusion
  • Information mediation
  • Issues in building such systems
  • Research projects
  • Industry products
  • Start-up ventures
  • Applicable to wide areas such as E-commerce,
    database and legacy systems integration, Web
    source extraction, content management, portals,
    digital libraries, bioinformatics.
Write a Comment
User Comments (0)
About PowerShow.com