Artemis: Integrating Scientific Data on the Grid - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Artemis: Integrating Scientific Data on the Grid

Description:

Artemis: Integrating Scientific Data on the Grid. Rattapoom Tuchinda. Snehal Thakkar ... Artemis. A mixed-initiative data integration system that aims to: ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 26
Provided by: ISI4
Learn more at: https://www.isi.edu
Category:

less

Transcript and Presenter's Notes

Title: Artemis: Integrating Scientific Data on the Grid


1
Artemis Integrating Scientific Data on the Grid
  • Rattapoom Tuchinda
  • Snehal Thakkar
  • Yolanda Gil
  • Ewa Deelman

2
Outline
  • Motivation
  • Data integration needs in scientific applications
  • Distributed computing in grids
  • Problem statement
  • Artemis architecture
  • Evaluation
  • Related Work
  • Conclusions and future work

3
Scientific Data Integration
  • Large-scale, cross-disciplinary scientific data
    collection, storage, and analysis exacerbates
    heterogeneity and dynamics
  • National Virtual Observatory (NVO)
  • Earth System Grid (ESG)

4
Grid Computing Foster Kesselman 04
  • Grids provide middleware services for distributed
    computing
  • Seamless integration and management of resources
    OGSA
  • Job submission and execution management Condor
  • Resource availability performance Monitoring
    and Directory Svc (MDS)
  • Data replication for robustness and efficiency
    Replica Loc Svc (RLS)
  • Descriptions of data sources Metadata Catalog
    Services (MCS)

From Kesselman 04
Security policy must underlie access
management decisions
Many sources of data, services, computation
Resource management is needed to ensure progress
arbitrate competing demands
Exploration analysis may involve
complex,multi-step workflows
Data integration activities may require access
to, exploration/analysis of, data at many
locations
5
Scientific Data Storage and Access
  • Data sources are very heterogeneous
  • Data that results from various instruments,
    disciplines, and types of analyses
  • Wide variety of data storage systems (files, DBs,
    servers, etc)
  • Data sources are highly distributed
  • Data stored in different locations on the grid
  • Data is replicated in multiple locations
  • Data sources are highly dynamic
  • Data grows continuously, new data models are
    routine
  • New data sources regularly appear
  • Data sources may become unavailable sporadically
  • Data available at unprecedented scale
  • Very soon petabytes
  • These challenges are in the way of scientific
    progress in many disciplines

6
Data Storage and Access in Grids
  • Data described with metadata attributes
  • Attribute names may not be consistent across
    different sources
  • Metadata descriptions often stored separately
    from the data itself
  • Metadata Catalog Service (MCS) Moore et al 01,
    Singh et al 03
  • Stores descriptive metadata and allows users to
    query based on desired attributes
  • Addresses heterogeneity of data source
    implementations and access

7
Sample Query
  • search constraints
  • keywords "atmospheric data" or "climate
    data
  • or "climate model
  • model type "CCSM" or "PCM
  • period 2001
  • search results Files, collections, or
    views                           
    /CCSM2/b20.007/atm                           
    /PCM/B06.62/atm                           
    /PCM/B06.20/atm                           
    /PCM/B06.21/atm

8
Problem Statement
  • Users should have seamless single point access
  • Should not have to formulate a different query
    for each source
  • Should not manage the unavailability of data
    sources
  • Users need assistance formulating the queries
  • Data models may have different attribute names
    and representations (even from the same source)
  • New data models/metadata attributes created all
    the time

DB1
MCS1
stime
q1
etime
q2
MCS2
descr
DB2
sub
q3
starttime
MCS3
DB3
endtime
9
Artemis
  • A mixed-initiative data integration system that
    aims to
  • Abstracts users from diversity in attribute
    representations
  • Assists users to formulate queries step-by-step
  • Manages the access and availability of dynamic
    collections of data sources
  • Integrates and extends various AI techniques
  • Data Integration
  • Ontology
  • Dialogue wizards

10
Approach
11
Artemis Architecture
Dynamic Model Generator
MCS Wizard
Metadata Catalog Service
Data Source
Entity selection
Models
Metadata Catalog Service
Filters
Data Source
Prometheus Query Mediator
Model Mappings
Metadata Catalog Service
Data Source
Ontology
12
MCS Wizard
  • Based on the Agent Wizard Tuchinda 2003
  • Domain experts create mappings between Ontologies
    and meta-data attributes
  • users can then pick the ontology and the mappings
    relevant to their domain.
  • Guides the user through available operations and
    filters consistent with the models of the data.

13
Prometheus Query Mediator
  • Data integration system from earlier research
    Thakkar et. al. 2004 Knoblock et al 2003
  • Provides unified query interface to a wide
    variety of data sources
  • Relational model
  • Requires pre-defined domain model relating
    sources to domain relations
  • Extended in Artemis to support
  • Source relations Various MCSs
  • Domain relations
  • File, View, Collection
  • Dynamic domain model based on availability of
    data sources

14
Dynamic Model Generation
  • Generate mediator model dynamically by querying
    MCSs
  • Convert object oriented model of MCSs to
    relational model of the mediator
  • Handles dynamic nature of data by generating new
    domain models at query time
  • Intuitive idea
  • Query MCSs one at a time for all possible
    attributes of different objects
  • Create domain relation for each object type with
    all possible attributes
  • Create rules defining each MCS as data source
  • Relate various data sources to domain relations

15
Dynamic Model Generator (Contd)
  • Example
  • MCS 1
  • File1(starttime, endtime, frequency),
    File2(starttime, endtime, frequency, amplitude)
  • MCS 2
  • File3(starttime, endtime, lat, lon, temp),
    File4(starttime, endtime, lat, lon, windspeed)
  • Domain relation
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name)
  • Source relations
  • MCS1File(starttime, endtime, frequency,
    amplitude, name)
  • MCS2File(starttime, endtime, lat, lon, temp,
    windspeed, name)
  • Domain Rules
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name) -
    MCS1File(starttime, endtime, frequency,
    amplitude, name)
  • (lat ) (lon ) (temp )
    (windspeed )
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name) -
    MCS2File(starttime, endtime, lat, lon, temp,
    windspeed)
  • (frequency ) (amplitude )

16
Query Processing
  • When Prometheus receives a query it determines
    which MCSs are relevant
  • Relevant MCSs are determined by comparing the
    constraints of the query with the constraints of
    the MCSs
  • MCSs that do not satisfy constraints of the query
    are not used in the query
  • For example, if the query asked for finding files
    that contained data for some lat, lon then MCS1
    would not be queried

17
Query Processing Example
  • Lets say, the user uses the MCSWizard to form
    the following query.
  • Q(name) -
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name)
  • (lat gt 33)(lat lt 34)
  • (lon lt -118)(lon gt -119)
  • (starttime gt 50000)(endtime lt 60000)
  • The Prometheus mediator would generate a datalog
    program with the query and domain rules
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name) -
  • MCS1File(starttime, endtime, frequency,
    amplitude, name)
  • (lat ) (lon ) (temp )
    (windspeed )
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name) -
  • MCS2File(starttime, endtime, lat, lon, temp,
    windspeed)
  • (frequency ) (amplitude )

18
Query Processing Example
  • Lets say, the user uses the MCSWizard to form
    the following query.
  • Q(name) -
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name)
  • (lat gt 33)(lat lt 34)
  • (lon lt -118)(lon gt -119)
  • (starttime gt 50000)(endtime lt 60000)
  • The Prometheus mediator would generate a datalog
    program with the query and domain rules
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name) -
  • MCS1File(starttime, endtime, frequency,
    amplitude, name)
  • (lat ) (lon ) (temp )
    (windspeed )
  • File(starttime, endtime, frequency, amplitude,
    lat, lon, temp, windspeed, name) -
  • MCS2File(starttime, endtime, lat, lon, temp,
    windspeed)
  • (frequency ) (amplitude )
  • The mediator determines that the order
    constraints in the rule one on lat and lon
    attribute are not compatible with the order
    constraints on lat and lon in the query, so only
    MCS2 is queried

19
Artemis Top level Selection
20
Artemis Filtering
21
Evaluation
  • Enabled users to query 12 different MCSs
  • Covering information from three different
    applications
  • LIGO, ESG, and Geo-spatial data warehouse
  • Covering 17,000 different files
  • Metadata consisted of about 300 different
    attributes
  • Simulated addition of metadata to MCSs and
    failure of several MCSs while system was running

22
Related Work
  • MCS Singh et al 03
  • Organize metadata about objects on the data grid
  • Object oriented schema to support user defined
    metadata attributes
  • Difficult for users to keep track of diverse
    attribute names
  • No semantic information is attached to the
    attributes
  • Agent Wizard Tuchinda et. al. 2003
  • Interactive application that guides user by
    dividing complex tasks as series of simpler
    question answering tasks
  • Challenge is to model complex task as set of
    simpler subtasks
  • Prometheus Mediator Thakkar et. al. 2004
  • Data integration system that can efficiently
    integrate data from a wide variety of data
    sources
  • Key restriction is that relational schema for
    data sources and domain must be known in advance

23
Related Work (Contd)
  • Mygrid Wroe 2003
  • Model data sources as semantic web services
  • Integration of data sources is represented as a
    workflow
  • Requires that data sources have fixed schema and
    associated semantics
  • Model-based mediator system for scientific data
    management Ludascher 2003
  • Data sources provide semantic information
    regarding their data
  • The provided information is used to generate
    domain model for a mediator system
  • Assumption is that semantic information is
    provided by different data sources of interest

24
Conclusions
  • Contributions
  • Mixed-initiative approach to help scientists
    query objects on the data grid
  • Isolate users from heterogeneity of data sources
  • Manage distributed dynamic data
  • Future Work
  • Algorithm to determine when to dynamically
    generate domain model
  • Better support for specifying model mappings
  • Artemis available as a grid service
  • More extensive testing and usability studies

25
  • ?
Write a Comment
User Comments (0)
About PowerShow.com