Title: Artemis: Integrating Scientific Data on the Grid
1Artemis Integrating Scientific Data on the Grid
- Rattapoom Tuchinda
- Snehal Thakkar
- Yolanda Gil
- Ewa Deelman
2Outline
- Motivation
- Data integration needs in scientific applications
- Distributed computing in grids
- Problem statement
- Artemis architecture
- Evaluation
- Related Work
- Conclusions and future work
3Scientific Data Integration
- Large-scale, cross-disciplinary scientific data
collection, storage, and analysis exacerbates
heterogeneity and dynamics - National Virtual Observatory (NVO)
- Earth System Grid (ESG)
4Grid Computing Foster Kesselman 04
- Grids provide middleware services for distributed
computing - Seamless integration and management of resources
OGSA - Job submission and execution management Condor
- Resource availability performance Monitoring
and Directory Svc (MDS) - Data replication for robustness and efficiency
Replica Loc Svc (RLS) - Descriptions of data sources Metadata Catalog
Services (MCS)
From Kesselman 04
Security policy must underlie access
management decisions
Many sources of data, services, computation
Resource management is needed to ensure progress
arbitrate competing demands
Exploration analysis may involve
complex,multi-step workflows
Data integration activities may require access
to, exploration/analysis of, data at many
locations
5Scientific Data Storage and Access
- Data sources are very heterogeneous
- Data that results from various instruments,
disciplines, and types of analyses - Wide variety of data storage systems (files, DBs,
servers, etc) - Data sources are highly distributed
- Data stored in different locations on the grid
- Data is replicated in multiple locations
- Data sources are highly dynamic
- Data grows continuously, new data models are
routine - New data sources regularly appear
- Data sources may become unavailable sporadically
- Data available at unprecedented scale
- Very soon petabytes
- These challenges are in the way of scientific
progress in many disciplines
6Data Storage and Access in Grids
- Data described with metadata attributes
- Attribute names may not be consistent across
different sources - Metadata descriptions often stored separately
from the data itself - Metadata Catalog Service (MCS) Moore et al 01,
Singh et al 03 - Stores descriptive metadata and allows users to
query based on desired attributes - Addresses heterogeneity of data source
implementations and access
7Sample Query
- search constraints
- keywords "atmospheric data" or "climate
data - or "climate model
- model type "CCSM" or "PCM
- period 2001
- search results Files, collections, or
views
/CCSM2/b20.007/atm
/PCM/B06.62/atm
/PCM/B06.20/atm
/PCM/B06.21/atm
8Problem Statement
- Users should have seamless single point access
- Should not have to formulate a different query
for each source - Should not manage the unavailability of data
sources - Users need assistance formulating the queries
- Data models may have different attribute names
and representations (even from the same source) - New data models/metadata attributes created all
the time
DB1
MCS1
stime
q1
etime
q2
MCS2
descr
DB2
sub
q3
starttime
MCS3
DB3
endtime
9Artemis
- A mixed-initiative data integration system that
aims to - Abstracts users from diversity in attribute
representations - Assists users to formulate queries step-by-step
- Manages the access and availability of dynamic
collections of data sources - Integrates and extends various AI techniques
- Data Integration
- Ontology
- Dialogue wizards
10Approach
11Artemis Architecture
Dynamic Model Generator
MCS Wizard
Metadata Catalog Service
Data Source
Entity selection
Models
Metadata Catalog Service
Filters
Data Source
Prometheus Query Mediator
Model Mappings
Metadata Catalog Service
Data Source
Ontology
12MCS Wizard
- Based on the Agent Wizard Tuchinda 2003
- Domain experts create mappings between Ontologies
and meta-data attributes - users can then pick the ontology and the mappings
relevant to their domain. - Guides the user through available operations and
filters consistent with the models of the data.
13Prometheus Query Mediator
- Data integration system from earlier research
Thakkar et. al. 2004 Knoblock et al 2003 - Provides unified query interface to a wide
variety of data sources - Relational model
- Requires pre-defined domain model relating
sources to domain relations - Extended in Artemis to support
- Source relations Various MCSs
- Domain relations
- File, View, Collection
- Dynamic domain model based on availability of
data sources
14Dynamic Model Generation
- Generate mediator model dynamically by querying
MCSs - Convert object oriented model of MCSs to
relational model of the mediator - Handles dynamic nature of data by generating new
domain models at query time - Intuitive idea
- Query MCSs one at a time for all possible
attributes of different objects - Create domain relation for each object type with
all possible attributes - Create rules defining each MCS as data source
- Relate various data sources to domain relations
15Dynamic Model Generator (Contd)
- Example
- MCS 1
- File1(starttime, endtime, frequency),
File2(starttime, endtime, frequency, amplitude) - MCS 2
- File3(starttime, endtime, lat, lon, temp),
File4(starttime, endtime, lat, lon, windspeed) - Domain relation
- File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - Source relations
- MCS1File(starttime, endtime, frequency,
amplitude, name) - MCS2File(starttime, endtime, lat, lon, temp,
windspeed, name) - Domain Rules
- File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) -
MCS1File(starttime, endtime, frequency,
amplitude, name) - (lat ) (lon ) (temp )
(windspeed ) - File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) -
MCS2File(starttime, endtime, lat, lon, temp,
windspeed) - (frequency ) (amplitude )
16Query Processing
- When Prometheus receives a query it determines
which MCSs are relevant - Relevant MCSs are determined by comparing the
constraints of the query with the constraints of
the MCSs - MCSs that do not satisfy constraints of the query
are not used in the query - For example, if the query asked for finding files
that contained data for some lat, lon then MCS1
would not be queried
17Query Processing Example
- Lets say, the user uses the MCSWizard to form
the following query. - Q(name) -
- File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - (lat gt 33)(lat lt 34)
- (lon lt -118)(lon gt -119)
- (starttime gt 50000)(endtime lt 60000)
- The Prometheus mediator would generate a datalog
program with the query and domain rules - File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - - MCS1File(starttime, endtime, frequency,
amplitude, name) - (lat ) (lon ) (temp )
(windspeed ) - File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - - MCS2File(starttime, endtime, lat, lon, temp,
windspeed) - (frequency ) (amplitude )
18Query Processing Example
- Lets say, the user uses the MCSWizard to form
the following query. - Q(name) -
- File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - (lat gt 33)(lat lt 34)
- (lon lt -118)(lon gt -119)
- (starttime gt 50000)(endtime lt 60000)
- The Prometheus mediator would generate a datalog
program with the query and domain rules - File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - - MCS1File(starttime, endtime, frequency,
amplitude, name) - (lat ) (lon ) (temp )
(windspeed ) - File(starttime, endtime, frequency, amplitude,
lat, lon, temp, windspeed, name) - - MCS2File(starttime, endtime, lat, lon, temp,
windspeed) - (frequency ) (amplitude )
- The mediator determines that the order
constraints in the rule one on lat and lon
attribute are not compatible with the order
constraints on lat and lon in the query, so only
MCS2 is queried
19Artemis Top level Selection
20Artemis Filtering
21Evaluation
- Enabled users to query 12 different MCSs
- Covering information from three different
applications - LIGO, ESG, and Geo-spatial data warehouse
- Covering 17,000 different files
- Metadata consisted of about 300 different
attributes - Simulated addition of metadata to MCSs and
failure of several MCSs while system was running
22Related Work
- MCS Singh et al 03
- Organize metadata about objects on the data grid
- Object oriented schema to support user defined
metadata attributes - Difficult for users to keep track of diverse
attribute names - No semantic information is attached to the
attributes - Agent Wizard Tuchinda et. al. 2003
- Interactive application that guides user by
dividing complex tasks as series of simpler
question answering tasks - Challenge is to model complex task as set of
simpler subtasks - Prometheus Mediator Thakkar et. al. 2004
- Data integration system that can efficiently
integrate data from a wide variety of data
sources - Key restriction is that relational schema for
data sources and domain must be known in advance
23Related Work (Contd)
- Mygrid Wroe 2003
- Model data sources as semantic web services
- Integration of data sources is represented as a
workflow - Requires that data sources have fixed schema and
associated semantics - Model-based mediator system for scientific data
management Ludascher 2003 - Data sources provide semantic information
regarding their data - The provided information is used to generate
domain model for a mediator system - Assumption is that semantic information is
provided by different data sources of interest
24Conclusions
- Contributions
- Mixed-initiative approach to help scientists
query objects on the data grid - Isolate users from heterogeneity of data sources
- Manage distributed dynamic data
- Future Work
- Algorithm to determine when to dynamically
generate domain model - Better support for specifying model mappings
- Artemis available as a grid service
- More extensive testing and usability studies
25