Title: Scientific Data Management
1 Scientific Data Management Center (SDM-ISIC) Ari
e Shoshani Computing Sciences Directorate Lawrence
Berkeley National Laboratory http//sdm.lbl.gov/
sdmcenter
2Participants
Center Director Arie Shoshani DOE
Laboratories ANL Bill Gropp ltgropp_at_mcs.anl.govgt
(coordinating PI) Rob Ross
ltrross_at_mcs.anl.govgt LBNL Ekow Otoo
ltejotoo_at_lbl.govgt Arie Shoshani
ltshoshani_at_lbl.govgt (coordinating
PI) LLNL Terence Critchlow ltcritchlow_at_llnl.govgt
(coordinating PI) ORNL Randy Burris
ltburrisrd_at_ornl.govgt Thomas Potok
ltpotokte_at_ornl.govgt (coordinating
PI) Universities Georgia Institute of
Technology Ling Liu ltlingliu_at_cc.gatech.edugt Calt
on Pu ltcalton.pu_at_cc.gatech.edugt (coordinating
PI) North Carolina State University Mladen Vouk
ltvouk_at_csc.ncsu.edugt (coordinating
PI) Northwestern University Alok Choudhary
ltchoudhar_at_ece.nwu.edugt (coordinating
PI) Wei-Keng Liao ltwkliao_at_ece.nwu.edugt UC San
Diego (Supercomputer Center) Amarnath Gupta
ltgupta_at_sdsc.edugt Reagan Moore ltmoore_at_sdsc.edugt
(coordinating PI)
3Original Goals and Framework
- Coordinated framework for the
- unification,
- development,
- deployment, and
- reuse
- of scientific data management software
- Framework
- 4 areas
- Very large databases
- distributed databases
- heterogeneous databases
- data mining
- ( agent technology)
- 4 tier levels
- Storage level
- File level
- Dataset level
- federated data level
4Master Diagram
4) Distributed, heterogeneous data access
d) Dataset Federation Level
Multi-tier metadata system for querying
heterogeneous data sources (LLNL, Georgia Tech)
Knowledge-based federation of heterogeneous
databases (SDSC)
1) Storage and retrieval of Very large datasets
2) Access optimization of distributed data
3) Data mining and discovery of access patterns
Analysis of application-level query patterns
(LLNL, NWU)
Optimizing shared access to tertiary
storage (LBNL, ORNL)
High-dimensional indexing techniques (LBNL)
c) Dataset Level
Multi-agent high-dimensional cluster analysis
(ORNL)
MPI I/O implementation based on file-level
hints (ANL, NWU)
b) File Level
Low level API for grid I/O (ANL)
Dimension reduction and sampling (LLNL,
LBNL)
Parallel I/O improving parallel access from
clusters (ANL, NWU)
a) Storage Level
Adaptive file caching in a distributed
system (LBNL)
Grid Enabling Technology
Optimization of low-level data storage,
retrieval and transport (ORNL)
5) Agent technology
Enabling communication among tools and data
(ORNL, NCSU)
5Scientific Data Management ISIC
Petabytes
Petabytes
Scientific Simulations experiments
- DOE Labs ANL, LBNL, LLNL, ORNL
- Universities GTech, NCSU, NWU, SDSC
Terabytes
Terabytes
- Climate Modeling
- Astrophysics
- Genomics and Proteomics
- High Energy Physics
SDM-ISIC Technology
- Optimizing shared access from mass storage
systems - Metadata and knowledge- based federations
- API for Grid I/O
- High-dimensional cluster analysis
- High-dimensional indexing
- Adaptive file caching
- Agents
Data Manipulation
Data Manipulation
20 time
- Using SDM-ISIC technology
- Getting files from Tape archive
- Extracting subset of data from files
- Reformatting data
- Getting data from heterogeneous, distributed
systems - moving data over the network
80 time
Scientific Analysis Discovery
80 time
Goals
- Optimize and simplify
- access to very large datasets
- access to distributed data
- access of heterogeneous data
- data mining of very large datasets
Scientific Analysis Discovery
20 time
Current
Goal
6Benefits to Applications
- Efficiency
- Example by removing I/O bottlenecks matching
storage structures to the application - Effectiveness
- Example by making access to data from tertiary
storage or various sites on the data grid
transparent, more effective data exploration is
possible - New algorithms
- Example by developing a more effective
high-dimensional clustering technique for large
datasets, discovery of new correlations are
possible - Enabling ad-hoc exploration of data
- Example by enabling a run and render
capability to visualize simulation output while
the code is running, it is possible to monitor
and steer a long-running simulation
7Current Projects
- High-Dimensional Clustering
- Target applications Astrophysics, Climate
Modeling - LLNL, ORNL
- Scientific problem targeted To understand the
mechanism(s) behind core-collapse supernovae it
is crucial to explore and quantify - The correlations between the neutrino flux and
stellar core convection - The correlations between convection and spatial
dimensionality - The correlations between convection and rotation
- Contact Anthony Mezzacappa, ORNL
- Scientific problem targeted Separating volcano
and ENSO (El Nino Southern oscillation) signals
from the rest of the climate data to study
variability in temperature - Contact Ben Santer, PCMDI, LLNL
8Current Projects
- 2) Efficient Parallel I/O to Disk Storage
- Target application Astrophysics
- ANL, NWU, LLNL
- Scientific problem targeted Astrophysics
simulation code (FLASH) Early production runs
spent as much as half of the time writing
checkpoint and vizualization data - Contact Mike Zingale, U of Chicago
- Scientific problem targeted improving parallel
I/O efficiency for tiled displays - a popular
medium for collaborative viewing of
high-resolution visualization Astrophysics data - Contact Mike Papka, ANL
- Scientific problem targeted Query pattern
analysis for astrophysics star data devising disk
layout for the data such that overall data access
time across multiple applications and users is
reduced - Contact LLNL
9Current Projects
- 3) Providing transparent access to grid data
- Target application High Energy Physics
- LBNL, ORNL
- Scientific problem targeted given a logical
request (expressed on event attributes), get
relevant data from grid sites and tertiary
storage to application code without human
intervention - Contact Doug Olson, LBNL
- Contact Stephen Gowdy, SLAC
- Contact Jackie Chan, Sandia Livermore
(combustion)
10Current Projects
- 4) Heterogeneous Data Federation
- Target application Biology
- LLNL, SDSC, GTU, NCSU, ORNL
- Scientific problem targeted to developing our
infrastructure in support of cancer researchers
at LLNL, who expect to use it to help identify
genes which respond to low-doses of radiation.
This problem is difficult because the information
required by the scientists is spread across many,
independent, web-based data sources - each using
their own interfaces and data formats - Contact Matt Coleman, LLNL
11(No Transcript)
12(No Transcript)
13(No Transcript)