N N meeting Australia 2003 - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

N N meeting Australia 2003

Description:

... line Interface, C and Fortran API's Password and Certificate authorisation ... Authenticate and Authorise user by checking certificate validity and check with ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 58
Provided by: ral83
Category:

less

Transcript and Presenter's Notes

Title: N N meeting Australia 2003


1
Data Management in a Grid Environment - theory
and practical examples
  • Kerstin Kleese van Dam et. al.,
  • CCLRC e-Science Centre
  • k.kleese_at_dl.ac.uk
  • http//www.e-science.clrc.ac.uk

2
  • Council for the Central Laboratory of the
    Research Councils
  • One of Europes largest Research Support
    Organisations, providing large scale
    experimental, data and computing facilities
    primarily to the UK research community both in
    academia and industry. Annually supporting around
    12000 scientists from all major scientific
    domains. 1800 members of staff over three sites
  • Rutherford Appleton Laboratory in Oxfordshire
  • Daresbury Laboratory in Cheshire
  • Chilbolton Observatory in Hampshire
  • Large quantities of data associated with the
    various facilities. Houses 1 World Data Centre, 3
    National Data Centres and a range of community
    based data services.
  • http//www.cclrc.ac.uk

3
  • CCLRC e-Science Centre

Early involvement in e-Science (from 1999 Data
Grid / WOS onwards). Centre established in 2000,
since 2001 with direct governmental funding,
additional funding through participation in other
projects. Currently housing UK Grid Support
Centre (together with Manchester Edinburgh) and
BBSRC Grid Support Centre. Involved in DataGrid,
GridPP, AstroGrid and NERC DataGrid Currently 40
permanent members of staff, 10 in the data
management group. http//www.escience.clrc.ac.uk
4
Data Management Group
5
Current e-Science Projects of the Data Management
Group
Working on collaborations with partners inside
CCLRC, the UK and internationally CLRC DataPortal
Integration of ISIS and BADC operational Data
Catalogues
Environment from the Molecular Level
NERC DataGrid
e-Science Technologies for the Simulation of
Complex Materials
Extensions of the Storage Resource Broker (SRB)
together with SDSC
Earth Science Portal Project
Database service for CCLRC and related e-Science
projects
6
Data Management
7
Currently the scientist has to take care of his
data, providing the binding link between
different areas of work.
In the future we hope that e-Science technologies
provide scientists with a more helpful
environment
8
Issues
Data capture from instruments and computers Data
Storage Annotating data Data Discovery Association
of data with appropriate applications Conversion
of data from one application to the other Merging
of data from different sources
9
Data capture from instruments and computers
In a Grid environment the Scientists will
ultimately have little control where he will
carry out his experiment or calculation and where
therefore his data will be. Capture Data Capture
Information about the environment Direct where
output goes
10
Data Capture from Experimental Facilities (1)
Instruments produce varying amounts of data,
ranging from small (e.g. temperature readings at
a station) to large (e.g. LHC with several Tbytes
per second).
Each instrument will produce data in its own
format, often incompatible with anything else.
Most facilities provide their own short term
storage, but will neither annotate nor manage the
data.
The collection of environmental information is
often limited, much of the information is still
recorded in lab notice books.
Correction values or error margins related to the
instrument are not linked to the collected data.
11
Data Capture from Experimental Facilities (2) -
Requirements
Generalised description of data format (possible
standardisation for instruments of the same type).
Automatic capture of environment information
including Instrument scientists if necessary.
Automatic linking of data about the environment
and the raw data produced by the instrument.
Automatic insertion of both types of data into
interim or final data repository.
Automatic linking of the donated data to existing
related information e.g. proposal, other
experiments of the same project.
12
Data Capture from Experimental Facilities (3) -
Examples
Finally Integrated with other Facility Data
within and outside CCLRC via Instances of the
CCLRC DataPortal software.
Collection of Raw data from the Instrument,
Detector specific Information for this experiment
etc.
ICAT - CLRC ISIS Catalogue http//www.isis.rl.ac.u
k/dataanalysis
Integrate Raw Data with original Proposal
Information and Log files of the Instrument
Scientists
See also Comb-e-Chem - http//www.combechem.org
13
Data Storage
The Grid environment provides access to a
multitude of storage systems, often hiding the
type of system behind services interfaces. Where
is the data How can I manage it On which media is
my data (access time) How can it be
accessed Where are replicas of my data
14
Data Storage (2) - Requirements
Easy overview where your data is on the
Grid Support to manage your data
(transfers/replicas) Access and access control to
your data where ever it is Support to share your
data
Two possible solutions Globus Data Management
tools - example ESG http//www.earthsystemsgrid.or
g Storage Resource Broker (SRB) from the San
Diego Super Computing Centre http//www.npaci.edu/
DICE/SRB
15
Typical Analysis Scenario and the use of Storage
Resource Managers (SRM)
Metadata Catalogue for Data Discovery within one
Virtual Organisation
Request goes out to Disk and Hierarchical Storage
Resource Managers
Replica Catalogue keeps track of all replicas of
specific datasets within one Virtual Organisation
The Network Weather Service helps to plan fastest
Access routes to the data
16
The Earth System Grid
LBNL
HPSS High Performance Storage System
disk
ANL
openDAPg server
CAS Community Authorization Services
CAS-enabled Striped-gridFTP server
CAS-enabled Striped-gridFTP server
Striped gridFTP client
gridFTP
SRM Storage Resource Management
gridFTP
gridFTP server
gridFTP
openDAPg server
MyProxy server
NCAR
GRAM gatekeeper
disk
CAS-enabled Striped-gridFTP server
MyProxy client
CAS client
openDAPg server
TOMCAT Servlet engine
MCS client
LLNL
RLS client
ORNL
SRM Storage Resource Management
gridFTP server
gridFTP server
gridFTP
gridFTP server
gridFTP
SRM Storage Resource Management
LAS Live Access Server
ISI
SRM Storage Resource Management
MCS Metadata Cataloguing Services
SOAP
HPSS High Performance Storage System
RLS Replica Location Services
RMI
MSS Mass Storage System
disk
disk
17
Storage Resource Broker (1)
Professional Data Storage Management System
initially developed in the mid 90s by the San
Diego Super Computing Centre. http//www.npaci.edu
/DICE/SRB/. Current version supports many
platforms and authentication methods. Web
services Interfaces.
18
Storage Resource Broker
Devise Interface Modules to wide range of
platforms easy to extend to new systems
SRB External Interface Modules MySRB (web
based), Command line Interface, C and Fortran
APIs Password and Certificate authorisation
MCAT provides links between logical to physical
data location, replica and versioning. MCAT can
be run on a variety of Relational Databases.
Integrated access to data on PC, UNIX, LINUX, DB
and Tape Store http//www.npaci.edu/dice/srb/mySRB
/mySRB.html also used in the BIRN project
http//www.nbirn.net/
19
Functions including ingestion, movement and
replication of data. Providing access to data for
others
Version of Data
Type of Data
Replica or Original Data
Physical Data Location and Type of Resource
20
(No Transcript)
21
(No Transcript)
22
Biomedical Informatics Research Network
23
Annotating Data
Data without further information is only of short
and very limited use. Information about the data
itself Information about the where, why, who and
when Information about the environment in which
the data was captured Related Information Example
CLRC Scientific Metadata Schema
http//www.e-science.clrc.ac.uk/Activity/ACTIVITY
DataPortalSECTION5
24
Diversity Users Searches
25
General Scientific Metadata
A generic metadata model for all scientific
applications with Specialisation for each domain

Can answer questions across domains Can answer
questions about specific domains
26
CLRC DataPortal - Scientific Metadata Model
Metadata Object
Topic
Study Description
Access Conditions
Data Description
Data Location
Related Material
27
Data Discovery
Most data is currently discovered by word of
mouth from friends and colleagues or sheer
luck. Discovery Browsing Selection Comparison Acc
ess Example CLRC DataPortal http//esc.dl.ac.uk9
000/index.html
28
Different Levels of Metadata supporting Discovery
and Selection
A -Metadata can be derived from the data itself
B -Metadata A summary of all other types of
metadata
C -Metadata All related metadata, papers,
pictures, related studies
D -Metadata User provided information on what,
who, what and when
29
CLRC DataPortal
  • The DataPortal currently allows access to
    selected metadata and data from four facilities.
    The first three housed by CLRC
  • The Synchrotron Radiation Department (SRD)
  • The Neutron Spallation Source (ISIS)
  • The British Atmospheric Data Centre (BADC)
  • Max-Planck Institute for Meteorology (MPIM)

You will be able to assess the available data via
the basic search. If you are not one of our
partners, but would like to try the system you
can use one of our test accounts Login , using
'dpuser' for your username and password. http//es
c.dl.ac.uk9000/index.html
30
DataPortal Architecture
The major functions of the DataPortal (DP) are
grouped into modules, each module has a grid
services interface to communicate with the other
DP services and in some cases also with outside
services like Visualisation or HPC Portal. The
Soap protocol is used for communication and WSDL
to describe the various services. We do not
change any local metadata system, but use our own
wrappers to translate our general query format
into the local syntax. Replies from the resources
will be XML files compliant with the CLRC
Scientific Metadata Format (http//www-dienst.rl.
ac.uk/library/2002/tr/dltr-2002001.pdf)
The UK e-Science Grid CA provides Globus x509
certificates for the UK e-Science community. The
CA is located at RAL and is being run as part of
the Grid Support Centre funded by the Research
Councils' Core e-Science programme.
(http//www.grid-support.ac.uk/)
The implementation of the core modules as grid
services allows the DataPortal to be a truly
distributed application and allows several
instances of the DataPortal to logically combined
thus extending any user query.
31
General CLRC DataPortal Architecture
32
DataPortal Architecture (2)
Accessing DataPortal either via Web Interface or
Web Services Interfaces e.g. Query and Reply
Authenticate and Authorise user by checking
certificate validity and check with associated
facilities for general access rights
Query Generation, Selection of Suitable
Facilities to Query. Farm out query to selected
Facilities in parallel and collect and collate
results
As well as interacting with the DataPortal via
the Web Interface users can also run queries by
directly calling the Query Reply service
assuming that they are properly authenticated.
Other services are also externally visible, for
example the Shopping Cart.
Put interesting Data in your personal, permanent
Shopping Cart, which you can share with others as
required.
Use the Data Transfer Service to send your data
on to a chosen application or service
33
Choose Facilities of Interest
Select Discipline and reduce Search Field
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Annotate your Search Results
Specific Services associated with this data
Forgotten where your data came from?
38
Association of data with appropriate applications
The scientists will need to be able to link to
all his favourite applications for analysis,
simulation and visualisation, but he also needs
to be informed about suitable other
programs. Suitable applications Correct
Format Suitable for your environment Availability
39
HPCGrid Services Portal
This is a pilot project funded by the CLRC
e-Science Centre to develop a Web portal to
search for resources and submit HPC applications
to a computational Grid in the UK. It will form
the basis of application portals for the UK
e-Science Grid and "thematic Grids" for e.g. NERC
DataGrid and HPCI Consortia. This project is a
collaboration with the San Diego Supercomputer
Centre who have developed the GridPortPortal and
HotPage software for the NPACI HPC Grid, and with
the University of Lecce, Italy who have developed
the Grid Resource broker. http//esc.dl.ac.uk/HP
CPortal/
40
HPC Grid Services Portal
Provides a portal for HPC resources which can be
customised for domain-specific applications.
Original collaboration with San Diego
Supercomputer Center, now University of Texas
(Mary Thomas). Similar functionality to HotPage
and GridPort (SDSC) Single sign-on using a
digital certificate (GSI) Resource monitoring and
Discovery (Globus) Application Discovery (search
engine) Personal "desktop" workspace File
transfer (Globus) and Job Submission (Globus)
41
InfoPortal
Searching for Applications on the UK Level 2 Grid
HPCPortal
DataPortal
42
Chose Application DLPOLY
Resulting Findings for DLPOLY
43
Summary Description
Web Service Address for DLPOLY code
Information about the systems the code is
installed and available for use
Link to job submission
44
(No Transcript)
45
All machines on the UK level 2 Grid and their
availability
46
Conversion of data from one application to the
other
The scientists will need to be able to pass data
from one application to the next seamlessly and
with minimum interference on their
part. Determining Data Formats Data
Schema Interchange/Conversion Example
e-Materials Project
47
The CLRC DataPortal Related Projects
E-SCIENCE TECHNOLOGIES IN THE SIMULATION OF
COMPLEX MATERIALS A combination of novel
computational and computer science methodologies
and teams will be used to develop GRID e-Science
technologies to deliver new simulation solutions
to problems and fields relating to combinatorial
materials science and polymorph prediction. The
project will exploit the latest developments in
scientific simulation methodologies (both
electronic structure and force field based) and
hardware ranging from desktop to HPC. It will
establish a field tested integrated data and
computing e-Science infrastructure customised for
these key areas of current materials science.
This infrastructure will, among others, enable
the automatic submission of simulation, triggered
by the identification of knowledge gaps in the
database in response to user queries.
Furthermore, the automatic integration of
experimental and computational results for
screening applications will be supported.
48
The Science Filtering
Two point displacement method used to build up
dynamical matrix. Single point energy calculation
at each displacement ve and ve in x, y, and z.
Purely SiO4 zeolite
Metal substitution with addition of proton
Calculation of Vibrational Freqs
  • Information of Interest
  • Structure
  • Total energy
  • Binding Energy
  • HOMO/LUMO
  • Population Analysis
  • Vibrational Freqs

Increase quality of calculation for best
candidates
Add probe
49
The Computation
ChemShell
1. Micro iterations to relax shells wrt forces
from QM region. RMS criteria (x) tested for
further movement of shells.
GAMESS-UK
GULP
2. Energy and gradients passed from GAMESS-UK to
GULP and then final forces passed back to
ChemShell (newopt module), which performs
geometry optimisation.
RMSx
ChemShell Optimiser
Maxg and maxs lt 0.01
3. Optimisation is considered complete when both
max gradient and max step are below set criteria.
GAMESS-UK
GULP
ChemShell
50
CML Chemical Markup Languages
CML is a new approach to managing molecular
information. It has a large scope as it covers
disciplines from macromolecular sequences to
inorganic molecules and quantum chemistry. CML is
new in bringing the power of XML to the
management of chemical information. CML and
associated tools allows for the conversion of
current files without semantic loss into
structured documents, including chemical
publications, and provides for the precise
location of information within files. Developed
by Peter Murray-Rust and Henry S. Rzepa.
http//www.xml-cml.org As an addition they
are also looking at CCML a Computational
Chemical Markup Language
51
ltdocumentgt- lt!-- CML document - caffeine - karne
- 7/8/00   --gt - lt!-- file converted from MDL
.mol   --gt - ltcml title"caffeine"
id"cml_caffeine_karne" xmlns"x-schemacml_schema
_ie_02.xml"gt- ltmolecule title"caffeine"
id"mol_caffeine_karne" convention"mol"gt 
ltformulagtC8 H10 N4 O2lt/formulagt   ltstring
title"CAS"gt58-08-2lt/stringgt   ltstring
title"ACX"gtI1001269lt/stringgt   ltstring
title"DOT"gtUN 1544lt/stringgt   ltstring
title"RTECS"gtEV6475000lt/stringgt   ltfloat
title"molecule weight"gt194.19lt/floatgt   ltfloat
title"melting point" units"degC"gt238lt/floatgt  
ltfloat title"specific gravity"gt1.23lt/floatgt  
ltstring title"water solubility" units"g/100
mL" convention"g per 100 mL at 23
degC"gt1-5lt/stringgt   ltstring title"comments"gtWhi
te powder or white glistening needles usually
melted together. LIGHT SENSITIVElt/stringgt -
ltlist title"alternate names"gt
52
The CLRC DataPortal Related Projects
ENVIRONMENT FROM THE MOLECULAR LEVEL AN
E-SCIENCE PROPOSAL FOR MODELLING THE ATOMISTIC
PROCESSES INVOLVED IN ENVIRONMENTAL ISSUES Many
environmental problems, such as transport of
pollutants, development of remediation
strategies, weathering, and containment of
high-level radioactive waste, require an
understanding of fundamental mechanisms and
processes at a molecular level. Computer
simulations at a molecular level can give
considerable progress in our understanding of
these processes. Developments in atomistic
simulation tools must now be linked with GRID
technologies in order to facilitate simulation
studies that can be performed with realistic
conditions, and which can scan across a wide
range of physical and chemical parameters. This
proposal brings together simulation scientists,
applications developers and computer scientists
to develop UK e-science/GRID capabilities for
molecular simulations of environmental issues. A
common set of simulation tools will be developed
for a wide range of applications, and the GRID
environment will be established which will result
in a giant leap in the capabilities of these
powerful scientific tools. See http//eminerals.or
g/
53
The CLRC DataPortal Related Projects
THE NERC DATAGRID Data discovery and delivery are
inherent components of many aspects of science.
They can be considered part of a processing chain
that starts with raw data from a variety of
sources, and ends with the graphical production
of information that is directly used in
scientific research. This proposal is to build a
grid which makes data discovery, delivery and use
much easier than it is now, facilitating better
use of the existing investment in the curation
and maintenance of quality data archives. Further
we intend to make the connection between data
held in managed archives and data held by
individual research groups seamless in such a way
that the same tools can be used to compare and
manipulate data from both sources. What will be
completely new will be the ability to compare and
contrast data from an extensive range of (US,
European, UK, NERC) datasets from within one
specific context. The presence of the NERC
DataGrid will allow grid based visualisation
services to access a wide variety of data held at
the British Atmospheric and Oceanographic Data
Centres (BADC and BODC respectively) as well as
on individual storage systems belonging to groups
which register their data with the NERC DataGrid.
The structures put in place will also allow NERC
data to become part of the putative future
semantic grid. See http//ndg.badc.rl.ac.uk/
54
CLRC DataPortal Related Projects
EARTH SCIENCE PORTAL The Earth Science Portal
(ESP) is a collaboration designed to build the
infrastructure needed to create web portals to
provide access to observed and simulated data
within the climate and weather communities. The
infrastructure created within ESP will provide a
flexible framework that will allow
interoperability between the front-end and
back-end software components. The initial ESP
community workshop was held on January 23rd and
Friday, January 24th, 2003 at the National Center
for Atmospheric Research, Boulder, Colorado.
Based on the discussions of the workshop we
created a draft document that describes the
software framework within ESP. The development
activities in ESP are intended to support this
framework. The document will be updated based
these activities and comments and suggestions
from the community. Partners are BADC, CCLRC,
CDC and GFDL NOAA, NASA, LLNL, NCAR and
PMEL http//nomads.gfdl.noaa.gov/ck/esp/webpages
55
The CLRC DataPortal Related Projects
EUROPEAN SPATIO-TEMPORAL DATA INFRASTRUCTURE FOR
HIGH-PERFORMANCE COMPUTING ESTEDI, an initiative
of European software vendors and supercomputing
centres, will establish a European standard for
the storage and retrieval of multidimensional
high-performance computing (HPC) data. It
addresses a main technical obstacle, the delivery
bottleneck of large HPC results to the users, by
augmenting high-volume data generators with a
flexible data management and extraction tool for
spatio-temporal raster data. To this end, the
multidimensional database system RasDaMan will be
enhanced with intelligent mass storage handling
and optimised towards HPC. See http//www.estedi.o
rg/
56
The CLRC DataPortal Related Projects
MSC PROJECT ON AUTOMATED DATA MANAGEMENT FOR
CLIMATE SIMULATIONS These days data is no longer
only produced by experiments, measurements and
observations. Many of the more complex phenomena
are studied in computer simulations. These
simulations can produce large quantities of data.
However in contrast to much experimental or
observational data these results are often not
accessible to the wider research communities.
Simulation data could be more widely exploited if
better information was available concerning the
simulation itself.This project aims to
investigate the possibility of automatically
capturing as much metadata concerning the
simulation as possible and storing it in a
suitable database. The database will be
accessible via the CLRC DataPortal. It is
expected that next to investigating the issue in
general a prototype installation will be provided
by the students.
57
The CLRC DataPortal Related Projects
CLRC e-Science Database Service We looking for
the most flexible operating systems in terms of
both software available and price/performance
ultimately led to the choice of a Linux based
system (enterprise editions). For running the
widest choice of databases, the Redhat Advanced
Server and SuSE Linux Enterprise Server are
available. Oracle has been selected for the
initial database service as it offers a
clustering technology. Oracle Real Application
Clusters are the multi-node extension to Oracle
database server. A cluster is a group of
independent servers (nodes) that cooperate as a
single system. The primary cluster components are
processor nodes, a cluster interconnect, and a
shared storage subsystem. Oracle cluster database
combines the memory in the individual nodes to
provide a single view of the distributed cache
memory for the entire database system. Oracle are
the only vendor to offer this capability.
We chose IBM x440 series nodes as the building
blocks for the data clusters. The IBM Enterprise
X-Architecture consists of Intel processor-based
servers, such as support for up to 16-way SMP
capability and remote I/O. The clusters connect
to 1TB RAID 5 storage arrays via fibre channel
switches.
PostgreSQL
58
For Information see
Integrated e-Science Environment
Portal http//esc.dl.ac.uk/IeSE/ HPC Grid
Services Portal http//esc.dl.ac.uk/HPCPortal/ Dat
aPortal http//esc.dl.ac.uk9000/index.html CLRC
e-Science Centre http//www.e-science.clrc.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com