SC2000 Tutorial - PowerPoint PPT Presentation

About This Presentation
Title:

SC2000 Tutorial

Description:

Efficient, secure and reliable transfer of data between sites. Data ... mySQL, BerkeleyDB, LDAP should all work. CERN-Russia Meeting. Other Grid Directories ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 13
Provided by: brian696
Category:

less

Transcript and Presenter's Notes

Title: SC2000 Tutorial


1

DataGrid Data Management Work Package Architecture
Peter Z. Kunszt Peter.Kunszt_at_cern.ch and the EU
DataGrid Data Management Working Group
2
Multi-Tier Model
100 MBytes/sec
Online System
Offline Farm
PBytes/sec
100 MBytes/sec
Tier 0
CERN Computer Center
2.4 Gbits/sec
Tier 1
Fermilab
France Regional Center
Italy Regional Center
UK Regional Center
.6 - 2.4 Gbits/sec
Tier 2
Tier 3
Institute
Institute
Institute
Institute
Physics data cache
100 - 1000 Mbits/sec
Workstations
3
Data Management WP Tasks
  • Data Transfer Efficient, secure and reliable
    transfer of data between sites
  • Data Replication Replicate data consistently
    across sites
  • Data Access Optimization Optimize data access
    using replication and remote open
  • Data Access Control Authentication, ownership,
    access rights on data
  • Metadata Storage Grid-wide persistent metadata
    store for all kinds of Grid information

4
Grid Application Layer
Application Management
Database Management
Algorithm Registry
Job Management
Data Registering
Job Decomposition
Job Prediction
Data Reclustering
Collective Services
Information Monitoring
Replica Manager
Grid Scheduler
Service Index
Network Monitoring
Time Estimation
Replica Catalog
Grid Information
Load Balancing
Replica Optimisation
Underlying Grid Services
Remote Job Execution Services (GRAM)
Security Services (Auth Access Control)
Messaging Services (MPI)
File Transfer Services (GridFTP)
SQLDatabase Service (Metadata storage)
5
Data Management M9 Status
  • Data Transfer Efficient, secure and reliable
    transfer of data between sites
  • Data Replication Replicate data consistently
    across sites
  • Data Access Optimization Optimize data access
    using replication and remote open
  • Data Access Control Authentication, ownership,
    access rights on data
  • Metadata Storage Grid-wide persistent metadata
    store for all kinds of Grid information

Work in progress
Complex, RD
Design not final
Data Granularity Files, not Objects.
6
Data Grid Storage Model
7
Replication Service
  • Replica Catalog
  • Store logical to physical file mappings and
    metadata
  • Replica Manager
  • Interface to create/delete replicas on demand
  • Replica Selection
  • Find closest replica based on certain criteria
  • Access Optimization
  • Automatic creation and selection of replica for
    whole jobs

8
Terminology
  • StorageElement
  • any storage system with a Grid Interface
  • supports GridFTP
  • Logical File Name (LFN)
  • globally unique
  • LFN//hostname/string
  • hostname virtual organization id
  • use of hostname guarantees uniqueness
  • e.g. LFN//cms.org/analysis/run10/event24.dat
  • Physical File Name (PFN)
  • PFN//hostname/path
  • hostname StorageElement host

9
Replica Catalog
  • Replica Catalog
  • database containing mappings of a Logical
    filename to 1 or more physical filenames
  • Design Goals
  • scalable
  • CMS experiment estimates that by 2006 their
    replica catalog will contains 50 million files
    spread across dozens of institutions
  • decentralized
  • local data should always be in local catalogs!
  • If an application wants to access a local
    replica, it should not have to query a replica
    catalog on the other side of the country/planet
  • fault tolerant

10
ReplicaCatalog Design
  • There should be exactly one ReplicaCatalog for
    each StorageElement
  • All Replica catalogs for a given virtual
    organization are linked together, probably in a
    tree structure
  • leaf catalogs contain a mapping of LFN to PFN
  • non-leaf catalogs contain only a pointer to
    another replica catalog
  • All ReplicaCatalogs (leaf and non-leaf) have
    identical client APIs

11
Sample Hierarchy
12
Sample Usage
  • Queries can start at any catalog
  • queries can be redirected up and/or down the tree
    until all instances are found
  • common application query Is there a replica on
    my local StorageElement?
  • only need to access the local ReplicaCatalog to
    answer this
  • this will reduce the load on the top-level RC
  • ReplicaCatalog updates (ie createNewReplica() )
    are done at leaf nodes only, and then propagated
    up the tree in batches (e.g. every few minutes)
  • this will also reduce the load on the top-level
    server

13
More Design Issues
  • We want to decouple transport protocol, query
    mechanism, and database mechanisms
  • provides greater flexibility
  • support arbitrary LFNs, support emerging XML
    add-ons
  • allows multiple types of database backends
  • Current Implementation plan
  • SOAP/XML over HTTPS to servlets which interface
    to the backend DB
  • HTTP redirect mechanism to traverse hierarchy
  • choice of DB is probably not important
  • mySQL, BerkeleyDB, LDAP should all work

14
Other Grid Directories
  • If this is a good design for the ReplicaCatalog,
    what about for other types of directories?
  • GIS Globus MDS uses a variant of this design,
    but
  • uses LDAP for everything (hierarchy traversal,
    query mechanism, protocol, database)
  • updates do not need to be propagated up the tree
    because LDAP naming hierarchy makes this
    unnecessary
  • GMA Producer Directory
  • similar requirements as ReplicaCatalog?
  • Authorization Service (e.g. CAS or Akenti)
  • each Virtual Organization and each site will
    likely have directories mapping a Grid User DN to
    a set of capabilities
  • similar requirements as ReplicaCatalog?

15
For More Information
  • http//www.eu-datagrid.org/
  • http//cern.ch/grid-data-management/
Write a Comment
User Comments (0)
About PowerShow.com