Title: Grid Data Management Components
1Grid Data Management Components
- Adam Belloum
- Computer Architecture Parallel Systems group
- University of Amsterdam
- adam_at_science.uva.nl
2The Problem (application pull)
- New class of application are emerging in
different domains and which involve a huge
collection of the data geographically distributed
and owned by different collaborating
organizations - Examples of such applications
- Large Hadron Collider at CERN (2005)
- Climate Modeling
3Requirements of the Grid-based Applications
- Efficient data transfer service
- Efficient data access service
- Reliability and security
- Possibility to create and manage multiple copies
of the data
4Summary
Distributed Application
Requested Data Management System ?
Services specific to the data Grid
infrastructure
Low level Services (shared with other Grid
Components)
5The Data Grid is then
- The Data Grid is the infrastructure that provides
the services required for manipulating
geographically distributed large collection of
measured and computed data - security services
- replicas services
- data transfer services
- Etc.
- Design Principles
- Mechanism neutrality
- Compatibility with Grid infrastructure
- Uniform of information and infrastructure
6The Data Grid Architecture
High Level Components
Core Services
Data Grid Specific services
Generic Grid Specific services
7Replica services for Data Grid
- Possibility to create multiple copies of the data
- Efficient and reliable management of the replicas
- Efficient replication strategy Replica
Management Service - Location of the replicas Replica location
mechanism - Coherence of the replicas Replica consistency
mechanism - Selection of the replica Replica selection
service - Secure replicas mechanism
8Transfer services for Data Grid
- Fast mechanisms for large data transfer
- Reliable transfer mechanisms
- Secure transfer mechanisms
- GridFTP
9Security services for Data Grid
- Authentication
- Who is can access or view the data?
- Authorization
- Who is authorized to effectively use the data
- Accounting
- The users may be charged for using the data
10Replica Management for Data Grid
Efficient access to the data sets ?
Replica
Replica
Replica
Create replicas of The data sets
Replica
Replica
11Replica Management Problem
- When a request for a large file is issued, it
requires a considerable amount of bandwidth to
be achieved. The availability of the bandwidth
at the due time will have an impact on the
latency of access to the requested file. - Replication of files nearby the potential users
- (in other domains this is called caching)
12What is the replicas manager
- It is a Grid Service responsible for
creating complete and partial copies of the
datasets (mainly a collection of files) - Grid Data model
- Datasets are stored in files grouped into
collections - A replica
- Is a subset of a collection that is stored on a
particular physical storage
Bill Allocock et al. Data management and
Transfer in High-Performance Computational Grid
Environments
13The role of replicas manager service
- Its purpose is to map a logical file name to a
physical name for the file on a specific storage - Note Does not use any semantic information
contained on the logical file names
14Services relevant to the Replicas Manager
Particle Physics applications, climate modeling
application, etc.
Replica Mgmt service
Replica Selection Service
Metadata Services
Distributed Catalog Service
Information Services
Storage Mgmt protocols
Catalog Mgmt protocols
Network Mgmt protocols
Compute Mgmt protocols
Communications , services discovery (DNS),
authentication, delegation,
Storage Systems
Networks
Compute Systems
Replica catalog
Metadata Catalog
15Framework of the replica manager service
- Separation of Replication and Metadata
Information - Only information needed for the mapping of the
logical file names to physical locations are
considered - Replication Semantics
- The replicas are not guaranteed to be coherent,
- The information on the original copy is not saved
- Replica Management Consistency
- The Replicas Manager is able to recover and
return to a consistent state
16Replicas Management Targets
- Replicas Management should answer the following
questions - Which files should be replicated?
- static files, large size files
- When a replicas should be created?
- Frequently accessed files
- Where the replicas should be located?
- Close to users, fast storage systems,
17Replication Strategy
How should I replicate The data
You need a replication strategy
18Simple Dynamic Replication Strategies
- Best Client
- Files are replicated at the node where they are
the mostly requested - Cascading replication
- Replicas are created each time a threshold of
requests is reached starting from the original
node (root) and follow the hierarchy of the nodes - Plain caching
- Files are stored locally at the client side
- Fast Spread
- Files are stored on each node of the path to the
destination - Caching plus cascading replication
19Dynamic Model-Driven Replication
- The decision of replicating a file and where
to locate the replicas are taken following a
performance model that compares the costs and the
benefits of creating replicas of a particular
file in certain locations - Single-system stability
- Transfer time between nodes
- Storage cost
- Accuracy of the replica location mechanism
- Etc.
Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
20Dynamic Model-Driven Replication
- The model driven approach is trying to answer
critical questions - What is the optimal number of replicas for a
given file? - Which is the best location for the replicas?
- When a file needs to be replicated?
21number of replicas for a file
- Is defined given a certain availability
- Proposed model
- RLacc(1-(1-p)r)gt Avail
- Where
- P the probability of a node to be up
- RLacc is the accuracy of the location mechanism
- Avail the needed availability
- r is the number of replicas
Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
22Best location for the replicas
- A query to the Discovery service returns a number
of nodes (candidate for replication) which - dont contain a copy of the file
- have available storage
- And a reasonable response time
- The best candidates should maximize the
difference between - The replication benefit (high as much as
possible) - Replication costs (low as much as possible)
Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
23Best location for the replicas
- Replication costs
- S(F, N2) trans(F, N1, N2)
- Where
- N1 node that currently contains the file
- N2 Candidate for a new replica
- S(F,N) storage cost for a file F at node N
- Trans(F,a,b) transfer costs between locations
a, and b - The benefit of creating a replica is
- trans(F,N1,User) trans(F,N2,User)
Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
24Replica Catalog
How do I keep Track of the replica
Replica
Replica
Replica
Create A catalog
Replica
Replica
25The Replica catalog
- The Replica catalog is a key component of the
Replica management service, it provides the
mapping between logical and physical entities. - The Replica catalog register three types of
entities - Logical collections represents a number of
logical file names - Locations maps the logical collection to a
particular physical instance of that collection - Logical files represents a unique logical file
name
26The Replica catalog
Filename Jan 1998 FilenameFeb 1998 etc
FilenameMar 1998 FilenameJun 1998 Protocol
GridFTP Hostnamejupiter.isi.edu Pathnfs/v6/clima
te
Logical File
Logical File
Logical File Jan 1998 Size 1468762
Logical File
27Operation allowed on the Replica catalog
- Publish (File_publish)
- Copies a file from a storage system not
registered in the replicas catalogue to a
registered storage system and updated the replica
catalogue. - Copy (File_copy)
- Copies a file from a registered storage system to
another registered storage system and updated the
replica catalogue. It creates the replicas. - Delete (File_delete)
- deletes a filename from the replica catalogue
location entry and optionally removes the file
from the registered storage system.
28Replicas management recovery
- Two functionality at least are required to
restart the replica manager after a failure - Restart
- rollback
29Replica Location Service
Where did I put the replicas !!!
You need A replicas location service
30Replication location Service
Application-Oriented Data Services
Data Management services
Reliable Replication Services
Metadata Service
Replication location service
File Transfer Service
Ann Chervenak Giggle A Framework for
Constructing Scalable Replica Location Services
31Role of the replica location service
- The main task of the replica location Service
is to find a specified number of Physical File
Names given a Logical File Name - The minimal set of required functions is
- Autonomy
- Best-effort consistency
- Adaptiveness
32Distributed, Adaptive Replica Location Service
Replica Location Node - query based on LFN -
forward query - Digest distribution -
Overlay Network - Soft-state
Storage sites - register/delete of pairs of
(LFN, PFN)
Matei Ripeanu Ian Foster Decentralized,
Adaptive Replica Location service
33Replica Selection Service
Replicas
Replicas
Replicas
You need A replica Selection Service
Replicas
Replicas
34The Problem of the Replicas Selection
- An application that requires access to replicated
data - Query a specific metadata repository.
- The logical file name are identified the
existence of replicas, - The application requires access to the most
appropriate replica (according specific
characteristics). - This task is achieved by the Replica selection.
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
35The role of replica selection
- The process selection is the process of choosing
a replica from among those spread across the Grid
based on some characteristics specified by the
application. - Access speed
- Geographical location
- Access costs
- Etc.
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
36A Data selection scenario
(8) Location of Selected replicas
application
Attribute of Desired Data (1)
Replica Selection Service
(2) (3) Logical File Names
(4) Location of (5) One or more replicas
Metadata service
Replica Management Service
(7) Performances Measurements And Predictions
(6) Sources and Destinations Candidates
transfers
Information Service
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
37How the replica selection achieves its goals?
(3) Search for the replicas that matches
the applications characteristics
(1) Locate the replicas
(2) Get the capability and usage policy for
all the replicas
Core Services
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
38The replica selection
- Two services are necessary to the replica
selection service - Replica management Service (High Level Service)
- Provides the information on all existing replicas
- Resource Management (Core Service)
- Provides the information of the characteristics
of the underlying resource.
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
39GRIS
Metacomputing Directory Service - information
collection - publication - access service
for grid resources
GIIS
GIIS
Grid Index Information Service (GIIS) -
register GRISs - support broad users queries
GRIS
GIIS
GIIS
GIIS
GRIS
storage resource Grid Resource Information
Server (GRIS) - collect and publish -
system configuration Metadata - security
- state propagation - dynamically generate
information
GRIS
GRIS
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
40Replica catalog
Replicas
Replicas
Replicas
Storage broker - Search - Match - Access
You need A storage broker service
Replicas
Replicas
41Matching Problem
- The matching process depends on
- Physical characteristics of the resources and the
load of the CPU, Networks, storage devices that
are part of the end-to-end path linking possible
sources and sinks - The Matching process depends on factors which are
very dynamic (dramatically change in the future) - Predictor to estimate future usage
Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
42Intelligent Matching process
- Having replica location expose performance about
- previous data transfers, which can be used to
predict future behaviour between sites - Prediction of end-to-end system performance
- Create a model of each system component involved
in the end-to-end data transfer (CPU, cache hit,
disk access, network ) - Observations from past application from the
entire system.
Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
43Collecting the observations
- Tools NWS, NetLogger, Web100, ipref, NetPerf
- Experience has shown substantial difference in
performance between a small network probe (64kb)
and the actual data transfer (GridFTP) - From logs of past applications
- The sporadic nature of large data transfers means
that often there is no data available about
current conditions
Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
44Summary
- We did not discuss in this course two services
the security service, and the data transfer
service - A number of technique for the replica management
have not addressed - Replica location using small-world Models
- System for Representing, Querying, and Automating
Data derivation - Most of the topics not addressed in this course
are covered by documents available at
www.globus.org/research/papers.hmtl