Grid Data Management Components - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Grid Data Management Components

Description:

Large Hadron Collider at CERN (2005) Climate Modeling. 3 ... When a request for a large file is issued, it requires a considerable amount of ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 45

Provided by: Ada578

Category:

more less

Transcript and Presenter's Notes

Title: Grid Data Management Components

1
Grid Data Management Components

Adam Belloum
Computer Architecture Parallel Systems group
University of Amsterdam
adam_at_science.uva.nl

2
The Problem (application pull)

New class of application are emerging in
different domains and which involve a huge
collection of the data geographically distributed
and owned by different collaborating
organizations
Examples of such applications
Large Hadron Collider at CERN (2005)
Climate Modeling

3
Requirements of the Grid-based Applications

Efficient data transfer service
Efficient data access service
Reliability and security

Possibility to create and manage multiple copies
of the data

4
Summary
Distributed Application
Requested Data Management System ?
Services specific to the data Grid
infrastructure
Low level Services (shared with other Grid
Components)
5
The Data Grid is then

The Data Grid is the infrastructure that provides
the services required for manipulating
geographically distributed large collection of
measured and computed data
security services
replicas services
data transfer services
Etc.
Design Principles
Mechanism neutrality
Compatibility with Grid infrastructure
Uniform of information and infrastructure

6
The Data Grid Architecture
High Level Components
Core Services
Data Grid Specific services
Generic Grid Specific services
7
Replica services for Data Grid

Possibility to create multiple copies of the data
Efficient and reliable management of the replicas
Efficient replication strategy Replica
Management Service
Location of the replicas Replica location
mechanism
Coherence of the replicas Replica consistency
mechanism
Selection of the replica Replica selection
service
Secure replicas mechanism

8
Transfer services for Data Grid

Fast mechanisms for large data transfer
Reliable transfer mechanisms
Secure transfer mechanisms
GridFTP

9
Security services for Data Grid

Authentication
Who is can access or view the data?
Authorization
Who is authorized to effectively use the data
Accounting
The users may be charged for using the data

10
Replica Management for Data Grid
Efficient access to the data sets ?
Replica
Replica
Replica
Create replicas of The data sets
Replica
Replica
11
Replica Management Problem

When a request for a large file is issued, it
requires a considerable amount of bandwidth to
be achieved. The availability of the bandwidth
at the due time will have an impact on the
latency of access to the requested file.
Replication of files nearby the potential users
(in other domains this is called caching)

12
What is the replicas manager

It is a Grid Service responsible for
creating complete and partial copies of the
datasets (mainly a collection of files)
Grid Data model
Datasets are stored in files grouped into
collections
A replica
Is a subset of a collection that is stored on a
particular physical storage

Bill Allocock et al. Data management and
Transfer in High-Performance Computational Grid
Environments
13
The role of replicas manager service

Its purpose is to map a logical file name to a
physical name for the file on a specific storage
Note Does not use any semantic information
contained on the logical file names

14
Services relevant to the Replicas Manager
Particle Physics applications, climate modeling
application, etc.
Replica Mgmt service
Replica Selection Service
Metadata Services
Distributed Catalog Service
Information Services
Storage Mgmt protocols
Catalog Mgmt protocols
Network Mgmt protocols
Compute Mgmt protocols
Communications , services discovery (DNS),
authentication, delegation,
Storage Systems
Networks
Compute Systems
Replica catalog
Metadata Catalog
15
Framework of the replica manager service

Separation of Replication and Metadata
Information
Only information needed for the mapping of the
logical file names to physical locations are
considered
Replication Semantics
The replicas are not guaranteed to be coherent,
The information on the original copy is not saved
Replica Management Consistency
The Replicas Manager is able to recover and
return to a consistent state

16
Replicas Management Targets

Replicas Management should answer the following
questions
Which files should be replicated?
static files, large size files
When a replicas should be created?
Frequently accessed files
Where the replicas should be located?
Close to users, fast storage systems,

17
Replication Strategy
How should I replicate The data
You need a replication strategy
18
Simple Dynamic Replication Strategies

Best Client
Files are replicated at the node where they are
the mostly requested
Cascading replication
Replicas are created each time a threshold of
requests is reached starting from the original
node (root) and follow the hierarchy of the nodes
Plain caching
Files are stored locally at the client side
Fast Spread
Files are stored on each node of the path to the
destination
Caching plus cascading replication

19
Dynamic Model-Driven Replication

The decision of replicating a file and where
to locate the replicas are taken following a
performance model that compares the costs and the
benefits of creating replicas of a particular
file in certain locations
Single-system stability
Transfer time between nodes
Storage cost
Accuracy of the replica location mechanism
Etc.

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
20
Dynamic Model-Driven Replication

The model driven approach is trying to answer
critical questions
What is the optimal number of replicas for a
given file?
Which is the best location for the replicas?
When a file needs to be replicated?

21
number of replicas for a file

Is defined given a certain availability
Proposed model
RLacc(1-(1-p)r)gt Avail
Where
P the probability of a node to be up
RLacc is the accuracy of the location mechanism
Avail the needed availability
r is the number of replicas

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
22
Best location for the replicas

A query to the Discovery service returns a number
of nodes (candidate for replication) which
dont contain a copy of the file
have available storage
And a reasonable response time
The best candidates should maximize the
difference between
The replication benefit (high as much as
possible)
Replication costs (low as much as possible)

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
23
Best location for the replicas

Replication costs
S(F, N2) trans(F, N1, N2)
Where
N1 node that currently contains the file
N2 Candidate for a new replica
S(F,N) storage cost for a file F at node N
Trans(F,a,b) transfer costs between locations
a, and b
The benefit of creating a replica is
trans(F,N1,User) trans(F,N2,User)

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
24
Replica Catalog
How do I keep Track of the replica
Replica
Replica
Replica
Create A catalog
Replica
Replica
25
The Replica catalog

The Replica catalog is a key component of the
Replica management service, it provides the
mapping between logical and physical entities.
The Replica catalog register three types of
entities
Logical collections represents a number of
logical file names
Locations maps the logical collection to a
particular physical instance of that collection
Logical files represents a unique logical file
name

26
The Replica catalog
Filename Jan 1998 FilenameFeb 1998 etc
FilenameMar 1998 FilenameJun 1998 Protocol
GridFTP Hostnamejupiter.isi.edu Pathnfs/v6/clima
te
Logical File
Logical File
Logical File Jan 1998 Size 1468762
Logical File
27
Operation allowed on the Replica catalog

Publish (File_publish)
Copies a file from a storage system not
registered in the replicas catalogue to a
registered storage system and updated the replica
catalogue.
Copy (File_copy)
Copies a file from a registered storage system to
another registered storage system and updated the
replica catalogue. It creates the replicas.
Delete (File_delete)
deletes a filename from the replica catalogue
location entry and optionally removes the file
from the registered storage system.

28
Replicas management recovery

Two functionality at least are required to
restart the replica manager after a failure
Restart
rollback

29
Replica Location Service
Where did I put the replicas !!!
You need A replicas location service
30
Replication location Service
Application-Oriented Data Services
Data Management services
Reliable Replication Services
Metadata Service
Replication location service
File Transfer Service
Ann Chervenak Giggle A Framework for
Constructing Scalable Replica Location Services
31
Role of the replica location service

The main task of the replica location Service
is to find a specified number of Physical File
Names given a Logical File Name
The minimal set of required functions is
Autonomy
Best-effort consistency
Adaptiveness

32
Distributed, Adaptive Replica Location Service
Replica Location Node - query based on LFN -
forward query - Digest distribution -
Overlay Network - Soft-state
Storage sites - register/delete of pairs of
(LFN, PFN)
Matei Ripeanu Ian Foster Decentralized,
Adaptive Replica Location service
33
Replica Selection Service
Replicas
Replicas
Replicas
You need A replica Selection Service
Replicas
Replicas
34
The Problem of the Replicas Selection

An application that requires access to replicated
data
Query a specific metadata repository.
The logical file name are identified the
existence of replicas,
The application requires access to the most
appropriate replica (according specific
characteristics).
This task is achieved by the Replica selection.

Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
35
The role of replica selection

The process selection is the process of choosing
a replica from among those spread across the Grid
based on some characteristics specified by the
application.
Access speed
Geographical location
Access costs
Etc.

Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
36
A Data selection scenario
(8) Location of Selected replicas
application
Attribute of Desired Data (1)
Replica Selection Service
(2) (3) Logical File Names
(4) Location of (5) One or more replicas
Metadata service
Replica Management Service
(7) Performances Measurements And Predictions
(6) Sources and Destinations Candidates
transfers
Information Service
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
37
How the replica selection achieves its goals?
(3) Search for the replicas that matches
the applications characteristics
(1) Locate the replicas
(2) Get the capability and usage policy for
all the replicas
Core Services
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
38
The replica selection

Two services are necessary to the replica
selection service
Replica management Service (High Level Service)
Provides the information on all existing replicas
Resource Management (Core Service)
Provides the information of the characteristics
of the underlying resource.

Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
39
GRIS
Metacomputing Directory Service - information
collection - publication - access service
for grid resources
GIIS
GIIS
Grid Index Information Service (GIIS) -
register GRISs - support broad users queries
GRIS
GIIS
GIIS
GIIS
GRIS
storage resource Grid Resource Information
Server (GRIS) - collect and publish -
system configuration Metadata - security
- state propagation - dynamically generate
information
GRIS
GRIS
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
40
Replica catalog
Replicas
Replicas
Replicas
Storage broker - Search - Match - Access
You need A storage broker service
Replicas
Replicas
41
Matching Problem

The matching process depends on
Physical characteristics of the resources and the
load of the CPU, Networks, storage devices that
are part of the end-to-end path linking possible
sources and sinks
The Matching process depends on factors which are
very dynamic (dramatically change in the future)
Predictor to estimate future usage

Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
42
Intelligent Matching process

Having replica location expose performance about
previous data transfers, which can be used to
predict future behaviour between sites
Prediction of end-to-end system performance
Create a model of each system component involved
in the end-to-end data transfer (CPU, cache hit,
disk access, network )
Observations from past application from the
entire system.

Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
43
Collecting the observations

Tools NWS, NetLogger, Web100, ipref, NetPerf
Experience has shown substantial difference in
performance between a small network probe (64kb)
and the actual data transfer (GridFTP)
From logs of past applications
The sporadic nature of large data transfers means
that often there is no data available about
current conditions

Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
44
Summary

We did not discuss in this course two services
the security service, and the data transfer
service
A number of technique for the replica management
have not addressed
Replica location using small-world Models
System for Representing, Querying, and Automating
Data derivation
Most of the topics not addressed in this course
are covered by documents available at
www.globus.org/research/papers.hmtl