Grid Data Management Components - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Grid Data Management Components

Description:

Large Hadron Collider at CERN (2005) Climate Modeling. 3 ... When a request for a large file is issued, it requires a considerable amount of ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 45
Provided by: Ada578
Category:

less

Transcript and Presenter's Notes

Title: Grid Data Management Components


1
Grid Data Management Components
  • Adam Belloum
  • Computer Architecture Parallel Systems group
  • University of Amsterdam
  • adam_at_science.uva.nl

2
The Problem (application pull)
  • New class of application are emerging in
    different domains and which involve a huge
    collection of the data geographically distributed
    and owned by different collaborating
    organizations
  • Examples of such applications
  • Large Hadron Collider at CERN (2005)
  • Climate Modeling

3
Requirements of the Grid-based Applications
  • Efficient data transfer service
  • Efficient data access service
  • Reliability and security
  • Possibility to create and manage multiple copies
    of the data

4
Summary
Distributed Application
Requested Data Management System ?
Services specific to the data Grid
infrastructure
Low level Services (shared with other Grid
Components)
5
The Data Grid is then
  • The Data Grid is the infrastructure that provides
    the services required for manipulating
    geographically distributed large collection of
    measured and computed data
  • security services
  • replicas services
  • data transfer services
  • Etc.
  • Design Principles
  • Mechanism neutrality
  • Compatibility with Grid infrastructure
  • Uniform of information and infrastructure

6
The Data Grid Architecture
High Level Components
Core Services
Data Grid Specific services
Generic Grid Specific services
7
Replica services for Data Grid
  • Possibility to create multiple copies of the data
  • Efficient and reliable management of the replicas
  • Efficient replication strategy Replica
    Management Service
  • Location of the replicas Replica location
    mechanism
  • Coherence of the replicas Replica consistency
    mechanism
  • Selection of the replica Replica selection
    service
  • Secure replicas mechanism

8
Transfer services for Data Grid
  • Fast mechanisms for large data transfer
  • Reliable transfer mechanisms
  • Secure transfer mechanisms
  • GridFTP

9
Security services for Data Grid
  • Authentication
  • Who is can access or view the data?
  • Authorization
  • Who is authorized to effectively use the data
  • Accounting
  • The users may be charged for using the data

10
Replica Management for Data Grid
Efficient access to the data sets ?
Replica
Replica
Replica
Create replicas of The data sets
Replica
Replica
11
Replica Management Problem
  • When a request for a large file is issued, it
    requires a considerable amount of bandwidth to
    be achieved. The availability of the bandwidth
    at the due time will have an impact on the
    latency of access to the requested file.
  • Replication of files nearby the potential users
  • (in other domains this is called caching)

12
What is the replicas manager
  • It is a Grid Service responsible for
    creating complete and partial copies of the
    datasets (mainly a collection of files)
  • Grid Data model
  • Datasets are stored in files grouped into
    collections
  • A replica
  • Is a subset of a collection that is stored on a
    particular physical storage

Bill Allocock et al. Data management and
Transfer in High-Performance Computational Grid
Environments
13
The role of replicas manager service
  • Its purpose is to map a logical file name to a
    physical name for the file on a specific storage
  • Note Does not use any semantic information
    contained on the logical file names

14
Services relevant to the Replicas Manager
Particle Physics applications, climate modeling
application, etc.
Replica Mgmt service
Replica Selection Service
Metadata Services
Distributed Catalog Service
Information Services
Storage Mgmt protocols
Catalog Mgmt protocols
Network Mgmt protocols
Compute Mgmt protocols
Communications , services discovery (DNS),
authentication, delegation,
Storage Systems
Networks
Compute Systems
Replica catalog
Metadata Catalog
15
Framework of the replica manager service
  • Separation of Replication and Metadata
    Information
  • Only information needed for the mapping of the
    logical file names to physical locations are
    considered
  • Replication Semantics
  • The replicas are not guaranteed to be coherent,
  • The information on the original copy is not saved
  • Replica Management Consistency
  • The Replicas Manager is able to recover and
    return to a consistent state

16
Replicas Management Targets
  • Replicas Management should answer the following
    questions
  • Which files should be replicated?
  • static files, large size files
  • When a replicas should be created?
  • Frequently accessed files
  • Where the replicas should be located?
  • Close to users, fast storage systems,

17
Replication Strategy
How should I replicate The data
You need a replication strategy
18
Simple Dynamic Replication Strategies
  • Best Client
  • Files are replicated at the node where they are
    the mostly requested
  • Cascading replication
  • Replicas are created each time a threshold of
    requests is reached starting from the original
    node (root) and follow the hierarchy of the nodes
  • Plain caching
  • Files are stored locally at the client side
  • Fast Spread
  • Files are stored on each node of the path to the
    destination
  • Caching plus cascading replication

19
Dynamic Model-Driven Replication
  • The decision of replicating a file and where
    to locate the replicas are taken following a
    performance model that compares the costs and the
    benefits of creating replicas of a particular
    file in certain locations
  • Single-system stability
  • Transfer time between nodes
  • Storage cost
  • Accuracy of the replica location mechanism
  • Etc.

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
20
Dynamic Model-Driven Replication
  • The model driven approach is trying to answer
    critical questions
  • What is the optimal number of replicas for a
    given file?
  • Which is the best location for the replicas?
  • When a file needs to be replicated?

21
number of replicas for a file
  • Is defined given a certain availability
  • Proposed model
  • RLacc(1-(1-p)r)gt Avail
  • Where
  • P the probability of a node to be up
  • RLacc is the accuracy of the location mechanism
  • Avail the needed availability
  • r is the number of replicas

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
22
Best location for the replicas
  • A query to the Discovery service returns a number
    of nodes (candidate for replication) which
  • dont contain a copy of the file
  • have available storage
  • And a reasonable response time
  • The best candidates should maximize the
    difference between
  • The replication benefit (high as much as
    possible)
  • Replication costs (low as much as possible)

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
23
Best location for the replicas
  • Replication costs
  • S(F, N2) trans(F, N1, N2)
  • Where
  • N1 node that currently contains the file
  • N2 Candidate for a new replica
  • S(F,N) storage cost for a file F at node N
  • Trans(F,a,b) transfer costs between locations
    a, and b
  • The benefit of creating a replica is
  • trans(F,N1,User) trans(F,N2,User)

Kavitha R. et al. Improving Data Availability
through Dynamic-Driven Replication in Large
Peer-to-Peer Communities
24
Replica Catalog
How do I keep Track of the replica
Replica
Replica
Replica
Create A catalog
Replica
Replica
25
The Replica catalog
  • The Replica catalog is a key component of the
    Replica management service, it provides the
    mapping between logical and physical entities.
  • The Replica catalog register three types of
    entities
  • Logical collections represents a number of
    logical file names
  • Locations maps the logical collection to a
    particular physical instance of that collection
  • Logical files represents a unique logical file
    name

26
The Replica catalog
Filename Jan 1998 FilenameFeb 1998 etc
FilenameMar 1998 FilenameJun 1998 Protocol
GridFTP Hostnamejupiter.isi.edu Pathnfs/v6/clima
te
Logical File
Logical File
Logical File Jan 1998 Size 1468762
Logical File
27
Operation allowed on the Replica catalog
  • Publish (File_publish)
  • Copies a file from a storage system not
    registered in the replicas catalogue to a
    registered storage system and updated the replica
    catalogue.
  • Copy (File_copy)
  • Copies a file from a registered storage system to
    another registered storage system and updated the
    replica catalogue. It creates the replicas.
  • Delete (File_delete)
  • deletes a filename from the replica catalogue
    location entry and optionally removes the file
    from the registered storage system.

28
Replicas management recovery
  • Two functionality at least are required to
    restart the replica manager after a failure
  • Restart
  • rollback

29
Replica Location Service
Where did I put the replicas !!!
You need A replicas location service
30
Replication location Service
Application-Oriented Data Services
Data Management services
Reliable Replication Services
Metadata Service
Replication location service
File Transfer Service
Ann Chervenak Giggle A Framework for
Constructing Scalable Replica Location Services
31
Role of the replica location service
  • The main task of the replica location Service
    is to find a specified number of Physical File
    Names given a Logical File Name
  • The minimal set of required functions is
  • Autonomy
  • Best-effort consistency
  • Adaptiveness

32
Distributed, Adaptive Replica Location Service
Replica Location Node - query based on LFN -
forward query - Digest distribution -
Overlay Network - Soft-state
Storage sites - register/delete of pairs of
(LFN, PFN)
Matei Ripeanu Ian Foster Decentralized,
Adaptive Replica Location service
33
Replica Selection Service
Replicas
Replicas
Replicas
You need A replica Selection Service
Replicas
Replicas
34
The Problem of the Replicas Selection
  • An application that requires access to replicated
    data
  • Query a specific metadata repository.
  • The logical file name are identified the
    existence of replicas,
  • The application requires access to the most
    appropriate replica (according specific
    characteristics).
  • This task is achieved by the Replica selection.

Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
35
The role of replica selection
  • The process selection is the process of choosing
    a replica from among those spread across the Grid
    based on some characteristics specified by the
    application.
  • Access speed
  • Geographical location
  • Access costs
  • Etc.

Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
36
A Data selection scenario
(8) Location of Selected replicas
application
Attribute of Desired Data (1)
Replica Selection Service
(2) (3) Logical File Names
(4) Location of (5) One or more replicas
Metadata service
Replica Management Service
(7) Performances Measurements And Predictions
(6) Sources and Destinations Candidates
transfers
Information Service
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
37
How the replica selection achieves its goals?
(3) Search for the replicas that matches
the applications characteristics
(1) Locate the replicas
(2) Get the capability and usage policy for
all the replicas
Core Services
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
38
The replica selection
  • Two services are necessary to the replica
    selection service
  • Replica management Service (High Level Service)
  • Provides the information on all existing replicas
  • Resource Management (Core Service)
  • Provides the information of the characteristics
    of the underlying resource.

Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
39
GRIS
Metacomputing Directory Service - information
collection - publication - access service
for grid resources
GIIS
GIIS
Grid Index Information Service (GIIS) -
register GRISs - support broad users queries
GRIS
GIIS
GIIS
GIIS
GRIS
storage resource Grid Resource Information
Server (GRIS) - collect and publish -
system configuration Metadata - security
- state propagation - dynamically generate
information
GRIS
GRIS
Sudharshan Vazhkudai Replica Selection in the
Globus Data Grid
40
Replica catalog
Replicas
Replicas
Replicas
Storage broker - Search - Match - Access
You need A storage broker service
Replicas
Replicas
41
Matching Problem
  • The matching process depends on
  • Physical characteristics of the resources and the
    load of the CPU, Networks, storage devices that
    are part of the end-to-end path linking possible
    sources and sinks
  • The Matching process depends on factors which are
    very dynamic (dramatically change in the future)
  • Predictor to estimate future usage

Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
42
Intelligent Matching process
  • Having replica location expose performance about
  • previous data transfers, which can be used to
    predict future behaviour between sites
  • Prediction of end-to-end system performance
  • Create a model of each system component involved
    in the end-to-end data transfer (CPU, cache hit,
    disk access, network )
  • Observations from past application from the
    entire system.

Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
43
Collecting the observations
  • Tools NWS, NetLogger, Web100, ipref, NetPerf
  • Experience has shown substantial difference in
    performance between a small network probe (64kb)
    and the actual data transfer (GridFTP)
  • From logs of past applications
  • The sporadic nature of large data transfers means
    that often there is no data available about
    current conditions

Sudharshan Vazhkudai Predicting the Performance
of Wide Area Data Transfers
44
Summary
  • We did not discuss in this course two services
    the security service, and the data transfer
    service
  • A number of technique for the replica management
    have not addressed
  • Replica location using small-world Models
  • System for Representing, Querying, and Automating
    Data derivation
  • Most of the topics not addressed in this course
    are covered by documents available at
    www.globus.org/research/papers.hmtl
Write a Comment
User Comments (0)
About PowerShow.com