Data Grid Technologies - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Data Grid Technologies

Description:

Replication to deal with faults and provide scheduling flexibility. ... James Annis , Yong Zhao, Jens Voeckler, Michael Wilde, Steve Kent, Ian Foster. SC 2002. ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 38

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: Data Grid Technologies

1
Data Grid Technologies

Sathish Vadhiyar
Sources/Credits Technical papers listed in
references

2
Replica Strategies
3
Problem Motivation

Replication to deal with faults and provide
scheduling flexibility.
Given a file that is partitioned into blocks that
are replicated throughout a wide-area file
system, how can a client retrieve the file with
the best performance?
Various algorithms

4
Basic Downloading Algorithm

The client opens a thread to each server
containing the file
A block size is chosen
Each thread selects a different block to download
and all threads start downloading
A thread then chooses a new block that is
currently not being downloaded by any other
thread
Adaptive Servers with higher bandwidths to
clients download more blocks
Selection of block size - tricky

5
Aggressive Redundancy

To provide fault tolerance and to improve
download time
A redundancy factor, R
The client downloads a block simultaneously from
R servers
Only 1 is chosen whichever returns first

6
Progress-Driven Redundancy

Retry a download only when it is progressing
slowly
Progress number - P, redundancy factor R
Each block assigned a download number initialized
to 0
When a thread attempts to download a block, it
increments the blocks download number

7
Progress-Driven Redundancy (Continued)

For selecting a new block to download
If there is a block B whose download number lt R,
and if there are P blocks after B whose downloads
have completed, then select B
Else select next block whose download number is
zero

8
Fastest1

Another approach
For downloading a block, choose a server that has
minimum value of time(l1)
time predicted time to download a block when
there is no contention. Obtained from NWS numbers
before download is initiated.
l number of threads currently downloading from
the server

9
Results
10
Multiple clients

Situation arises when parallel data for
computation on parallel clients have to be
selected from available replica server locations
More challenges download decision by a client
can impact download performance on other clients.
Need to predict this impact.
Periodic network monitoring have to be augmented
by measurements corresponding to current
downloads

11
Collective Download algorithm

Each algorithm connects to a server only once
even if some of the data belongs to other clients
download phase
The clients then redistribute data among
themselves redistribution phase
Widely followed in parallel-I/O
Especially useful when clients and servers are on
either side of WAN multiple latencies can be
avoided at the cost of less expensive
redistribution phase

12
Replica Placement Strategies

Replica placement questions
When should replicas be created?
Which files should be replicated?
Where should replicas be placed?
The model assumes that data is produced in tier-1
(root) and there are storage spaces at various
tiers (levels of hierarchy)
Clients that request data form the leaves of the
hierarchy

13
Placement strategies

Best client
Each storage node maintains history regarding
number of requests for the files it contains
If the number of requests for a file exceeds the
threshold, the node creates a replica of the file
in that client node that has generated most
requests for that file (best-client)
The request details for the file are cleared.

14
Strategies

Cascading replication
Analogy to a 3-tiered function
Once a threshold for a file is exceeded at the
root, a replica is created at the next level on
the path to the best client and so on
Geographical locality is exploited
Plain caching done at the client
Caching plus Cascading Replication

15
Strategies

Fast Spread
A replica of the file is stored at each node
along its path to the client
Replica selection closest replica
Replica replacement least popular file with
oldest age is replaced. Popularity logs are
cleared periodically

16
Findings

Best-client performs worst for random access
patterns and shows improvement for access
patterns with a bit of geographical locality
Fast spread works much better than cascading for
random data access
Bandwidth savings are more in fast spread than in
cascading
Fast spread has high storage requirements

17
Computation and Data
18
GriPhyN

Focuses on virtual data grid technologies
Allows exploitation of computation procedures and
results as community resources
Request to data can either retrieve data or
execute computation procedures that produce the
data

19
Challenges

Representing transformations in virtual data
catalog
Tracing derived data
Mapping computations onto effective flow graphs
Rebuilding dependent objects when code or data
changes
Automated generation and scheduling of
computations required to instantiate data products

20
Chimera

Virtual data system that supports capture and
reuse of data generated by computations
Consists of virtual data catalog and virtual data
language interpreter
VDC tracks how data is derived
Transformation abstract definition of how a
program is to be invoked, what parameters and
input files it needs etc.
Derivation invocation of a transformation with
specific set of inputs and files
Execution of all transformations recorded in
Chimera database
VDL query functions allows to search VDC for
derivation or transformation. Queried by
application, transformation, input, output name

21
Chimera architecture
22
Transformation and Derivation Example
23
Chimera-Pegasus Architecture
24
Work flows
25
Decoupling computation and data movement
26
Architecture

External Scheduler (ES)
Decides which remote site to send the job to
Local Scheduler (LS)
Follows its own policies
Data Scheduler (DS)
Replicates popular data sets to remote sites
following some algorithm

27
Algorithms

4 different ES algorithms
JobRandom
JobLeastLoaded
JobDataPresent
JobLocal
3 different DS algorithms
DataDoNothing
DataRandom
DataLeastLoaded

28
Simulation

Discrete Event Simulator was used
Resource capacities were modeled
Dataset sizes uniform distribution between 500
MB to 2 GB
Initially only one replica per data set
Users mapped evenly across sites
Each job requires a single input file and
requires 300 D seconds, where D is the input size
in GB
Network contention modeled based on number of
simultaneous data transfers
Input file requests generated randomly according
to geometric distribution based on popularity of
files

29
Popularity distribution
30
Results
31
Sources / References / Credits

Algorithms for high Performance, Wide-area
distributed file downloads. J.S. Plank, S.
Atchley, Y.Ding and M. Beck, Parallel Processing
Letters, vol. 13, no. 2, pp 207-224, June 2003.
Downloading Replicated Wide-Area Files a
Framework and Empirical Evaluation. R.L. Collins
and J.S. Plank. NCA 2004.
Identifying Dynamic Replication Strategies for a
High-Performance Data Grid. K. Ranganathan and I.
Foster. Grid 2002.

32
Sources / References / Credits

Grid-Based Galaxy Morphology Analysis for the
National Virtual Observatory. Ewa Deelman,
Raymond Plante, Carl Kesselman, Gurmeet Singh,
Mei-Hui Su, Gretchen Greene, Robert Hanisch,
Niall Gaffney, Antonio Volpicelli, James Annis,
Vijay Sekhri, Fermi Tamas Budavari, Maria
Nieto-Santisteban, William O'Mullane, David
Bohlender, Tom McGlynn, Arnold Rots, Olga
Pevunova, Supercomputing 2003.
Applying Chimera virtual data concepts to cluster
finding in the Sloan Sky Survey. James Annis ,
Yong Zhao, Jens Voeckler, Michael Wilde, Steve
Kent, Ian Foster. SC 2002.

33
Sources / References / Credits

Kavitha Ranganathan and Ian Foster, Decoupling
Computation and Data Scheduling in Distributed
Data Intensive Applications, Proceedings of the
11th International Symposium for High Performance
Distributed Computing (HPDC-11), Edinburgh, July
2002.

34
Replica Creation and Elimination Policy

More replicas lead to load balance but puts
pressure on storage capacities
Replication creation
On-demand
By replica managers in the background
Replica management decisions for on-demand
Replica decision should a remote file be
replicated at a local site in response to the
file request?
Replica selection and
Replica replacement

35
Replica Optimization Strategies

LRU
Replication decision always replicate
Replica selection based on closest location
Replica replacement LRU
Binomial
Replica decision based on file value calculated
from binomial prediction of file popularity
Replica selection using auction and bidding
Replica replacement replace local replica with
lowest file value
Zipf
Same as binomial, but zipf distribution is used

36
Scheduling optimizations

Random assign job to a random host
Shortest queue assign job to host whose queue
is smallest
Access cost assign job to host whose access
cost to files required by the job is smallest
Queue access cost assign job to host where the
sum of access cost for this job and all the jobs
in the queue is the smallest

37
Results Impact of Network Performance

Write a Comment

User Comments (0)