Data Grids Data Intensive Computing

About This Presentation

Title:

Data Grids Data Intensive Computing

Description:

Term Grid borrowed from electrical grid ... Unavailability of file can cause job to hang. Potential delay to job can be unbounded ... – PowerPoint PPT presentation

Number of Views:151

Avg rating:3.0/5.0

Slides: 65

Provided by: mingleisus

Category:

more less

Transcript and Presenter's Notes

Title: Data Grids Data Intensive Computing

1
Data GridsData Intensive Computing
2
Simplistically

Data Grids
Large number of users
Large volume of data
Large computational task involved
Connecting resources through a network

3
Grid

Term Grid borrowed from electrical grid
Users obtains computing power through Internet by
using Grid just like electrical power from any
wall socket

4
Data Grid

By connecting to a Grid, can get
needed computing power
storage spaces and data
Specialized equipment
Each user - a single login account to access all
resources
Resources - owned by diverse organizations
Virtual Organization

5
Data Grids

Data
Measured in terabytes and petabytes
Also geographically distributed
Researchers
Access and analyze data
Sophisticated, computationally expensive
Geographically distributed
Queries
Require management of caches, data transfer over
WANs
Schedule data transfer and computation
Performance estimates to select replicas

6
Data Grids

Domains as diverse as
Global climate change
High energy physics
Computational genomics
Biomedical applications

7
Data Grids

Data grid differs from
Cluster computing grid is more than homogeneous
sites connected by LAN (grid can be multiple
clusters)
Distributed system grid is more than
distributing the load of a program across two or
more processes
Parallel computing grid is more than a single
task on multiple machines
Data grid is
heterogeneous, geographically distributed,
independent site
Gridware manages the resources for Grids

8
Methods of Grid Computing

Distributed Supercomputing
Tackle problems that cannot be solved on a single
system
High-Throughput Computing
goal of putting unused processor cycles to work
on loosely coupled, independent tasks (SETI
Search for Extraterrestrial Intelligence)
On-Demand Computing
short-term requirements for resources that are
not locally accessible, real-time demands

9
Methods of Grid Computing

Data-Intensive Computing
Synthesize new information from data that is
maintained in geographically distributed
repositories, databases, etc.
Collaborative Computing
enabling and enhancing human-to-human
interactions

10
An Illustrative Example

NASA research scientist
collected microbiological samples in the
tidewaters around Wallops Island, Virginia.
Needs
high-performance microscope at National Center
for Microscopy and Imaging Research (NCMIR),
University of California, San Diego.

11
Example (continued)

Samples sent to San Diego and used NPACIs
Telescience Grid and NASAs Information Power
Grid (IPG) to view and control the output of the
microscope from her desk on Wallops Island.
Viewed the samples, and move platform holding
them, making adjustments to the microscope.

12
Example (continued)

The microscope produced a huge dataset of images
This dataset was stored using a storage resource
broker on NASAs IPG
Scientist was able to run algorithms on this very
dataset while watching the results in real time

13
Grid - Lower level services

Other basic services
Authorization/authentication
Resource reservation for predictable transfers
Performance measurements, estimation techniques
Instrument services that enable end-to-end
instrumentation of storage transfers

14
Grid - Higher level services

Replica manager
Create/delete copies of files instances
Typically byte-for-byte copies
Replica created for better performance/availabilit
y
Logical file exists in metadata repository with
globally unique name
Related logical files grouped into replica
catalogs collections (hierarchies too)
File not in catalog is in local cache
Replica policy separate from replica manager
Can keep local copies separate

15
Topics to follow

Discuss data Grid research at UA
Discuss Green computing
Discuss Celadon cluster at UA

16
An On-Line Replication Strategy to Increase
Availability in Data Grids

Ming Lei, PhD
Department of Computer Science
University of Alabama
Now at Oracle Corporation
Atlanta, GA

17
Introduction

How to improve file access time and data
availability?
Replicate the Data!
Copies of files at different sites
Deciding where and when is the problem
Dynamic behavior of Grid user
Large volume of datasets
Hundreds of client across the globe submit
requests

18
Introduction

Early work in data replication focused on
decreasing access latency and network bandwidth
As bandwidth and computing capacity become
cheaper, data access latency can drop
How to improve availability and reliability
becomes the focus
Unavailability of file can cause job to hang
Potential delay to job can be unbounded
Any node failure or data outage can cause
potential file unavailability

19
Related Replica Work

Economical model replica decision based on
auction protocol Carman, Zini, et al. e.g.,
replicate if used in future, unlimited storage
Hotzone places replicas so client-to-replica
latency minimized Szymaniak et al.
Replica strategies central and distributed
replication Tang et al. consider limited
storage but only LRU replacement
Multi-tiered Grid Simple Bottom Up and
Aggregate Bottom Up Tang et al.
Replicate fragments of files, block mapping
procedure for direct user access ChangChen

20
Motivation

Want to complete a job with correct data
File access failure can lead to incorrect result
or job crash
Improve overall system availability
Propose to measure the system level data
availability
Assume limited file storage

21
Data Grid Architecture
Computing Element CE Storage Element SE Replica
manger containing a replica optimizer
22
File Availability

File availability
Associate with each SE (storage element) is a
file availability (probability will be available)
Doesnt help to increase copies at same SE, all
fail together
One copy per SE
All copies same availability at same SE

23
Measures of System Availability

System File Missing Rate SFMR
number of files potentially unavailable
number of all the files requested by all the jobs
System Bytes Missing Rate SBMR
number of bytes potentially unavailable
Total number bytes requested by all the jobs
two metrics will be the same when all files
sizes the same

24
System model
Availability Pj of file fj is
set of jobs, J (j1, j2, j3, jN)
PSEi is the file availability in the ith SE k
denotes the number of copies of the file fj
25
System model
System File Missing Rate SFMR
n denotes the total number of jobs, each of which
will have m file accesses
System Bytes Missing Rate SBMR
Sj denotes the size of file fj
26
Problem Generalization

sequence of file requests O(r1, r2, r3., rN),
SFMR
SBMR
best system data availability results from
minimizing above equations subject to
Ci denotes the number of copies of fi S is the
total storage available
27
Problem Generalization
Transfer the minimization problem to a
maximization problem
SFMR
SBMR
N is the total number of the request operations
in a given set O
Tbytes denotes the total bytes that will be
accessed for all of O
To minimize the SFMR and SBMR, we need to
maximize
and
28
On-line Optimal Replication Problem
With each file associate a value Vi (future
access)
Assume a newly requested file is t
Choose a file set d f1,f2,..,fk from the file
set F t to achieve the maximum
and
If t is in d, then we need to replicate the file
The above optimal problem is a classic Knapsack
problem
Aggregate each file replicas storage costs
together as the weight of the item fi
29
On-line Optimal Replication Problem
Solving this Knapsack problem at each replacement
instant is known to be NP-hard
Can convert our optimization problem to an
approximate fractional knapsack problem (done
elsewhere by Berkeley people)
Assume that the storage capacity is sufficiently
large and holds a significantly large number of
file
Amount of space left after storing the maximum
number of files is negligible compared to the
total storage space
30
Minimum Data Missing Rate StrategyMinDmr

Propose MinDmr replica optimizer
In our greedy algorithm, we introduce the file
weight as
W (Pj Vj) /(Cj Sj)
Vj file value based on future accesses
Pj - file fjs availability
Cj - the number of copies of fj
Sj - the size of fj

31
MinDmr Strategy

Value Vi
Must make long term performance decisions
Each file access operation ri, at instant T, is
associated with an important variable Vi
Vi is set to number of times file will be
accessed in the future
Assign future value to file via a prediction
function

32
Prediction Functions

Prediction via four kinds of prediction
functions
Bio Prediction binomial distribution is used to
predict a value based on file access history
Zipf Prediction Zipf distribution is used to
predict a value based on file access history
Queue Prediction The current job queue is used
to predict a value of the file
No Prediction No predictions of the file are
made, the value will always be 1

33
MinDmr Strategy

For each file request
If enough space
replicate file
Else
Sort stored files by weight W
Replace file(s)
if value gained by replicating gt
value lost by replacing a file

34
(No Transcript)
35
Existing Eco model Comparison

Compare to Economical Model in OptorSim
Eco
File replicated if maximizes profit if SE (e.g.
what is earned over time based on predicted
number of file requests)
Eco prediction functions
EcoBio
EcoZipf

36
Existing Eco model Comparison

MinDmr differs from Eco
Both greedy
MinDmr uses 2 values gain/loss and value for
sorting existing files for replacement
Eco uses same value to determine files value and
replacement
MinDmr includes availability, copies and size
Incidence of replication different for both Eco
replicates the same file many more times

37
OptorSim

Evaluate the performance of our replica and
replacement strategy
Using OptorSim
OptorSim developed by the EU DataGrid Project to
test dynamic replica schemes

38
Grid topology
39
Compare to

Will compare 8 replica schemes (optimizers)
Bio MD (Bio MinDmr)
ZipfMD (Zipf MinDmr)
MDNo Pred (MinDmr R no prediction)
MDQuePred (MinDmr queue prediction)
EcoBio
EcoZipf
LRU (least recently used)
LFU (least frequently used)

40
What to vary

Comparisons made for
varying access patterns
Total job time
Varying scheduler
Queue length
SE availability
Different file size
Different sized files

41
Access Patterns

Consider 4 access patterns (OptorSim)
Random
Random Walk Gaussian
Sequential
Random Walk Zipf

42
Job Schedulers

Consider 4 types of job schedulers (OptorSim)
Random
Shortest Queue
Access Cost file has lowest access cost
Queue access cost sum of access cost for job
and access cost for all jobs in the Q is the
smallest

43
Performance Results
44
Workload and system parameter values
File availability at each SE is 99
45
SFMR with varying replica optimizers
46
Results

MinDMR (MD) perform better than both Eco
EcoBio worst, EcoZipf 2nd worst
SFMR for Eco up to 200 times greater than MinDMR
LFU slightly better than LRU
ZipfMD worse than LRU, LFU
This will be consistent in most of the results
ZipfMD uses Zipf prediction function in OptorSim-
not acurrate

47
Total job time with sequential access
48
Results

Total job time smallest for MinDMR
BioMD shortest, EcoBio the longest
LRU higher SFMR, but lower total job time
Notice we only used sequential access

49
SFMR with varying job schedulers
50
Results

Shortest Q, Access cost similar SFMR values for
all replica schemes
Random worst, Q access cost best
Notice dropped LRU

51
SFMR with varying job queue length
52
Results

Effect of queue length on SFMR
Consider only MDQuePred
Shorter job queue, higher SFMR
However, if queue too long, SFMR can increase
slightly
Valuable files are replicated and stay in storage
too long

53
Total Job Time with varying job queue length
54
Results

As length of Queue increases, total running time
decreases
Decrease increases for longer queues
Trade-off total job time for SFMR

55
SFMR Ratio of MDQuePred with varying SE
availability
56
Results

Vary availability at 90, 99, 99.9 and 99.99
Compare to availability of MDQuePred (smallest)
All benefit from higher availability
MinDmr strategies always smaller

57
SFMR with sequential access when varying file size
58
Results

Change size of all files from 200, 300, 400, 500,
600M all files still same size
The larger the file size, the higher the SFMR
All MinDmr better except for ZipfMD

59
SFMR and SBMR with file sizes different
60
Results

Each file different size
Range from 500M-1G
All replica schemes except LFU higher SBMR than
SFMR
Schemes store small-size files in relica space,
displacing larger ones
LFU (LRU) not affected
MinDmr (except ZipfMD) better

61
Difference in SBMR and SFMR with file sizes
different
62
Results

Display SBMR-SFMR
Largest gap for EcoBio, smallest BioMD
Larger gap for EcoBio, EcoZipf and ZipfMD.
SFMR and SBMR small for LFU BioMed, MDNoPred and
MDQuePred.

63
Conclusions

MinDmr is better than the others in terms of the
new data availability metrics regardless of
File sizes
System load
Queue length
Prediction function
Job schedulers
File access patterns

Data Grids Data Intensive Computing - PowerPoint PPT Presentation

Data Grids Data Intensive Computing

Term Grid borrowed from electrical grid ... Unavailability of file can cause job to hang. Potential delay to job can be unbounded ... – PowerPoint PPT presentation