Title: Data Grids Data Intensive Computing
1Data GridsData Intensive Computing
2Simplistically
- Data Grids
- Large number of users
- Large volume of data
- Large computational task involved
- Connecting resources through a network
3Grid
- Term Grid borrowed from electrical grid
- Users obtains computing power through Internet by
using Grid just like electrical power from any
wall socket
4Data Grid
- By connecting to a Grid, can get
- needed computing power
- storage spaces and data
- Specialized equipment
- Each user - a single login account to access all
resources - Resources - owned by diverse organizations
Virtual Organization
5Data Grids
- Data
- Measured in terabytes and soon petabytes
- Also geographically distributed
- Researchers
- Access and analyze data
- Sophisticated, computationally expensive
- Geographically distributed
- Queries
- Require management of caches, data transfer over
WANs, - Schedule data transfer and computation
- Performance estimates to select replicas
6Data Grids
- Domains as diverse as
- Global climate change
- High energy physics
- Computational genomics
- Biomedical applications
- http//www.gridstart.org/links.shtml
7Data Grids
- Data grid differs from
- Cluster computing grid is more than homogeneous
sites connected by LAN (grid can be multiple
clusters) - Distributed system grid is more than
distributing the load of a program across two or
more processes - Parallel computing grid is more than a single
task on multiple machines - Data grid is
- heterogeneous, geographically distributed,
independent site - Gridware manages the resources for Grids
8Methods of Grid Computing
- Distributed Supercomputing
- Tackle problems that cannot be solved on a single
system - High-Throughput Computing
- goal of putting unused processor cycles to work
on loosely coupled, independent tasks (SETI
Search for Extraterrestrial Intelligence) - On-Demand Computing
- short-term requirements for resources that are
not locally accessible, real-time demands
9Methods of Grid Computing
- Data-Intensive Computing
- Synthesize new information from data that is
maintained in geographically distributed
repositories, databases, etc. - Collaborative Computing
- enabling and enhancing human-to-human
interactions
10An Illustrative Example
- NASA research scientist
- collected microbiological samples in the
tidewaters around Wallops Island, Virginia. - Needs
- high-performance microscope at National Center
for Microscopy and Imaging Research (NCMIR),
University of California, San Diego.
11Example (continued)
- Samples sent to San Diego and used NPACIs
Telescience Grid and NASAs Information Power
Grid (IPG) to view and control the output of the
microscope from her desk on Wallops Island. - Viewed the samples, and move platform holding
them, making adjustments to the microscope.
12Example (continued)
- The microscope produced a huge dataset of images
- This dataset was stored using a storage resource
broker on NASAs IPG - Scientist was able to run algorithms on this very
dataset while watching the results in real time
13Grid topology
Grid Topology
14Higher level services
- Replica selection
- Choosing specific replica to optimize
performance, cost, security - Grid info services provide info about network,
metadata provide info about size of file - Determine replica with fastest access and/or
determine of replicate will result in better
access - Subsets of files
15Scheduling and Replication in Data-Intensive
Computing
- Ming Lei, PhD student
- Department of Computer Science
- University of Alabama
16Introduction
- Early work in data replication focused on
decreasing access latency and network bandwidth - As bandwidth and computing capacity become
cheaper, data access latency can drop - How to improve availability and reliability
becomes the focus - Due to dynamic nature of Grid user/system,
difficult to make replica decisions to meet
system availability goal - Usually assumed unlimited storage
17Introduction
- Related work
- Economical model replica decision based on
auction protocol Carman, Zini, et al. e.g.,
replicate if used in future, unlimited storage - Hotzone places replicas so client-to-replica
latency minimized Szymaniak et al. - Replica strategies dynamic, shortest
turnaround, least relative load Tang et al.
consider only LRU - Multi-tiered Grid Simple Bottom Up and
Aggregate Bottom Up Tang et al. - Replicate fragments of files, block mapping
procedure for direct user access ChangChen
18Sliding Window Replica Scheme
- Alternative to future prediction which can
overemphasize future accesses times when queue is
long - Following example based on future access times
19 Figure 2. Without Replica File 4
Figure 3. Replica File 5
20Sliding window replica protocol
- Build sliding window set of times used
immediately in the future - Size bound by size of local SE
- Includes all files current job will access and
distinct files from next arriving jobs - Sliding window slides forward one more file each
time system finishes processing a file - Sliding window dynamic
21Sliding window replica protocol
- Sum of size of all files lt size of SE
- No duplicate files in sliding window
22(No Transcript)
23OptorSim
- Evaluate the performance of our replica and
replacement strategy - Using OptorSim
- OptorSim developed by the EU DataGrid Project to
test dynamic replica schemes
24Simulation results
- Assume OptorSim topology
- 10,000 jobs in Grid
- Each job accesses 3-10 files
- Storage available 100M-10G
- File size 0.5-1.5 G
- Compare replica strategies
- LRU, LFU, EcoBio, EcoZipf, No Prediction, Sliding
Window
25Measurements
- Measurements
- Total Running time
26Eco model
- Compare to Economical Model in OptorSim
- Eco
- File replicated if maximizes profit if SE (e.g.
what is earned over time based on predicted
number of file requests) - Eco prediction functions
- EcoBio
- EcoZipf
- Queue Prediction
- No Prediction
27Prediction Functions
- Prediction via four kinds of prediction
functions - Bio Prediction binomial distribution is used to
predict a value based on file access history - Zipf Prediction Zipf distribution is used to
predict a value based on file access history - Queue Prediction The current job queue is used
to predict a value of the file - No Prediction No predictions of the file are
made, the value will always be 1
28Access Patterns
- Consider 4 access patterns
- Random
- Random Walk Gaussian
- Sequential
- Random Walk Zipf
29Running Time
- First, study sliding window without RDAE
- Measure running time
30Figure 7. Running Time with Varying File
Accessing Pattern.
31Sliding Window
- Sliding window replica scheme always best
turnaround time - No replication, EcoBio the worst
- LFU second best
32Figure 8. Impact of network bandwidth
33Sliding Window
- Higher the bandwidth, shorter system performance
- Sliding window replica always the best
- EcoZipf, EcoBio perform almost same as no
prediction
34Switch Time
35Figure 10. Impact of varying the job switch time
36Switch Time
- Longer the switch time, the longer the total
running time - Sliding window the best
- LRU, LFU second best
- Improvement provided by sliding window over LRU,
LFU greatest for smaller switch times - Improvement provided by sliding window over
EcoBio, EcoZipf greatest remains high for higher
switch times
37Varying Schedulers
- Impact of varying schedulers
38Figure 11. Running Time with Varying Schedulers
39Conclusions and Future work
- File bundle situation
- Preferential treatment of smaller size files
- Green Grids
40Power Aware
- Info from Pate, et al., 2002
- Data center with 1000 racks, 25,000 square feet
require 10 MW of power. - Also requires 5 MW to dissipate the heat
- At 100/MWh, 4M per year
- Average utilization of 20 and 80 for peak loads
- This means 80 of resources not utilized, yet
generating heat, consuming power
41Power Aware
- Green Grid group of IT professionals
- Power Usage Effectiveness PUE
- Total facility power/IT equipment power
- Data Center infrastructure Efficiency metric CDiE
- 1/PUE
42Power Aware
- Computer Dec. 2007 devoted to green computing
- Servers
- Seek high energy efficiency at peak performance
- In sleep consume near-zero energy
- But
- Servers rarely completely idle, seldom operate
near maximum utilization - Operate at 10 to 50 of maximum utilization
43Power Aware
- Mismatch between server energy efficiency and
behavior of server class workloads - Need energy proportional machines to exhibit wide
dynamic power range
44What can be done to achieve energy proportional
behavior
- Server power (Google) peak vs. idle
- CPU smaller and smaller fraction of total power
when system idle - Processors close to energy proportional behavior
- Desktop and server processors can consume less
than 1/3 of their peak power at low activity
modes can achieve 1/10 or less - Dynamic power range of other components much
narrower - Lower voltage processor frequency mode good
because not much impact on overall performance - Networking equipment doesnt offer low power
modes - Need improvements in memory and disk systems
45Future Wwork - Power Aware
- Ways to conserve energy
- Power down CPU
- Slow down CPU
- Power down storage disks energy to power up
- Store less data
46Future Work
- With Dr. John Lusth
- Build a green grid (with lower energy mother
boards) - Hardware purchased, put together
- Implement some power saving strategies
- Run benchmarks (e.g. computations, queries)
- Compare to existing grid at Arkansas
- DESIGN NEW STRATEGIES