Fault Tolerance and Reliable Data Placement

About This Presentation

Title:

Fault Tolerance and Reliable Data Placement

Description:

Stork. A scheduler for data placement activities in the Grid ... Enhanced interaction between Stork, DiskRouter and higher level planners ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 37

Provided by: tevfik4

Category:

more less

Transcript and Presenter's Notes

Title: Fault Tolerance and Reliable Data Placement

1
Fault Tolerance and Reliable Data Placement
Zach Miller University of Wisconsin-Madison zmill
er_at_cs.wisc.edu
2
Fault Tolerant Shell (FTSH)

A grid is a harsh environment.
FTSH to the rescue!
The ease of scripting with very precise error
semantics.
Exception-like structure allows scripts to be
both succinct and safe.
A focus on timed repetition simplifies the most
common form of recovery in a distributed system.
A carefully-vetted set of language features
limits the "surprises" that haunt system
programmers.

3
Simple Bourne script

!/bin/sh
cd /work/foo
rm rf data
cp -r /fresh/data .

What if /work/foo is unavailable??
4
Getting Grid Ready

!/bin/sh
for attempt in 1 2 3
cd /work/foo
if ! ?
then
echo "cd failed, trying again..."
sleep 5
else
break
fi
done
if ! ?
then
echo "couldn't cd, giving up..."
return 1
fi

5
Or with FTSH

!/usr/bin/ftsh
try 5 times
cd /work/foo
rm -rf bar
cp -r /fresh/data .
end

6
Or with FTSH

!/usr/bin/ftsh
try for 3 days or 100 times
cd /work/foo
rm -rf bar
cp -r /fresh/data .
end

7
Or with FTSH

!/usr/bin/ftsh
try for 3 days every 1 hour
cd /work/foo
rm -rf bar
cp -r /fresh/data .
end

8
Exponential Backoff Example

command_wrapper /path/to/command max_attempts
max_time each_time initial_delay
zmiller_at_cs.wisc.edu 2003-08-02
try for 3 hours or 2 times
try 1 time for 4 hours
1 6 7 8 9 10 11 12
catch
echo "ERROR 1 6 7 8 9 10
11 12
echo "sleeping for delay
seconds"
sleep delay
delaydelay .mul. 2
failure
end
catch
echo ERROR all attempts failed...
returning failure
exit 1
end

9
Another quick example

hosts"mirror1.wisc.edu mirror2.wisc.edu
mirror3.wisc.edu"
forany h in hosts
echo "Attempting host h"
wget http//h/some-file
end
echo "Got file from h
File transfers may be better served by Stork

10
FTSH Summary

All the usual shell constructs
Redirection, loops, conditionals, functions,
expressions, nesting,
And more
Logging
Timeouts
Process Cancellation
Complete parsing at startup
File cleanup
Used on Linux, Solaris, Irix, Cygwin,

11
FTSH Summary

Written by Doug Thain
Available under GPL license at
http//www.cs.wisc.edu/thain/research/ftsh/

12
Outline

Introduction
FTSH
Stork
DiskRouter
Conclusions

13
A Single Project..

LHC (Large Hadron Collider)
Comes online in 2006
Will produce 1 Exabyte data by 2012
Accessed by 2000 physicists, 150 institutions,
30 countries

14
And Many Others..

Genomic information processing applications
Biomedical Informatics Research Network (BIRN)
applications
Cosmology applications (MADCAP)
Methods for modeling large molecular systems
Coupled climate modeling applications
Real-time observatories, applications, and
data-management (ROADNet)

15
The Same Big Problem..

Need for data placement
Locate the data
Send data to processing sites
Share the results with other sites
Allocate and de-allocate storage
Clean-up everything
Do these reliably and efficiently

16
Stork

A scheduler for data placement activities in the
Grid
What Condor is for computational jobs, Stork is
for data placement
Stork comes with a new concept
Make data placement a first class citizen in the
Grid.

17
The Concept
18
The Concept
19
The Concept
Condor Job Queue
DaP A A.submit DaP B B.submit Job C
C.submit .. Parent A child B Parent B child
C Parent C child D, E ..
DAG specification
C
DAGMan
Stork Job Queue
C
E
20
Why Stork?

Stork understands the characteristics and
semantics of data placement jobs.
Can make smart scheduling decisions, for reliable
and efficient data placement.

21
Failure Recovery and Efficient Resource
Utilization

Fault tolerance
Just submit a bunch of data placement jobs, and
then go away..
Control number of concurrent transfers from/to
any storage system
Prevents overloading
Space allocation and De-allocations
Make sure space is available

22
Support for Heterogeneity
Protocol translation using Stork memory buffer.
23
Support for Heterogeneity
Protocol translation using Stork Disk Cache.
24
Flexible Job Representation and Multilevel Policy
Support

Type Transfer
Src_Url srb//ghidorac.sdsc.edu/kosart.cond
or/x.dat
Dest_Url nest//turkey.cs.wisc.edu/kosart/x
.dat
Max_Retry 10
Restart_in 2 hours

25
Run-time Adaptation

Dynamic protocol selection
dap_type transfer
src_url drouter//slic04.sdsc.edu/tmp/tes
t.dat
dest_url drouter//quest2.ncsa.uiuc.edu/tmp
/test.dat
alt_protocols nest-nest, gsiftp-gsiftp
dap_type transfer
src_url any//slic04.sdsc.edu/tmp/test.da
t
dest_url any//quest2.ncsa.uiuc.edu/tmp/tes
t.dat

26
Run-time Adaptation

Run-time Protocol Auto-tuning
link slic04.sdsc.edu quest2.ncsa.uiuc.edu
protocol gsiftp
bs 1024KB //block size
tcp_bs 1024KB //TCP buffer size
p 4

27
Outline

Introduction
FTSH
Stork
DiskRouter
Conclusions

28
DiskRouter

A mechanism for high performance, large scale
data transfers
Uses hierarchical buffering to aid in large scale
data transfers
Enables application-level overlay network for
maximizing bandwidth
Supports application-level multicast

29
Store and Forward
C
A
With DiskRouter
DiskRouter
B
Without DiskRouter
Improves performance when bandwidth fluctuation
between A and B is independent of the bandwidth
fluctuation between B and C
30
DiskRouter Overlay Network
90 Mb/s
B
A
31
DiskRouter Overlay Network
90 Mb/s
B
A
400 Mb/s
400 Mb/s
DiskRouter
C
Add a DiskRouter Node C which is not necessarily
on the path from A to B, to enforce use of an
alternative path.
32
Data Mover/Distributed Cache
Source
Destination
DiskRouter Cloud

Source writes to the closest DiskRouter and
Destination receives it up from its closest
DiskRouter

33
Outline

Introduction
FTSH
Stork
DiskRouter
Conclusions

34
Conclusions

Regard data placement as first class citizen.
Introduce a specialized scheduler for data
placement.
Introduce a high performance data transfer tool.
End-to-end automation, fault tolerance, run-time
adaptation, multilevel policy support, reliable
and efficient transfers.

35
Future work

Enhanced interaction between Stork, DiskRouter
and higher level planners
co-scheduling of CPU and I/O
Enhanced authentication mechanisms
More run-time adaptation

36
You dont have to FedEx your data anymore.. We
deliver it for you!

For more information
Stork
Tevfik Kosar
Email kosart_at_cs.wisc.edu
http//www.cs.wisc.edu/condor/stork
DiskRouter
George Kola
Email kola_at_cs.wisc.edu
http//www.cs.wisc.edu/condor/diskrouter

Write a Comment

User Comments (0)

About PowerShow.com

Fault Tolerance and Reliable Data Placement - PowerPoint PPT Presentation

Fault Tolerance and Reliable Data Placement

Stork. A scheduler for data placement activities in the Grid ... Enhanced interaction between Stork, DiskRouter and higher level planners ... – PowerPoint PPT presentation