The Ethernet Approach to Grid Computing - PowerPoint PPT Presentation

About This Presentation
Title:

The Ethernet Approach to Grid Computing

Description:

try for 30 minutes. end. Outline. Two problems in real systems: Timing is ... Try block succeeds. If group fails within time limit. Automatically retried. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 31
Provided by: dougl229
Category:

less

Transcript and Presenter's Notes

Title: The Ethernet Approach to Grid Computing


1
The Ethernet Approachto Grid Computing
  • Douglas Thain and Miron Livny
  • Condor Project, University of Wisconsin
  • http//www.cs.wisc.edu/condor/ftsh

2
The UWUS-CMSPhysics Grid
Gatekeeper (C)
MCRunJob (python)
Impala (bash)
Jobmanager (C)
MOP (python)
Batch Interface (bash)
Submit DAG (perl)
Batch System (???)
DAGMan (C)
Condor-G (C)
MOP wrapper (bash)
Gridmanager (C)
Impala wrapper (bash)
Actual Job (Fortran)
GAHP Server (C)
3
Outline
  • Two problems in real systems
  • Timing is uncontrollable.
  • Failures lack detail.
  • A solution
  • The Ethernet Approach.
  • A language and a tool
  • The Fault Tolerant Shell.
  • Time and failures are explicit.
  • Example Applications
  • Shared Job Queue.
  • Shared Disk Buffer.
  • Shared Data Servers.

Ethernet Carrier Sense Collision
Detect Exponential Backoff Limited Allocation
try for 30 minutes ... end
4
1 - Timing is Uncontrollable
  • Consider a distributed file system.
  • Suppose that the network is down.
  • soft mounted - failure after one minute
  • hard mounted failure never exposed
  • Time is an unknown in nearly every operating
    system activity
  • Process invocation.
  • Memory access.
  • Network communications.

5
2 - Failures Lack Detail
  • Consider this trivial program
  • We would like to distinguish
  • success.
  • file not found.
  • nfs server down, still trying.
  • couldnt find library libc.so.25.

cp a b
6
2 - Failures Lack Detail
  • Consider this trivial program
  • Actual results
  • success. (exit code 0)
  • file not found. (exit code 1)
  • nfs server down, still trying. (code 1)
  • couldnt find library libc.so.25. (code 1)

cp a b
7
Examples Abound!
  • TCP connect -gt ECONNREFUSED
  • Wrong port number.
  • A loaded service is rejecting connections.
  • The machine has just rebooted, has initialized
    TCP/IP, but not yet started the service.
  • FTP RETR -gt code 550
  • 550 File or directory not found.
  • 550 Erlaubnis hat verweigert.
  • 550 Archiveer systeem offline.
  • 550 Fuori di memoria.
  • 550 File staging in from tape. (NCSA Unitree)

8
How do we design new systems that avoid these
problems? Error Scope HPDC 2002
Real systems have these problems. How can we
learn to live with them? Ethernet
Approach HPDC 2003
Not enough information or control.
9
The Ethernet Approach
Ethernet Rules Carrier Sense Collision
Detect Exponential Backoff Limited Allocation
No Carrier Sense Aloha Protocol
Network or Memory or Disk Space or OS Resources
10
The Fault Tolerant Shell
  • A tool that encourages the Ethernet approach in
    system integration.
  • Similar to the Bourne or C-Shells.
  • Process invocation and repetition are simple.
  • Other elements are possible but ugly.
  • Not meant to be general purpose, high
    performance, or abstractly beautiful.
  • Not OOP, AOP, SOP, GP, etc...
  • Ethernet ideas could be used in such languages.
  • Elements
  • Brittle property, try/catch, timed try,
    forany/forall.

11
The Brittle Property
Failure of any step causes an immediate halt of
the entire group.
  • wget http//host/file.tar.gz
  • gunzip file.tar.gz
  • tar xvf file.tar

12
Untyped Exceptions
try wget http//host/file.tar.gz gunzip
file.tar.gz tar xvf file.tar catch echo
Zoiks! end
Failure of this group raises an exception.
Exceptions have no type!
13
Timed Try Statements
The enclosed statement will be cancelled after 30
mins.
try for 30 minutes wget http//host/file.tar.gz
gunzip file.tar.gz tar xvf file.tar end
An exception in the enclosed statement will retry
up to 30 mins. (Exp. backoff.)
Success after n is as good as success after one.
(Otherwise, failure.)
14
Timed Try Statements
  • If group completes within time limit.
  • Try block succeeds.
  • If group fails within time limit.
  • Automatically retried.
  • Exponentially increasing delay.
  • Random factor to avoid collisions.
  • If group runs over time limit.
  • Resources reclaimed, exception thrown.

15
forany and forall
forany host in xxx yyy zzz wget
http//host/file end
Attempt to make this statement succeed for any
random branch.
Attempt to make this statement succeed for all
branches simultaneously.
forall host in xxx yyy zzz wget
http//host/file end
16
Example Applications
Job Queue Disk Buffer Data Servers
Collision Detect failed cmd failed cmd failed cmd
Exp Backoff try backoff try backoff try backoff
Limited Allocation try timeout try timeout try timeout
Carrier Sense File Descriptors Estimated Free Space Short Active Probe
Ethernet Properties
handled by ftsh
handled by coder
17
Shared Job Queue
Multiple clients connect to a job queue to
manipulate jobs. (Submit, query, remove, etc.)
Whats the bottleneck?
Match Maker
Client
Condor schedd
CPU
Client
CPU
Client
Local Filesystem
Job
Activity Log
Job
Job
Job Queue
Job
Job
Job
CPU
Job
Job
18
Aloha Client
try for 5 minutes condor_submit job.file end
19
Ethernet Client
try for 5 minutes if avail_fds() .lt.
1000 failure end condor_submit job.file end
Measure free file descriptors.
Throw an exception and try again.
20
(No Transcript)
21
Shared Disk Buffer
Multiple batch jobs share an output buffer. Jobs
write output files, and a mover pushes them out.
Step C Commit
Step D Read
Step B Write
Step A Arbitrate
Step E Send
Data Mover
Job 8
Job 9
Job 10
Step F Delete
d5.c
d6.c
d7.c
d9.i
Local File System
d10.i
d8.i
d4.c
22
Aloha Client
  • try for 30 minutes
  • try
  • run-job gt dn.i
  • mv dn.i dn.c
  • catch
  • rm -f dn.i
  • end
  • end

Create the file, marked incomplete.
Atomically commit the file.
Remove the file if any failure.
23
Ethernet Client
  • try for 30 minutes
  • if overcommitted()
  • failure
  • end
  • try
  • run-job gt dn.i
  • mv dn.i dn.c
  • catch
  • rm -f dn.i
  • end
  • end

Buffer is overcommitted if estimated needs exceed
available space.
24
(No Transcript)
25
Shared Data Servers
Accepts all connections and holds them idle
indefinitely.
A healthy but loaded server might also have a
high response time.
Each client wants one instance of the data set,
but doesnt care which one. How to deal with
delays and failures?
26
Aloha Client
try for 15 minutes forany host in xxx yyy
zzz try for 1 minute wget http//host/data
end end end
27
Ethernet Client
  • try for 15 minutes
  • forany host in xxx yyy zzz
  • try for 5 seconds
  • wget http//host/tiny
  • end
  • try for 1 minute
  • wget http//host/data
  • end
  • end
  • end

Test the server by fetching a tiny file.
28
All Clients Blocked on Black Hole
29
Some Thoughts
  • This is a necessary technique for real problems.
  • Timing is uncontrollable failures lack detail.
  • A simple technique has significant payoff.
  • The Ethernet approach is not always ideal.
  • Carefully chosen errnos are powerful.
  • Designing errnos is tricky.
  • Requires clients of good will.
  • Some scenarios require external coordination.
  • Admission control for admission control?
  • Time and failure are first-class concerns.
  • They should be first-class elements of languages!
  • We get good mileage without complex
    constructions.
  • More info at
  • http//www.cs.wisc.edu/condor/ftsh

30
Computings central challenge, How not to make a
mess of it, has not yet been met. -Edsger
Dijkstra
Write a Comment
User Comments (0)
About PowerShow.com