Asynchronous Remote Execution

About This Presentation

Title:

Asynchronous Remote Execution

Description:

The Distributed Buffer Cache. The Kangaroo distributed buffer cache introduces asynchronous ... UNIX buffer cache. Imprecise Exceptions. Human Closure: Coda ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 76

Provided by: dougla9

Learn more at: https://www3.nd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Asynchronous Remote Execution

1
AsynchronousRemote Execution

PhD Preliminary Examination
Douglas Thain
University of Wisconsin
19 March 2002

2
Thesis

Asynchronous operations improve the throughput,
resiliency, and scalability of remote execution
systems.
However, asynchrony introduces new failure modes
that must be carefully understood in order to
preserve the illusion of synchronous operation.

3
Proposal

I propose to explore the coupling between
asynchrony, failures, and performance in remote
execution.
To accomplish this, I will modify an existing
system and increase the available asynchrony in
degrees.

4
Contributions

A measurement of the performance benefits through
asynchrony.
A system design that accommodates asynchrony
while tolerating a significant set of expected
failure modes.
An exploration of the balance between
performance, risk, and knowledge in a distributed
system.

5
Outline

Introduction
A Case Study
Remote Execution
Related Work
Progress Made
Research Agenda
Conclusion

6
Science NeedsLarge-Scale Computing

Theory, Experiment, Computation
Nearly every field of scientific study has a
grand challenge problem
Meteorology
Genetics
Astronomy
Physics

7
The Grid Vision
Security Services
Tape Archive
Disk Archive
Disk Archive
Disk Archive
Disk Archive
8
The Grid Reality

Systems for managing CPUs
Condor, LSF, PBS
Programming Interfaces
POSIX, Java, C, MPI, PVM....
Systems for managing data
SRB, HRM, SAM, ReqEx (DapMan)
Systems for storing data
NeST, IBP
Systems for moving data
GridFTP, HTTP, Kangaroo
Systems for remote authentication
SSL, GSI, Kerberos, NTSSPI

9
The Grid Reality

Host uptime
Median 15.92 days
Mean 5.53 days
Local Maximum 1 day
Long et al, A Longitudinal Study of Internet
Host Reliability, Symposium on Reliable
Distributed Systems, 1995.
Wide-area connectivity
approx 1 chance of 30-sec interruption
approx 0.1 chance of a persistent outage
Chandra et al, End-to-end WAN Service
Availability, Proceedings of 3rd Usenix Symp on
Internet Technologies and Systems, 2001.

10
Security Services
Disk Archive
Disk Archive
Disk Archive
Disk Archive
Job
Tape Archive
Tape Archive
Tape Archive
Tape Archive
11
Usual ApproachHold and Wait

Request CPU, Wait for Success
Stage Data to CPU, Wait
Move Executable to CPU, Wait
Execute Program, Wait
Missing File -gt Stage Data, Wait
Stage Output Data, Wait
Failure? Start over...

12
Synchronous Systemsare Inflexible

Poor utilization and throughput due to
hold-and-wait.
CPU idle while disk busy.
Disk idle while CPU busy.
Disk full? CPU stops.
System sensitive to failures of both performance
and correctness.
Network down? Everything stops.
Network slow? Everything slows.
Credentials lost? Everything aborts.

13
Resiliency Requires Flexibility

Most jobs have weak couplings between all of
their components.
Asynchrony Time Decoupling
Cant have network now? Ok, use disk.
Cant have CPU now? Ok, checkpoint.
Cant store data now? Ok, recompute later.
Time Decoupling -gt Space Decoupling

14
Computings Central Challenge

How not to make a mess of it.
- Edsger Dijkstra, CACM March 2001.

How can we harness the advantages of asynchrony
while maintaining a coherent and reliable user
experience?
15
Outline

Introduction
A Case Study The Distributed Buffer Cache
Remote Execution
Related Work
Progress Made
Research Agenda
Conclusion

16
Case StudyThe Distributed Buffer Cache

The Kangaroo distributed buffer cache introduces
asynchronous I/O for remote execution.
It offers improved job throughput and failure
resiliency at the price of increased latency in
I/O arrival.
A small mess Jobs and I/O are not recoupled at
completion time.

17
Kangaroo Prototype
An application may contact any node in the system
and perform partial-file reads and writes.
The node may then execute or buffer operations as
conditions warrant.
K
K
K
App
Buffer
Buffer
Disk
A consistency protocol ensures no loss of data
due to crash/disconnect.
18
Distributed Buffer Cache
K
K
K
Distributed Buffer Cache
K
K
K
K
Disk
19
MacrobenchmarkImage Processing

Post-processing of satellite image data Need to
compute various enhancements and produce output
for each.
Read input image
For I1 to N
Compute transformation of image
Write output image
Example
Image size about 5 MB
Compute time about 6 sec
IO-cpu ratio .91 MB/s

20
I/O Models for Image Processing
Offline Staging I/O
OUTPUT
OUTPUT
CPU
OUTPUT
INPUT
OUTPUT
CPU
CPU
CPU
Online Streaming I/O
OUTPUT
OUTPUT
CPU
OUTPUT
INPUT
OUTPUT
CPU
CPU
CPU
Kangaroo
CPU
INPUT
CPU
CPU
CPU
OUTPUT
OUTPUT
OUTPUT
OUTPUT
21

22
A Small Mess

The output will make it back eventually, barring
the removal of a disk.
But, what if...
...we need to know when it arrives?
...the data should be cancelled?
...it never arrives?
There is a hold-and-wait operation (push,) but
this defeats much of the purpose.
The job result needs to be a function of both the
compute and data results.

23
Lesson

We may decouple CPU and I/O consumption for
improved throughput.
But, CPU and I/O must be semantically coupled at
both dispatch and completion in order to provide
useful semantics.
Not necessary in a monolithic system
All components fail at once.
Integration of CPU and I/O management (fsync)

24
Outline

Introduction
A Case Study
Remote Execution
Synchronous Execution
Asynchronous Execution
Failures, Transparency, and Performance
Related Work
Progress Made
Research Agenda
Conclusion

25
Remote Execution

Remote execution is the problem of running a job
in a distributed system.
A job is a request to consume a set of resources
in a coordinated way
Abstract Programs, Files, Licenses, Users
Concrete CPUs, Storage, Servers, Terminals
A distributed system is a computing environment
that is
Composed of autonomous units.
Subject to uncoordinated failure.
Subject to high performance variability.

26
About Jobs

Job policy dictates what resources are acceptable
to consume
CPU must be a SPARC
Must have gt 128MB memory
Must be within 100ms of a disk server
CPU must be owned by a trusted authority.
Input data set X may come from any trusted
replication site.

27
About Jobs

The components of a job have flexible temporal
and spatial requirements.

Input Device
Output Device
CPU
Output Data
Input Data
read
write
present throughout
present at startup
interactive preferred
Program Image
License
Creds
28
Expected Jobs

In this work, I will concentrate on a limited
class of jobs
Executable image
Single CPU request
May checkpoint/restart to manage CPU.
Input data (online/offline)
Output data (multiple targets)

29
Expected Systems

High latency
I/O operations are ms-gtsec
Process dispatch is seconds-gtminutes
Performance variation
TCP hiccups cause outages of seconds-gtminutes.
By day, network congested, by night-gtfree
Uncoordinated failure
File system fails, CPU continues to run.
Network fails, but endpoints continue.
Autonomy
Users reclaim workstation CPUs.
Best-effort storage is reclaimed.

30
Expected Users

A wide variety of users will have varying degrees
of policy aggression.
Scientific computation
Maximize long-term utilization/throughput.
Scientific instrument
Minimize use of one device.
Disaster response
Compute this ASAP at any expense!
Graduate student
Finish job before mid-life crisis.

31
The Synchronous Approach

Grab one resource at a time as they become
necessary and available.
Assume any other resources are immediately
available online.
Start with the resource with the most contention.
Examples
Condor distributed batch system
Fermi Sequential Access Manager (SAM)

32
The Condor Approach
Job Input Needs CPU Needs
results
Online Storage
CPU
CPU
CPU
CPU
CPU
Match Maker
33
The SAM Approach
Job Input Needs CPU Needs
results
Temp Disk
CPU
CPU
CPU
CPU
CPU
Tape Archive
34
Problems

What if one resource is not obviously (or
consistently) the constraint?
What is the expense of holding one resource idle
while waiting for another?
What if no single resource is under your absolute
control?
What if all your resource requirements cannot be
stated offline?
How can we deal with failures without starting
everything again from scratch?

35
Asynchronous Execution

Recognize when a job has loose synchronization
requirements.
Seek parallelism where available.
Synchronize parallel activities at necessary
joining points.
Allow idle resources to be released and
re-allocated for use by others.
Consider failures in execution as allocation
problems.

36
CPU Request
Output Data
write
CPU
read
Input Data
Exit Code
exit
Program Image
exec
37
The Benefits of Asynchrony

Better utilization of disjoint resources -gt
higher throughput.
More resilient to performance variations.
Less expensive recovery from partial system
failures.

38
The Price of Asynchrony

Complexity
Many new boundary cases to cover.
Is the complexity worth the trouble?
Risk
Without appropriate policies, we may
Oversubscribe (internal fragmentation)
Undersubscribe (external fragmentation)

39
The Problem ofClosing the Loop
Job Submission
Job Completion
40
Synchronous I/O
Program Result
CPU Busy
CPU Idle
CPU Busy
I/O Result
I/O Dispatch
I/O Busy
41
Asynchronous Open-Loop I/O
CPU Idle
Program Result
CPU Busy
CPU Busy
I/O Result?
I/O Dispatch
I/O Result
I/O Busy
I/O Validation
42
AsynchronousClosed-Loop I/O
CPU Idle
Program Result
CPU Busy
CPU Busy
Job Result
I/O Result?
I/O Dispatch
I/O Result
I/O Busy
I/O Validation
43
Outline

Abstract
Introduction
Remote Execution
Related Work
Progress Made
Research Agenda
Conclusion

44
Related Work

Many components of grid computing
CPUs, storage, networks...
Many traditional research areas
scheduling, file systems, virtual memory...
What systems seek parallelism in operations that
would appear to be atomic?
What systems exchange one resource for another?
How do they deal with failures?

45
Computer Architecture

These two are remarkably similar
sort -n lt infile gt outfile
ADD r18, r212, r3
Each has multiple parts with a loose coupling in
time and space.
idle -gt working -gt done -gt commited
A failure or an unsuccessful speculation must
roll back dependent parts.

46
Trading Storagefor Communication

Immediate Closure
Synchronous I/O
Bounded Closure
GASS, AFS, transaction
Indeterminate Closure
UNIX buffer cache
Imprecise Exceptions
Human Closure
Coda -gt Failures invoke email

47
Trading Computationfor Communication

Time Warp Simulation Model
All nodes checkpoint frequently.
All messages one-way without synchro.
Missed a message? Roll back and send out
anti-messages to undo earlier work.
Problems
When can I throw out a checkpoint?
Cant tolerate message failure.
Virtual Data Grid
Data sets have a functional specification.
Transfer here, or recompute?
Decide at run-time using cost/benefit.

48
Outline

Introduction
Case Study
Remote Execution
Related Work
Progress Made
Research Agenda
Conclusion

49
Progress Made

We have already laid much of the research
foundation necessary to explore asynchronous
remote execution.
Deployable software
Bypass, Kangaroo, NeST
Organizing Concepts
Distributed buffer cache
I/O communities
Error management theory

50
Interposition Agents
Application
Interposition Agent
Standard Library
Kernel
51
The Grid Console
Half Interactive Process
Unreliable Network
52
I/O Communities
Condor Pool
Condor Pool
NeST
NeST
Data
Job
53
References in ClassAds
Refers to NearestStorage.
Knows where NearestStorage is.
Job Ad
Machine Ad
Storage Ad
match
Machine
Job
NeST
54
Distributed Buffer Cache
K
K
K
Distributed Buffer Cache
K
K
K
K
Disk
55
Error Management

In preparation Error Scope on a Computational
Grid Theory and Practice
An environment for Java in Condor.
How do we understand the significance of the many
things that may go wrong?
Every scope must have a handler.

56
Publications

Douglas Thain and Miron Livny, Error Scope on a
Computational Grid,'' in preparation.
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dusseau, and Miron Livny,
Gathering at the Well Creating Communities for
Grid I/O,'' Proceedings of Supercomputing 2001,
Denver, Colorado, November 2001.
Douglas Thain, Jim Basney, Se-Chang Son, and
Miron Livny, The Kangaroo Approach to Data
Movement on the Grid,'' in Proceedings of the
Tenth IEEE Symposium on High Performance
Distributed Computing (HPDC10), San Francisco,
California, August 7-9, 2001, pp 325-333.
Douglas Thain and Miron Livny, Multiple Bypass
Interposition Agents for Distributed Computing,''
Journal of Cluster Computing, Volume 4, Pages
39-47, 2001.
Douglas Thain and Miron Livny, Bypass A tool
for building split execution systems'', in
Proceedings of the Ninth IEEE Symposium in High
Performance Distributed Computing (HPDC9),
Pittsburgh, Pennsylvania, August 1-4, 2000, pp
79-85.

57
Outline

Introduction
A Case Study
Remote Execution
Related Work
Progress Made
Research Agenda
Conclusion

58
Research Agenda

I propose to create an end-to-end structure for
asynchronous remote execution.
To accomplish this, I will take an existing
remote execution system, and increase the
asynchrony by degrees.
The focus will be mechanisms, not policies.
Suggest points where policies must be attached.
Use simple policies to demonstrate use.
Mechanisms must be correct regardless of policy
choices.

59
Research Environment

The Condor distributed batch system.
Local resources
Test Pool - approx 20 workstations.
Main Pool - approx 1000 machines.
Possible to deploy significant changes to all
partcipating software.
Remote Resources
INFN Bologna - approx 300 workstations.
Other pools as necessary.
Can only deploy changes within the context of an
interposition agent.

60
Stage OneAsynchronous Output

Timeline March-May 2002
Goal
Decouple CPU allocation from output data
movement.
Method
Couple Kangaroo with Condor and close the loop.
Job requires a new state waiting for output.
Policy
How long should a job remain in waiting for
output before it is re-run?

61
Stage TwoAsynchronous Input

Timeline June-August 2002.
Goal
Decouple CPU allocation from input data movement.
Method
Modify scheduler to be aware of I/O communities
and seek CPU and I/O allocations independently.
Unexpected I/O needs may use checkpointing to
release CPU allocations.
Policy
How long to hold idle before timeout?
How to estimate queueing time for each resource?

62
Stage ThreeDisconnected Operation

Timeline September-December 2002
Goal
Permit application to execute without any
run-time dependence on the submitter.
Method
Release umbilical once policy is set.
Job requires new state presumed alive.
Unexpected policy needs may require reconnection.
Policy
How much autonomy may be delegated tothe
interposition agent? (performance/control)

63
Stage FourDissertation

Timeline January-May 2003
Design
What algorithms and data structures are
necessary?
Performance
What are the quantitative costs/benefits?
Discussion
What are the tradeoffs between performance, risk,
and knowledge?
What are the implications of designing for
fault-tolerance?

64
Evaluation Criteria

Correctness
The system must meet its interface obligations.
Reliability
Satisfy the user with high probability.
Throughput
Improve by avoiding hold-and-wait.
Latency
A modest increase is ok for batch workloads.
Knowledge
Has my job finished?
How much has it consumed?
Complexity

65
Contributions

Short-term
Improvement of the Condor software.
Not the goal, but a necessary validation
Medium-term
Serve as a design resource for grid computing.
Key concepts such as closing the loop and
matching interfaces.
Long-term
Serve as a basis for further research.

66
Further Work

Should jobs move to data or vice versa?
Lets try both!
Many opportunities for speculation
Potentially stale data in file cache? Keep going.
Partial program completion is useful
DAGMan dispatch a process based on exit code,
dispatch another based on output data.
What if we change the API?
A drastic step, but... MPI, PVM, MW, Linda.
Can we admit subprogram failure and maintain a
usable interface?

67
Outline

Introduction
A Case Study
Remote Execution
Related Work
Progress Made
Research Agenda
Conclusion

68
Conclusion

Large-grained asynchrony has yet to be explored
in the context of remote program execution.
Asynchrony has benefits, but requires careful
management of failure modes.
This dissertation will contribute a system design
and an exploration of performance, risk, and
knowledge in a distributed system.

69
Extra Slides
70
Execution Site
Submission Site
starter
shadow
Secure Remote I/O
I/O Server
I/O Proxy
Local I/O (Chirp)
Fork
Local System Calls
JVM
Home File System
The Job
I/O Library
71
Explicit descriptions of ordering, reliability,
performance, and availability. (POSIX, MPI, PVM,
MW)
Running Program
Application Interface
Interposition Agent
Remote Resource Interfaces
Few guarantees on performance, availability, and
reliability.
Disk
CPU
RAM
Network
72
Supervisor Process
Running Program

In the event of a failure
Retry Hold the CPU allocation and try again.
Checkpoint Release the CPU, with some restart
condition.
Abort Expose the failure to a supervisor.

Interposition Agent
Disk
CPU
RAM
Network
73
The Cost of Hiding Failures

Each technique is valid from the standpoint of
API conformity.
What to use? Depends on cost
Retry Holds CPU idle while retrying device.
Checkpoint Consumes disk and network, but
releases CPU.
Abort No expense up front, but must re-consume
resources when ready to roll forward.
User policy is vital in determining what costs
are acceptable for hiding failures.

74
Running Program
What file should I open? How long should I
try? May I checkpoint now? Where should I store
the checkpoint image? Should I stage or stream
the output? FYI, Ive used 395 service units
here. FYI, Im about to be evicted from this site.
Policy Director
Interposition Agent
Disk
CPU
RAM
Network
75
Disconnected Operation

The policy manager is also an execution resource
that is occasionally slow or unavailable.
Holding a resource idle while waiting
indefinitely for policy direction is still
wait-while-idle.
A higher degree of asynchrony can be achieved
through disconnected operation.
Requires each autonomous unit be given an
allowance.

Write a Comment

User Comments (0)