Title: Asynchronous Remote Execution
1AsynchronousRemote Execution
- PhD Preliminary Examination
- Douglas Thain
- University of Wisconsin
- 19 March 2002
2Thesis
- Asynchronous operations improve the throughput,
resiliency, and scalability of remote execution
systems. - However, asynchrony introduces new failure modes
that must be carefully understood in order to
preserve the illusion of synchronous operation.
3Proposal
- I propose to explore the coupling between
asynchrony, failures, and performance in remote
execution. - To accomplish this, I will modify an existing
system and increase the available asynchrony in
degrees.
4Contributions
- A measurement of the performance benefits through
asynchrony. - A system design that accommodates asynchrony
while tolerating a significant set of expected
failure modes. - An exploration of the balance between
performance, risk, and knowledge in a distributed
system.
5Outline
- Introduction
- A Case Study
- Remote Execution
- Related Work
- Progress Made
- Research Agenda
- Conclusion
6Science NeedsLarge-Scale Computing
- Theory, Experiment, Computation
- Nearly every field of scientific study has a
grand challenge problem - Meteorology
- Genetics
- Astronomy
- Physics
7The Grid Vision
Security Services
Tape Archive
Disk Archive
Disk Archive
Disk Archive
Disk Archive
8The Grid Reality
- Systems for managing CPUs
- Condor, LSF, PBS
- Programming Interfaces
- POSIX, Java, C, MPI, PVM....
- Systems for managing data
- SRB, HRM, SAM, ReqEx (DapMan)
- Systems for storing data
- NeST, IBP
- Systems for moving data
- GridFTP, HTTP, Kangaroo
- Systems for remote authentication
- SSL, GSI, Kerberos, NTSSPI
9The Grid Reality
- Host uptime
- Median 15.92 days
- Mean 5.53 days
- Local Maximum 1 day
- Long et al, A Longitudinal Study of Internet
Host Reliability, Symposium on Reliable
Distributed Systems, 1995. - Wide-area connectivity
- approx 1 chance of 30-sec interruption
- approx 0.1 chance of a persistent outage
- Chandra et al, End-to-end WAN Service
Availability, Proceedings of 3rd Usenix Symp on
Internet Technologies and Systems, 2001.
10Security Services
Disk Archive
Disk Archive
Disk Archive
Disk Archive
Job
Tape Archive
Tape Archive
Tape Archive
Tape Archive
11Usual ApproachHold and Wait
- Request CPU, Wait for Success
- Stage Data to CPU, Wait
- Move Executable to CPU, Wait
- Execute Program, Wait
- Missing File -gt Stage Data, Wait
- Stage Output Data, Wait
- Failure? Start over...
12Synchronous Systemsare Inflexible
- Poor utilization and throughput due to
hold-and-wait. - CPU idle while disk busy.
- Disk idle while CPU busy.
- Disk full? CPU stops.
- System sensitive to failures of both performance
and correctness. - Network down? Everything stops.
- Network slow? Everything slows.
- Credentials lost? Everything aborts.
13Resiliency Requires Flexibility
- Most jobs have weak couplings between all of
their components. - Asynchrony Time Decoupling
- Cant have network now? Ok, use disk.
- Cant have CPU now? Ok, checkpoint.
- Cant store data now? Ok, recompute later.
- Time Decoupling -gt Space Decoupling
14Computings Central Challenge
- How not to make a mess of it.
- - Edsger Dijkstra, CACM March 2001.
How can we harness the advantages of asynchrony
while maintaining a coherent and reliable user
experience?
15Outline
- Introduction
- A Case Study The Distributed Buffer Cache
- Remote Execution
- Related Work
- Progress Made
- Research Agenda
- Conclusion
16Case StudyThe Distributed Buffer Cache
- The Kangaroo distributed buffer cache introduces
asynchronous I/O for remote execution. - It offers improved job throughput and failure
resiliency at the price of increased latency in
I/O arrival. - A small mess Jobs and I/O are not recoupled at
completion time.
17Kangaroo Prototype
An application may contact any node in the system
and perform partial-file reads and writes.
The node may then execute or buffer operations as
conditions warrant.
K
K
K
App
Buffer
Buffer
Disk
A consistency protocol ensures no loss of data
due to crash/disconnect.
18Distributed Buffer Cache
K
K
K
Distributed Buffer Cache
K
K
K
K
Disk
19MacrobenchmarkImage Processing
- Post-processing of satellite image data Need to
compute various enhancements and produce output
for each. - Read input image
- For I1 to N
- Compute transformation of image
- Write output image
- Example
- Image size about 5 MB
- Compute time about 6 sec
- IO-cpu ratio .91 MB/s
20I/O Models for Image Processing
Offline Staging I/O
OUTPUT
OUTPUT
CPU
OUTPUT
INPUT
OUTPUT
CPU
CPU
CPU
Online Streaming I/O
OUTPUT
OUTPUT
CPU
OUTPUT
INPUT
OUTPUT
CPU
CPU
CPU
Kangaroo
CPU
INPUT
CPU
CPU
CPU
OUTPUT
OUTPUT
OUTPUT
OUTPUT
21 22A Small Mess
- The output will make it back eventually, barring
the removal of a disk. - But, what if...
- ...we need to know when it arrives?
- ...the data should be cancelled?
- ...it never arrives?
- There is a hold-and-wait operation (push,) but
this defeats much of the purpose. - The job result needs to be a function of both the
compute and data results.
23Lesson
- We may decouple CPU and I/O consumption for
improved throughput. - But, CPU and I/O must be semantically coupled at
both dispatch and completion in order to provide
useful semantics. - Not necessary in a monolithic system
- All components fail at once.
- Integration of CPU and I/O management (fsync)
24Outline
- Introduction
- A Case Study
- Remote Execution
- Synchronous Execution
- Asynchronous Execution
- Failures, Transparency, and Performance
- Related Work
- Progress Made
- Research Agenda
- Conclusion
25Remote Execution
- Remote execution is the problem of running a job
in a distributed system. - A job is a request to consume a set of resources
in a coordinated way - Abstract Programs, Files, Licenses, Users
- Concrete CPUs, Storage, Servers, Terminals
- A distributed system is a computing environment
that is - Composed of autonomous units.
- Subject to uncoordinated failure.
- Subject to high performance variability.
26About Jobs
- Job policy dictates what resources are acceptable
to consume - CPU must be a SPARC
- Must have gt 128MB memory
- Must be within 100ms of a disk server
- CPU must be owned by a trusted authority.
- Input data set X may come from any trusted
replication site.
27About Jobs
- The components of a job have flexible temporal
and spatial requirements.
Input Device
Output Device
CPU
Output Data
Input Data
read
write
present throughout
present at startup
interactive preferred
Program Image
License
Creds
28Expected Jobs
- In this work, I will concentrate on a limited
class of jobs - Executable image
- Single CPU request
- May checkpoint/restart to manage CPU.
- Input data (online/offline)
- Output data (multiple targets)
29Expected Systems
- High latency
- I/O operations are ms-gtsec
- Process dispatch is seconds-gtminutes
- Performance variation
- TCP hiccups cause outages of seconds-gtminutes.
- By day, network congested, by night-gtfree
- Uncoordinated failure
- File system fails, CPU continues to run.
- Network fails, but endpoints continue.
- Autonomy
- Users reclaim workstation CPUs.
- Best-effort storage is reclaimed.
30Expected Users
- A wide variety of users will have varying degrees
of policy aggression. - Scientific computation
- Maximize long-term utilization/throughput.
- Scientific instrument
- Minimize use of one device.
- Disaster response
- Compute this ASAP at any expense!
- Graduate student
- Finish job before mid-life crisis.
31The Synchronous Approach
- Grab one resource at a time as they become
necessary and available. - Assume any other resources are immediately
available online. - Start with the resource with the most contention.
- Examples
- Condor distributed batch system
- Fermi Sequential Access Manager (SAM)
32The Condor Approach
Job Input Needs CPU Needs
results
Online Storage
CPU
CPU
CPU
CPU
CPU
Match Maker
33The SAM Approach
Job Input Needs CPU Needs
results
Temp Disk
CPU
CPU
CPU
CPU
CPU
Tape Archive
34Problems
- What if one resource is not obviously (or
consistently) the constraint? - What is the expense of holding one resource idle
while waiting for another? - What if no single resource is under your absolute
control? - What if all your resource requirements cannot be
stated offline? - How can we deal with failures without starting
everything again from scratch?
35Asynchronous Execution
- Recognize when a job has loose synchronization
requirements. - Seek parallelism where available.
- Synchronize parallel activities at necessary
joining points. - Allow idle resources to be released and
re-allocated for use by others. - Consider failures in execution as allocation
problems.
36CPU Request
Output Data
write
CPU
read
Input Data
Exit Code
exit
Program Image
exec
37The Benefits of Asynchrony
- Better utilization of disjoint resources -gt
higher throughput. - More resilient to performance variations.
- Less expensive recovery from partial system
failures.
38The Price of Asynchrony
- Complexity
- Many new boundary cases to cover.
- Is the complexity worth the trouble?
- Risk
- Without appropriate policies, we may
- Oversubscribe (internal fragmentation)
- Undersubscribe (external fragmentation)
39The Problem ofClosing the Loop
Job Submission
Job Completion
40Synchronous I/O
Program Result
CPU Busy
CPU Idle
CPU Busy
I/O Result
I/O Dispatch
I/O Busy
41Asynchronous Open-Loop I/O
CPU Idle
Program Result
CPU Busy
CPU Busy
I/O Result?
I/O Dispatch
I/O Result
I/O Busy
I/O Validation
42AsynchronousClosed-Loop I/O
CPU Idle
Program Result
CPU Busy
CPU Busy
Job Result
I/O Result?
I/O Dispatch
I/O Result
I/O Busy
I/O Validation
43Outline
- Abstract
- Introduction
- Remote Execution
- Related Work
- Progress Made
- Research Agenda
- Conclusion
44Related Work
- Many components of grid computing
- CPUs, storage, networks...
- Many traditional research areas
- scheduling, file systems, virtual memory...
- What systems seek parallelism in operations that
would appear to be atomic? - What systems exchange one resource for another?
- How do they deal with failures?
45Computer Architecture
- These two are remarkably similar
- sort -n lt infile gt outfile
- ADD r18, r212, r3
- Each has multiple parts with a loose coupling in
time and space. - idle -gt working -gt done -gt commited
- A failure or an unsuccessful speculation must
roll back dependent parts.
46Trading Storagefor Communication
- Immediate Closure
- Synchronous I/O
- Bounded Closure
- GASS, AFS, transaction
- Indeterminate Closure
- UNIX buffer cache
- Imprecise Exceptions
- Human Closure
- Coda -gt Failures invoke email
47Trading Computationfor Communication
- Time Warp Simulation Model
- All nodes checkpoint frequently.
- All messages one-way without synchro.
- Missed a message? Roll back and send out
anti-messages to undo earlier work. - Problems
- When can I throw out a checkpoint?
- Cant tolerate message failure.
- Virtual Data Grid
- Data sets have a functional specification.
- Transfer here, or recompute?
- Decide at run-time using cost/benefit.
48Outline
- Introduction
- Case Study
- Remote Execution
- Related Work
- Progress Made
- Research Agenda
- Conclusion
49Progress Made
- We have already laid much of the research
foundation necessary to explore asynchronous
remote execution. - Deployable software
- Bypass, Kangaroo, NeST
- Organizing Concepts
- Distributed buffer cache
- I/O communities
- Error management theory
50Interposition Agents
Application
Interposition Agent
Standard Library
Kernel
51The Grid Console
Half Interactive Process
Unreliable Network
52I/O Communities
Condor Pool
Condor Pool
NeST
NeST
Data
Job
53References in ClassAds
Refers to NearestStorage.
Knows where NearestStorage is.
Job Ad
Machine Ad
Storage Ad
match
Machine
Job
NeST
54Distributed Buffer Cache
K
K
K
Distributed Buffer Cache
K
K
K
K
Disk
55Error Management
- In preparation Error Scope on a Computational
Grid Theory and Practice - An environment for Java in Condor.
- How do we understand the significance of the many
things that may go wrong? - Every scope must have a handler.
56Publications
- Douglas Thain and Miron Livny, Error Scope on a
Computational Grid,'' in preparation. - Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dusseau, and Miron Livny,
Gathering at the Well Creating Communities for
Grid I/O,'' Proceedings of Supercomputing 2001,
Denver, Colorado, November 2001. - Douglas Thain, Jim Basney, Se-Chang Son, and
Miron Livny, The Kangaroo Approach to Data
Movement on the Grid,'' in Proceedings of the
Tenth IEEE Symposium on High Performance
Distributed Computing (HPDC10), San Francisco,
California, August 7-9, 2001, pp 325-333. - Douglas Thain and Miron Livny, Multiple Bypass
Interposition Agents for Distributed Computing,''
Journal of Cluster Computing, Volume 4, Pages
39-47, 2001. - Douglas Thain and Miron Livny, Bypass A tool
for building split execution systems'', in
Proceedings of the Ninth IEEE Symposium in High
Performance Distributed Computing (HPDC9),
Pittsburgh, Pennsylvania, August 1-4, 2000, pp
79-85.
57Outline
- Introduction
- A Case Study
- Remote Execution
- Related Work
- Progress Made
- Research Agenda
- Conclusion
58Research Agenda
- I propose to create an end-to-end structure for
asynchronous remote execution. - To accomplish this, I will take an existing
remote execution system, and increase the
asynchrony by degrees. - The focus will be mechanisms, not policies.
- Suggest points where policies must be attached.
- Use simple policies to demonstrate use.
- Mechanisms must be correct regardless of policy
choices.
59Research Environment
- The Condor distributed batch system.
- Local resources
- Test Pool - approx 20 workstations.
- Main Pool - approx 1000 machines.
- Possible to deploy significant changes to all
partcipating software. - Remote Resources
- INFN Bologna - approx 300 workstations.
- Other pools as necessary.
- Can only deploy changes within the context of an
interposition agent.
60Stage OneAsynchronous Output
- Timeline March-May 2002
- Goal
- Decouple CPU allocation from output data
movement. - Method
- Couple Kangaroo with Condor and close the loop.
- Job requires a new state waiting for output.
- Policy
- How long should a job remain in waiting for
output before it is re-run?
61Stage TwoAsynchronous Input
- Timeline June-August 2002.
- Goal
- Decouple CPU allocation from input data movement.
- Method
- Modify scheduler to be aware of I/O communities
and seek CPU and I/O allocations independently. - Unexpected I/O needs may use checkpointing to
release CPU allocations. - Policy
- How long to hold idle before timeout?
- How to estimate queueing time for each resource?
62Stage ThreeDisconnected Operation
- Timeline September-December 2002
- Goal
- Permit application to execute without any
run-time dependence on the submitter. - Method
- Release umbilical once policy is set.
- Job requires new state presumed alive.
- Unexpected policy needs may require reconnection.
- Policy
- How much autonomy may be delegated tothe
interposition agent? (performance/control)
63Stage FourDissertation
- Timeline January-May 2003
- Design
- What algorithms and data structures are
necessary? - Performance
- What are the quantitative costs/benefits?
- Discussion
- What are the tradeoffs between performance, risk,
and knowledge? - What are the implications of designing for
fault-tolerance?
64Evaluation Criteria
- Correctness
- The system must meet its interface obligations.
- Reliability
- Satisfy the user with high probability.
- Throughput
- Improve by avoiding hold-and-wait.
- Latency
- A modest increase is ok for batch workloads.
- Knowledge
- Has my job finished?
- How much has it consumed?
- Complexity
65Contributions
- Short-term
- Improvement of the Condor software.
- Not the goal, but a necessary validation
- Medium-term
- Serve as a design resource for grid computing.
- Key concepts such as closing the loop and
matching interfaces. - Long-term
- Serve as a basis for further research.
66Further Work
- Should jobs move to data or vice versa?
- Lets try both!
- Many opportunities for speculation
- Potentially stale data in file cache? Keep going.
- Partial program completion is useful
- DAGMan dispatch a process based on exit code,
dispatch another based on output data. - What if we change the API?
- A drastic step, but... MPI, PVM, MW, Linda.
- Can we admit subprogram failure and maintain a
usable interface?
67Outline
- Introduction
- A Case Study
- Remote Execution
- Related Work
- Progress Made
- Research Agenda
- Conclusion
68Conclusion
- Large-grained asynchrony has yet to be explored
in the context of remote program execution. - Asynchrony has benefits, but requires careful
management of failure modes. - This dissertation will contribute a system design
and an exploration of performance, risk, and
knowledge in a distributed system.
69Extra Slides
70Execution Site
Submission Site
starter
shadow
Secure Remote I/O
I/O Server
I/O Proxy
Local I/O (Chirp)
Fork
Local System Calls
JVM
Home File System
The Job
I/O Library
71Explicit descriptions of ordering, reliability,
performance, and availability. (POSIX, MPI, PVM,
MW)
Running Program
Application Interface
Interposition Agent
Remote Resource Interfaces
Few guarantees on performance, availability, and
reliability.
Disk
CPU
RAM
Network
72Supervisor Process
Running Program
- In the event of a failure
- Retry Hold the CPU allocation and try again.
- Checkpoint Release the CPU, with some restart
condition. - Abort Expose the failure to a supervisor.
Interposition Agent
Disk
CPU
RAM
Network
73The Cost of Hiding Failures
- Each technique is valid from the standpoint of
API conformity. - What to use? Depends on cost
- Retry Holds CPU idle while retrying device.
- Checkpoint Consumes disk and network, but
releases CPU. - Abort No expense up front, but must re-consume
resources when ready to roll forward. - User policy is vital in determining what costs
are acceptable for hiding failures.
74Running Program
What file should I open? How long should I
try? May I checkpoint now? Where should I store
the checkpoint image? Should I stage or stream
the output? FYI, Ive used 395 service units
here. FYI, Im about to be evicted from this site.
Policy Director
Interposition Agent
Disk
CPU
RAM
Network
75Disconnected Operation
- The policy manager is also an execution resource
that is occasionally slow or unavailable. - Holding a resource idle while waiting
indefinitely for policy direction is still
wait-while-idle. - A higher degree of asynchrony can be achieved
through disconnected operation. - Requires each autonomous unit be given an
allowance.