Title: Errors, Status, and Asynchrony Discussion Session
1Errors, Status,and AsynchronyDiscussion Session
- PPDG Data Replication Meeting
- 10 January 2002
- Douglas Thain, Condor Project
- University of Wisconsin
2Agenda
- A Working Model
- Two Error-Management Issues
- Thinking of Data-Movement as Jobs
- Reconciling Error Representations
- Example Problem
- Discussion
- Open Issues
- Hints and Absolutes in Replica Management
- Tradeoff between consistency and availability
3Discussion Points
- Data Job Management and Fault-Tolerance
- What faults do we intend to tolerate/expose/ignore
? - Can we develop a general transaction
infrastructure for replication-related
activities? - How should we evaluate designs that may be error
sensitive? (design review, stress testing) -
- Error Identification and Representation
- Should we have a uniform error space?
- Is it feasible to translate between existing
error spaces? - What systems have unusual errors modes that
outsiders may not expect? - How do we deal with unusual errors that must pass
through existing APIs?
4A Working Model Giggle
GRIN
L1
B
L2
B
L3
B
Replica Site B
L1
P1
L2
P2
L3
P3
Foster, Iamnitchi, Ripeanu, Chervenak, Deelman,
Kesselman, Hoschek, Kunszt, Stockinger,
Stockinger, Tierney, Giggle A Framework for
Constructing Scalable Replica Location Services
5The Problem
- Replication systems will be subject to a wide
variety of errors. - How do we build systems that maintain consistency
in the face of errors? - Answer Use transactions to manage jobs, but...
- How do we build systems that make reasonable
performance decisions in the face of errors? - Answer Informative errors, but
6Fault Tolerance Terminology
- Failure
- An externally-visible deviation from
specifications. - Error
- An internal data state that leads to a failure.
- Fault
- An external event that creates an error.
A. Avizienis and J.C. Laprie, Dependable
computing From concepts to design diversity,
Proc IEEE 74, 5 (May) 629-638
7Example
FAULT
Client
Server
Hmm, sqrt(4) is...
Hmm, sqrt(9) is...
FAILURE
ERROR
8- Silent errors (failures)
- The system claims to have reached a valid result,
but an auditor claims it is invalid. - Explicit errors (failures)
- The system tells us it cannot complete the
desired action. - Escaping errors (failures)
- The system detects an error, but has no method of
reporting it, so it escapes by an alternate route
-- drop connection, core dump, kernel panic.
(exception)
John B. Goodenough, Exception Handling Issues
and a Proposed Notation, CACM 1822 (1975), pp
683-696.
9What Errors to Expect in a Replication System?
- Errors of communication
- File transfer was broken between bytes.
- Collection transfer was broken between files.
- Errors of omission
- Requested some files, but response was slow, so
the caller gave up and left. (with/out abort?) - Errors in configuration
- Space at target server cant admit all incoming
data at once.
10What Must Be Consistent?
Replica Catalog
L1
B
L2
B
L3
B
Replica Site B
L1
P1
L2
P2
P3
L3
P3
P2
P1
11Data Movement as a Job
- Each request issued for replication must have a
past, present, and future - Who issued it, and why?
- What is it doing now?
- Is it done? Did it succeed?
- Enough information to roll back after a failure.
- A complete program execution
- data jobs cpu jobs dependencies
DAGMan/DaPMan
12Job Management
- Primary technique for reliable interacting with
the job queue transaction. - ACID Test Atomicity, Consistency, Isolation,
Durability. - Of course, the natural interface to a db, but not
all participants are a full db. - Interface
- 2PL and friends
- Implementation
- Logging, shadowing, a real db?
13Two-Phase Commit
Server
Client
Stable Storage
Work Space
Archival Space
J. Eliot Moss, Nested Transactions An Approach
to Reliable Distributed Computing, MIT Press,
1985.
14Two-Phase Commit
Server
Client
Stable Storage
Work Space
Archival Space
James Frey, Todd Tannenbaum, Ian Foster, Miron
Livny, and Steven Tuecke, "Condor-G A
Computation Management Agent for
Multi-Institutional Grids", Proceedings of the
Tenth IEEE Symposium on High Performance
Distributed Computing (HPDC10), 2001.
15Transactions and Status
- The transaction ID then becomes a persistent job
number for later queries - Success, failure, abort, timeout
- unknown-past, unknown-future.
- For this status to be useful, a record of the job
must be kept around for a certain period of time. - Also ok to time out, cancel, or otherwise remove
data movement jobs. - But, a committed transaction must be kept.
- Cant re-use a job number!
16Transaction Implementations
- Logging
- Keep a log of all actions, new and old values.
- Read forward to redo, backwards to undo.
- Shadowing
- Add changed data to unallocated space.
- Atomically commit new pointers to data.
M
D
D
D
D
D
D
D
17Transaction Implementations
- If a standard file system is the underlying
storage, then shadowing is a natural fit. - Most metadata updates are designed to be atomic
and synchronous. - Most large data updates are designed to provide
good xput, but are asynchronous and not
guaranteed until after an explicit commit.
18Atomic File Update
fd creat(file.tmp) write(fd,data,length) fsyn
c(fd) close(fd) rename(file.tmp,file)
On Failure or abort
On Success
On reboot
Done.
unlink(file.tmp)
unlink(.tmp)
(Technique used on Condor checkpoint servers and
scheduler processes.)
19Unifying Storage Services
App
POSIX
Virtual Operating System
UNIX Driver
SRB Driver
GridFTP Driver
NeST Driver
Kangaroo Driver
GASS Driver
An Alphabet Soup of Protocols, APIs, Systems,
Authorities, and Authors
20Error RepresentationA Problemof Depth
App
Tape Archive
POSIX
Bypass Agent
???
Disk Cache
PPDG API
Win32
Replica Access Library
FTP Server
Replica Catalog
RAP
FTP
Replica Server
Replica Server
RMP
RMP
21A Problem ofDesign Direction
App
App
Bottom Up Design
???
POSIX
Application Library
Virtual OS
Outside In Design
ANSI
PPDG API
Standard Library
Replica Access
POSIX
SRB
OS Kernel
Replica Server
22The End-to-End Argument
- In complex software, the outermost layer has the
ultimate responsibility for interpreting and
recovering from errors. - Recovery in a lower layer is an optimization of
performance or convenience. - If the possibility of error is very high,
lower-level recovery is needed for good
performance.
Saltzer, Reed, and Clark, End-to-End Arguments in
System Design, Computer Systems 24, pp 277-288,
1984.
23UNIX Errnos
- A single namespace of integer errors that apply
to all levels of the system. - Any call is free to return any possible error.
(124) - General vs specific
- ENOENT vs ECHILD
- Some artifacts
- EACCESS vs EPERM
- EADV and EDOTDOT
EPERM 1 / Operation not permitted / ENOENT 2
/ No such file or directory / ESRCH 3 / No
such process / EINTR 4 / Interrupted system
call / EIO 5 / I/O error / ENXIO 6 / No
such device or address / E2BIG 7 / Arg list
too long / ENOEXEC 8 / Exec format error
/ EBADF 9 / Bad file number / ECHILD 10 / No
child processes / EAGAIN 11 / Try again
/ ENOMEM 12 / Out of memory / EACCES 13 /
Permission denied / ..
24FTP Reply Codes
- Integer codes indicate the severity of a response
to an action. - Many transfer problems are identified, but few
file system problems are. - Third digit specified infrequently, and for wide
classes of errors.
100 - Positive Preliminary 200 - Positive
Completion 300 - Positive Intermediate 400 -
Transient Negative 500 - Permanent negative 000
- Syntax 010 - Information 020 - Connections 030
- Authentication 040 - Unspecified 050 - File
System 550 e.g. File not found, no access
25SRB Reply Codes
- Error space is an amalgam of all back end error
spaces. - Pros No information is ever lost in translation.
- Cons Very difficult to write code that switches
on the error number (1026 cases.)
UNIX_EPERM -1301 UNIX_ENOENT -1302 . .
. UNIX_EDEADLOCK -1356
HPSS_EPERM -1401 HPSS_ENOENT -1402 . .
. HPSS_NOCOS -1499
SQL_RSLT_TOO_LONG -1600
HTTP_ERR_BAD_PATH -1700
MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 .
. . MCAT_USER_NOT_IN_DOMN -3032
26Globus Error Objects
- Pros
- Errors may be identified at varying levels of
granularity. - Easily expandable.
- Lots of debug info.
- Cons
- Can be difficult to decide in which class to
place an external error. - In practice, most errors are returned as objects
of type string.
Error
Authen- tication
Author- ization
Commun- ication
String
No Creds
Expired Creds
No Trust
27Translation Can be Done to a Point
UNIX_EPERM -1301 UNIX_ENOENT -1302 . .
. UNIX_EDEADLOCK -1356
EPERM
ENOENT
ESRCH
HPSS_EPERM -1401 HPSS_ENOENT -1402 . .
. HPSS_NOCOS -1499
EINTR
EIO
SQL_RSLT_TOO_LONG -1600
EACCESS
HTTP_ERR_BAD_PATH -1700
EISDIR
MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 .
. . MCAT_USER_NOT_IN_DOMN -3032
OTHER
28Grope in the Dark
- if GET succeeds
- return success
- else
- if CHDIR succeeds
- return EISDIR
- else
- if LIST succeeds
- return EACCESS
- else
- return ENOENT
- end
- end
- end
GET
CHDIR
LIST
EACCESS
29Error Identification isa Performance Concern
- We can always find some way to produce an
execution that avoids a silent failure. - Pass all errors up one level.
- Retry all errors until time expires.
- Abort process completely.
- But, a known, finite, space allows the caller to
make targeted decisions about what to do next - Not Authorized -- best to pass up one level.
- Operation Interrupted -- best to retry here.
30Give the Essence orGive the Details?
- Example in file systems
- Fell off the end of the directory linked list.
- or No file by that name.
- Example in networking
- Timer went off, but no network interrupt
received. - or Connection lost.
- Example in security
- Failure in PEM_do_header while reading
password. - or You have no credentials.
- Example in Storage
- HPSS_NOCOS
- or ?????
31Example and Discussion
32Example
- Goal
- User requests a repl of a file from B to A.
- Data Structures at each Node
- A persistent map map from LFNs to PFNs.
- A persistent store for transactions.
- A persistent store for data.
- Assumptions
- Files are read-only, no need for invalidation.
- All nodes must survive reboot cleanly.
- File transfers may be resumed from any point.
33Replica Catalog
L1
B
Client
L2
B
L3
B
Replica Site B
L1
P1
L2
P2
L3
P3
34Replica Site A
LFN
TRN
Server
Client
L2
T53
T53.tmp LFN L2 PFN P16 State Working
T53.tmp LFN L2 PFN P16 State Working
T53.tmp LFN L2 PFN P16 State Done
P16 Physical Data File
T53 LFN L2 PFN P16 State Working
T53 LFN L2 PFN P16 State Done
35More Issues
- Cleanup at Reboot
- Remove uncommitted transactions.
- Jobs in progress Update LFN-gtTRN entry.
- Client Status Check
- Requesting client examines state of transaction.
- Or, other clients indirect through LFN entry.
- Notification of Status Change
- Unreliable -- Server sends messages to client.
- Reliable --Server must do transaction to client.
- (See Condor-G Paper)