Title: Dariusz Kowalski
1 Performing Tasks in Asynchronous Environments
- Dariusz Kowalski
- University of Connecticut Warsaw University
- joint work with
- Alex Shvartsman
- University of Connecticut MIT
2Do-All problem (DHW et al.)
- DA(p,t) problem abstracts the basic problem of
cooperation in a distributed setting - p processors must perform t tasks, andat least
one processor must know about it
Dwork Halpern Waarts
92/98 - Tasks are
- known to every processor
- similar - each takes similar number of local
steps - independent - may be performed in any order
- idempotent - may be performed concurrently
3Do-All synchronous model with crashes
- Model processors are synchronous, may fail by
crashes - Solutions problem well understood, results close
to optimal - Shared-memory model -- communication by
read/write - Kanellakis, P.C., Shvartsman, A.A.
- Fault-tolerant parallel computation. Kluwer
Academic Publishers (1997) - Message-passing model -- communication by
exchanging messages - Dwork, C., Halpern, J., Waarts, O.
- Performing work efficiently in the presence of
faults. - SIAM Journal on Computing, 27 (1998)
- De Prisco, R., Mayer, A., Yung, M.
- Time-optimal message-efficient work performance
in the presence of faults. Proc. of 13th
PODC, (1994) - Chlebus, B., De Prisco, R., Shvartsman, A.A.
- Performing tasks on synchronous restartable
message- passing processors. Distributed
Computing, 14 (2001)
4Do-All asynchronous models
- Models
- Shared-memory model -- communication by
read/write -- widely studied, but solutions far
from optimal - Kanellakis, P.C., Shvartsman, A.A.
Fault-tolerant parallel computation. Kluwer
Academic Publishers (1997) - Anderson, R.J., Woll, H. Algorithms for the
certified Write-All problem. SIAM Journal on
Computing, 26 (1997) - Kedem, Z., Palem, K., Raghunathan, A., Spirakis,
P. Combining tentative and definite executions
for very fast dependable parallel computing.
Proc. of 23rd STOC, (1991) - Message-passing model -- communication by
exchanging messages -- no interesting solutions
until recently
5Shared-Memory vs. Message-Passing
- Shared-Memory (atomic registers)
- processors communicate by read/write in
shared-memory - atomicity - guarantees that read outputs the last
written value - one read/write operation per local clock cycle
- information propagates and information is
persistent - Hence cooperation is always possible, although
delayedHere processor scheduling is the major
challenge - Message-Passing
- processors communicate by exchanging messages
- duration of a local step may be unbounded
- message delays may be unbounded
- information may not propagate -- send/recv depend
on delay
6Message-delay-sensitive approach
- Even if messages delay are bounded by d
(d-adversary),cooperation may be difficult - Observation
- If d ?(t) then work must be ?(t p)
- This means that cooperation is difficult, and
addressing scheduling alone is not enough - -
algorithm design and analysis must be d-sensitive - Message-delay-sensitive approach
- C. Dwork, N. Lynch and L. Stockmeyer. Consensus
in the presence of partial synchrony. J. of the
ACM, 35 (1988)
7Measures of efficiency
- Termination time the first time when all tasks
are done and at least one processors knows about
it - Used only to define work and message complexity
- Not interesting on its own if all processors but
one are delayed then trivially time is ?(t) - Work measures the sum, over all processors, of
the number of local steps taken until termination
time - Message complexity (message-passing model)
measures number of all point-to-point messages
sent until termination time
8Structure of the presentation
- Part 1 Shared-memory model
- Model and bibliography
- Improving AW algorithm in shared-memory by better
scheduling processors (task load-balancing)
- Part 2 Message-passing model.
- Model asynchrony, message delay, and modeling
issues - Delay-sensitive lower bounds for Do-All
- Progress-tree Do-All algorithms
- Simulating shared-memory and Anderson-Woll (AW)
- Asynch. message-passing progress-tree algorithm
- Permutation Do-All algorithms
9Shared-Memory - model and goal
- We consider the following model
- p asynchronous processors with PID in 0,,p-1
- processors communicate by read/write in
shared-memory - atomicity - read outputs the last written value
- one read/write operation per local clock cycle
- Write-All write 1s into t locations of given
array
Goal improve scheduling of cooperating
asynchronous processors leading to better
load-balancing wrt tasks
10Write-All Selected Bibliography
- Introducing Write-All problem
- Kanellakis, P.C., Shvartsman, A.A. Efficient
parallel algorithms can be made robust. PODC
(1989), Distributed Computing (1992) - AW algorithm with work O(t p? )
- Anderson, R.J., Woll, H. Algorithms for the
certified Write-All problem. SIAM Journal on
Computing, 26 (1997) - Randomized algorithm with work ?(t plog p)
- Martel, C., Subramonian, R. On the complexity of
Certified Write-All algorithms. J. Algorithms 16
(1994) - First work-optimal deterministic algorithm for
t ?(p4log p) - Malewicz, G. A work-optimal deterministic
algorithm for the asynchronous Certified
Write-All problem. PODC (2003)
11Progress tree algorithms BKRS, AW
- Shared memory
- p processors, t tasks (p t)
- q permutations of q
- q-ary progress tree of depth logq p
- nodes are binary completion bits
- Permutations establish the order in which
the children are visited - p processors traverse the tree and use
q-ary expansion of their PID to choose
permutations - Anderson Woll
1 2 3 q
1 2 3 q
1 2 3 q
12Algorithm AWT Anderson Woll
- Progress tree data structure is stored in shared
memory -
-
- p, t 9 , q 3
- ? list of 3 schedules from S3
- T ternary tree of 9 leaves (progress
tree), values 0-1 - PID(j) j-th digit of ternary-representation
of PID
1
2
3
?0 PID 0,3,6
?1 PID 1,4,7
0
?2 PID 2,5,8
7213
1
2
3
7213
4
5
8
7
9
10
12
11
6
13Contention of permutations
- Sn - group of all permutations on set n,
- with composition ? and identity ?n
- ?, ? - permutations in Sn
- ? - set of q permutations from Sn
- i is lrm (left-to-right maximum) in ? if ?(i) gt
maxjlti ?(j) - LRM(? ) - number of lrm in ? Knuth
- Cont(?,? ) ?? ?? LRM(? -1 ? ?)
- Contention of ? Cont(? ) max? Cont(?,? )
AW - Theorem AW For any n gt 0 there exists set ?
of n permutations from Sn with Cont(? ) ? 3nHn
?(n log n). - Knuth Knuth, D.E. The art of computer
programming Vol. 3 (third edition).
Addison-Wesley Pub Co. (1998)
10
3
5
2
4
6
1
9
7
8
11
14Procedure Oblivious Do
- n - number of jobs and units
- ? - list of n schedules from Sn
- Procedure Oblivious
- Forall processors PID 0 to n-1
- for i 1 to n do
- perform Job(? PID(i))
- Execution of Job(? PID(i)) by processor PID is
primary, if job ? PID(i) has not been previously
performed - Lemma AW In algorithm Oblivious with n units,
n jobs, and using the list? of n permutations
from Sn, the number of primary job executions is
at most Cont(? ).
15AWT(q) - new progress tree traversal algorithm
- Instead of using q permutations on set q, we
use q permutations on set n, where n q2 log
q -
-
- p 6 , t 16 , q 2, n 4
- ? list of 2 schedules from S4
- T 4-ary tree of 16 leaves (progress
tree), values 0-1 - PID(j) j-th digit of ternary-representation
of PID
?0 PID even
?1 PID odd
0
51014
1
2
3
4
51014
5
6
9
8
10
11
13
12
7
14
15
16
17
18
19
20
16Main result
- Set n q2 log q and let ? be list of q
schedules from Sn - Define Cont(?, ?) max? ? ? Cont(?,? )
- Lemma For sufficiently large q and any set ? of
at most exp(q2 log2 q) permutations on set q2
log q, there is a list of q schedules ? from Sn
such that - Cont(?, ?) ? q2 log q 6q log q
- Take q log p and ? from above Lemma
- Theorem For every ? gt 0, sufficiently large p
and t ?(p2?), algorithm AWT(q) performs
work O(t).
17Message-Passing - model and goals
- We consider the following model
- p asynchronous processors with PID in 0,,p-1
- processors communicate by message passing
- in one local step each processor can send a
message to any subset of processors - messages incur delays between send and receive
- processing of all received messages can be done
during one local step - Goal understand the impact of message delay on
efficiency of algorithmic solutions for Do-All
18Lower bound - randomized algorithms
- Theorem Any randomized algorithm solving DA with
t tasks using p asynchronous message-passing
processors performs expected work - ?(tp?d?logd1 t)
- against any d-adversary.
- Proof (sketch)
- Adversary partitions computation into stages,
each containingd time units, and constructs
delay pattern stage after stage - ? delays all messages in stage to be received at
the end of stage - ? delays linear number of processors (which want
to perform more than (1-1/(3d)) fraction of
undone tasks) during stage - selection is on-line, with high probability has
good properties
19Simulating shared-memory algorithms
- Write-All algorithm AWT
- Anderson, R.J., Woll, H. Algorithms for the
certified Write-All problem. SIAM Journal on
Computing, 26 (1997) - Quorum systems Atomic memory services
- Attiya, H., Bar-Noy, A., Dolev, D. Sharing
memory robust-ly in message passing systems. J.
of the ACM, 42 (1996) - Lynch, N., Shvartsman, A. RAMBO A
Reconfigurable Atomic Memory Service. Proc. of
16th DISC, (2002) - Emulating asynchronous shared-memory algorithms
- Momenzadeh, M. Emulating shared-memory Do-All in
asynchronous message passing systems. Masters
Thesis, CSE, University of Conn, (2003)
20Atomic memory is not required
- We use q-ary progress trees as the main data
structure that is written and read -- note
that atomicity is not required - If the following two writes occur (the entire
tree is written), then a subsequent read may
obtain a third value that was never written - Property of monotone progress
- 1 at a tree node i indicates that all tasks
attached to the leaves in the sub-tree rooted in
i have been performed - If 1 is written at a node i in the progress tree
of a processor, it remains 1 forever
0
0
0
write
write
read
1
0
0
1
1
1
21Algorithm DAq - traverse progress tree
- Instead of using shared memory, processors
broadcast their progress trees as soon as local
progress is recorded -
-
- p, t 9 , q 3
- ? list of 3 schedules from S3
- T ternary tree of 9 leaves (progress
tree), values 0-1 - PID(j) j-th digit of ternary-representation
of PID
1
2
3
?0 PID 0,3,6
?1 PID 1,4,7
0
?2 PID 2,5,8
7213
1
2
3
7213
4
5
8
7
9
10
12
11
6
22Algorithm DAq - case p ? t
23Procedure DOWORK
24Algorithm DAq - analysis
- Modification of algorithm DAq for p lt t
- We partition the t tasks into p jobs of size t
/p and let the algorithm DAq work with these
jobs. - It takes a processor O(t /p) work (instead of
constant) to process such a job (job unit). - In each step, a processor broadcasts at most one
message to p-1 other processors, we obtain - Theorem 4 For any constant ? gt 0 there is a
constant q such that the algorithm DAq has work - W(p,t,d) O(t?p? p?d??t /d? ? )
- and message complexity
- O(p ? W(p,t,d))
- against any d-adversary (do(t)).
25Permutation algorithms - case p ? t
- Algorithms proceed in a loop
- select the next task using ORDERSELECT rule
- perform selected task
- send messages, receive messages, and update state
- ORDERSELECT rules
- PARAN1 initially processor PID permutes tasks
randomly - PID selects first task remaining on
his schedule - PARAN2 no initial order
- PID selects task from remaining sets
randomly - PADET initially processor PID chooses
schedule ?PID in ? - PID selects first task remaining on
schedule ?PID - ? - list of p schedules from St
26d-Contention of permutations
- We introduce the notion of d-Contention
- i is d-lrm in ? if j lt i ?(i) lt ?(j) lt d
- d 2
- LRMd(?) - number of d-lrm in ?
- Contd(?,? ) ?? ?? LRMd(? -1 ? ?)
- d-Contention of ? Contd(? ) max? Contd(?,?
) - Theorem For sufficiently large p and n, there
is a list ? of p permutations from Sn such that,
for every integer d gt1, - Contd(? ) ? n log n 5pd ln(en/d).
- Moreover, random ? is good with high
probability.
10
3
5
2
4
6
1
9
7
8
11
27d-Contention and work
- Lemma For algorithms PADET and PARAN1, the
respective worst case work and expected work is
at most - Contd(? )
- against any d-adversary.
- Example
- p 2, t 11, d 2
Order of tasks to perform 1,2,3,4,5,6,7,8,9,10,1
1
1
3
2
5
7
4
9
8
6
11
10
1
3
2
5
7
9
11
10
2
4
6
8
10
11
9
7
5
3
1
2
4
6
8
10
11
28Permutation algorithms - results
- Theorem Randomized algorithms PARAN1 and PARAN2
perform expected work - O(t?log p p?d?log(t /d))
- and have expected communication
- O(t?p?log p p2?d?log(t /d))
- against any d-adversary (do(t)).
- Corollary There exists a deterministic list of
schedules ? such that algorithm PADET performs
work - O(t?log p p?mint,d?log(2t /d))
- and has communication
- O(t?p?log p p2?mint,d?log(2t /d))
- when p ? t.
29Conclusions and open problems
- Work-optimal Write-All algorithm for t ?(p2?)
- First message-delay-sensitive analysis of the
Do-All problem for asynchronous processors in
message-passing model - lower bounds for deterministic and randomized
algorithms - deterministic and randomized algorithms with
subquadratic(in p and t ) work for any message
delay d as long as do(t) - Among the interesting open questions are
- is there work-optimal scheduling for t ?(p log
p) - for algorithm PADET how to construct list ? of
permutations efficiently - closing the gap between the upper and the lower
bounds - investigate algorithms that simultaneously
control work and message complexity