Buffersizing for Precedence Graphs on Restricted Multiprocessor Architectures presentation

About This Presentation

Title:

Buffersizing for Precedence Graphs on Restricted Multiprocessor Architectures

Description:

Edward A. Lee, Thomas M. Parks - Proceedings of the IEEE, 1995. 11/12/09. 13. Preliminaries ... increasing communication buffer size to hold all the Wa tokens; ... –

Number of Views:64

Avg rating:3.0/5.0

Slides: 69

Provided by: eecsBe

Category:

more less

Transcript and Presenter's Notes

Title: Buffersizing for Precedence Graphs on Restricted Multiprocessor Architectures

1
Buffer-sizing for Precedence Graphs on Restricted
Multiprocessor Architectures

Thomas Feng
Yang Yang
Mentors Qi Zhu, Abhijit Davare

2
Outline

Motivation
Previous Work
Preliminaries
Problem Statement
Investigative Approach
Summary and Conclusion
Compare with Prior Work
Our Contribution
Future Work

3
Outline

Motivation
Previous Work
Preliminaries
Problem Statement
Investigative Approach
Summary and Conclusion
Future Work

4
Parallel Heterogeneous Platforms (PHPs)

Advantages
High computational capability
Challenges
Explore the theoretically high performance

(From Abhijit Davares Quals Presentation)
5
Project Goals

Dataflow Programming Model
Infinite Buffers
Blocking read, non-blocking write
Many scheduling and allocation techniques

Multiprocessor Platform
Limited connectivity between processors
Limited, finite depth FIFOs
Low overhead reads and writes to FIFOs

6
Deploying applications on PHPs

Computation Synthesis
Task Allocation
Task Scheduling
Communication Synthesis
Interconnection Synthesis
Buffer sizing (The part we are working on)

7
Buffer Sizing

Architectures have bounded buffer resources.
If more communication buffer resources are
utilized, processors may spend less time waiting
to send/receive data.
Additional buffer resources may adversely affect
communication overhead, achievable clock speed,
or design closure.

8
Example Flow
Function Model
Architecture
Architecture Model
Function
Allocation Scheduling
Buffer Sizing
9
Outline

Motivation
Previous Work
Preliminaries
Problem Statement
Investigative Approach
Summary and Conclusion
Future Work

10
Previous Work

Transformations from various statically-schedulabl
e dataflow variants into precedence DAGs 1
Survey on allocation and scheduling algorithms
for precedence DAGs assuming infinite-length
buffers 2
Minimizing Buffer Requirements for uniprocessor
architectures 3
Minimizing multiprocessor buffer sizing for SDF
applications under conservative
(non-interleaving) conditions 4

1 Software Synthesis and Code Generation for
Signal Processing Systems S. Bhattacharyya, R.
Leupers, P. Marwedel - IEEE Transactions on
Circuits and Systems, 2000.
2 Static scheduling algorithms for allocating
directed task graphs to multiprocessorsYK Kwok,
I Ahmad - ACM Computing Surveys, 1999.
3 Minimizing Buffer Requirements of Synchronous
Dataflow Graphs with Model Checking M. Geilen, T.
Basten, S. Stuijk, DAC 2005.
4 Data Memory Minimization for Synchronous Data
Flow Graphs Emulated on DSP-FPGA Targets M. Adé,
R. Lauwereins, J.A. Peperstraete DAC 1997.
11
Outline

Motivation
Previous Work
Preliminaries
Problem Statement
Investigative Approach
Summary and Conclusion
Future Work

12
Preliminaries

Precedence DAG
A precedence DAG is a common representation for
the deployment of an application across multiple
processors.
Precedence DAG can be generated from statically
schedulable dataflow descriptions, such as
synchronous dataflow or cyclo-static dataflow.
This is suitable for most applications in the
multimedia domain. 45
Our synthesis process starts from precedence DAG.

4 A Hierarchical Multiprocessor Scheduling
System for DSP Applications Jose Luis Pino,
Edward A. Lee, Shuvra S. Bhattacharyya - 29th
Asilomar Conference on Signals, Systems and
Computers, 1995
5 Dataflow process networks Edward A. Lee,
Thomas M. Parks - Proceedings of the IEEE, 1995
13
Preliminaries

Synthesis process
Relationships among allocation, scheduling and
buffer sizing
Allocation assign each node in precedence DAG to
a particular processor in the architecture.
Scheduling specify an execution sequence for the
set of tasks on each processor.
Buffer sizing assign sizes to inter-processor
communication channels.
In our approach, allocation and scheduling are
done by assuming unbounded communication buffer
size. Then buffer sizing will be based on the
result of allocation and scheduling.

14
Preliminaries

Artificial deadlock is deadlock that results when
the size of buffers between processors is reduced
from infinity to some finite numbers. 6
In buffer sizing, we want to minimize the
objective function, avoiding artificial deadlock.
(Deadlock implies artificial deadlock in
the following slides.)

6 Requirements on the Execution of Kahn Process
Networks Marc Geilen and Twan Basten -
Programming Languages and Systems 12th European
Symposium on Programming, ESOP 2003.
15
Outline

Motivation
Previous Work
Preliminary
Problem Statement
Investigative Approach
Summary and Conclusion
Future Work

16
Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
17
Assumptions

Interleaving communication is
For inter-processor communication, when write
and read tasks are both active, they can
communicate any amount of data through one-place
buffer.

18
Outline

Motivation
Previous Work
Preliminary
Problem Statement
Investigative Approach
Summary and Conclusion
Future Work

19
Classification of Blocked nodes

In a precedence DAG, we classify nodes which will
be blocked during execution into 3 kinds.
read blocked node -- the node will be blocked
because it can not read in enough tokens.
write blocked node -- the node will be blocked
because it can not finish writing all the
produced tokens.
scheduling blocked node -- the node can not be
fired because its previous node on the same
processor has not finished execution.

20
Our Observations of Deadlock

We proved that it is impossible to have deadlock
with only scheduling blocked nodes and read
blocked nodes.
We proved that if a precedence DAG has deadlock,
then it must has at least such a pattern called
write blocked cycle in which
- all the schedule edges are in the same
direction
- there must be 1 or more write blocked nodes,
whose incoming degree is 0 in the cycle
- there could be read blocked nodes, whose
incoming degree is one or more in the cycle
- if reversing the directed data edges from all
the write blocked nodes, it becomes a directed
cycle.
Note not every write blocked edges must be in a
write block cycle.

21
Write Blocked Cycle
Pi
Pj

Data edge
Schedule edge
ni
nj
Series of several data/schedule edges, in which
the schedule edges are in the same direction as
the schedule edge from ni to ni1 , while the
data edges could be in either direction.
ni1
nj1

nim
njn

mgt1, ngt1

Buffer space from Pi to Pj lt Token count on the
write edge from ni to njn
22
Examples of Write Blocked Cycle
P1
P2
P3
a
c
e
a
a
c
e
2
2
2
2
!
!
!
b
d
f
2
2
!
Fig.2
Fig.1
BufferPiPj 1 i,j1,2,3
23
How to Avoid Deadlock

There is no artificial deadlock in a precedence
DAG if and only if there is no write blocked
cycles in the graph.
Proof by contradiction. (Omitted here.)
We can resolve the write blocked cycles by using
enough communication buffer or internal buffer.

24
How to Avoid Deadlock
a
c
Wa
Bij
b
d
Pi
Pj

Wa gt Space(Bij), a is write blocked. Deadlock is
solved by
- increasing communication buffer size to hold
all the Wa tokens
or increasing internal buffer size in Pi to hold
all the Wa tokens, and then write them to d after
b and c
- or reading all the tokens before c, and
increasing internal buffer size in Pj to hold all
the Wa tokens,

25
How to Avoid Deadlock
e
g
Wa
We
a
d
a
c
Wb
Wa
b
e
Bmn
f
h
Pm
Pn
Bij
c
f
Bij
b
d
Pi
Pj
Pi
Pj
Wa gt Space(Bij), We gt Space(Bmn), a and e are
write blocked. Deadlock is solved by increasing
the size of Bij to hold all the Wa, or increasing
the size of Bmn to hold all the Wb tokens or
using internal buffer.
Wa Wb gt Space(Bij), a and b are write
blocked. Deadlock is solved by increasing buffer
size to hold all the Wa Wb tokens or by
increasing internal buffer in Pi or Pj.
26
Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
27
Using Internal Buffers

We can always find a legal scheduling S1 with one
place communication buffers by using internal
buffers, assuming we have a legal scheduling SU
requiring unbounded communication buffer.
Proof
Since the scheduling SU (a partial order) is
legal, we can always find a sequential order SS
to execute the nodes, which is also legal. We can
get SS by arbitrarily assigning a total order
conforming to the partial order. (to be continued)

Let the communication buffer between every two
processors be one place. If there is deadlock in
SS, obviously the first blocked node must be
write blocked node.

Simulate this execution. Let x be the first
blocked node and it writes tokens to nodes y1, ,
yk. The block can be eliminated by letting x
write all the produced tokens to the internal
buffers For every node yi on processor pi, write
the tokens from x to the internal buffer of pi.
Corresponding reading codes are inserted right
after the executed codes of pi. Therefore, the
deadlock at x is solved, and following execution
will not be affected by this solution. Repeat
this process until all the write blocked nodes is
eliminated. Consequently, a legal schedule is
found by using internal buffers.
29
Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
30
MakeSpan

Make span is the maximum completion time for a
set of processors.
Assumptions
1. Interprocessor communication takes place
through bounded depth FIFOs with blocking reads
and writes
2. Unlimited internal buffer space is available
on each processor
Conjecture
For a task precedence graph, if insufficient
FIFO depth leads to deadlock, reading and writing
can be reordered in such a way that deadlock is
eliminated and makespan is not affected.
Counterexample
The example is scheduled in such a way that
multiple paths are relatively critical.
Reordering the reads and writes to eliminate the
deadlock increases the length of some of the
relatively critical paths, extending the
makespan, even if tx/rx time ltlt computation time.

31
P3
215
e
Communication Model Tx/Rx time 5 units Latency
0 units
P1
P2
a
c
f
10
10
10
g
d
b
200
280
50
P5
i
80
290
305
h
300
315
P4
315
32

In the example, edges a-gtd and c-gtg may be
blocked due to insufficient FIFO depth. Without
increasing the FIFO depth, there are 4 ways to
resolve this
Move a-gtd communication after b
Move a-gtd communication before c
Move c-gtg communication after d
Move c-gtg communication before f
Options 1 and 3 delay d and g by a large amount,
and increase the makespan significantly
Options 2 and 4 extend the critical paths that
end at h and i

33
Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
34
NP-hard Problem

Formally, the problem DEADLOCK-FREE-MIN-BUFFER
(DFMB) is defined as follows Given a Precedence
DAG D, find the minimal total buffer size, such
that there is no deadlock in D.
The problem DEADLOCK-FREE-MIN-BUFFER is NP-hard.
Proof
We prove it by reducing FEEDBACK ARC SET (FAS)
problem, which known to be NP-complete, step by
step to the DRMB problem.

The FEEDBACK ARC SET (FAS) Problem is the
following Given a directed graph G(V, E), and a
positive integer K, does there exist a subset
, such that B contains at least one
edge from every directed cycle in G?
This problem is known to be NP-complete 7

7 Computers and Intractability M. R. Garey and
D.S. Johnson - W. H. Freeman and Co., NY 1979
36

First, we prove FAS problem can be reduced to the
Problem below
Problem B Given a directed graph G(V, E) with
weight w(e) on every edge , find the
minimal , such that
and B contains at least one edge from every
directed cycle in G.
Then we prove Problem B can be reduced to DFMB
problem, by proving that an arbitrary instance X
of Problem B can be transformed to an instance X
of DFMB problem in polynomial time, and the
result of X is got by solving X.

37
Instance X
Instance X
A vertex in G(V, E)
The corresponding nodes and schedule edge in D(N,
E)
The corresponding data edge in D(N, E)
An edge in G(V, E)
38
A directed cycle in G (V, E)
A write blocked cycle in D (N, E), where E DE
U SE
39

Solution to Instance X
min , and B
contains at least one edge from every directed
cycle in G
min , and M
contains at least one edge from every writing
block cycle in D
Solution to Instance X

40
Algorithms
41
Minimizing Maximum FIFO Size (1)

Mathematical Model
V v1, v2, , vm. The set of vertices.
P p1, p2, , pl. The set of processors.
M V ? P. Mapping from vertices to the processors
they are scheduled on.
E e1, e2, , en. The set of edges.
S e e ? E ? M(src(e)) M(des(e)). Set of
schedule edges.
D e e ? E ? M(src(e)) ? M(des(e)). Set of
data edges.
W D ? R . The weight function.
F P ? P ? R. The function that returns the FIFO
size. F(pi, pj) need not be equal to F(pj, pi).

42
Minimizing Maximum FIFO Size (2)

Formalizing the problems
Find an algorithm such that given a schedule ltV,
P, M, E, Wgt, find a valid F function, such that
maxF(pi, pj) is minimized (a.k.a. min max
problem).
With interleaving communication.
Without interleaving communication.
Find an algorithm such that given a schedule ltV,
P, M, E, Wgt, find a valid F function, such that
?F(pi, pj) is minimized (a.k.a. min total
problem).

43
Min Max Problem (1)

Free vertices the vertices with no incoming
edges. (a and c in the figure)
Free edges the edges starting from free
vertices. (ab, ad, cd, ce in the figure)
Our algorithm always deals with free edges. When
one free edge is resolved with the algorithm,
some other edges may become free.

44
Dependency Graph

Dependency graph A graph constructed from the
precedence DAG by making all the data edges
bidirectional.
A data edge implies bidirectional dependency
between the two vertices. A schedule edge is
still unidirectional.

45
Min Max Problem (2)

4 types of free edges (priority 1gt2gt3gt4)
1 Free Schedule edge, and the source has no
other outgoing edges.
2 Free data edge between two free vertices
(ignoring the incoming data edges to the second
vertex).
3 Free data edge that is not 2 and is not in a
dependency cycle.
4 Free data edge that is not 2 and is in a
dependency cycle.

46
Min Max Problem (3)

1 Just delete it, because a can finish
immediately. b becomes free.
2 Just delete it, because a and c can run
simultaneously with interleaving communication.

47
Min Max Problem (4)

3 Just delete it, because d will be ready
later, and a just needs to wait.
4 Resolve blocking before deleting the edge.
Increase FIFO size if no space left otherwise,
use the space first.

48
Choice of Free Edges (Min Max)

If edges of 1, 2 or 3 exists, remove them
first.
If only edges of 4 are left, choose one of them
to resolve in a greedy manner
Among the edges of 4, always pick the one e such
that F(M(src(e)), M(des(e))) is minimal after e
is resolved.

49
Min Max Proof of Optimality

Induction. G is the complete precedence DAG. At
step i, Gi is the sub-graph we have solved. G
Gi is the sub-graph with only the remaining
edges. Fi is the F function at step i.
Base case G0 is empty. G0 is optimal.
Induction step Assume Gk is optimal
(maxFk(M(src(e)), M(des(e))) e ? Gk is
minimal). Prove Gk1 is also optimal.
Gk1 is obtained by either removing an edge of
type 1, 2 or 3 (in which case Gk1 is
obviously optimal), or updating FIFO for an edge
of type 4. In the latter case, we always pick an
edge ek1 such that Fk1(M(src(ek1)),
M(des(ek1)) is minimum among such edges. Then,
max Fk1 (M (src(e)), M (des(e))) e ?
Gk1
max( max Fk (M (src(e)), M (des(e))) e ?
Gk,
Fk1 (M (src(ek1)), M (des(ek1)))
is also minimal. So, Gk1 is optimal.
This proof does not work with min total.

50
Linear cycle detection algorithm Ac

To decide whether data edge from a to d is in a
cycle
Without considering edge (a, d) in the dependency
graph, can we find d by traversing the graph from
a?
Without considering edge (d, a) in the dependency
graph, can we find a by traversing the graph from
d?
If either case is true, then return true
otherwise, false.

51
Quadratic Min Max Algorithm Am

?i?P, j?P. spaceij0, fifoij0
while E is not empty do
type0 sel_srcNone sel_desNone
min_fifo-1.0
for each edge e(src,des) do
if src is free
if e is 1 then
type3 sel_srcsrc sel_desdes
else if typelt2 and e is 2 then
type2 sel_srcsrc sel_desdes
else if typelt1 then // 3 or 4
new_fifo_sizecalculate_fifo(src,d
es)
if min_fifolt0 or
min_fifogtfifo_size then
type1 sel_srcsrc
sel_desdes min_fifofifo_size
switch (type)
case 3 remove(sel_src,sel_des) break
case 2 remove(sel_src,sel_des)
remove(sel_des,sel_src) break
case 1 if (sel_src,sel_des) is in a
cycle according to Ac then
... / resolve blocking and
update fifo and space /
remove(sel_src,sel_des)
remove(sel_des,sel_src) break

52
Exponential Min Total Algorithm At

At is very similar to Am
Except that if only free edges of 4 are left, At
picks them one by one in an arbitrary order, and
each time it recursively computes the FIFO size
based on that choice.
After finishing computing one FIFO, it backtracks
and picks another such edge to try again.
This process ends when all possible sequences of
choices are exhausted. The FIFO with the minimum
total size is returned.
Some intermediate result can be saved and reused.
Because the exact min total problem is NP-hard,
At has to be exponential.

53
Lower bound of FIFO size for Non-interleaving
Communication Ln

For platforms that do not allow interleaving
communication, we compute a conservative lower
bound of FIFO size.
A common assumption in the literature.
Let
?p1 ? P, p2 ? P. (p1 p2) ? Ln(p1, p2) 0 and
(p1 ? p2) ?
Ln(p1, p2) maxW(e) e ? E and M(src(e)) p1
and M(des(e)) p2
Under this assumption, this must be true for any
valid F function
? p1 ? P, p2 ? P. Ln(p1,p2) F(p1,p2)
Ln(p1,p2) F(p1,p2) may not be achievable in
most non-trivial cases.

54
Example Am, At and Ln

Am
resolve (v0, v4)
resolve (v3, v7)
F(p0, p1) 2, F(p1, p2) 3
max 3, total 5
At
resolve (v3, v7)
F(p1, p2) 3
max 3, total 3
Ln
Ln(p0, p1) 2, Ln(p0, p2) Ln(p1, p2) 3
max 3, total 8

55
Implementation

The code is written in C and compiled with GCC
and VC.
It uses the BGL (Boost Graph Library).
Adjacency list data structure for graphs.
Input
A text description of the tasks to be scheduled
and the number of processors. Hand written or
generated by TGFF (Task Graphs For Free)
randomly.
Output
Screen output of the FIFO sizes between each two
processors, and the time spent in each algorithm.

56
Benchmark of Bigger Tests
57
Outline

Motivation
Previous Work
Preliminaries
Problem Statement
Investigative Approach
Summary and Conclusion
Future Work

58
Compare with Prior Work

We are focusing on
Deploying buffer for multi-processors
architecture using interleaving communication.
- Uniprocessor v.s. Multiprocessor
Lots of prior work was on uniprocessor.
For multiprocessor, it is more complicated to do
buffer sizing not much work on this, especially
no work under interleaving communication form.
- Non-interleaving communication v.s.
Interleaving communication for inter-processor
communication
Non-interleaving communication is write and
read tasks can not communicate data in the
interleaving way, so the read tasks will not
start reading until all the data are ready to be
read. It a conservative way.
Interleaving communication is when write and
read tasks are both active, they can communicate
any amount of data through one-place buffer. This
way is more efficient.

59
Our Contribution

Theory
We proved
- Several properties about deadlock in a
precedence graph.
- Given a legal scheduling for unbounded
communication buffer, we can always find a legal
scheduling with one-place communication buffer by
using internal buffers. (P1)
- Using internal buffer will affect makespan.
(P2)
- The problem given a Precedence DAG, find the
minimal total buffer size avoiding deadlock is
NP-hard. (P3)
Implementation
We implemented and tested
- Algorithms for minimizing maximum FIFO size
without using internal buffers (P3)
- Algorithms for minimizing total FIFO size
without using internal buffers. (P3)

60
Summary

Avoidance of artificial deadlock on architectures
that support low-overhead interleaving
communication
Show sufficiency of one-place buffers for
inter-processor communication
Possible increase of minimum makespan with
limited FIFO depth
Algorithms for minimizing two cost functions for
FIFO size without using internal buffers
Minimize maximum FIFO size
Minimize total FIFO size (proved to be NP-hard)

61
Conclusions

Identification of an implementation gap between
dataflow programming models and architectural
platforms
Must be bridged to enable automated code synthesis

62
Future work

Better heuristics for Min-Total
Compare with optimal buffer sizing under no
interleaving communication
Industrial case studies

63
Future Work Case Studies

Apply to industrial applications
JPEG encoder
Motion JPEG encoder
H.264 encoder
Deployment on multiprocessor architectures
Intel MXP5800
Xilinx Virtex II Pro with Fast Simplex Link (FSL)
communication

64
Thank you!
65
Background Slides
66
Precedence DAG

A precedence directed acyclic graph (DAG) is a
common representation for the deployment of an
application across multiple processors.
Precedence DAGs can be generated from statically
schedulable dataflow descriptions, such as
synchronous dataflow or cyclo-static dataflow.
The nodes in this graph represent an execution
(or firing) of a particular task in the
application. Node weights represent the estimated
execution times on each of the processors in the
architecture.
The directed edges in the graph represent data
dependencies between nodes. A node can only be
activated after data is received from all
predecessor nodes. Edge weights represent the
amount of data transferred between edges.