Title: Buffersizing for Precedence Graphs on Restricted Multiprocessor Architectures
1Buffer-sizing for Precedence Graphs on Restricted
Multiprocessor Architectures
- Thomas Feng
- Yang Yang
- Mentors Qi Zhu, Abhijit Davare
2Outline
- Motivation
- Previous Work
- Preliminaries
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Compare with Prior Work
- Our Contribution
- Future Work
3Outline
- Motivation
- Previous Work
- Preliminaries
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Future Work
4Parallel Heterogeneous Platforms (PHPs)
- Advantages
- High computational capability
- Challenges
- Explore the theoretically high performance
(From Abhijit Davares Quals Presentation)
5Project Goals
- Dataflow Programming Model
- Infinite Buffers
- Blocking read, non-blocking write
- Many scheduling and allocation techniques
- Multiprocessor Platform
- Limited connectivity between processors
- Limited, finite depth FIFOs
- Low overhead reads and writes to FIFOs
6Deploying applications on PHPs
- Computation Synthesis
- Task Allocation
- Task Scheduling
- Communication Synthesis
- Interconnection Synthesis
- Buffer sizing (The part we are working on)
7Buffer Sizing
- Architectures have bounded buffer resources.
- If more communication buffer resources are
utilized, processors may spend less time waiting
to send/receive data. - Additional buffer resources may adversely affect
communication overhead, achievable clock speed,
or design closure.
8Example Flow
Function Model
Architecture
Architecture Model
Function
Allocation Scheduling
Buffer Sizing
9Outline
- Motivation
- Previous Work
- Preliminaries
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Future Work
10Previous Work
- Transformations from various statically-schedulabl
e dataflow variants into precedence DAGs 1 - Survey on allocation and scheduling algorithms
for precedence DAGs assuming infinite-length
buffers 2 - Minimizing Buffer Requirements for uniprocessor
architectures 3 - Minimizing multiprocessor buffer sizing for SDF
applications under conservative
(non-interleaving) conditions 4
1 Software Synthesis and Code Generation for
Signal Processing Systems S. Bhattacharyya, R.
Leupers, P. Marwedel - IEEE Transactions on
Circuits and Systems, 2000.
2 Static scheduling algorithms for allocating
directed task graphs to multiprocessorsYK Kwok,
I Ahmad - ACM Computing Surveys, 1999.
3 Minimizing Buffer Requirements of Synchronous
Dataflow Graphs with Model Checking M. Geilen, T.
Basten, S. Stuijk, DAC 2005.
4 Data Memory Minimization for Synchronous Data
Flow Graphs Emulated on DSP-FPGA Targets M. Adé,
R. Lauwereins, J.A. Peperstraete DAC 1997.
11Outline
- Motivation
- Previous Work
- Preliminaries
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Future Work
12Preliminaries
- Precedence DAG
- A precedence DAG is a common representation for
the deployment of an application across multiple
processors. - Precedence DAG can be generated from statically
schedulable dataflow descriptions, such as
synchronous dataflow or cyclo-static dataflow.
This is suitable for most applications in the
multimedia domain. 45 - Our synthesis process starts from precedence DAG.
4 A Hierarchical Multiprocessor Scheduling
System for DSP Applications Jose Luis Pino,
Edward A. Lee, Shuvra S. Bhattacharyya - 29th
Asilomar Conference on Signals, Systems and
Computers, 1995
5 Dataflow process networks Edward A. Lee,
Thomas M. Parks - Proceedings of the IEEE, 1995
13Preliminaries
- Synthesis process
- Relationships among allocation, scheduling and
buffer sizing - Allocation assign each node in precedence DAG to
a particular processor in the architecture. - Scheduling specify an execution sequence for the
set of tasks on each processor. - Buffer sizing assign sizes to inter-processor
communication channels. - In our approach, allocation and scheduling are
done by assuming unbounded communication buffer
size. Then buffer sizing will be based on the
result of allocation and scheduling.
14Preliminaries
- Artificial deadlock is deadlock that results when
the size of buffers between processors is reduced
from infinity to some finite numbers. 6 - In buffer sizing, we want to minimize the
objective function, avoiding artificial deadlock.
- (Deadlock implies artificial deadlock in
the following slides.)
6 Requirements on the Execution of Kahn Process
Networks Marc Geilen and Twan Basten -
Programming Languages and Systems 12th European
Symposium on Programming, ESOP 2003.
15Outline
- Motivation
- Previous Work
- Preliminary
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Future Work
16Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
17Assumptions
- Interleaving communication is
- For inter-processor communication, when write
and read tasks are both active, they can
communicate any amount of data through one-place
buffer. -
18Outline
- Motivation
- Previous Work
- Preliminary
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Future Work
19Classification of Blocked nodes
- In a precedence DAG, we classify nodes which will
be blocked during execution into 3 kinds. - read blocked node -- the node will be blocked
because it can not read in enough tokens. - write blocked node -- the node will be blocked
because it can not finish writing all the
produced tokens. - scheduling blocked node -- the node can not be
fired because its previous node on the same
processor has not finished execution.
20Our Observations of Deadlock
- We proved that it is impossible to have deadlock
with only scheduling blocked nodes and read
blocked nodes. - We proved that if a precedence DAG has deadlock,
then it must has at least such a pattern called
write blocked cycle in which - - all the schedule edges are in the same
direction - - there must be 1 or more write blocked nodes,
whose incoming degree is 0 in the cycle - - there could be read blocked nodes, whose
incoming degree is one or more in the cycle - - if reversing the directed data edges from all
the write blocked nodes, it becomes a directed
cycle. - Note not every write blocked edges must be in a
write block cycle.
21Write Blocked Cycle
Pi
Pj
Data edge
Schedule edge
ni
nj
Series of several data/schedule edges, in which
the schedule edges are in the same direction as
the schedule edge from ni to ni1 , while the
data edges could be in either direction.
ni1
nj1
nim
njn
mgt1, ngt1
Buffer space from Pi to Pj lt Token count on the
write edge from ni to njn
22Examples of Write Blocked Cycle
P1
P2
P3
a
c
e
a
a
c
e
2
2
2
2
!
!
!
b
d
f
2
2
!
Fig.2
Fig.1
BufferPiPj 1 i,j1,2,3
23How to Avoid Deadlock
- There is no artificial deadlock in a precedence
DAG if and only if there is no write blocked
cycles in the graph. - Proof by contradiction. (Omitted here.)
- We can resolve the write blocked cycles by using
enough communication buffer or internal buffer.
24How to Avoid Deadlock
a
c
Wa
Bij
b
d
Pi
Pj
- Wa gt Space(Bij), a is write blocked. Deadlock is
solved by - - increasing communication buffer size to hold
all the Wa tokens - or increasing internal buffer size in Pi to hold
all the Wa tokens, and then write them to d after
b and c - - or reading all the tokens before c, and
increasing internal buffer size in Pj to hold all
the Wa tokens,
25How to Avoid Deadlock
e
g
Wa
We
a
d
a
c
Wb
Wa
b
e
Bmn
f
h
Pm
Pn
Bij
c
f
Bij
b
d
Pi
Pj
Pi
Pj
Wa gt Space(Bij), We gt Space(Bmn), a and e are
write blocked. Deadlock is solved by increasing
the size of Bij to hold all the Wa, or increasing
the size of Bmn to hold all the Wb tokens or
using internal buffer.
Wa Wb gt Space(Bij), a and b are write
blocked. Deadlock is solved by increasing buffer
size to hold all the Wa Wb tokens or by
increasing internal buffer in Pi or Pj.
26Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
27Using Internal Buffers
- We can always find a legal scheduling S1 with one
place communication buffers by using internal
buffers, assuming we have a legal scheduling SU
requiring unbounded communication buffer. - Proof
- Since the scheduling SU (a partial order) is
legal, we can always find a sequential order SS
to execute the nodes, which is also legal. We can
get SS by arbitrarily assigning a total order
conforming to the partial order. (to be continued)
28- Let the communication buffer between every two
processors be one place. If there is deadlock in
SS, obviously the first blocked node must be
write blocked node.
Simulate this execution. Let x be the first
blocked node and it writes tokens to nodes y1, ,
yk. The block can be eliminated by letting x
write all the produced tokens to the internal
buffers For every node yi on processor pi, write
the tokens from x to the internal buffer of pi.
Corresponding reading codes are inserted right
after the executed codes of pi. Therefore, the
deadlock at x is solved, and following execution
will not be affected by this solution. Repeat
this process until all the write blocked nodes is
eliminated. Consequently, a legal schedule is
found by using internal buffers.
29Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
30MakeSpan
- Make span is the maximum completion time for a
set of processors. - Assumptions
- 1. Interprocessor communication takes place
through bounded depth FIFOs with blocking reads
and writes - 2. Unlimited internal buffer space is available
on each processor - Conjecture
- For a task precedence graph, if insufficient
FIFO depth leads to deadlock, reading and writing
can be reordered in such a way that deadlock is
eliminated and makespan is not affected. - Counterexample
- The example is scheduled in such a way that
multiple paths are relatively critical.
Reordering the reads and writes to eliminate the
deadlock increases the length of some of the
relatively critical paths, extending the
makespan, even if tx/rx time ltlt computation time.
31P3
215
e
Communication Model Tx/Rx time 5 units Latency
0 units
P1
P2
a
c
f
10
10
10
g
d
b
200
280
50
P5
i
80
290
305
h
300
315
P4
315
32- In the example, edges a-gtd and c-gtg may be
blocked due to insufficient FIFO depth. Without
increasing the FIFO depth, there are 4 ways to
resolve this - Move a-gtd communication after b
- Move a-gtd communication before c
- Move c-gtg communication after d
- Move c-gtg communication before f
- Options 1 and 3 delay d and g by a large amount,
and increase the makespan significantly - Options 2 and 4 extend the critical paths that
end at h and i
33Problem Statement
Allocation and Scheduling with unbounded
communication buffer size
Bounded communication buffer size
Artificial deadlock
Use internal buffer
Use communication buffer
P1 Can we always find a legal scheduling with
one-place communication buffer by using internal
buffers, assuming we have legal scheduling for
unbounded communication buffer?
P3 Minimize total (or largest) communication
buffer size
P3 Minimize total (or largest) communication
buffer size
P2 Does using internal buffer increase makespan
or not?
similar problems
Minimize total (or largest) internal buffer size
P4 Give an optimal internal buffer assignment
34NP-hard Problem
- Formally, the problem DEADLOCK-FREE-MIN-BUFFER
(DFMB) is defined as follows Given a Precedence
DAG D, find the minimal total buffer size, such
that there is no deadlock in D. - The problem DEADLOCK-FREE-MIN-BUFFER is NP-hard.
- Proof
- We prove it by reducing FEEDBACK ARC SET (FAS)
problem, which known to be NP-complete, step by
step to the DRMB problem.
35- The FEEDBACK ARC SET (FAS) Problem is the
following Given a directed graph G(V, E), and a
positive integer K, does there exist a subset
, such that B contains at least one
edge from every directed cycle in G? - This problem is known to be NP-complete 7
7 Computers and Intractability M. R. Garey and
D.S. Johnson - W. H. Freeman and Co., NY 1979
36- First, we prove FAS problem can be reduced to the
Problem below - Problem B Given a directed graph G(V, E) with
weight w(e) on every edge , find the
minimal , such that
and B contains at least one edge from every
directed cycle in G. - Then we prove Problem B can be reduced to DFMB
problem, by proving that an arbitrary instance X
of Problem B can be transformed to an instance X
of DFMB problem in polynomial time, and the
result of X is got by solving X. -
37Instance X
Instance X
A vertex in G(V, E)
The corresponding nodes and schedule edge in D(N,
E)
The corresponding data edge in D(N, E)
An edge in G(V, E)
38A directed cycle in G (V, E)
A write blocked cycle in D (N, E), where E DE
U SE
39- Solution to Instance X
- min , and B
contains at least one edge from every directed
cycle in G - min , and M
contains at least one edge from every writing
block cycle in D - Solution to Instance X
40Algorithms
41Minimizing Maximum FIFO Size (1)
- Mathematical Model
- V v1, v2, , vm. The set of vertices.
- P p1, p2, , pl. The set of processors.
- M V ? P. Mapping from vertices to the processors
they are scheduled on. - E e1, e2, , en. The set of edges.
- S e e ? E ? M(src(e)) M(des(e)). Set of
schedule edges. - D e e ? E ? M(src(e)) ? M(des(e)). Set of
data edges. - W D ? R . The weight function.
- F P ? P ? R. The function that returns the FIFO
size. F(pi, pj) need not be equal to F(pj, pi).
42Minimizing Maximum FIFO Size (2)
- Formalizing the problems
- Find an algorithm such that given a schedule ltV,
P, M, E, Wgt, find a valid F function, such that
maxF(pi, pj) is minimized (a.k.a. min max
problem). - With interleaving communication.
- Without interleaving communication.
- Find an algorithm such that given a schedule ltV,
P, M, E, Wgt, find a valid F function, such that
?F(pi, pj) is minimized (a.k.a. min total
problem).
43Min Max Problem (1)
- Free vertices the vertices with no incoming
edges. (a and c in the figure) - Free edges the edges starting from free
vertices. (ab, ad, cd, ce in the figure) - Our algorithm always deals with free edges. When
one free edge is resolved with the algorithm,
some other edges may become free.
44Dependency Graph
- Dependency graph A graph constructed from the
precedence DAG by making all the data edges
bidirectional. - A data edge implies bidirectional dependency
between the two vertices. A schedule edge is
still unidirectional.
45Min Max Problem (2)
- 4 types of free edges (priority 1gt2gt3gt4)
- 1 Free Schedule edge, and the source has no
other outgoing edges. - 2 Free data edge between two free vertices
(ignoring the incoming data edges to the second
vertex). - 3 Free data edge that is not 2 and is not in a
dependency cycle. - 4 Free data edge that is not 2 and is in a
dependency cycle.
46Min Max Problem (3)
- 1 Just delete it, because a can finish
immediately. b becomes free. - 2 Just delete it, because a and c can run
simultaneously with interleaving communication.
47Min Max Problem (4)
- 3 Just delete it, because d will be ready
later, and a just needs to wait. - 4 Resolve blocking before deleting the edge.
Increase FIFO size if no space left otherwise,
use the space first.
48Choice of Free Edges (Min Max)
- If edges of 1, 2 or 3 exists, remove them
first. - If only edges of 4 are left, choose one of them
to resolve in a greedy manner - Among the edges of 4, always pick the one e such
that F(M(src(e)), M(des(e))) is minimal after e
is resolved.
49Min Max Proof of Optimality
- Induction. G is the complete precedence DAG. At
step i, Gi is the sub-graph we have solved. G
Gi is the sub-graph with only the remaining
edges. Fi is the F function at step i. - Base case G0 is empty. G0 is optimal.
- Induction step Assume Gk is optimal
(maxFk(M(src(e)), M(des(e))) e ? Gk is
minimal). Prove Gk1 is also optimal. - Gk1 is obtained by either removing an edge of
type 1, 2 or 3 (in which case Gk1 is
obviously optimal), or updating FIFO for an edge
of type 4. In the latter case, we always pick an
edge ek1 such that Fk1(M(src(ek1)),
M(des(ek1)) is minimum among such edges. Then, - max Fk1 (M (src(e)), M (des(e))) e ?
Gk1 - max( max Fk (M (src(e)), M (des(e))) e ?
Gk, - Fk1 (M (src(ek1)), M (des(ek1)))
- is also minimal. So, Gk1 is optimal.
- This proof does not work with min total.
50Linear cycle detection algorithm Ac
- To decide whether data edge from a to d is in a
cycle - Without considering edge (a, d) in the dependency
graph, can we find d by traversing the graph from
a? - Without considering edge (d, a) in the dependency
graph, can we find a by traversing the graph from
d? - If either case is true, then return true
otherwise, false.
51Quadratic Min Max Algorithm Am
- ?i?P, j?P. spaceij0, fifoij0
- while E is not empty do
- type0 sel_srcNone sel_desNone
min_fifo-1.0 - for each edge e(src,des) do
- if src is free
- if e is 1 then
- type3 sel_srcsrc sel_desdes
- else if typelt2 and e is 2 then
- type2 sel_srcsrc sel_desdes
- else if typelt1 then // 3 or 4
- new_fifo_sizecalculate_fifo(src,d
es) - if min_fifolt0 or
min_fifogtfifo_size then - type1 sel_srcsrc
sel_desdes min_fifofifo_size - switch (type)
- case 3 remove(sel_src,sel_des) break
- case 2 remove(sel_src,sel_des)
remove(sel_des,sel_src) break - case 1 if (sel_src,sel_des) is in a
cycle according to Ac then - ... / resolve blocking and
update fifo and space / - remove(sel_src,sel_des)
remove(sel_des,sel_src) break
52Exponential Min Total Algorithm At
- At is very similar to Am
- Except that if only free edges of 4 are left, At
picks them one by one in an arbitrary order, and
each time it recursively computes the FIFO size
based on that choice. - After finishing computing one FIFO, it backtracks
and picks another such edge to try again. - This process ends when all possible sequences of
choices are exhausted. The FIFO with the minimum
total size is returned. - Some intermediate result can be saved and reused.
- Because the exact min total problem is NP-hard,
At has to be exponential.
53Lower bound of FIFO size for Non-interleaving
Communication Ln
- For platforms that do not allow interleaving
communication, we compute a conservative lower
bound of FIFO size. - A common assumption in the literature.
- Let
- ?p1 ? P, p2 ? P. (p1 p2) ? Ln(p1, p2) 0 and
(p1 ? p2) ? - Ln(p1, p2) maxW(e) e ? E and M(src(e)) p1
and M(des(e)) p2 - Under this assumption, this must be true for any
valid F function - ? p1 ? P, p2 ? P. Ln(p1,p2) F(p1,p2)
- Ln(p1,p2) F(p1,p2) may not be achievable in
most non-trivial cases.
54Example Am, At and Ln
- Am
- resolve (v0, v4)
- resolve (v3, v7)
- F(p0, p1) 2, F(p1, p2) 3
- max 3, total 5
- At
- resolve (v3, v7)
- F(p1, p2) 3
- max 3, total 3
- Ln
- Ln(p0, p1) 2, Ln(p0, p2) Ln(p1, p2) 3
- max 3, total 8
55Implementation
- The code is written in C and compiled with GCC
and VC. - It uses the BGL (Boost Graph Library).
- Adjacency list data structure for graphs.
- Input
- A text description of the tasks to be scheduled
and the number of processors. Hand written or
generated by TGFF (Task Graphs For Free)
randomly. - Output
- Screen output of the FIFO sizes between each two
processors, and the time spent in each algorithm.
56Benchmark of Bigger Tests
57Outline
- Motivation
- Previous Work
- Preliminaries
- Problem Statement
- Investigative Approach
- Summary and Conclusion
- Future Work
58Compare with Prior Work
- We are focusing on
- Deploying buffer for multi-processors
architecture using interleaving communication. - - Uniprocessor v.s. Multiprocessor
- Lots of prior work was on uniprocessor.
- For multiprocessor, it is more complicated to do
buffer sizing not much work on this, especially
no work under interleaving communication form. - - Non-interleaving communication v.s.
Interleaving communication for inter-processor
communication - Non-interleaving communication is write and
read tasks can not communicate data in the
interleaving way, so the read tasks will not
start reading until all the data are ready to be
read. It a conservative way. - Interleaving communication is when write and
read tasks are both active, they can communicate
any amount of data through one-place buffer. This
way is more efficient.
59Our Contribution
- Theory
- We proved
- - Several properties about deadlock in a
precedence graph. - - Given a legal scheduling for unbounded
communication buffer, we can always find a legal
scheduling with one-place communication buffer by
using internal buffers. (P1) - - Using internal buffer will affect makespan.
(P2) - - The problem given a Precedence DAG, find the
minimal total buffer size avoiding deadlock is
NP-hard. (P3) - Implementation
- We implemented and tested
- - Algorithms for minimizing maximum FIFO size
without using internal buffers (P3) - - Algorithms for minimizing total FIFO size
without using internal buffers. (P3)
60Summary
- Avoidance of artificial deadlock on architectures
that support low-overhead interleaving
communication - Show sufficiency of one-place buffers for
inter-processor communication - Possible increase of minimum makespan with
limited FIFO depth - Algorithms for minimizing two cost functions for
FIFO size without using internal buffers - Minimize maximum FIFO size
- Minimize total FIFO size (proved to be NP-hard)
61Conclusions
- Identification of an implementation gap between
dataflow programming models and architectural
platforms - Must be bridged to enable automated code synthesis
62Future work
- Better heuristics for Min-Total
- Compare with optimal buffer sizing under no
interleaving communication - Industrial case studies
63Future Work Case Studies
- Apply to industrial applications
- JPEG encoder
- Motion JPEG encoder
- H.264 encoder
- Deployment on multiprocessor architectures
- Intel MXP5800
- Xilinx Virtex II Pro with Fast Simplex Link (FSL)
communication
64Thank you!
65Background Slides
66Precedence DAG
- A precedence directed acyclic graph (DAG) is a
common representation for the deployment of an
application across multiple processors.
Precedence DAGs can be generated from statically
schedulable dataflow descriptions, such as
synchronous dataflow or cyclo-static dataflow. - The nodes in this graph represent an execution
(or firing) of a particular task in the
application. Node weights represent the estimated
execution times on each of the processors in the
architecture. - The directed edges in the graph represent data
dependencies between nodes. A node can only be
activated after data is received from all
predecessor nodes. Edge weights represent the
amount of data transferred between edges.
67Assumptions
- Firing a node
- we abstract the firing of a node (or a task) to
be such an execution sequence on a single
processor.
Pi
68Assumptions
- Two kinds of deadlock are already avoided in
scheduling step
Data edge
Schedule edge
Pi
Pj
either of the above
ni
nj
nm
n1
ni1
nj1
n2
n4
n3
nim
njn
2. Directed cycle
1. Cross