Title: Dataflow Architecture
1Dataflow Architecture
Liu, Chaofeng Dai, Chaowei Li, Mao
2Dataflow Overview
Dataflow Overview
- What is dataflow and what can it achieve
- Distinguish dataflow from Von Neumans
- control-flow model
3Dataflow Overview
What is dataflow and what can it achieve
Arvind and Iannucci identifies two fundamental
issues that must be addressed to construct a
successful multiprocessor
- Memory latency the time between issuing a
memory request and receiving the corresponding
response - Synchronization It is needed in order to
enforce the ordering of instruction execution
according to their data dependencies.
Before the dataflow, the majority of
multiprocessor computers system is based on the
von Neuman style processors. These processors use
a program counter to sequence the execution of
instructions in a program. This sequential
execution style may make it difficult to exploit
parallelism in a program. In 1970s, dataflow
computers was proposed and developed to meet
these deficiencies of von Neumans control-flow
multiprocessors
4Distinguish dataflow from Von Neumanns
control-flow model --Von Neumans control-flow
model
Dataflow Overview
The control-flow model assumes that a program is
a series of addressable instructions, each of
which either specifies an operation along with
memory locations of the operands or specifies
transfer of control to another instruction
unconditionally or when some condition holds. the
following given example will illustrate the
difference of control-flow from dataflow
architecture. C if n 0 then ab else a b
fi When using a control-flow model, the program
will be translated into a series of instructions
starting with an instruction comparing n to 0 and
either transferring control to an instruction
adding a to b or to another instruction
subtracting b from a. In both cases, store the
result in c. What a control-flow essentially does
is the branch instruction wait to be executed
after the completion of comparing n to 0
instruction.
5Distinguish dataflow from Von Neumanns
control-flow model --Dataflow model
Dataflow Overview
Unlike the control-flow model, dataflow assumes
that a program is a data-dependency graph whose
nodes denote operations and whose edges denote
dependencies between operations. Any operation
denoted by a node is executed as soon as its
incoming edges have the necessary operands. In
particular, if n is available, the operation
can be applied to its operands n and constant 0.
Similarly, if a and b are available, both of the
operation and can be applied to a and b even
before the comparison of n and 0 has been
completed.
Fig. Dataflow graph for c if n 0 then ab
else a-b fi
6Static vs dynamic dataflow architecture
Static vs dynamic dataflow architecture
- Based on the node firing rules, the dataflow
model is classified into two types. - Static dataflow architecture
- Dynamic dataflow architecture
- In the first approach, the firing rule restricts
each edge can only hold one token at a time and
the operation is executed when tokens (values)
are present on each of the input edges. It also
implies that an executable operation can actually
execute only when its output edge has no token on
it. The static dataflow model can exploit
structural parallelism (as in different unrelated
operators executing at the same time) and
pipeline parallelism (as in different stage of
the graph consuming different tokens of a stream
of tokens at the same time).
7Static dataflow
Static vs dynamic dataflow architecture
- Disadvantage
- Since each edge can only hold one token at a
time, only one iteration or one function
invocation can be active. Thus it cannot exploit
dynamic forms of parallelisms such as loop
parallelism (from simultaneously executing
different unrelated iterations of a loop body) or
recursive parallelism (from simultaneously
evaluating multiple recursive function calls). - Application
- This model is thus well suited for applications
with regular numeric computational structures
such as signal processing and image processing
applications that do not make heavy use of
iterative or recursive program structures.
8Dynamic dataflow
Static vs dynamic dataflow architecture
- Contrast to the static dataflow model, infinite
storage is allowed on each arch in the dynamic
dataflow model. Since data values for a
particular instantiation of the operator have to
be identified, tags are assigned to the data
tokens. For this reason the second scheme is
often referred as the tagged-token approach. The
tag associated with each token is a four-tuple
. - c is the invocation ID To distinguish between
tokens of different function invocations - i is the iteration ID To distinguish tokens
belonging to different iterations - b is the code block address The code block
address along with the instruction address of a
token identify its destination. - a is the instruction address within the code
block.
9Dynamic dataflow
Static vs dynamic dataflow architecture
Interconnection Network (IN)
PE1
PE2
PE3
PE4
Fig. Overall organization of the tagged token
dataflow architecture
The overall organization of the tagged token
dataflow architecture is shown on the top. It
consist of several processing elements (PEs)
which is illustrated in next picture
interconnected via a packet switching
interconnection network (IN).
10Dynamic dataflow
Static vs dynamic dataflow architecture
A processing element (PE) consists of a matching
store unit (MU), a instruction fetch unit (IFU),
a processing unit (PU), and a output unit (OU)
Fig. The Internal Structure of a PE
11Dynamic dataflow
Static vs dynamic dataflow architecture
Step1 If a token arriving at the matching unit
completes all the input requirements for the
execution of an instruction, a group token is
formed with all the input data and is sent to the
code fetch unit. Otherwise, the token is added
into the matching unit with the token already
gathered for the instruction. Step2 When the
instruction fetch unit (IFU) receives a packet
with all the data for a particular instruction,
the corresponding instruction is fetched. The
instruction along with the data then forms an
executable packet that is sent to the processing
unit (PU). Step3 The PU contains a number of
function units (FUs) that can perform the
dataflow operation in parallel. The result
generated by the PU is sent to the output unit
(OU). Step4 The main function of the OU is to
form the tokens from the result generated by the
PU. Further, the OU unit also evaluates the
assignment function to determine the physical
address of the PE to which the token needs to be
sent.
12Dynamic dataflow
Static vs dynamic dataflow architecture
In short, the evolution of dataflow computers has
been motivated by the need to get better
performance. Dynamic dataflow computer was
designed in order to be able to do loop
iterations and subprogram invocations in
parallel. The two well known as the MIT tagged
token dataflow architecture and the Manchester
dataflow computer. Â More recently, hybrid
architecture was proposed in order to combine
positive features from the von Neumann and
dataflow architectures which bridges the gap
between existing systems and new dataflow
supercomputers by allowing execution of existing
software written for conventional Processors. It
will be further discussed in depth by Chaowei
Dai.
13(No Transcript)
14Hybrid Dataflow Architecture Model
15Computer Architecture Models
- There are two basic modelsthe von Neumann
sequential control model and the data-driven
distributed control model. - The von Neumann model uses a program counter to
sequence the execution of instructions in a
program. Dataflow model is an alternative to the
conventional stored-program (von Neumann)
execution model .
16Computer Architectures Models
- In the dataflow architecture model, the program
parallelism is expressed in terms of a directed
graph, in which the nodes describe the operation
to be performed and the arcs represent the data
dependencies among the operations, execution of
the directed graph is data driven in the sense
that the execution of a node does not proceed
until the availability of the data at the input
of the node.
17Advantages of Dataflow Architecture Models
- The dataflow model of computation offers a sound,
simple, yet powerful model of parallel
computation. - In dataflow programming and architectures, there
is not notion of a single point or locus of
control. - Dataflow architectures have promised two
fundamental problems of von Neumann computers in
multiprocessing the memory latency and
synchronization overhead .
18Advantages of Dataflow Architecture Models
- The advantages of von Neumann is its efficiency
and simplicity of the instruction sequencing
mechanism as well as over 40 years of
optimization of the instruction execution
mechanism . - Research work on combining the advantages of
these two models together was carried out on last
10 years. Some of the research achievements are
introduced as follows.
19Research Work on Hybrid Dataflow Model
- One of the most valuable researches was conducted
by dr. Gao. In his work, an efficient hybrid
architecture model was proposed. The model
employees - A conventional architecture technique to achieve
fast pipelined instruction execution, while
exploiting fine-grain parallelism by data-driven
instruction scheduling - An efficient mechanism which supports concurrent
operation of multiple instruction threads on the
hybrid architecture model
20Research Work on Hybrid Dataflow Model
- (3) a compiling paradigm for dataflow software
pipelining which efficiently exploits loop
paradigm through limited balancing. A set of
basic results was established by the author. - - It showed that the fine-grain parallelism in a
loop exposed through limited balancing can fully
exploited by a simple greedy runtime data-driven
scheduling scheme, achieving both time and space
efficiency simultaneously.
21Research Work on Hybrid Dataflow Model
- Based on his experience with the MIT dynamic
(tagged-token) dataflow architecture, Iaanucci
combined dataflow ideas with sequential thread
execution to define a hybrid computation model.
The ideas later evolved into a multithreaded
architecture project at IBM Yorktown research
center. The architecture includes features such
as cache memory with synchronization controls,
prioritized processor ready queues and features
for efficient process migration to facilitate
load balancing.
22Research Work on Hybrid Dataflow Model
- 3. The P-RISC is a hybrid model exploring the
possibility of constructing a multithreaded
architecture around a RISC processor(r.S. Nikhil,
Arvind). The start-t project a successor of the
monsoon project has defined a multiprocessor
architecture using an extension of an
off-the-shelf processor architecture to support
fine-grain communication and scheduling of user
micro threads. The architecture is intended to
retain the latency-hiding feature of the monsoon
split-phase global memory operation.
23Example of Hybrid Dataflow Architecture Model
- Dr. Gaos research work was primarily focused on
the organization of a single processor which can
support multiple instruction threads, while
instruction from different threads are subjected
to pipelined execution. - The proposed hybrid dataflow architecture model
is an extension of the McGill dataflow
architecture model which employs the
argument-fetching principle.
24MDFA Model
25MDFA Model
26Hybrid Dataflow Architecture Model
- The idea is that the IPU directly generates the
next p-instruction address to be executed,
instead of going through the scheduling phase in
ISU. - Each p-instruction is extended to carry an extra
field a tag field (also called von Neumann bit)
which indicates whether the instruction is
following a dataflow style scheduling or a von
Neumann style scheduling.
27Hybrid Dataflow Model
28Features of the Hybrid MDFA Model
- Generality the hybrid MDFA model supports both
thread level and instruction level parallelism
through efficient fine-grain synchronization. At
any time, IPU can execute several instructions in
parallel. This model is different from so-called
the macro-dataflow schemes where dataflow
scheduling can only done at inter-procedure
level. It retains the advantage of dataflow
models in terms of dealing with the two
fundamental issues of von Neumann
multiprocessing.
29Features of the Hybrid MDFA Model
- 2. Flexibility there is no restrictions as to
the size of an instruction thread which can be
supported by this model. In fact, multiple
instruction threads each with a different size
can be active concurrently. This is an important
advantage in comparison with other multi-thread
architectures where the number of thread are
fixed a priori.
30Features of the Hybrid MDFA Model
- 3. Simplicity under the hybrid MDFA model, any
instruction in a program can be set to one or two
modes, regardless of its function or type. This
flexibility certainly make the job of a compiler
easier, since the mode control and operation of
an instruction become orthogonal.
31Decoupled Scheduled Dataflow Multithreaded
Architecture
- Mao Li
- CIS
- Cleveland State University
32Decoupling memory accesses from Execution Pipeline
- The gap between processor speed and average
memory access speed limits achieving high
performance. - Decoupled architectures offers a solution in
leaping over the memory wall - Integrating the decoupled architecture with
multithreading presents a wide range of
implementations for next-generation architecture - Two multithreaded architectures that support
decoupled memory accesses Rhamma and PL/PS
33Rhamma Processor
- Rhamma used two separate processors--Memory
processor perform all Load and Store
Instructions, other instructions by Execution
processor - A single sequence of instructions (thread) is
generated for both processors - When a Memory access instruction is decoded by
the Execution Processor, a context switch is
utilized to return the thread to the Memory
processor - When the Memory Processor decodes a non-memory
access instruction, a context switch causes the
thread to be handed over to the Execute Processor - Threads are blocking (no other thread can run
till current one finish)
34PL/PS Architecture
- Threads are non-blocking
- All memory accesses are done by the Memory
Processor, which delivered enabled threads to the
Execute Processor - Each thread is enabled when the required inputs
are available and all operands are pre-loaded
into a register context - Once enabled, a thread executes to completion
without blocking where the instructions belonging
to a thread will execute on the Execute Processor - The Results from completed threads are
post-stored by the Memory Processor
35Limitations of Pure Dataflow
- Dataflow model holds the promise of an elegant
execution paradigm with parallelism in
applications, but no actual implementation can
offer the promised performance. - Major limitations of the pure dataflow model that
prevented commercial implementation - Too fine-grained (instruction level)
multithreading - Difficulty in using memory hierarchies and
registers - Asynchronous triggering of instructions
36Pipeline Structure of Scheduled Dataflow
37Analytical Models Evaluating New Architecture
Effect of thread parallelism
- The same normalized workload for all
architectures (all architectures execute the same
amount of useful work) - Latency of a pair of threads (the time difference
between termination of a thread and the
initialization of a successive thread) - Both Scheduled Dataflow and Rhamma show
performance gain with increase of thread
parallelism - Scheduled Dataflow executes the multithreaded
workload faster than Rhamma for all values of
thread parallelism - Scheduled Dataflow will provide higher degree of
thread parallelism than Rhamma, since
non-blocking nature of Scheduled Dataflow leads
to finer-grained threads
38Thread Granularity
- Normalized thread length includes only functional
instructions and does not include
architecture-specific overhead instructions - For conventional and scheduled Dataflow, increase
of thread run-lengths shows performance gains to
a certain degree (since longer threads imply
fewer context switches) - Rhamma, longer thread does not guarantee shorter
execution times
39Fraction of memory access instructions
- Conventional architecture, increasing memory
access instructions leads to increased cache
misses, thus increasing the execution time - The decoupling allows for the two multithreaded
processors to tolerate the cache miss penalties - Scheduled Dataflow outperforms Rhamma for all
values of memory access instructions, because the
pre-loading and post-storing are performed by
Scheduled Dataflow
40Conclusions
- This paper presented a new data flow architecture
utilizing control-flow like scheduling of
instructions and separating memory accesses from
instruction execution to tolerate long latency
incurred by the memory access - The proposed scheduled Dataflow system is
instruction driven where a program counter type
sequencing is used to execute instructions the
instructions within a thread for this system
still retain dataflow (functional) properties,
thus eliminate the need for complex hardware. - Decoupled access/execute implementations with
multithread model presents better opportunities
for exploiting the decoupling of memory accesses
from execution pipeline - Grouping memory accesses (pre-load and
post-store) for threads eliminates unnecessary
delays caused by memory accesses.