Title: Optimus: Efficient Realization of Streaming Applications on FPGAs
1Optimus Efficient Realization of Streaming
Applications on FPGAs
- University of Michigan Amir Hormati, Manjunath
Kudlur, Scott Mahlke - IBM Research David Bacon, Rodric Rabbah
2Introduction
- End of free ride from clock scaling.
- Wide of variety of applications
- Evolution of new architectures
- Heterogeneous Architectures
3Why FPGAs?
4Why Streaming?
- Suitable for a wide range of applications(Digital
signal processing, graphics, encryption, etc.) - Promising approach multi-cores .
- Locality
- Exposed parallelism
- New languages StreamIt, Brook, CUDA, SPUR, etc.
Ray tracing example
5Vision
Streaming
Lime SIR
Front-EndCompiler
FPGA Model
Annotated Java bytecode
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
HDL
C
Xilinx VHDL Compiler
Cell SDK
Xilinx bitfile
Cell binary
Stream VM
Virtex5 FPGA
Cell BE
Stream VM
Stream VM
6Overview
- StreamIt Example
- Compilation Flow
- Scheduling
- Optimizations
- Results
7StreamIt Example
8Operation Compilation
temp
sum
i
1
im
i0
ADD
ADD
1
1
sum sum temp i i 1 Branch bb2 if i lt 8
FU
predicate
8
temp
CMP
1
o0
on
Control out 3
Control out 4
Register
Control in
9Filter Compilation
10Top Level Compilation
11Stream Scheduling
- All the filters are active at the same time.
- Pushes and pops are blocking.
- No need for double buffering. Channels can have
any size. - Deadlock is possible.
12Optimizations
- Classic optimizations (micro functional)
- Common subexpression elimination, Constant
folding, Loop unrolling, etc. - Done in many of previous hardware compilers
- Streaming optimizations (macro functional)
- Channel allocations, Channel access fusion,
Filter fission and fusion, etc. - Doing these optimization needs global information
about the stream graph. - Ignored in the previous works.
13Channel Allocation
- Larger channels
- More SRAM
- More control logic
- Less stalls
-
- All filters are active at the same time
- Interlocking makes sure that each filter gets the
right data or blocks - Size of channels can be smaller than the
statically scheduled graph
14Channel Allocation Algorithm
- Set the size of the channels to infinity.
- Warm-up the queues.
- Record the steady state schedules for the pair.
- Find the maximum number of overlapping lifetimes.
15Channel Allocation Example
Producer
Consumer
Source
Filter 1
Filter 2
Sink
Max overlap 3
16Channel Allocation
17Channel Access Fusion
- Each channel access(push or pop) takes one
cycle. - Increases communication to computation ratio
- Longer critical path latency
- Limit task-level parallelism
18Channel Access Fusion Algorithm
- Loop Unrolling
- Push/Pop grouping
- Balance channel access clusters
- Widen the channels
19Access Fusion Example
int-gtint filter Adder(int addSize) work pop
addSize push 1 int sum 0 int t1, t2,
t3, t4 for (int i 0 i lt addSize/4 i)
(t1, t2, t3, t4) pop4() sum
t1 t2 t3 t4 push(sum)
int-gtint filter Adder(int addSize) work pop
addSize push 1 int sum 0 for (int i
0 i lt addSize i) sum pop()
push(sum)
20Access Fusion
21Speedup (baseline PowerPC)
22Energy Consumption
23Conclusion
- Streaming language to program heterogeneous
systems - Hierarchical synthesis
- Macro functional optimizations
24Thank you!
25Static Stream Scheduling
- Resources have to be ready before a filter
starts(pushes and pops are non-blocking). - Double buffering for parallelism.
- Deadlock can be detected at compile-time.
- Could be inefficient in case of data dependent
bahavior.