Optimus: Efficient Realization of Streaming Applications on FPGAs

About This Presentation

Title:

Optimus: Efficient Realization of Streaming Applications on FPGAs

Description:

Optimus: Efficient Realization of Streaming Applications on FPGAs ... Round-Robin Splitter(8,8,8,8) Adder 3. Adder 2. Round-Robin Joiner(1,1,1,1) Printer ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 26

Provided by: hsienhs

Category:

more less

Transcript and Presenter's Notes

Title: Optimus: Efficient Realization of Streaming Applications on FPGAs

1
Optimus Efficient Realization of Streaming
Applications on FPGAs

University of Michigan Amir Hormati, Manjunath
Kudlur, Scott Mahlke
IBM Research David Bacon, Rodric Rabbah

2
Introduction

End of free ride from clock scaling.
Wide of variety of applications
Evolution of new architectures
Heterogeneous Architectures

3
Why FPGAs?
4
Why Streaming?

Suitable for a wide range of applications(Digital
signal processing, graphics, encryption, etc.)
Promising approach multi-cores .
Locality
Exposed parallelism
New languages StreamIt, Brook, CUDA, SPUR, etc.

Ray tracing example
5
Vision
Streaming
Lime SIR
Front-EndCompiler
FPGA Model
Annotated Java bytecode
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
HDL
C
Xilinx VHDL Compiler
Cell SDK
Xilinx bitfile
Cell binary
Stream VM
Virtex5 FPGA
Cell BE
Stream VM
Stream VM
6
Overview

StreamIt Example
Compilation Flow
Scheduling
Optimizations
Results

7
StreamIt Example
8
Operation Compilation
temp
sum
i
1
im
i0
ADD
ADD
1
1

sum sum temp i i 1 Branch bb2 if i lt 8
FU
predicate
8
temp
CMP

1
o0
on
Control out 3
Control out 4
Register
Control in

9
Filter Compilation
10
Top Level Compilation
11
Stream Scheduling

All the filters are active at the same time.
Pushes and pops are blocking.
No need for double buffering. Channels can have
any size.
Deadlock is possible.

12
Optimizations

Classic optimizations (micro functional)
Common subexpression elimination, Constant
folding, Loop unrolling, etc.
Done in many of previous hardware compilers
Streaming optimizations (macro functional)
Channel allocations, Channel access fusion,
Filter fission and fusion, etc.
Doing these optimization needs global information
about the stream graph.
Ignored in the previous works.

13
Channel Allocation

Larger channels
More SRAM
More control logic
Less stalls
All filters are active at the same time
Interlocking makes sure that each filter gets the
right data or blocks
Size of channels can be smaller than the
statically scheduled graph

14
Channel Allocation Algorithm

Set the size of the channels to infinity.
Warm-up the queues.
Record the steady state schedules for the pair.
Find the maximum number of overlapping lifetimes.

15
Channel Allocation Example
Producer
Consumer
Source
Filter 1
Filter 2
Sink
Max overlap 3
16
Channel Allocation
17
Channel Access Fusion

Each channel access(push or pop) takes one
cycle.
Increases communication to computation ratio
Longer critical path latency
Limit task-level parallelism

18
Channel Access Fusion Algorithm

Loop Unrolling
Push/Pop grouping
Balance channel access clusters
Widen the channels

19
Access Fusion Example
int-gtint filter Adder(int addSize) work pop
addSize push 1 int sum 0 int t1, t2,
t3, t4 for (int i 0 i lt addSize/4 i)
(t1, t2, t3, t4) pop4() sum
t1 t2 t3 t4 push(sum)

int-gtint filter Adder(int addSize) work pop
addSize push 1 int sum 0 for (int i
0 i lt addSize i) sum pop()
push(sum)
20
Access Fusion
21
Speedup (baseline PowerPC)
22
Energy Consumption
23
Conclusion

Streaming language to program heterogeneous
systems
Hierarchical synthesis
Macro functional optimizations

24
Thank you!

Questions?

25
Static Stream Scheduling

Resources have to be ready before a filter
starts(pushes and pops are non-blocking).
Double buffering for parallelism.
Deadlock can be detected at compile-time.
Could be inefficient in case of data dependent
bahavior.

Write a Comment

User Comments (0)