Optimus: Efficient Realization of Streaming Applications on FPGAs - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Optimus: Efficient Realization of Streaming Applications on FPGAs

Description:

Optimus: Efficient Realization of Streaming Applications on FPGAs ... Round-Robin Splitter(8,8,8,8) Adder 3. Adder 2. Round-Robin Joiner(1,1,1,1) Printer ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 26
Provided by: hsienhs
Category:

less

Transcript and Presenter's Notes

Title: Optimus: Efficient Realization of Streaming Applications on FPGAs


1
Optimus Efficient Realization of Streaming
Applications on FPGAs
  • University of Michigan Amir Hormati, Manjunath
    Kudlur, Scott Mahlke
  • IBM Research David Bacon, Rodric Rabbah

2
Introduction
  • End of free ride from clock scaling.
  • Wide of variety of applications
  • Evolution of new architectures
  • Heterogeneous Architectures

3
Why FPGAs?
4
Why Streaming?
  • Suitable for a wide range of applications(Digital
    signal processing, graphics, encryption, etc.)
  • Promising approach multi-cores .
  • Locality
  • Exposed parallelism
  • New languages StreamIt, Brook, CUDA, SPUR, etc.

Ray tracing example
5
Vision
Streaming
Lime SIR
Front-EndCompiler
FPGA Model
Annotated Java bytecode
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
HDL
C
Xilinx VHDL Compiler
Cell SDK
Xilinx bitfile
Cell binary
Stream VM
Virtex5 FPGA
Cell BE
Stream VM
Stream VM
6
Overview
  • StreamIt Example
  • Compilation Flow
  • Scheduling
  • Optimizations
  • Results

7
StreamIt Example
8
Operation Compilation
temp
sum
i
1
im
i0
ADD
ADD
1
1

sum sum temp i i 1 Branch bb2 if i lt 8
FU
predicate
8
temp
CMP

1
o0
on
Control out 3
Control out 4
Register
Control in

9
Filter Compilation
10
Top Level Compilation
11
Stream Scheduling
  • All the filters are active at the same time.
  • Pushes and pops are blocking.
  • No need for double buffering. Channels can have
    any size.
  • Deadlock is possible.

12
Optimizations
  • Classic optimizations (micro functional)
  • Common subexpression elimination, Constant
    folding, Loop unrolling, etc.
  • Done in many of previous hardware compilers
  • Streaming optimizations (macro functional)
  • Channel allocations, Channel access fusion,
    Filter fission and fusion, etc.
  • Doing these optimization needs global information
    about the stream graph.
  • Ignored in the previous works.

13
Channel Allocation
  • Larger channels
  • More SRAM
  • More control logic
  • Less stalls
  • All filters are active at the same time
  • Interlocking makes sure that each filter gets the
    right data or blocks
  • Size of channels can be smaller than the
    statically scheduled graph

14
Channel Allocation Algorithm
  • Set the size of the channels to infinity.
  • Warm-up the queues.
  • Record the steady state schedules for the pair.
  • Find the maximum number of overlapping lifetimes.

15
Channel Allocation Example
Producer
Consumer
Source
Filter 1
Filter 2
Sink
Max overlap 3
16
Channel Allocation
17
Channel Access Fusion
  • Each channel access(push or pop) takes one
    cycle.
  • Increases communication to computation ratio
  • Longer critical path latency
  • Limit task-level parallelism

18
Channel Access Fusion Algorithm
  • Loop Unrolling
  • Push/Pop grouping
  • Balance channel access clusters
  • Widen the channels

19
Access Fusion Example
int-gtint filter Adder(int addSize) work pop
addSize push 1 int sum 0 int t1, t2,
t3, t4 for (int i 0 i lt addSize/4 i)
(t1, t2, t3, t4) pop4() sum
t1 t2 t3 t4 push(sum)

int-gtint filter Adder(int addSize) work pop
addSize push 1 int sum 0 for (int i
0 i lt addSize i) sum pop()
push(sum)
20
Access Fusion
21
Speedup (baseline PowerPC)
22
Energy Consumption
23
Conclusion
  • Streaming language to program heterogeneous
    systems
  • Hierarchical synthesis
  • Macro functional optimizations

24
Thank you!
  • Questions?

25
Static Stream Scheduling
  • Resources have to be ready before a filter
    starts(pushes and pops are non-blocking).
  • Double buffering for parallelism.
  • Deadlock can be detected at compile-time.
  • Could be inefficient in case of data dependent
    bahavior.
Write a Comment
User Comments (0)
About PowerShow.com