Post Placement C-Slow Retiming for Xilinx Virtex FPGAs - PowerPoint PPT Presentation

About This Presentation

Title:

Post Placement C-Slow Retiming for Xilinx Virtex FPGAs

Description:

Two benchmarks: AES and Smith/Waterman. Hand mapped (optionally) hand placed ... AES and Smith/Waterman didn't use synthesis. Can't automatically C-slow ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 25

Provided by: nicholas75

Learn more at: http://brass.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Post Placement C-Slow Retiming for Xilinx Virtex FPGAs

1
Post Placement C-Slow Retiming for Xilinx Virtex
FPGAs
Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek

UC Berkeley Reconfigurable Architectures,
Systems, and Software (BRASS) Group
ACM Symposium on Field Programmable Gate Arrays
(FPGA)
February 2x, 2003
http//www.cs.berkeley.edu/nweaver/cslow.html

2
Outline

Automatically Double Your Throughput
You paid for those registers, heres how to
use them
Retiming and C-slow Retiming
The transformation
C-slow Retiming and the Virtex FPGA
The target
Retiming 3 Benchmarks
The tests

3
Retiming and Repipelining

Retiming
Automatically moving registers to minimize the
clock period
Benefits limited by the number of registers
Algorithm developed by Leiserson et al
Repipelining
Adding registers to the front or back
Let retiming then move them around
But What About Feedback Loops?
Retiming and repipelining are of limited benefit
when you have feedback loops

4
C-Slow Retiming

Replace every register with a sequenceof C
registers.
With more registersretiming can break the
design into finer pieces
Again proposed by Leiserson et al, to meet
systolic slowdown
Semantic altering transformation
But resulting semantics are predictable and
useful
Ideal C-slow in synthesis, retime after
placement
Our prototype C-slow and retime after placement

5
Design Semantics After C-Slowing

Design operates on C independent data streams
Data streams are externally interleaved on round
robin basis
Semantics apply to designs with Task Level
Parallelism
Encryption
Counter (CTR) mode works on independent blocks
Sequence matching
Compare sequence vs database
C-slowing improves throughput but adds latency
and registers

6
C-slowing, Retiming, and the Virtex FPGA

Every 4-LUT has associated register
Register can, almost always, be used
independently of the LUT
LUTs can act as clocked shiftregisters (SRL16s)
Used in our AES hand-benchmark
Not used in our tool
Many designs have low register utilization
Excess of registers available in unoptimized
designs
Retiming best performed with/after placement
Xilinx placement operates on mapped slices
Need net delay information for better results

7
Sketch of Tools Operation

Convert .ncd to .xdl after placement
Load design into graph representation
Replace registers with edge annotations to
represent registers
Replace every single register with C registers
Compute costs based on delay model
Retime
Convert edge annotations back to instance
registers
Write out .xdl, convert to .ncd
Route

Placer
Router
8
Experiment 1How Good is the Tool?

Tool is a simple prototype
Manhattan distance delay estimate
No attempt to minimize flip-flops
Basic flip-flop allocation
Two benchmarks AES and Smith/Waterman
Hand mapped
(optionally) hand placed
(optionally) hand C-slowed and retimed
Our Best hand AES implementation
1.3 Gb/s
lt800 Slices, 10 BlockRAMs
10 part, Spartan II-100

9
Experiment 1AES, Automatically Placed
Version Clock Rate (Throughput) Stream Clock Rate(1 / Latency)
Initial Design 48 MHz 48 MHz
5-Slow by hand 105 MHz 21 MHz
Retimed Automatically 47 MHz 47 MHz
2-Slow Automatically 64 MHz 32 MHz
3-Slow Automatically 75 MHz 25 MHz
4-Slow Automatically 87 MHz 21 MHz
5-Slow Automatically 88 MHz 18 MHz

Just retiming is of no benefit
Automatic C-slowing very effective
But could do even better

10
Experiment 1Smith/Waterman, Automatically Placed
Version Clock Rate (Throughput) Stream Clock Rate(1 / Latency)
Initial Design 43 MHz 43 MHz
4-Slow by hand 90 MHz 22 MHz
Retimed Automatically 40 MHz 40 MHz
2-Slow Automatically 69 MHz 34 MHz
3-Slow Automatically 84 MHz 28 MHz
4-Slow Automatically 76 MHz 25 MHz

Again, just retiming is of no benefit
C-slowing highly effective
Within 7 of hand-built implementation

11
Experiment 1Comments

Just retiming is of no benefit
Both designs limited by single cycle feedback
loops
C-Slowing very effective
Able to automatically nearly double throughput
Hand implementations more than doubled throughput
Reasonable numbers of additional registers
Limitations of prototype tool
Flip-flop allocation routines could be better
Some AES hand benchmarks used SRL16 delay chains
Simple is pretty good
Relatively simplistic implementation gets
reasonably close to hand-mapped performance

12
Experiment 2 Retiming LEON

Can we automatically C-slow a large, synthesized
design?
Leon 1 A synthesized , GPLed SPARCcompatible
microprocessor core 1
5 stage pipeline, integer only
Modify register file to use BlockRAMs
BlockRAMs are used as negative edge devices
Remove caches, I/O, etc
Synthesize, using Symplify with CEs disabled
Edit EDIF to replace Sets/Resets
Retime and C-slow with prototype tool
Prototype tool converts BlockRAMs to positive
edge
C-slow a microprocessor core...
Get an interleaved multithreaded architecture

1 Leon 1, by Jiri Gaisler, http//www.gaisler.co
m/leonmain.html
13
Experiment 2Results
Version Clock Rate (Throughput) Thread Clock Rate(Latency) Lut Associated Flip Flops Lut Independent Flip Flops
Initial Design 23 MHz 23 MHz 1611 NA
Retimed Automatically 25 MHz 25 MHz 2398 194
2-Slow Automatically 46 MHz 23 MHz 2150 388
3-Slow Automatically 47 MHz 16 MHz 2438 3713

Retiming alone worked surprisingly well
2-slowing very effective
3-slowing hit diminishing returns

6132 Luts for all designs
14
Experiment 2Comments

Retiming alone worked surprisingly well
Tool automatically converted BlockRAMs to
positive-edge clocking and rebalanced the
pipeline
2-slowing very effective
Effectively doubled the initial throughput
NO slowdown in latency over initial design
because retiming was effective without C-slowing
Used more many registers, but fewer registers
than LUTs
3-slowing hit diminishing returns
Too many registers required combined with poor
register allocation ? poor performance

15
Conclusions