Post Placement C-Slow Retiming for Xilinx Virtex FPGAs - PowerPoint PPT Presentation

About This Presentation
Title:

Post Placement C-Slow Retiming for Xilinx Virtex FPGAs

Description:

Two benchmarks: AES and Smith/Waterman. Hand mapped (optionally) hand placed ... AES and Smith/Waterman didn't use synthesis. Can't automatically C-slow ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 25
Provided by: nicholas75
Category:

less

Transcript and Presenter's Notes

Title: Post Placement C-Slow Retiming for Xilinx Virtex FPGAs


1
Post Placement C-Slow Retiming for Xilinx Virtex
FPGAs
Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek
  • UC Berkeley Reconfigurable Architectures,
    Systems, and Software (BRASS) Group
  • ACM Symposium on Field Programmable Gate Arrays
    (FPGA)
  • February 2x, 2003
  • http//www.cs.berkeley.edu/nweaver/cslow.html

2
Outline
  • Automatically Double Your Throughput
  • You paid for those registers, heres how to
    use them
  • Retiming and C-slow Retiming
  • The transformation
  • C-slow Retiming and the Virtex FPGA
  • The target
  • Retiming 3 Benchmarks
  • The tests

3
Retiming and Repipelining
  • Retiming
  • Automatically moving registers to minimize the
    clock period
  • Benefits limited by the number of registers
  • Algorithm developed by Leiserson et al
  • Repipelining
  • Adding registers to the front or back
  • Let retiming then move them around
  • But What About Feedback Loops?
  • Retiming and repipelining are of limited benefit
    when you have feedback loops

4
C-Slow Retiming
  • Replace every register with a sequenceof C
    registers.
  • With more registersretiming can break the
    design into finer pieces
  • Again proposed by Leiserson et al, to meet
    systolic slowdown
  • Semantic altering transformation
  • But resulting semantics are predictable and
    useful
  • Ideal C-slow in synthesis, retime after
    placement
  • Our prototype C-slow and retime after placement

5
Design Semantics After C-Slowing
  • Design operates on C independent data streams
  • Data streams are externally interleaved on round
    robin basis
  • Semantics apply to designs with Task Level
    Parallelism
  • Encryption
  • Counter (CTR) mode works on independent blocks
  • Sequence matching
  • Compare sequence vs database
  • C-slowing improves throughput but adds latency
    and registers

6
C-slowing, Retiming, and the Virtex FPGA
  • Every 4-LUT has associated register
  • Register can, almost always, be used
    independently of the LUT
  • LUTs can act as clocked shiftregisters (SRL16s)
  • Used in our AES hand-benchmark
  • Not used in our tool
  • Many designs have low register utilization
  • Excess of registers available in unoptimized
    designs
  • Retiming best performed with/after placement
  • Xilinx placement operates on mapped slices
  • Need net delay information for better results

7
Sketch of Tools Operation
  1. Convert .ncd to .xdl after placement
  2. Load design into graph representation
  3. Replace registers with edge annotations to
    represent registers
  4. Replace every single register with C registers
  5. Compute costs based on delay model
  6. Retime
  7. Convert edge annotations back to instance
    registers
  8. Write out .xdl, convert to .ncd
  9. Route

Placer
Router
8
Experiment 1How Good is the Tool?
  • Tool is a simple prototype
  • Manhattan distance delay estimate
  • No attempt to minimize flip-flops
  • Basic flip-flop allocation
  • Two benchmarks AES and Smith/Waterman
  • Hand mapped
  • (optionally) hand placed
  • (optionally) hand C-slowed and retimed
  • Our Best hand AES implementation
  • 1.3 Gb/s
  • lt800 Slices, 10 BlockRAMs
  • 10 part, Spartan II-100

9
Experiment 1AES, Automatically Placed
Version Clock Rate (Throughput) Stream Clock Rate(1 / Latency)
Initial Design 48 MHz 48 MHz
5-Slow by hand 105 MHz 21 MHz
Retimed Automatically 47 MHz 47 MHz
2-Slow Automatically 64 MHz 32 MHz
3-Slow Automatically 75 MHz 25 MHz
4-Slow Automatically 87 MHz 21 MHz
5-Slow Automatically 88 MHz 18 MHz
  • Just retiming is of no benefit
  • Automatic C-slowing very effective
  • But could do even better

10
Experiment 1Smith/Waterman, Automatically Placed
Version Clock Rate (Throughput) Stream Clock Rate(1 / Latency)
Initial Design 43 MHz 43 MHz
4-Slow by hand 90 MHz 22 MHz
Retimed Automatically 40 MHz 40 MHz
2-Slow Automatically 69 MHz 34 MHz
3-Slow Automatically 84 MHz 28 MHz
4-Slow Automatically 76 MHz 25 MHz
  • Again, just retiming is of no benefit
  • C-slowing highly effective
  • Within 7 of hand-built implementation

11
Experiment 1Comments
  • Just retiming is of no benefit
  • Both designs limited by single cycle feedback
    loops
  • C-Slowing very effective
  • Able to automatically nearly double throughput
  • Hand implementations more than doubled throughput
  • Reasonable numbers of additional registers
  • Limitations of prototype tool
  • Flip-flop allocation routines could be better
  • Some AES hand benchmarks used SRL16 delay chains
  • Simple is pretty good
  • Relatively simplistic implementation gets
    reasonably close to hand-mapped performance

12
Experiment 2 Retiming LEON
  • Can we automatically C-slow a large, synthesized
    design?
  • Leon 1 A synthesized , GPLed SPARCcompatible
    microprocessor core 1
  • 5 stage pipeline, integer only
  • Modify register file to use BlockRAMs
  • BlockRAMs are used as negative edge devices
  • Remove caches, I/O, etc
  • Synthesize, using Symplify with CEs disabled
  • Edit EDIF to replace Sets/Resets
  • Retime and C-slow with prototype tool
  • Prototype tool converts BlockRAMs to positive
    edge
  • C-slow a microprocessor core...
  • Get an interleaved multithreaded architecture

1 Leon 1, by Jiri Gaisler, http//www.gaisler.co
m/leonmain.html
13
Experiment 2Results
Version Clock Rate (Throughput) Thread Clock Rate(Latency) Lut Associated Flip Flops Lut Independent Flip Flops
Initial Design 23 MHz 23 MHz 1611 NA
Retimed Automatically 25 MHz 25 MHz 2398 194
2-Slow Automatically 46 MHz 23 MHz 2150 388
3-Slow Automatically 47 MHz 16 MHz 2438 3713
  • Retiming alone worked surprisingly well
  • 2-slowing very effective
  • 3-slowing hit diminishing returns

6132 Luts for all designs
14
Experiment 2Comments
  • Retiming alone worked surprisingly well
  • Tool automatically converted BlockRAMs to
    positive-edge clocking and rebalanced the
    pipeline
  • 2-slowing very effective
  • Effectively doubled the initial throughput
  • NO slowdown in latency over initial design
    because retiming was effective without C-slowing
  • Used more many registers, but fewer registers
    than LUTs
  • 3-slowing hit diminishing returns
  • Too many registers required combined with poor
    register allocation ? poor performance

15
Conclusions
  • C-slow retiming is very effective
  • "Automatically double your throughput"
  • Benefits More throughput
  • Costs More Flip Flops, worse latency
  • Post-placement retiming appropriate
  • Independent Flip Flop usage critical
  • Have delay model for interconnect as well as
    logic
  • Some room for improvement
  • Faster/Better implementation
  • Minimize Flip Flop usage as well as delay
  • Use SRL16s
  • Better placement of Flip Flops
  • Experience suggests more Flip Flops/LUT would be
    useful

16
Backup Slide Why Not Use (Current) Synthesis
Tools?
  • Many synthesis tools support retiming, but with
    caveats
  • ONLY works for synthesized items
  • AES and Smith/Waterman didn't use synthesis
  • Can't automatically C-slow
  • Can't retime through memory blocks
  • Can't accurately guesstimate interconnect delay
    before placement
  • gt½ of the delay is the interconnect
  • Can't effectively scavenge unused flip-flops
    before placement
  • Xilinx placement operates on slices, not luts

17
Backup Slide Why the limitations on total
speedup?
  • Absolute maximum
  • Interconnect LUT Flip-Flop
  • Practical maximums
  • Too many flip-flops to allocate
  • Only one flip-flop per LUT available
  • Flip-flop allocation poor
  • Quick and dirty greedy heuristic
  • Works well for mild C-slowing
  • Fails with highly aggressive C-slowing
  • Tool doesnt minimize flip-flops
  • Critical path is defined by the single worst path
  • Tool uses Cheap and dirty interconnect delay
    model

18
(Backup Slide) Design Restrictions to Enable
C-slowing
  • Resets and Clock Enables
  • Convert to explicit logic
  • Memories
  • Increase by a factor of C
  • Add high bits of addr to provide round-robin
    access
  • Every stream sees an independent memory
  • Global Set/Reset
  • Convert to individual resets
  • Still highly restrictive
  • Interleave/deinterleave IO
  • Requires external logic
  • No asynchronous sets/resets

19
Scrap Image
20
Scrap Image 2-
21
Scrap Image 3
22
Scrap Image 4
23
Scrap 5
24
Scrap 6
Write a Comment
User Comments (0)
About PowerShow.com