Title: Design Techniques for Million Gate, High Speed FPGAs
1Design Techniques for Million Gate, High Speed
FPGAs
Michael A. Bohm Chief Scientist Technical Fellow
Mentor Graphics
2Agenda
- The Problem
- State-of-the-Art Technology
- Design Issues
- Performance Oriented Design
3The Problem
How do we move mainstream designs from ASICs to
high performance FPGAs ??
4State-of-the-Art 2000
- Technology
- Gate Count
- Frequency
- Clock Domains
- Computer Hardware
- Design Software
- RTL Language
- Design
5Those who can not remember thepast are
condemned to repeat it.
From The Life of Reason, by George Santayana,
1906
Technology is changing rapidly. It took 21 years
to get to a 1Ghz processor. It will take 1 year
to get to a 2Ghz processor.
6State-of-the-Art Technology
Process Geometries
7State-of-the-Art Gate Count
Gate Count (excluding memory)
8State-of-the-Art Frequency
System Frequency
9State-of-the-Art Clock Domains
10State-of-the-Art Computer Design Hardware
RAM Virtual Swap
EP20K160E XCV300 128MB 256MB
EP20K400E XCV600 256MB 400MB
EP20K600E XCV1000 512MB 800MB
EP20K1000E XCV2000 1GB 1GB
EP20K1500E XCV3200 1.5GB 2GB
11State-of-the-Art RTL Language
System
Algorithm
RTL
Logic
Gate
- Abstract Data Types
- Design reusability
- Compiled concepts
- Design Management
- Structure replication
- Abstract Data Types
- Design reusability
- Compiled concepts
- Design Management
- Structure replication
- Fixed Data Types
- Easier to learn
- Interpreted concepts
- Gate Level Sign-off
12State-of-the-Art Design
Text
- Co-simulation within HDL simulator
- Mix of HDL user defined C/C
- Behavioral Synthesis
- Tight physical correlation.
Flow Chart
13State-of-the-Art Failures
Failures
Logical 55
Slow Path 13
Clocking 10
Power 6
Race Condition 4
Yield 4
Misc 3
IR drops 2
Mixed signal interface 1
FPGAs make a failure recoverable.
14State-of-the-Art FPGA
- APEX and Virtex at 3 Million Gates
- Maximum Operating Frequency is 200Mhz (pushing
300Mhz) - Large blocks of memory
- Imbedded Processors (PowerPC, ARM, Mips)
- Copper interconnect
15The Development Gap
Design size
Design Size
Design Gap
Ability to Fabricate
Verification Gap
Ability to Design
Ability to Verify
16System / SOC Design Methodology
Embedded Software Development
Hardware / Software Coverification
Pre-existing Hardware
Hardware Development
Pre-existing Software
Manufacturing
17Adjusting to a New Methodology
- Team Design
- IP Logic
- More software content
- Heavy with memory
- Less synthesis / more chip level assembly
02 - SOC 10M gates
99 - SOC 1M gates
Memory
BlockA
BlockB
97 - ASIC 50-150K gates
CPU
IP
Block1
18Effects of the Design Flow
VHDL,Verilog C,Java
VHDL,Verilog C
201
VHD,Verilog EDIF
101
51
Higher Abstraction provides more design choices !!
31
21
19ASIC versus FPGA design
M per re-spin!!
FPGA Design
Fab Chip
Physical Design
FPGA Synthesis
Logic Verif.
Logic Design
System Verification with fewer iterations
RTL Prototype
Software Dev. and Debug
20A Designers Life
15
Design Specification
Beh / RTL Description
8
Functional Verification
15
7
Synthesis
Place Route
15
20
Timing Validation
20
System Verification
21How to make a better designer
- Provide proper training
- Designers went to college to learn digital logic
design, but most have less than 10 hours RTL
training. - Provide a proven Design Methodology
- Enforce Design for Quality techniques
- Quality circuits are always easier to
manufacture and are the most profitable. - Functionality is only a minor part of the design
process. Using Performance Orient Design
techniques are the key to a successful product
development
22Performance Oriented Design Techniques
The Keys to Success
- RTL Coding Styles
- Design Architecture trade-offs
- Design Structure
- Timing Optimization
- Physical Optimization
23Coding style impact
- Coding style does impact performance
- It affect FPGAs more than ASICs
- Different level of RTL
- Different descriptions give different results
- Tools are also part of the equation
- Different tools give different results
- Learn to know your tool !!!
24The Keys to Language Synthesis
- Data Types
- Packages
- Ports
- Hierarchy
- Combinational Logic
- Relational Operators
- Arithmetic Operators
- Sequential Logic
- Memory
- IOs
25Structuring A Design
- A design should read like a book.
- Table of contents An explanation of the design
structure. - Logical flow from beginning to end.
- Chapters Logical breaks in a design.
- Commentary Comments on complex structure in the
design.
99 of all designs are unintelligible to another
designer !
26Source Code Control
Revision Comparison
The main difference between hardware and software
is the control!
27Hierarchy
Partitioning between logical and virtual
hierarchy is key!
28Understand what the RTL does!!
Everytime you use and if-then-else, a 21 mux
is built.
29Serial / Priority Structure
The 1st branch of the if is the critical
signal. On some FPGAs, this structure is faster
than a case statement.
30Parallel Structure
All logic branches are Equal.
31Tri-State
Internal tri-state buses are slow on most
FPGAs. Tri-states belong on the top level of the
design.
32Bi-directional Buffer
Bi-directional bus causes timing loops. False
paths need to be marked.
33Relational Operators
Large relational operators (gt 4-bits) are built
out of high speed carry chains on the FPGA.
34Addition Operators
- Adders are the 1 used operator in a design.
- Use constants wisely
- A2 1 with cin
- A-2 -1 with cin
- A8 (A(high downto 3) 1) A(2 downto 0)
35Resource Sharing (when it really hurts)
if (B gt C) then sig lt A B else sig lt A
C end if
Resource Sharing OFF Total LUTs 64 Clock Freq
133.3 MHz (52 !!!)
Resource Sharing ON Total LUTs 32 Clock Freq
87.7 MHz
36Multiplication Operator
- Most expensive operator
- Slowest operator, unless built into the FPGA.
- When multiplying by a constant, use a CSD
multiplier. - Use constants wisely
- A2 A sra 1
- A3 (A sra 1) A
37Pipelined Multipliers
- Improve timing by introducing parallelism
- Registers, introduced by pipelining may have
modest area impact - Requirements
- Certain constructs in the input RTL source code
description - Output of the multiplier must be registered.
- Optimal pipeline stages log2(input data bus
width) - A 16 bit databus gt optimal pipeline value of 4
- 32 bit bus gt optimal pipeline value of 5.
38A little Algebra goes a long ways
Original Code Modified Code AREA Reduction
A-B0 AB 80
A9 (A SHL 3) A 40
A lt 0 A(Ahigh) 90
A1 when en 1 else A A en 60
A when A gt 0 else -A not A 1 when A(31) else a 30
A 2 A SHL 1 100
- Minimize all arithmetic equation to eliminate
operators. - Frequency increased dramatically.
39D Flip-flop
Most FPGAs only have an Async Set or Reset DFF.
This will be translated to sync set and async
reset for FPGAs.
40Complex Clock Enables
- Higher Frequency
- Denser Logic
Clock enables with only be found with 4-6 levels
of logic. Use clock-enables instead of a gated
clock.
41Latches
A latch is a 2 to 1 mux with the output fed back
to an input. This can put combinational loops in
your circuit depending on the FPGA Vendor.
42Counter
Counters should either be built as a macro or
make sure the synthesis tool had counter
recognition.
43State Machine
- Tools have made progress with FSM compilers
- Reachability analysis, highly optimal results
- Extended encoding techniques
- Without FSM one hot is often the best choice
- Deflates the next state decoding logic cloud
- FSM compiler without Safe State
- Implements the functionality, however the state
machine may not be totally bullet proof - The Safe option
- default switch in the case may be ignored
- Recovery logic is implemented to go back to the
reset state - The Exact implementation
- You want a better match with simulation
- Performance is not an obstacle
- Your design works in a harsh environment
44State Machine
45Read Only Memory (ROM)
- Roms provide a method for setting dont cares
- Different algorithms are used on ROM logic.
- A rom is just a ram with initial programming.
- Indexing into a constant array is very efficient
for simulation and synthesis
46Single Port Rams
47Dual Port Rams
48Content Addressable Memory (CAM)
- Use a CAM when address translation is needed.
- Use CAMs for sparsely used addresses.
- CAMs replace large priority encoders.
- 60 area reduction
- 80 timing reduction
49Checklist for performance
- Pipeline for high performance
- Make hardware work in parallel
- Optimize late-arriving signals
- Control arithmetic circuits
- Use IP and hard-macros
50Parallel Gates
Parallel Gates are removed during the
pre-optimize stage !!
51Attributes
- Attributes can be passed thru HDL code
- Homogeneous syntax in VHDL for attributes
- No syntax checks, just passed through !
- Synthesis attributes helpful for...
- Improved usability
- Name preservation
- Replication
- Resource sharing
- Speed / area control
- FSM encoding
- Attributes enable...
- Mapping control
- DLLs setup
- IOB flop control
- Ram initialization
- Soft macros for speed
52Physical Optimization
- Floor Plan your FPGA.
- Produces a faster circuit
- Circuit is more predictable and repeatable.
- Timing convergence occurs quickly.
- Back Annotate real timing data.
- Allows 2nd pass of synthesis works on real
critical paths.
53FPGA High-Level Floorplanner
- Tight links to Exemplars synthesis tool.
- Position blocks into regions of device
- Generates area constraints
- Required for new Incremental design flow
- Useful for Design Planning
54TimeCloser Flow
Optimization Allocation Clock
resources Allocation of some routing resources
(low skew) Timing
Optimization Critical path optimization
Logic and register replication Clustering
of critical path objects Allocation of routing
resources for hi-fan out nets
Manual Floor Planning
Place Route
Incremental PR
Back Annotation of PR delays
Critical Path optimization
(based upon real delay values)
55Incremental Optimization using Incremental Files
Leonardo Spectrum
PR Software
EDIF Netlist constraints
Synthesize 1st pass Critical Path Optimization
Perform Initial Place
and Route
Incremental data
Save Design in
Incremental files
XDB format
Critical Path Timing Optimization
Delay File
Perform Timing
Analysis
Restore original Netlist
Top-Level EDIF Netlist
ECO or Incremental Flow
Perform incremental
Reoptimize only changed sub block
place and route with
guide files
Unique incremental flow to Leonardo Spectrum
56Constraint Based Clustering
- Uses place and route timing data to improve
device performance - Reduces levels of logic on true critical paths
- Reduces route delay effects by using a timing
driven clustering algorithm
57Logic Replication
- Reduces route delay effects using logic
replication and route optimization
- Useful to duplicate flip-flops and control fanout
- However you cannot prevent automatic replication
from the tools - Helps to manually control the fanout
- Keep the name of the nets in the netlist
- Very useful for simulation
58Critical Path Restructuring
- Uses place and route timing data to improve
device performance - Reduces levels of logic on true critical paths
- Moves late arriving signals up it logic tree
59User Applied Physical Constraints
- Preserve signals
- Assign nets to secondary routing resources
- Specify fanout on net by net basis
60Design Techniques for Million Gate, High Speed
FPGAs
Michael A. Bohm Chief Scientist Technical Fellow
Mentor Graphics