Title: Module 2
1Designing for 100 MHz
21999 Designs Demand...
- Higher system speed
- Higher integration
- smaller size, less power, better reliability
- Lower cost
- Shorter development time
- Better product differentiation
3Traditional Multi-Chip Boards
- Discrete design components
- CPU, memory
- bus transceivers, PCI controller, FIFOs
- Ethernet controller, Graphics accelerator, MPEG,
DSP, etc. - programmable logic as glue and custom function
- Advantages
- well-documented sophisticated functions
- readily available as IP in silicon
4Multi-Chip Board Problems
- Physical size
- Power consumption and reliability
- PC board signal integrity
- Limited flexibility
- prevents design modifications and upgrades
- prevents product diversification
- prevents product customization
- Poor product differentiation
- standard parts standard architecture
5FPGA Advantages
- Smaller size
- Lower power consumption
- Better signal integrity
- fewer PC-board issues
- Enhanced flexibility
- easy modifications, upgrades, etc.
- Enhanced product differentiation
- proprietary architectures
6FPGAs Users Want...
- System clock rate of 100 MHz
- gt100,000 gates
- Efficient design methodologies
- Availability of well-documented Cores
- Reasonable cost
7The FPGA Solution
4th Generation FPGALogicMemoryRouting
Delay-Locked Loop for Fast Clock and I/O
3.3 ns Synchronous Dual-Port SRAM
Multi-Standard Select I/O
500 Mbps SelectMAP Configuration
Temperature Sensing
8Now the Challenge...
Design a 100 MHz system
- Together, we can do it...
- well supply the ingredients...
- you use them intelligently
- But dont forget...
- the clock period is less than 10 ns !
9Designing for 100 MHz.
- Volts, Amps, and Watts
- PCB signal distribution
- chip inputs and outputs
- power and thermal considerations
- Ones and zeros
- logic emulation
- Bits and bytes
- memory hierarchy
10Moore Meets Einstein
2048 1024 512 256 128 64 32 16 8 4 2 1
Trace Length MHz
Clock Frequency
Inches per 1/4 Clock Period
65
70
75
80
85
90
95
00
05
10
Year
- Speed Doubles Every 5 Years ...But the speed of
light never changes
11Volts, Amps, and Watts
- PCB design issues
- capacative loading
- transmission lines and termination
- Chip inputs and outputs
- clock distribution and DLLs
- I/O standards
- Power and thermal considerations
- temperature sensing diode
- power supply decoupling
- Configuration
- new SelectMAP mode
12Capacitive Loading
- Capacitance slows outputs and increases power
- output delay increase
- 25 ps per pF of additional loading
- output power dissipation increase
- 11 µW per MHz per pF with 3.3-V swing
- Sources of capacitance
- 10 pF max for each device pin
- 2 pF per inch for narrow traces ( 0.8 pF/cm )
- 130 pF per inch2 for copper areas ( 20 pF/cm2)
- IBIS files provide output impedance details
13Transmission Lines
- Some traces must be treated as transmission lines
to minimize ringing - transmission line if round trip gt transition time
- lumped-capacitance if round trip lt transition
time - Signal delay on a PCB
- 140 to 180 ps per inch ( 50 to 70 ps/cm)
- Lumped-capacitance trace length
- 3 inches max for a 1-ns transition time (7.5 cm)
- 6 inches max for a 2-ns transition time (15 cm)
14Terminated Transmission Lines Reflections and
ringing
Traditional Thevenintermination at the end
V
CC
100 ?
50 ?
100 ?
Dynamic termination at the end is better and
saves power
50 ?
50 ?
100 pF
Series termination at the source is best single
source and destination only!
22 ? 27 ?
50 ?
(50 ? Total)
15On-Chip Clock Distribution
Clock
CLB
Data
IOB
- Clock distribution introduces delay
- larger chips suffer more clock delay
16Clock Delay Problems
- Clock delay increases clock-to-output times
- Clock delay leads to unacceptable input hold time
- set-up time is negative
- Additional data delay can eliminate the hold time
- set-up time becomes positive
- but tolerance build-up widens the data-valid
window
IOB Flip-Flop
Clock Required Data Valid (without
delay) Required Data Valid (with delay)
Q
D
Delay
Data
Clock Distribution Delay
Clock
17DLLs Maximize I/O Speed
- Clock-to-output time plus set-up time
determinesthe I/O speed and data bandwidth - min clock period max clock-to-out max set-up
- Traditional solution
- use highly buffered, balanced clock trees
- needed to reduce internal clock skew
- cannot totally eliminate the delay
- The Virtex solution
- use a Delay-Locked-Loop ( DLL )
- aligns the internal and external clocks
- effectively eliminates the clock-distribution
delay
18Virtex Has 4 Independent DLLs
Clock
Error
Comparator
Delay
CLB
IOB
Data
- DLLs adjust clock delay to align internal and
external clocks - digital closed-loop control
- 25 to 200-MHz range, 35-picosecond resolution
19Fast Clock-to-Out With DLL
- 160 MHz inter-chip data rate
- 16-mA LVTTL
- IOB register to IOB register
Virtex FPGA
Virtex FPGA
0.5 ns
D
Q
DLL
DLL
3.8 ns
1.9 ns
Clock
20LVTTL Data Rate with DLL
- 1.4 ns measured clock-to-output delay
Output standard LVTTL Fast 16mA (OBUF_F_16) Temp
100C, Vdd2.375V, Vcco3.3V Waveforms 1
CLKIN 2 DATA OUT (no DLL) 3 DATA OUT (DLL
deskewed) Timing w/o DLL w/ DLL r-gtr
r-gtf r-gtr r-gtf 3.9n 3.9n 1.4n 1.4n
21Other DLL Functions
- Double the incoming clock frequency
- fast internal operation slow external clock
- Clock mirroring to the PCB
- Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
- Adjust clock duty cycle to 50-50
- Create four quadrature clock phases
- input four sequential bits per clock period
22Duty Cycle Correction
- 25 duty cycle in 50 duty cycle out
Virtex FPGA
1X
DLL
25 MHz 25 Duty Cycle
25 MHz 50 Duty Cycle
23Clock Doubling and Mirroring
- Clock mirror with less than 100 ps skew
- simplifies PCB clock distribution
Virtex
SDRAM
74 MHz 1
DLL 1
37 MHz
SystemClock
Exactly Aligned
1 Input Load
74 MHz 2
DLL 2
74 MHz Internal
37 MHz Internal
Zero-Delay Internal Clock Buffer
Actual HDTV Customer Example
System Clock
SDRAM
Inside FPGA
Inside FPGA
24Precise Clock Mirroring
- 2x system clock for board use
Virtex FPGA
2X
DLL
66MHz Clock
132 MHz Clock
25Clock Division
- Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
- maintain synchronous edges
CLKIn 200 MHz
CLKout 200 MHz
CLKDV 12.5 MHz
26Multi-Standard SelectI/O
GTL
2.5V SSTL
MicroProcessor
SRAM
1.8V
SDRAM
SDRAM
5V Tolerant
FLASH
Mixed Signal
5V
3.3V LVTTL
Busses/Backplanes(3/5V PCI, ISA, GTL)
DSP
27Mix Match Output Standards
- User-supplied voltages determine output swing
- 3.3 V, 2.5 V, 1.5 V
- one voltage per bank
- a bank is half of a chip edge
- Output characteristics are programmable on a
per-pin basis - push-pull or open-drain
- LVTTL drive strength
- 2-mA to 24-mA sink and source current
- LVTTL Slew rate
28Mix Match Input Standards
Internal Reference
- Internal or user-supplied threshold voltage
- selectable on a per-pin basis
- one user-suppliedthreshold voltage per bank
- Programmable over-voltage protection
- 5-V tolerant or diodeclamp to VCCO
- selectable on a per-pin basis
VREF
Input
Input
Input
Input
Input
Input
VREF
29SSTL Clock-to-Out With DLL
- 200 MHz inter-chip data rate
- SSTL 3, Class II
- IOB register to IOB register
Virtex FPGA
Virtex FPGA
0.3 ns
D
Q
DLL
DLL
2.8 ns
1.9 ns
Clock
(Stub Series Transceiver Logic)
30SSTL Data Rate with DLL
- 1.3 ns measured clock-to-output delay
- much lower noise than LVTTL
Output standard SSTL 3 Class 2 (OBUF_SSTL3_II) T
emp100C, Vdd2.375V, Vcco3.3V,
Vtt1.5V Waveforms 1 CLKIN 2 DATA OUT (no
DLL) 3 DATA OUT (DLL deskewed) Timing w/o
DLL w/ DLL r-gtr r-gtf r-gtr r-gtf 3.5n
3.8n 1.1n 1.3n
31From FPGA to System ComponentRedefining the
FPGA
Cache SRAM (Mbytes)
Chip 1
Chip 1
SDRAM (133MHz)
LVCMOS
x2 CLK
x1 CLK
Low Voltage CPU
SSTL3
LVTTL
GTL
High Speed System Backplane
"Virtex moves FPGAs from glue to system
component - Ron Neale, EE
32Power and Thermal Issues
- Power and heat are serious concerns
- All CMOS power consumption is dynamic
- proportional to VCC2
- proportional to capacitance
- proportional to frequency
- Virtex conserves power
- 2.5-V supply voltage
- small geometries and short interconnects reduce
capacitance
33Virtex Power Consumption
- Virtex is designed to conserve power
- 100 MHz 16-bit counters
- 12.5 MHz average transition rate
- 6.5 mW per counter including clock distribution
- 100 MHz 8-bit counters
- 25 MHz average transition rate
- 5 mW per counter including clock distribution
34Thermal Management
- Temperature-sensing diode
- matched to maxim MAX 1617 A/D
- programmable alarms
- similar to the Pentium II solution
Virtex FPGA
DXP
SBMCLK
Maxim MAX1617
SBMDATA
DXN
ALERT
35Power Supply Decoupling
- CMOS power-supply current is dynamic
- current pulse every active clock edge
- Peak current can be 5x the average current
- instantaneous current peaks can only besupplied
by decoupling capacitors - Use one 0.1 µF ceramic chip capacitor for each
power-supply pin - low L and R are more important than high C
- double up for lower L and R if necessary
- use direct vias to the supply planes, close to
the power-supply pins
36Virtex Configuration
- New byte-wide SelectMAP mode
- up to 528 Mbps at 66 MHz
- simple handshake protocol
- up to 400 Mbps at 50 MHz
- no handshake required
- Configuration bit-stream length
- 0.5 Mbits to 6.1 Mbits
Control Logic (EPLD)
Busy
CS
Address Configuration EPROM
Data
WE, CS
Virtex FPGA
37Volts, Amps, and Watts Recap
- PCB design issues
- minimize capacitance for higher speed
- terminate transmission lines to reduce ringing
- Chip inputs and outputs
- use DLLs to maximize I/O bandwidth
- use SelectI/O to interface with different
standards - Power and thermal considerations
- use the sensing diode to manage chip temperature
- decouple the power supply well
- Configuration
- configure faster with the SelectMAP mode
38Designing for 100 MHz.
- Volts, Amps, and Watts
- PCB Signal Distribution
- chip Inputs and Outputs
- power and Thermal Considerations
- Ones and zeros
- logic Emulation
- Bits and bytes
- memory hierarchy
39Spending the 10 ns Budget
- Fast logic requires fast function generators
- signals often pass through several function
generators - Routing delays must also be kept short
- there are routing delays between every function
generator - Arithmetic delays are important
- carry chains often create critical paths
40You Dont Have To Be An Expert
- You dont have to be an FPGA architecture expert
to implement high-performance designs - the benefits of a good architecture are automatic
- all the logic goes faster
- software provides easy access to the features
- You can achieve high-performance only with a good
FPGA architecture - a good FPGA empowers its users
- Youll design better if you know the architecture
- matching your design style to the available
features increases performance and/or lowers cost
41Virtex CLB
- Logic and arithmetic delay reduction demands
improvements in the CLB - Virtex CLB is divided into two slices, each with
- 2 function generators
- 2 flip-flops
- 2 bits of carry logic
Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
42Fast Function Generators
- Each function generator emulates 2 to 3 levels
of logic - a 10-level logic path typically requires 3 to 5
Function Generators in series - at 100 MHz, they must be less than 2 ns each
including the routing - Virtex has 0.6-ns function generators
- leaves 1.4 ns for each route
43Connecting Function Generators
- Some functions need several function generators
- F5 MUXs connect pairs of function generators
- functions with 5 to 9 inputs
- F6 MUXs connect all 4 function generators
- functions with 6 to 17 inputs
Fnct Gen
Fnct Gen
F5
F5
F6
Fnct Gen
Fnct Gen
44Fast Local Routing
- Local routing provides fast interconnects
- in a CLB, Function Generators connect with
minimal routing delays - fast paths between adjacent CLBs increases
flexibility
Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
45Use Pipelining for Speed
- Shorter clock periods means doing less each
period - create a pipeline structure
- pipeline stages operate concurrently
- more functions are done at the same time
- throughput increases
- All function generators have output flip-flops
- most pipeline support is free
4616-Bit Pipeline in One LUT
- In directly cascaded pipelines the flip-flopsare
not free - One SRLUT can implementup to 16 bits of delay
- shift data in and select the appropriate tap
Delay Select
Output
16-Bit Shift Register
Input
47 Fast Logic Needs Fast Routing
- Our typical design with 3 to 5 CLBs needed an
average routing delay of 1.4 ns or less - the Virtex routingarchitecture deliversthis
performance - Delay is independentof direction
- dependablyshort delays
48Go Farther, Faster
- Virtex achieves its speed through a hierarchy of
highly buffered routing resources - wires span 1, 2, or 6 CLBs
- The Virtex routing architecture is designed for
large arrays - todays FPGAs are big but tomorrows will be
even bigger - Virtex is designed to maintain its performance
even in very large arrays
49No Routing Congestion
- For high-speed applications, routing must be
dependably fast - not just capable of being fast
- In the past, high device utilization has caused
routing congestion - critical nets might be forced to meander
- Virtex minimizes these problems
- abundant resources prevent congestion
If it needs to be fast, it will be fast
automatically!
50Built-in Tri-State Busses
- Bi-directional busses are supported directly by
tri-state buffers built into each CLB - two drivers per CLB
- segmentable every four CLB columns
CLB
CLB
CLB
CLB
CLB
51Arithmetic A Special Case
- Adders, accumulators, counters, and comparators
all depend on carry chains - Carry-chain logic is usually much deeper than the
rest of the design - 32 levels for a 16-bit ripple adder
- too deep to use function generators at 100 MHz
- arithmetic delays would limit performance
- Dedicated carry logic provides the desired speed
- 16-bit adders can operate at up to200 MHz
register-to-register
52Wide Arithmetic
- 64-bit adders would require 128 levels of logic
- expensive complex carry schemes would be needed
to preserve performance - Virtex minimizes the carry propagation delay
- 100 ps per bit pair
- zero routing delay between CLBs
- Minimal performance loss for each extra bit
16-bit adders operate at up to 200 MHz 64-bit
adders operate at up to 135 MHz
53Efficient Virtex Multipliers
- Cascade vs. tree structure
- cascade simpler and smaller
- tree is faster
- Virtex gives the best of both worlds
- as fast as a tree
- smaller than a cascade
- 160 MHz clock rate for pipelined 16 x 16
multiplier
Cascade Tree Virtex Tree
Delay
4 x 4
8 x 8
16 x 16
Cascade Tree Virtex Tree
Number of CLBs
4 x 4
8 x 8
16 x 16
54Fast Address Decoders
- Wide address decoderscould slow operation
- wide AND gates withinvertable inputs
- Virtex carry-chain MUXscan act as AND gates
- combine functiongenerator ANDs
- 64-bit decoders operateat up to 155 MHz
0
1
0
0
1
0
0
1
0
0
1
0
1
55Speed Is Never Wasted
- You can never have too much performance
- excess performance can always be traded for size
and cost reduction - Replace single-cycle functions with smaller
multi-cycle versions - a 2-cycle multiplier is half the cost of a
single-cycle multiplier
Reduce costs by designing down to the performance
you need
56Creating a High-Speed Clock
- Logic sometimes needs to operate faster than the
available clock - multiple RAM accesses in a single cycle
- low-speed PCB clock distribution for power or
noise reduction - Virtex DLLs can double and redouble incoming
clocks
2X
2X
DLL1
DLL2
45 MHz
90 MHz
180 MHz
57Optimized for the Future
- Deep sub-micron technology permits larger and
larger array sizes - poses new circuit-design challenges
- changes the rules of FPGA architecture
- Across-chip routing is the most vulnerable
- could easily limit design performance
- Virtex is designed for long-term growth
- even long, across-chip routes will remain fast
- Virtex is tomorrows FPGA today!
5810 ns is Long Enough
- Virtex CLBs can implement relatively complex
functions in 10 ns - 0.6 ns per 4-input function generator
- Virtex offers fast interconnections
- even across-chip when fully utilized
- fast tri-state buses
- Support for very fast arithmetic operations
- 16-bit adders at 200MHz
59Implement Designs Automatically
- You dont have to be an FPGA wizard to use Virtex
- Virtex is optimized for automated implementation
- uniform structure
- efficient mapping/synthesis
- ample routing
- simple placement and no congestion
- predictable performance
- effective synthesis
- IP cores speed design even more
- validated functionality with guaranteed
performance
60Designing for 100 MHz
- Volts, Amps, and Watts
- PCB signal distribution
- chip inputs and outputs
- power and thermal considerations
- Ones and zeros
- logic emulation
- Bits and bytes
- memory hierarchy
61100 MHz Memory
- Virtex memory operates up to 200 MHz
- High-speed memory has two benefits
- data storage
- work-in-progress
- input/output buffers, FIFOs
- accelerating complex functions
- store pre-computed values in look-up tables
62Data Storage Hierarchy
- Virtex supports 3 levels of memory hierarchy
- On-chip SelectRAM
- small-to-medium memories
- 0.6-ns read access time
- On-chip Block SelectRAM
- larger memories
- true dual-ported operation
- 3.3-ns read access time
- Fast SelectI/O interfaces to external RAM
- DLL boosts memory bandwidth
63SelectRAM
- SelectRAM uses CLB LUTs as user memory
- 16-deep RAMs
- 32-deep RAMs
- 16-deep dual-ported RAMs
- 16-deep shift registers
- Cascadable for larger memories
- 128 or more words deep
- uses logic resources for expansion
64Block SelectRAM
- Up to 32 dual-ported 4096-bit RAM Blocks
- synchronous read and write
- True dual-port memory
- each port has full read and write capability
- different clocks for each port
- Configurable aspect ratio
- trade width for depth
- 4096 x 1 bit to 256 x 16 bits
- separate configurations for each port
- Dedicated routing for memory expansion
65High-Speed Memory Interfaces
- SelectI0 and DLLs together provide fast access to
many types of external memory - Xilinx currently offers two reference designs
- fully synthesized
- automatic placement and routing
- SDRAM up to 125 MHz
- ZBTRAM up to 143 MHz
(Zero Bus-Turn-around)
66Input/Output Data Buffers
- High-performance systems need data buffers to
decouple internal operation from I/O activity - I/O may be sporadic (burst-mode busses)
- I/O may be faster or slower
- I/O may be wider or narrower
- I/O buffers can take several forms
- dual-ported RAMs
- ping-pong buffers
- FIFOs
67Dual-ported I/O Buffers
- Block SelectRAM is ideal for I/O buffers
- dual-ported operation
- independent clocks and controls
- bridges between clock domains
- simultaneous read and write
- port-specific aspect-ratio control
- built-in rate/width conversions
- SelectRAM provides similar benefits on a
smaller scale
68Ping Pong Buffers
- Ping-pong buffers are pairs of blocks that
alternate between input and processing - SRLUT for small buffers
- self-addressing input
- 0.6-ns read access
- Larger buffers can usethe dual-ported Block RAM
- one address bit alternatesread/write areas
- 3.3-ns read access
Read Address
16-Bit Shift Register
Output
16-Bit Shift Register
Select
Input
69Small FIFOs in SRLUTs
- Small FIFOs can be implemented in SRLUTs
- word count addresses the output data
- increment and enable SRLUT to Push
- decrement to Pop
- enable only for both
- 16-Byte FIFO in 4 CLBs
- 16 x 16 in 6 CLBs
- 200 MHz
- Expandable for deeperFIFOs
Pop
Down Word Counter Up
Push
Output
16-Bit Shift Register
Input
70Large FIFOs in Block RAM
- Large FIFOs can use the dual-ported block RAM
- add read and write address counters
- Asynchronous push and pop
- Different port sizes give rate-for-width
conversion - Block RAM FIFOs can operate at up to 170 MHz
including flag logic
Input
Output
Data
Data
Block SelectRAM
Counter
Counter
Addrs
Addrs
WE
Pop
En
En
Control Logic
Full
Empty
Push
71Pre-computing for Speed
- Some functions are too complex for 10-ns logic
implementation - pipelining is not always possible
- An alternative is to pre-compute all the possible
results and store them in memory - select a result according to the inputs
- Function time is independent of complexity
- 0.6 ns SelectRAM access time
- 3.3 ns Block SelectRAM access time
- The function table can be smaller than the logic
72Multiplication By A Constant
- Sometimes, data has to be scaled
- multiplied by a constant value
- A full multiplier is too expensive
- it can multiply by a variable
- unnecessarily general and too complex
- Storing all multiples of the constant is a
better alternative - smaller and much faster
Constant
Multiplier Array
Scaled Data
Input
Product Table
Scaled Data
Input
7316-bit Scaler
- A 216-word product table is impractical
- partition the input into nibbles
- use 16-word LUTs for nibble products
- combine the partial products in adders
- Roughly half the CLBs of a full multiplier
- for a 16-bit Coefficient36 CLBs vs.62 CLBs
- Pipeline the addersfor extra speed
Input
x4096
LUT
x256
Scaled Data
LUT
x16
LUT
LUT
74Changing the Constant
- The SRLUT mode can be used to update the table
- push-only stack
- last 16 bits loaded define the table
- A simple accumulatorcomputes all productsof a
new constant
Input
Output
16-Bit Shift Register
Reg- ister
Reg- ister
Constant
Clear
Load
Change Constant
75Large Function Tables
- Larger functions can be implemented in the Block
SelectRAM - 12-input functions
- micro-coded state machines
- Data tables can also be implemented
- sine/cosine tables for DSP, for example
- dual-ported access gives the sine and cosine
simultaneously - a simple address offset gives 90º phase shift for
accessing sine and cosine from a single table
76Block RAM/ROM Creation
- CORE Generator software creates RAMs and ROMs
- simple GUI interface
- Initialization file is loaded into RAMs and
ROMs at configuration time
77Memory Summary
- Virtex has two kinds of internal memory
- distributed SelectRAM for small RAMs
- Block SelectRAM for larger RAMs
- SelectRAM
- 0.6 ns read access time
- 16- and 32-word RAMs
- 16-word dual-ported RAMs
- 16-word shift registers
- sequential write/random-access read
- FIFOs, pipelining, LUT functions, etc...
78Memory Summary
- Dual-ported 4096-bit Block SelectRAM
- 3.3 ns read access time
- true dual-ported operation
- both ports are read/write
- ports can be clocked asynchronously
- configurable aspect ratio
- 4096 x 1 bit to 256 x 16 bits
- configure ports differently for width/rate
conversion - High-speed SelectI/O access to external RAM
79Designing for 100 MHz
- Volts, Amps, and Watts
- DLLs and flexible I/O standards
- fast inter-chip communication
- simple rules for good signal integrity
- Ones and zeros
- fast logic and fast interconnect
- dependable high performance
- Bits and bytes
- distributed SelectRAM
- dual-ported Block SelectRAM
80The Virtex Family
- The complete Virtex Data Sheet is on your AppLinx
CD-ROMand at www.xilinx.com/partinfo/virtex.pdf
81Designing for 100 MHz