Module 2

About This Presentation

Title:

Module 2

Description:

Designing for 100+ MHz – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 82

Provided by: Berni94

Category:

more less

Transcript and Presenter's Notes

Title: Module 2

1
Designing for 100 MHz
2
1999 Designs Demand...

Higher system speed
Higher integration
smaller size, less power, better reliability
Lower cost
Shorter development time
Better product differentiation

3
Traditional Multi-Chip Boards

Discrete design components
CPU, memory
bus transceivers, PCI controller, FIFOs
Ethernet controller, Graphics accelerator, MPEG,
DSP, etc.
programmable logic as glue and custom function
Advantages
well-documented sophisticated functions
readily available as IP in silicon

4
Multi-Chip Board Problems

Physical size
Power consumption and reliability
PC board signal integrity
Limited flexibility
prevents design modifications and upgrades
prevents product diversification
prevents product customization
Poor product differentiation
standard parts standard architecture

5
FPGA Advantages

Smaller size
Lower power consumption
Better signal integrity
fewer PC-board issues
Enhanced flexibility
easy modifications, upgrades, etc.
Enhanced product differentiation
proprietary architectures

6
FPGAs Users Want...

System clock rate of 100 MHz
gt100,000 gates
Efficient design methodologies
Availability of well-documented Cores
Reasonable cost

7
The FPGA Solution
4th Generation FPGALogicMemoryRouting
Delay-Locked Loop for Fast Clock and I/O
3.3 ns Synchronous Dual-Port SRAM
Multi-Standard Select I/O
500 Mbps SelectMAP Configuration
Temperature Sensing
8
Now the Challenge...
Design a 100 MHz system

Together, we can do it...
well supply the ingredients...
you use them intelligently
But dont forget...
the clock period is less than 10 ns !

9
Designing for 100 MHz.

Volts, Amps, and Watts
PCB signal distribution
chip inputs and outputs
power and thermal considerations
Ones and zeros
logic emulation
Bits and bytes
memory hierarchy

10
Moore Meets Einstein
2048 1024 512 256 128 64 32 16 8 4 2 1
Trace Length MHz
Clock Frequency
Inches per 1/4 Clock Period
65
70
75
80
85
90
95
00
05
10
Year

Speed Doubles Every 5 Years ...But the speed of
light never changes

11
Volts, Amps, and Watts

PCB design issues
capacative loading
transmission lines and termination
Chip inputs and outputs
clock distribution and DLLs
I/O standards
Power and thermal considerations
temperature sensing diode
power supply decoupling
Configuration
new SelectMAP mode

12
Capacitive Loading

Capacitance slows outputs and increases power
output delay increase
25 ps per pF of additional loading
output power dissipation increase
11 µW per MHz per pF with 3.3-V swing
Sources of capacitance
10 pF max for each device pin
2 pF per inch for narrow traces ( 0.8 pF/cm )
130 pF per inch2 for copper areas ( 20 pF/cm2)
IBIS files provide output impedance details

13
Transmission Lines

Some traces must be treated as transmission lines
to minimize ringing
transmission line if round trip gt transition time
lumped-capacitance if round trip lt transition
time
Signal delay on a PCB
140 to 180 ps per inch ( 50 to 70 ps/cm)
Lumped-capacitance trace length
3 inches max for a 1-ns transition time (7.5 cm)
6 inches max for a 2-ns transition time (15 cm)

14
Terminated Transmission Lines Reflections and
ringing
Traditional Thevenintermination at the end
V
CC
100 ?
50 ?
100 ?
Dynamic termination at the end is better and
saves power
50 ?
50 ?
100 pF
Series termination at the source is best single
source and destination only!
22 ? 27 ?
50 ?
(50 ? Total)
15
On-Chip Clock Distribution
Clock
CLB
Data
IOB

Clock distribution introduces delay
larger chips suffer more clock delay

16
Clock Delay Problems

Clock delay increases clock-to-output times
Clock delay leads to unacceptable input hold time
set-up time is negative
Additional data delay can eliminate the hold time
set-up time becomes positive
but tolerance build-up widens the data-valid
window

IOB Flip-Flop
Clock Required Data Valid (without
delay) Required Data Valid (with delay)
Q
D
Delay
Data
Clock Distribution Delay
Clock
17
DLLs Maximize I/O Speed

Clock-to-output time plus set-up time
determinesthe I/O speed and data bandwidth
min clock period max clock-to-out max set-up
Traditional solution
use highly buffered, balanced clock trees
needed to reduce internal clock skew
cannot totally eliminate the delay
The Virtex solution
use a Delay-Locked-Loop ( DLL )
aligns the internal and external clocks
effectively eliminates the clock-distribution
delay

18
Virtex Has 4 Independent DLLs
Clock
Error
Comparator
Delay
CLB
IOB
Data

DLLs adjust clock delay to align internal and
external clocks
digital closed-loop control
25 to 200-MHz range, 35-picosecond resolution

19
Fast Clock-to-Out With DLL

160 MHz inter-chip data rate
16-mA LVTTL
IOB register to IOB register

Virtex FPGA
Virtex FPGA
0.5 ns
D
Q
DLL
DLL
3.8 ns
1.9 ns
Clock
20
LVTTL Data Rate with DLL

1.4 ns measured clock-to-output delay

Output standard LVTTL Fast 16mA (OBUF_F_16) Temp
100C, Vdd2.375V, Vcco3.3V Waveforms 1
CLKIN 2 DATA OUT (no DLL) 3 DATA OUT (DLL
deskewed) Timing w/o DLL w/ DLL r-gtr
r-gtf r-gtr r-gtf 3.9n 3.9n 1.4n 1.4n
21
Other DLL Functions

Double the incoming clock frequency
fast internal operation slow external clock
Clock mirroring to the PCB
Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
Adjust clock duty cycle to 50-50
Create four quadrature clock phases
input four sequential bits per clock period

22
Duty Cycle Correction

25 duty cycle in 50 duty cycle out

Virtex FPGA
1X
DLL
25 MHz 25 Duty Cycle
25 MHz 50 Duty Cycle
23
Clock Doubling and Mirroring

Clock mirror with less than 100 ps skew
simplifies PCB clock distribution

Virtex
SDRAM
74 MHz 1
DLL 1
37 MHz
SystemClock
Exactly Aligned
1 Input Load
74 MHz 2
DLL 2
74 MHz Internal
37 MHz Internal
Zero-Delay Internal Clock Buffer
Actual HDTV Customer Example
System Clock
SDRAM
Inside FPGA
Inside FPGA
24
Precise Clock Mirroring

2x system clock for board use

Virtex FPGA
2X
DLL
66MHz Clock
132 MHz Clock
25
Clock Division

Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
maintain synchronous edges

CLKIn 200 MHz
CLKout 200 MHz
CLKDV 12.5 MHz
26
Multi-Standard SelectI/O
GTL
2.5V SSTL
MicroProcessor
SRAM
1.8V
SDRAM
SDRAM
5V Tolerant
FLASH
Mixed Signal
5V
3.3V LVTTL
Busses/Backplanes(3/5V PCI, ISA, GTL)
DSP
27
Mix Match Output Standards

User-supplied voltages determine output swing
3.3 V, 2.5 V, 1.5 V
one voltage per bank
a bank is half of a chip edge
Output characteristics are programmable on a
per-pin basis
push-pull or open-drain
LVTTL drive strength
2-mA to 24-mA sink and source current
LVTTL Slew rate

28
Mix Match Input Standards
Internal Reference

Internal or user-supplied threshold voltage
selectable on a per-pin basis
one user-suppliedthreshold voltage per bank
Programmable over-voltage protection
5-V tolerant or diodeclamp to VCCO
selectable on a per-pin basis

VREF
Input
Input
Input
Input
Input
Input
VREF
29
SSTL Clock-to-Out With DLL

200 MHz inter-chip data rate
SSTL 3, Class II
IOB register to IOB register

Virtex FPGA
Virtex FPGA
0.3 ns
D
Q
DLL
DLL
2.8 ns
1.9 ns
Clock
(Stub Series Transceiver Logic)
30
SSTL Data Rate with DLL

1.3 ns measured clock-to-output delay
much lower noise than LVTTL

Output standard SSTL 3 Class 2 (OBUF_SSTL3_II) T
emp100C, Vdd2.375V, Vcco3.3V,
Vtt1.5V Waveforms 1 CLKIN 2 DATA OUT (no
DLL) 3 DATA OUT (DLL deskewed) Timing w/o
DLL w/ DLL r-gtr r-gtf r-gtr r-gtf 3.5n
3.8n 1.1n 1.3n
31
From FPGA to System ComponentRedefining the
FPGA
Cache SRAM (Mbytes)
Chip 1
Chip 1
SDRAM (133MHz)
LVCMOS
x2 CLK
x1 CLK
Low Voltage CPU
SSTL3
LVTTL
GTL
High Speed System Backplane
"Virtex moves FPGAs from glue to system
component - Ron Neale, EE
32
Power and Thermal Issues

Power and heat are serious concerns
All CMOS power consumption is dynamic
proportional to VCC2
proportional to capacitance
proportional to frequency
Virtex conserves power
2.5-V supply voltage
small geometries and short interconnects reduce
capacitance

33
Virtex Power Consumption

Virtex is designed to conserve power
100 MHz 16-bit counters
12.5 MHz average transition rate
6.5 mW per counter including clock distribution
100 MHz 8-bit counters
25 MHz average transition rate
5 mW per counter including clock distribution

34
Thermal Management

Temperature-sensing diode
matched to maxim MAX 1617 A/D
programmable alarms
similar to the Pentium II solution

Virtex FPGA
DXP
SBMCLK
Maxim MAX1617
SBMDATA
DXN
ALERT
35
Power Supply Decoupling

CMOS power-supply current is dynamic
current pulse every active clock edge
Peak current can be 5x the average current
instantaneous current peaks can only besupplied
by decoupling capacitors
Use one 0.1 µF ceramic chip capacitor for each
power-supply pin
low L and R are more important than high C
double up for lower L and R if necessary
use direct vias to the supply planes, close to
the power-supply pins

36
Virtex Configuration

New byte-wide SelectMAP mode
up to 528 Mbps at 66 MHz
simple handshake protocol
up to 400 Mbps at 50 MHz
no handshake required
Configuration bit-stream length
0.5 Mbits to 6.1 Mbits

Control Logic (EPLD)
Busy
CS
Address Configuration EPROM
Data
WE, CS
Virtex FPGA
37
Volts, Amps, and Watts Recap

PCB design issues
minimize capacitance for higher speed
terminate transmission lines to reduce ringing
Chip inputs and outputs
use DLLs to maximize I/O bandwidth
use SelectI/O to interface with different
standards
Power and thermal considerations
use the sensing diode to manage chip temperature
decouple the power supply well
Configuration
configure faster with the SelectMAP mode

38
Designing for 100 MHz.

Volts, Amps, and Watts
PCB Signal Distribution
chip Inputs and Outputs
power and Thermal Considerations
Ones and zeros
logic Emulation
Bits and bytes
memory hierarchy

39
Spending the 10 ns Budget

Fast logic requires fast function generators
signals often pass through several function
generators
Routing delays must also be kept short
there are routing delays between every function
generator
Arithmetic delays are important
carry chains often create critical paths

40
You Dont Have To Be An Expert

You dont have to be an FPGA architecture expert
to implement high-performance designs
the benefits of a good architecture are automatic
all the logic goes faster
software provides easy access to the features
You can achieve high-performance only with a good
FPGA architecture
a good FPGA empowers its users
Youll design better if you know the architecture
matching your design style to the available
features increases performance and/or lowers cost

41
Virtex CLB

Logic and arithmetic delay reduction demands
improvements in the CLB
Virtex CLB is divided into two slices, each with
2 function generators
2 flip-flops
2 bits of carry logic

Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
42
Fast Function Generators

Each function generator emulates 2 to 3 levels
of logic
a 10-level logic path typically requires 3 to 5
Function Generators in series
at 100 MHz, they must be less than 2 ns each
including the routing
Virtex has 0.6-ns function generators
leaves 1.4 ns for each route

43
Connecting Function Generators

Some functions need several function generators
F5 MUXs connect pairs of function generators
functions with 5 to 9 inputs
F6 MUXs connect all 4 function generators
functions with 6 to 17 inputs

Fnct Gen
Fnct Gen
F5
F5
F6
Fnct Gen
Fnct Gen
44
Fast Local Routing

Local routing provides fast interconnects
in a CLB, Function Generators connect with
minimal routing delays
fast paths between adjacent CLBs increases
flexibility

Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
Carry
Carry
Fnct Gen
Fnct Gen
45
Use Pipelining for Speed

Shorter clock periods means doing less each
period
create a pipeline structure
pipeline stages operate concurrently
more functions are done at the same time
throughput increases
All function generators have output flip-flops
most pipeline support is free

46
16-Bit Pipeline in One LUT

In directly cascaded pipelines the flip-flopsare
not free
One SRLUT can implementup to 16 bits of delay
shift data in and select the appropriate tap

Delay Select
Output
16-Bit Shift Register
Input
47
Fast Logic Needs Fast Routing

Our typical design with 3 to 5 CLBs needed an
average routing delay of 1.4 ns or less
the Virtex routingarchitecture deliversthis
performance
Delay is independentof direction
dependablyshort delays

48
Go Farther, Faster

Virtex achieves its speed through a hierarchy of
highly buffered routing resources
wires span 1, 2, or 6 CLBs
The Virtex routing architecture is designed for
large arrays
todays FPGAs are big but tomorrows will be
even bigger
Virtex is designed to maintain its performance
even in very large arrays

49
No Routing Congestion

For high-speed applications, routing must be
dependably fast
not just capable of being fast
In the past, high device utilization has caused
routing congestion
critical nets might be forced to meander
Virtex minimizes these problems
abundant resources prevent congestion

If it needs to be fast, it will be fast
automatically!
50
Built-in Tri-State Busses

Bi-directional busses are supported directly by
tri-state buffers built into each CLB
two drivers per CLB
segmentable every four CLB columns

CLB
CLB
CLB
CLB
CLB
51
Arithmetic A Special Case

Adders, accumulators, counters, and comparators
all depend on carry chains
Carry-chain logic is usually much deeper than the
rest of the design
32 levels for a 16-bit ripple adder
too deep to use function generators at 100 MHz
arithmetic delays would limit performance
Dedicated carry logic provides the desired speed
16-bit adders can operate at up to200 MHz
register-to-register

52
Wide Arithmetic

64-bit adders would require 128 levels of logic
expensive complex carry schemes would be needed
to preserve performance
Virtex minimizes the carry propagation delay
100 ps per bit pair
zero routing delay between CLBs
Minimal performance loss for each extra bit

16-bit adders operate at up to 200 MHz 64-bit
adders operate at up to 135 MHz
53
Efficient Virtex Multipliers

Cascade vs. tree structure
cascade simpler and smaller
tree is faster
Virtex gives the best of both worlds
as fast as a tree
smaller than a cascade
160 MHz clock rate for pipelined 16 x 16
multiplier

Cascade Tree Virtex Tree
Delay
4 x 4
8 x 8
16 x 16
Cascade Tree Virtex Tree
Number of CLBs
4 x 4
8 x 8
16 x 16
54
Fast Address Decoders

Wide address decoderscould slow operation
wide AND gates withinvertable inputs
Virtex carry-chain MUXscan act as AND gates
combine functiongenerator ANDs
64-bit decoders operateat up to 155 MHz

0
1
0
0
1
0
0
1
0
0
1
0
1
55
Speed Is Never Wasted

You can never have too much performance
excess performance can always be traded for size
and cost reduction
Replace single-cycle functions with smaller
multi-cycle versions
a 2-cycle multiplier is half the cost of a
single-cycle multiplier

Reduce costs by designing down to the performance
you need
56
Creating a High-Speed Clock

Logic sometimes needs to operate faster than the
available clock
multiple RAM accesses in a single cycle
low-speed PCB clock distribution for power or
noise reduction
Virtex DLLs can double and redouble incoming
clocks

2X
2X
DLL1
DLL2
45 MHz
90 MHz
180 MHz
57
Optimized for the Future

Deep sub-micron technology permits larger and
larger array sizes
poses new circuit-design challenges
changes the rules of FPGA architecture
Across-chip routing is the most vulnerable
could easily limit design performance
Virtex is designed for long-term growth
even long, across-chip routes will remain fast
Virtex is tomorrows FPGA today!

58
10 ns is Long Enough

Virtex CLBs can implement relatively complex
functions in 10 ns
0.6 ns per 4-input function generator
Virtex offers fast interconnections
even across-chip when fully utilized
fast tri-state buses
Support for very fast arithmetic operations
16-bit adders at 200MHz

59
Implement Designs Automatically

You dont have to be an FPGA wizard to use Virtex
Virtex is optimized for automated implementation
uniform structure
efficient mapping/synthesis
ample routing
simple placement and no congestion
predictable performance
effective synthesis
IP cores speed design even more
validated functionality with guaranteed
performance

60
Designing for 100 MHz

Volts, Amps, and Watts
PCB signal distribution
chip inputs and outputs
power and thermal considerations
Ones and zeros
logic emulation
Bits and bytes
memory hierarchy

61
100 MHz Memory

Virtex memory operates up to 200 MHz
High-speed memory has two benefits
data storage
work-in-progress
input/output buffers, FIFOs
accelerating complex functions
store pre-computed values in look-up tables

62
Data Storage Hierarchy

Virtex supports 3 levels of memory hierarchy
On-chip SelectRAM
small-to-medium memories
0.6-ns read access time
On-chip Block SelectRAM
larger memories
true dual-ported operation
3.3-ns read access time
Fast SelectI/O interfaces to external RAM
DLL boosts memory bandwidth

63
SelectRAM

SelectRAM uses CLB LUTs as user memory
16-deep RAMs
32-deep RAMs
16-deep dual-ported RAMs
16-deep shift registers
Cascadable for larger memories
128 or more words deep
uses logic resources for expansion

64
Block SelectRAM

Up to 32 dual-ported 4096-bit RAM Blocks
synchronous read and write
True dual-port memory
each port has full read and write capability
different clocks for each port
Configurable aspect ratio
trade width for depth
4096 x 1 bit to 256 x 16 bits
separate configurations for each port
Dedicated routing for memory expansion

65
High-Speed Memory Interfaces

SelectI0 and DLLs together provide fast access to
many types of external memory
Xilinx currently offers two reference designs
fully synthesized
automatic placement and routing
SDRAM up to 125 MHz
ZBTRAM up to 143 MHz

(Zero Bus-Turn-around)
66
Input/Output Data Buffers

High-performance systems need data buffers to
decouple internal operation from I/O activity
I/O may be sporadic (burst-mode busses)
I/O may be faster or slower
I/O may be wider or narrower
I/O buffers can take several forms
dual-ported RAMs
ping-pong buffers
FIFOs

67
Dual-ported I/O Buffers

Block SelectRAM is ideal for I/O buffers
dual-ported operation
independent clocks and controls
bridges between clock domains
simultaneous read and write
port-specific aspect-ratio control
built-in rate/width conversions
SelectRAM provides similar benefits on a
smaller scale

68
Ping Pong Buffers

Ping-pong buffers are pairs of blocks that
alternate between input and processing
SRLUT for small buffers
self-addressing input
0.6-ns read access
Larger buffers can usethe dual-ported Block RAM
one address bit alternatesread/write areas
3.3-ns read access

Read Address

16-Bit Shift Register
Output

16-Bit Shift Register
Select
Input
69
Small FIFOs in SRLUTs

Small FIFOs can be implemented in SRLUTs
word count addresses the output data
increment and enable SRLUT to Push
decrement to Pop
enable only for both
16-Byte FIFO in 4 CLBs
16 x 16 in 6 CLBs
200 MHz
Expandable for deeperFIFOs

Pop
Down Word Counter Up

Push
Output
16-Bit Shift Register
Input
70
Large FIFOs in Block RAM

Large FIFOs can use the dual-ported block RAM
add read and write address counters
Asynchronous push and pop
Different port sizes give rate-for-width
conversion
Block RAM FIFOs can operate at up to 170 MHz
including flag logic

Input
Output
Data
Data
Block SelectRAM
Counter
Counter
Addrs
Addrs
WE
Pop
En
En
Control Logic
Full
Empty
Push
71
Pre-computing for Speed

Some functions are too complex for 10-ns logic
implementation
pipelining is not always possible
An alternative is to pre-compute all the possible
results and store them in memory
select a result according to the inputs
Function time is independent of complexity
0.6 ns SelectRAM access time
3.3 ns Block SelectRAM access time
The function table can be smaller than the logic

72
Multiplication By A Constant

Sometimes, data has to be scaled
multiplied by a constant value
A full multiplier is too expensive
it can multiply by a variable
unnecessarily general and too complex
Storing all multiples of the constant is a
better alternative
smaller and much faster

Constant
Multiplier Array
Scaled Data
Input
Product Table
Scaled Data
Input
73
16-bit Scaler

A 216-word product table is impractical
partition the input into nibbles
use 16-word LUTs for nibble products
combine the partial products in adders
Roughly half the CLBs of a full multiplier
for a 16-bit Coefficient36 CLBs vs.62 CLBs
Pipeline the addersfor extra speed

Input
x4096
LUT
x256
Scaled Data
LUT
x16
LUT
LUT
74
Changing the Constant

The SRLUT mode can be used to update the table
push-only stack
last 16 bits loaded define the table
A simple accumulatorcomputes all productsof a
new constant

Input
Output
16-Bit Shift Register
Reg- ister
Reg- ister
Constant
Clear
Load
Change Constant
75
Large Function Tables

Larger functions can be implemented in the Block
SelectRAM
12-input functions
micro-coded state machines
Data tables can also be implemented
sine/cosine tables for DSP, for example
dual-ported access gives the sine and cosine
simultaneously
a simple address offset gives 90º phase shift for
accessing sine and cosine from a single table

76
Block RAM/ROM Creation