Title: Achieving 550 MHz in an ASIC Methodology
1Achieving 550 MHzin an ASIC Methodology
David ChinneryBorivoje Nikolic Kurt
Keutzer University of California at Berkeley
2ASIC and custom gap in processors in 0.18 um
200 MHz in slower, cheaper process
5 gap between ASIC and custom
speed
Can this be bridged?
Pentium 4 1700 MHz
Average ASIC 100 250 MHz
Tensilica Xtensa 320 MHz
- Last year we showed the 6 to 8 gap and the
causes - But where is an ASIC bridging the gap?
2
3ASICs can do better disk drive read channels
The picture in 1999
200 MHz in slower, cheaper process
1.5 gap between ASIC and custom
speed
This cant be bridged yet
Average ASIC 100 250 MHz
Pentium III 800 MHz
TI SP4140 550 MHz 0.21 um
Tensilica Xtensa 320 MHz
- Last year we showed the 6 to 8 gap and the
causes - But where is an ASIC bridging the gap?
- Texas Instruments SP4140!
3
4Where does the speed go?
4
5Where does the speed go?
- 4.20 micro-architecture and pipelining
- Architectural transformations to shorten critical
path - Reducing critical path length by inserting
flip-flops or latches - 1.00 vs. good ASIC
5
6Where does the speed go?
- 2.00 due to process variation and accessibility
- 1.20 vs. good ASIC
ASICworst case, worst process
ASICgood yield, worst process
ASICgood yield, good process
fastest custom bin
produced
2.0
ASIC libraries may lag technology improvements
speed
6
7Where does the speed go?
- 1.50 through dynamic logic on critical paths
- p-channel MOSFETs replaced by precharge
transistor - Reduced gate input capacitance, reduced area
- There are also other high speed, custom logic
styles
VDD
VDD
GND
GND
clock
static CMOS
domino logic
7
8Where does the speed go?
- 1.40 timing
- Distribute clock tree carefully to reduce clock
skew - Use latches instead of flip-flops
- Avoid clock skew and setup time penalty
- Share slack between pipeline stages
- Slack passing
- Time borrowing
- 1.15 vs. good ASIC
t
t
D-Q
D-Q
D1
H
L
clock
Q1
D2
Q2
slack passed
t
t
comb1
comb2
8
9Where does the speed go?
- 1.25 good floorplanning and placement
- Place connected modules nearby reducing wire
lengths - Layout follows datapath
- 1.00 vs. good ASIC
9
10Where does the speed go?
- 1.25 appropriate sizing of transistors and wires
- 1.05 vs. good ASIC
VDD
VDD
C
C
GND
GND
10
11ASIC Example The TI SP4140
- Constraints
- Entirely new design, in a new process, in 9
months - Disk drive read channel needs high throughput,
525 Mb/s - Maximum power 1.7 W at full speed
- Technology characteristics
- Supply voltage 1.9 V
- Process 0.21 um CMOS (0.18 um effective channel
length)
write
write
encoder
scrambler
precomp
signal
data
FIR filter
Viterbi
read
read
VGA
CT filter
decoder
descrambler
equalizer
decoder
ADC
data
signal
servo
timing recovery
11
12The TI SP4140 Critical Paths
- The digital logic critical paths are in the read
portion - Encoded data is read and 6-bit sampled at 550
Mb/s - FIR filter
- Critical path of this is the multiply-accumulate
operation - Runs at 275 MHz, 550 Mb/s throughput
- Lower power consumption
- Viterbi decoder
- Critical part of this is the add-compare-select
(ACS) unit - Single cycle feedback
- Hard to pipeline, would have to unroll the
recursive loop - Runs at 550 MHz, 525 Mb/s output (redundancy
removed)
12
13Components of Critical Path Delay
- With flip-flops, cycle time T is a function of
- Combinational logic delay,
- Clock skew,
- Setup time, when input must be stable,
- Clock-to-Q delay, from clock edge to when output
changes,
data
Q1
Q2
Tclock1
Tclock2
Tclock1
Tclock2
Q2
clock-to-Q
clock
13
14Concrete for our bridge
We examine the architecture, timing overhead, and
process.
14
151. Architectural Transformations
- Increase speed by reducing the critical path
length - Pipeline adding latches or flip-flops between
logic - With flip-flops, must be pipeline stages balanced
for high speed - Latches allow time borrowing and sharing between
stages - Parallelization increasing throughput by
duplication - Retiming to remove some logic from the critical
path - 25 speed up from ASIC 320 MHz to 400 MHz!
15
16Pipelined FIR
- Pipelining breaks up the critical path into
smaller pieces
Direct form FIR
x(n)
?h0
?h1
?h2
?hn
y(n)
Transpose-form FIR
x(n)
?h0
?h1
?h2
?hn
y(n)
16
17Pipelined, Interleaved FIR
- Computation in parallel to double throughput
Transpose-form FIR
x(n)
?h0
?h1
?h2
?hn
y(n)
Two-path parallel transpose FIR throughput
doubles, area doubles
17
18Pipelined, Interleaved FIR
- 8 pipeline stages, parallel computation on two
paths - Speed up by pipelining and interleaving, with
flip-flops - Initial cycle time of
- After transposing to pipeline
- Interleaving doubles throughput, but doubles the
area,
x(n)odd
select
?h0
?h1
?h2
?h3
?hn
y(n)odd
x(n)even
y(n)
MUX
?h0
?h1
?h2
?h3
?hn
y(n)even
18
19Retimed Viterbi Add-Compare-Select
- Retiming removes subtractor from the critical
path - Critical path is shorter!
select
bm
i,k
select
n
-
1
p
i
n
-
1
sm
i
bm
n
p
n
sm
k,l
l
k
MUX
n
-
1
p
j
n
-
1
sm
j
select
bm
j,k
Standard Add-Compare-Select
bm
n
p
k,m
m
bm
k,l
select
n
p
l
n
sm
Retimed Compare-Select-Add area doubles,
subtractor not in critical path
k
MUX
n
p
m
bm
k,m
Transformation to Compare-Select-Add
19
202. Reducing Timing Overhead
- Timing overheads proportion of total delay
varies - Custom is 1.4 faster than typical ASIC
- Better clock skew 1.10
- Fast latches and flip-flops 1.10
- Can include logic in a custom latch
- Overlapping clock phases to eliminate some
latches - ASICs can reduce skew and use faster memory
elements - Typical ASIC has timing overhead of 1.0 ns
- SP4140 has timing overhead of less than 0.5 ns
- Speed up of 25 from 400 MHz to 500 MHz!
1.15
20
21Use Latches
- Cycle time T with flip-flops
-
- Latch-based designs are faster because
- Latch is transparent for half of the cycle
- Cycle time not affected by
- clock skew
- setup time
- If input arrives after latch is transparent,and
before the latch closes - Minimum time between clock edges is
- When latch is transparent, delay from
inputarrival to output changing is
clock
L1
combinational
logic 1
L2
combinational
logic 2
21
22Use Latches
- Latches are faster than flip-flops because
- Pipeline stages dont have to be well-balanced
- Slack passing and time borrowing between stages
clock
t
t
D-Q
D-Q
L1
D1
H
L
clock
combinational
logic 1
Q1
D2
L2
Q2
slack passed
combinational
logic 2
t
t
comb2
comb1
22
23Reduce flip-flop setup time and clock-to-Q delay
- Sometimes have to use flip-flops
- Single cycle recursion in Viterbi decoder
- No time borrowing!
- Fast flip-flops pulsed latches
- Hybrid-latch flip-flop
- Sense-amplifier flip-flop
- Custom cell characterized for standard cell
synthesis of SP4140 - Characteristics
- Smaller setup time and clock-to-Q delay
- First stage generates a pulse
- Second stage captures the pulse
- Clock skew tolerance
23
24Hybrid-Latch Flip-flop
- When D transitions low
- X, Clk, and Clk all high
- Causes Q to transition low
- When D transitions high
- Low pulse on X
- Causes Q to transition high
- Otherwise Q is held by the cross-coupled inverters
Vdd
pulse generator
Q
Clk
Q
X
Clk
D
D
X
true single-phase clock latch
Q
Clk
Clk
24
25Sense-Amplifer Flip-flop (SAFF)
Vdd
- Sense amplifier amplifies the difference between
D and D - After the clock goes high, it pulls S or R low
- Set-reset latch captures the pulse
- Sized for typical load
- Characterized for use in ASIC flow
Vdd
D
D
sense amplifier
Clk
S
R
R
S
QRSQ
QSRQ
set-reset latch
25
26Partitioning and Clock Tree Design
- Timing critical blocks are 10,000 to 30,000 gates
- Layout areas of 1 2 mm2, small size helps
synthesis converge - Blocks have local gated clock trees
- Clock distribution over a smaller area reduces
clock skew - Fixed fanout at each clock tree level
- Insert dummies to match the insertion delays
- Local trees merged into global tree with added
clock gating - Poor ASIC skew is 500ps or more
- Good ASIC skew is 200 ps
- TI SP4140 clock skew of 60 ps
to about 100 clocked elements
26
273. Process Variation and Accessibility
- ASIC libraries calculate worst case speeds for
process - Achieve a good yield by better knowledge of
process variation - SP4140 has a voltage regulator on chip, reducing
supply voltage variation - Speed up of 10 from 500 MHz to 550 MHz!
- Speeds off a line estimated to vary by 20 to 40
- Custom designs can speed bin the chips
fast chip, rest slower
ASIC with good yield
ASICworst case
produced
1.1
1.2 1.4
speed
27
28ASIC vs. custom speed, area for ACS
- ASIC exploration
- ACS recursion as fast as 2.2ns
- CSA recursion as fast as 1.6ns
- Area increases 2.5
- Synthesis increases area for speed
- Custom CSA
- Half the area at same speed
- 20 faster at same area
Area
20000
16000
12000
CSA
8000
custom CSA
4000
ACS
0
1.0
1.5
2.0
2.5
3.0
3.5
Clock Period (ns)
28
29Summary
Weve quantified the speed differences between
ASIC and custom designsWeve told you how to
improve ASIC speeds
- Good ASIC speeds of 320 MHz
- SP4140
- Architectural transformations to get to 400 MHz
- Reducing timing overhead to get to 500 MHz
- Reduce process variation to get to 550 MHz
- Gap closed from 5 to 3
- Area and power gap of about 2
29
30Future Work
- Quantify impact with respect to power and area
- How do we improve ASIC area and power?
- Does it help to extend standard cell library with
more sizes? - What is the design time cost of including custom
cells characterized for an ASIC flow? - What about bit slice tiling and custom placement
versus automated place and route? - Look for more ASIC-oriented techniques for
closing speed gap with minimal power and area
increase
30