Achieving 550 MHz in an ASIC Methodology - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Achieving 550 MHz in an ASIC Methodology

Description:

Share slack between pipeline stages. Slack passing. Time borrowing. 1.15 vs. good ASIC ... The digital logic critical paths are in the read portion: ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 31

Provided by: carl296

Category:

more less

Transcript and Presenter's Notes

Title: Achieving 550 MHz in an ASIC Methodology

1
Achieving 550 MHzin an ASIC Methodology
David ChinneryBorivoje Nikolic Kurt
Keutzer University of California at Berkeley
2
ASIC and custom gap in processors in 0.18 um
200 MHz in slower, cheaper process
5 gap between ASIC and custom
speed
Can this be bridged?
Pentium 4 1700 MHz
Average ASIC 100 250 MHz
Tensilica Xtensa 320 MHz

Last year we showed the 6 to 8 gap and the
causes
But where is an ASIC bridging the gap?

2
3
ASICs can do better disk drive read channels
The picture in 1999
200 MHz in slower, cheaper process
1.5 gap between ASIC and custom
speed
This cant be bridged yet
Average ASIC 100 250 MHz
Pentium III 800 MHz
TI SP4140 550 MHz 0.21 um
Tensilica Xtensa 320 MHz

Last year we showed the 6 to 8 gap and the
causes
But where is an ASIC bridging the gap?
Texas Instruments SP4140!

3
4
Where does the speed go?
4
5
Where does the speed go?

4.20 micro-architecture and pipelining
Architectural transformations to shorten critical
path
Reducing critical path length by inserting
flip-flops or latches
1.00 vs. good ASIC

5
6
Where does the speed go?

2.00 due to process variation and accessibility
1.20 vs. good ASIC

ASICworst case, worst process
ASICgood yield, worst process
ASICgood yield, good process
fastest custom bin
produced
2.0
ASIC libraries may lag technology improvements
speed
6
7
Where does the speed go?

1.50 through dynamic logic on critical paths
p-channel MOSFETs replaced by precharge
transistor
Reduced gate input capacitance, reduced area
There are also other high speed, custom logic
styles

VDD
VDD
GND
GND
clock
static CMOS
domino logic
7
8
Where does the speed go?

1.40 timing
Distribute clock tree carefully to reduce clock
skew
Use latches instead of flip-flops
Avoid clock skew and setup time penalty
Share slack between pipeline stages
Slack passing
Time borrowing
1.15 vs. good ASIC

t
t
D-Q
D-Q
D1
H

L
clock

Q1
D2
Q2
slack passed
t
t
comb1
comb2
8
9
Where does the speed go?

1.25 good floorplanning and placement
Place connected modules nearby reducing wire
lengths
Layout follows datapath
1.00 vs. good ASIC

9
10
Where does the speed go?

1.25 appropriate sizing of transistors and wires
1.05 vs. good ASIC

VDD
VDD
C
C
GND
GND
10
11
ASIC Example The TI SP4140

Constraints
Entirely new design, in a new process, in 9
months
Disk drive read channel needs high throughput,
525 Mb/s
Maximum power 1.7 W at full speed
Technology characteristics
Supply voltage 1.9 V
Process 0.21 um CMOS (0.18 um effective channel
length)

write

write
encoder
scrambler
precomp

signal
data
FIR filter
Viterbi
read

read

VGA
CT filter
decoder
descrambler
equalizer
decoder
ADC

data
signal
servo
timing recovery
11
12
The TI SP4140 Critical Paths

The digital logic critical paths are in the read
portion
Encoded data is read and 6-bit sampled at 550
Mb/s
FIR filter
Critical path of this is the multiply-accumulate
operation
Runs at 275 MHz, 550 Mb/s throughput
Lower power consumption
Viterbi decoder
Critical part of this is the add-compare-select
(ACS) unit
Single cycle feedback
Hard to pipeline, would have to unroll the
recursive loop
Runs at 550 MHz, 525 Mb/s output (redundancy
removed)

12
13
Components of Critical Path Delay

With flip-flops, cycle time T is a function of
Combinational logic delay,
Clock skew,
Setup time, when input must be stable,
Clock-to-Q delay, from clock edge to when output
changes,

data
Q1
Q2
Tclock1
Tclock2
Tclock1
Tclock2
Q2
clock-to-Q
clock
13
14
Concrete for our bridge
We examine the architecture, timing overhead, and
process.
14
15
1. Architectural Transformations

Increase speed by reducing the critical path
length
Pipeline adding latches or flip-flops between
logic
With flip-flops, must be pipeline stages balanced
for high speed
Latches allow time borrowing and sharing between
stages
Parallelization increasing throughput by
duplication
Retiming to remove some logic from the critical
path
25 speed up from ASIC 320 MHz to 400 MHz!

15
16
Pipelined FIR

Pipelining breaks up the critical path into
smaller pieces

Direct form FIR
x(n)
?h0
?h1
?h2
?hn

y(n)

Transpose-form FIR
x(n)
?h0
?h1
?h2
?hn
y(n)

16
17
Pipelined, Interleaved FIR

Computation in parallel to double throughput

Transpose-form FIR
x(n)
?h0
?h1
?h2
?hn

y(n)

Two-path parallel transpose FIR throughput
doubles, area doubles
17
18
Pipelined, Interleaved FIR

8 pipeline stages, parallel computation on two
paths
Speed up by pipelining and interleaving, with
flip-flops
Initial cycle time of
After transposing to pipeline
Interleaving doubles throughput, but doubles the
area,

x(n)odd
select
?h0
?h1
?h2
?h3
?hn
y(n)odd

x(n)even
y(n)
MUX
?h0
?h1
?h2
?h3
?hn

y(n)even
18
19
Retimed Viterbi Add-Compare-Select

Retiming removes subtractor from the critical
path
Critical path is shorter!

select
bm
i,k

select

n
-
1
p
i

n
-
1
sm
i
bm
n

p
n

sm
k,l
l
k
MUX
n
-
1

p
j
n
-
1
sm

j
select
bm

j,k
Standard Add-Compare-Select
bm
n

p
k,m
m
bm
k,l

select

n

p
l
n

sm
Retimed Compare-Select-Add area doubles,
subtractor not in critical path
k

MUX
n

p

m
bm
k,m

Transformation to Compare-Select-Add
19
20
2. Reducing Timing Overhead

Timing overheads proportion of total delay
varies
Custom is 1.4 faster than typical ASIC
Better clock skew 1.10
Fast latches and flip-flops 1.10
Can include logic in a custom latch
Overlapping clock phases to eliminate some
latches
ASICs can reduce skew and use faster memory
elements
Typical ASIC has timing overhead of 1.0 ns
SP4140 has timing overhead of less than 0.5 ns
Speed up of 25 from 400 MHz to 500 MHz!

1.15
20
21
Use Latches

Cycle time T with flip-flops
Latch-based designs are faster because
Latch is transparent for half of the cycle
Cycle time not affected by
clock skew
setup time
If input arrives after latch is transparent,and
before the latch closes
Minimum time between clock edges is
When latch is transparent, delay from
inputarrival to output changing is

clock
L1

combinational

logic 1
L2

combinational

logic 2
21
22
Use Latches

Latches are faster than flip-flops because
Pipeline stages dont have to be well-balanced
Slack passing and time borrowing between stages

clock
t
t
D-Q
D-Q
L1

D1

H

L
clock

combinational

logic 1
Q1
D2
L2

Q2
slack passed

combinational

logic 2
t
t
comb2
comb1
22
23
Reduce flip-flop setup time and clock-to-Q delay

Sometimes have to use flip-flops
Single cycle recursion in Viterbi decoder
No time borrowing!
Fast flip-flops pulsed latches
Hybrid-latch flip-flop
Sense-amplifier flip-flop
Custom cell characterized for standard cell
synthesis of SP4140
Characteristics
Smaller setup time and clock-to-Q delay
First stage generates a pulse
Second stage captures the pulse
Clock skew tolerance

23
24
Hybrid-Latch Flip-flop

When D transitions low
X, Clk, and Clk all high
Causes Q to transition low
When D transitions high
Low pulse on X
Causes Q to transition high
Otherwise Q is held by the cross-coupled inverters

Vdd
pulse generator
Q
Clk
Q
X
Clk
D
D
X
true single-phase clock latch
Q
Clk
Clk
24
25
Sense-Amplifer Flip-flop (SAFF)
Vdd

Sense amplifier amplifies the difference between
D and D
After the clock goes high, it pulls S or R low
Set-reset latch captures the pulse
Sized for typical load
Characterized for use in ASIC flow

Vdd
D
D
sense amplifier
Clk
S
R
R
S
QRSQ
QSRQ
set-reset latch
25
26
Partitioning and Clock Tree Design

Timing critical blocks are 10,000 to 30,000 gates
Layout areas of 1 2 mm2, small size helps
synthesis converge
Blocks have local gated clock trees
Clock distribution over a smaller area reduces
clock skew
Fixed fanout at each clock tree level
Insert dummies to match the insertion delays
Local trees merged into global tree with added
clock gating
Poor ASIC skew is 500ps or more
Good ASIC skew is 200 ps
TI SP4140 clock skew of 60 ps

to about 100 clocked elements

26
27
3. Process Variation and Accessibility

ASIC libraries calculate worst case speeds for
process
Achieve a good yield by better knowledge of
process variation
SP4140 has a voltage regulator on chip, reducing
supply voltage variation
Speed up of 10 from 500 MHz to 550 MHz!
Speeds off a line estimated to vary by 20 to 40
Custom designs can speed bin the chips

fast chip, rest slower
ASIC with good yield
ASICworst case
produced
1.1
1.2 1.4
speed
27
28
ASIC vs. custom speed, area for ACS

ASIC exploration
ACS recursion as fast as 2.2ns
CSA recursion as fast as 1.6ns
Area increases 2.5
Synthesis increases area for speed
Custom CSA
Half the area at same speed
20 faster at same area

Area
20000
16000
12000
CSA
8000
custom CSA
4000
ACS
0
1.0
1.5
2.0
2.5
3.0
3.5
Clock Period (ns)
28
29
Summary
Weve quantified the speed differences between
ASIC and custom designsWeve told you how to
improve ASIC speeds

Good ASIC speeds of 320 MHz
SP4140
Architectural transformations to get to 400 MHz
Reducing timing overhead to get to 500 MHz
Reduce process variation to get to 550 MHz
Gap closed from 5 to 3
Area and power gap of about 2

29
30
Future Work

Quantify impact with respect to power and area
How do we improve ASIC area and power?
Does it help to extend standard cell library with
more sizes?
What is the design time cost of including custom
cells characterized for an ASIC flow?
What about bit slice tiling and custom placement
versus automated place and route?
Look for more ASIC-oriented techniques for
closing speed gap with minimal power and area
increase