VLSI Arithmetic Lecture 8 - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

VLSI Arithmetic Lecture 8

Description:

Single-rail dynamic ALU. Design choices. p n PD-SOI Devices. Body of devices not tied to Vcc/Vss ... Permits a single-rail carry-merge tree design ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 64

Provided by: ramkrishna5

Category:

more less

Transcript and Presenter's Notes

Title: VLSI Arithmetic Lecture 8

1
VLSI ArithmeticLecture 8

Prof. Vojin G. Oklobdzija
University of California
http//www.ece.ucdavis.edu/acsel

2
Designing for Speed and Power

Ultimate Speed Adders, IEEE Trans on Electronic
Computers, April, 1963 correspondence between
Sklansky and Lehman
Sklansky
Consequently the question Which adder is the
fastest? is an impossibly difficult question if
we define adder speed as the contribution of an
adder to the over-all computational
effectiveness.

3
High-Performance ArithmeticChallenges From
Architectures to Circuits

Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar
Borkar
Microprocessor Research, Intel Labs
Intel Corporation, Hillsboro, OR, USA
ramk_at_ichips.intel.com
Prof. Vojin Oklobdzija
ACSEL Lab, Dept. of ECE
University of California, Davis, CA, USA
vojin_at_ece.ucdavis.edu

16th IEEE International Computer Arithmetic
Symposium, Santiago, June 18th 2003
4
Outline

Motivation
Design choices for high-performance circuits
SOI vs. Bulk devices ALU design test-case
64-bit ALUs in PD-SOI and Bulk CMOS
Energy-efficient high-performance AGU/ALUs
4GHz Sparse-tree AGU Design
6.5-10GHz Integer ALU Design
Summary

5
High-performance trends

Frequency doubles every generation
Performance-critical units
ALUs AGUs
Register files, L0 caches

Single-cycle latency throughput
6
64-bit ALUs in 0.18mm PD-SOI/Bulk CMOSDesign
Scaling Trends
S. Mathew et al, ISSCC 2001
S. Mathew et al, JSSC, Nov 2001
7
Design choices

High performance devices
Partially depleted Silicon-on-Insulator
Pros Cons vs. bulk CMOS
Scaling trends
High performance circuit design
Sparse-tree semi-dynamic AGU
Single-rail dynamic ALU

8
PD-SOI Devices
p
n
n
n
p
STI
p
P type body
N type body
STI
STI
Buried Oxide
P-Substrate

Body of devices not tied to Vcc/Vss
Body is isolated by buried oxide
Floating Body!

9
History Effect in PD-SOI
G
S
D
n Gate
Cgb
n
Body Potential
n
Cdb
Csb
Buried Oxide
Cbox
Backgate

Delay Function of switching history
Capacitive coupling from S/G/D
Impact Ionization, Diode conduction
Transient Vbs DC Vbs

Complicates timing analysis
10
64-bit ALU architecture
Mux control
Shift control
External operands
51 Mux
0.5pF
91 Mux
Single rail adder core
31 Mux
Sum
External operands
21 Mux
91 Mux
Mux control
Sign control
1200mm Loopback bus
Ideal test-bed for evaluating process technologies
11
High-performance Adders Kogge Stone
1 2 3 4 5
6 7
Sumeven
Even input bits
PG Gen.
CM1
CM2
CM3
CM4
CM5
XOR
Sumodd
Odd input bits
CM1
CM2
CM3
CM4
CM5
XOR
PG Gen.
GGGiPiGi-1 GPPiPi-1

Generate all carries
Full-blown binary tree ? energy-inefficient
Carry-merge stages log2(N)

12
64-bit Han-Carlson adder core
PG generator
3N
b59
b1
b0
b2
b3
b63
b62
b61
b60
Odd bit
Even bit
Carry-merge0
2P
Carry-merge1
2N
CM0
CM1
Carry-merge5
2N
Odd carry generator
2P
Sum XOR

Carry-merge done on even bitslices
50 fewer carry-merge gates vs. Kogge-Stone
Extra logic stage generates odd carries

13
Energy-efficient adder core

43 less energy/transition at iso-performance

14
Han Carlson carry-merge tree
Complementary signal generator
PG gen.
CM0
CM1
CM2
CM3
CM4
CM5
CM6
Ceven
Even inputs
2P
CSG
2N
2N
3N
2P
2P
2N
Ceven
Codd
Odd inputs
CSG
2P
3N
Codd
Carry-merge tree
Odd carry generator
Dual rail
Single rail

Single rail adder core
CSG circuit generates dual-rail carry

15
Complementary signal gen.
F2
Keeper
Carryi
True pull-down path
Cini
Keeper
Complementary pull-down path
Carryi

Domino-compatible Carry/Carry
Permits a single-rail carry-merge tree design
Not time-borrowable Penalty absorbed by placing
gate at F2 boundary

16
Partial sum generator
F1
F1
Pi
Ai
Bi
Keeper
F1
Gi
Psumi
Ai
Bi

Generates domino-compatible partial sum
Placing the gate at F1 boundary mitigates output
noise-glitches

17
ALU performance in bulk CMOS
F1
F2
Adder core
Inp.
Sum
91 Mux
51 Mux
31 Mux
Bus driver
1200mm Bus
2N
2P
2P
2P
2N
XOR
2N
3N
2P
310ps
0.18mm bulk CMOS, Vcc1.5V
18
Porting from bulk to PD-SOI
Direct port
SOI design

Design issues
Noise tolerance due to lowered Vt
Min-delay timing-analysis

Bulk design
SOI-optimal design
SOI favored redesign

Motivation for redesign
Reduced SOI stack penalty
Deeper stack design
Stage reduction

Design choices
Architecture should favor deep stack design
Avoid increase in fanouts

19
0.18mm Bulk PD-SOI technologies

Equal IOFF at DC Vbs
SOI IDSAT is 1-2 lower

20
History effect measurements in 0.18mm PD-SOI
21
Direct port of Han-Carlson ALU to PD-SOI
0.18mm technology, Vcc1.5V

Adder core speedup 14
Stasiak et al.,ISSCC 2000 21 speedup

22
Speedup analysis

Diffusion dominated muxes Max. speedup
Load dominated gates Speedup decreases

23
Motivation for PDSOI-optimal redesign

Reduced stack penalty in SOI
Deeper stack design Stage reduction
ALU is amenable to such a redesign
Not true for all CPU critical paths
SOI-optimal ALU architecture
Increasing stack depth must not increase fanouts
A novel deep-stack sparse-tree ALU was developed

24
Sparse-tree adder core
2N
PG generator
b1
b0
b2
b3
b63
b62
b61
b60
2P
6362
6160
5958
32
10
76
54
4N
158
2316
3124
3932
70
4740
Fast carry-merge tree
2P
150
3116
4732
3N
470
4948
310
5554
5352
5150
1110
5958
5756
1716
3534
3332
2726
2524
2322
2120
1918
4342
4140
3938
3736
54
98
76
32
10
Mux
Int. carry gen.
Int. carry gen.
Int. carry gen.
Int. carry gen.
Mux
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen

50 reduced fanouts compared to Han-Carlson
7 gate stages (Two less than Han-Carlson)

25
Intermediate Carry Generator
P74 G74
P30 G30
P118 G118
0
1
2
2
2
2
CM
CM
CM
CM
CM
CM
Carry from Fast CM Chain
21 Mux
21 Mux
21 Mux
C3
C11
C7

Generates 1 in 4 carries (C3, C7, C19.. C59)
Non-critical path (ripple carry-select scheme)
Fast carry selects bet. the conditional carries

26
Non-critical Sum Generator
Pi2 ,Gi2
Gi1
Pi
Pi1
Pi3,Gi3

1
0
CM
CM
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi3
Sumi1
Sumi2
Sumi

Non-critical path ripple carry chain
Reduced area, energy consumption, leakage
Generate conditional sums for each bit
1 in 4 carry selects appropriate sum

27
Sparse-tree adder critical path
Intermediate carry generator
3N
2P
2N
Input
2N
2P
4N
2P
3N
Fast carry-merge path
2N
Sumout
Inv
3N
2P
Sum generator

Fast carry-merge path Critical path
Non-critical side-paths Ripple-carry

28
PD-SOI optimal redesign in 0.18mm
0.18mm technology, Vcc1.5V

Deeper stack redesign additional 5 speedup

29
Margining for reverse-body bias in PD-SOI

400mV rvs. bias increases rise-delay by 10
Difficult to detect for large circuits
10 Margin required for all max-delay paths

Overall PD-SOI speedup reduces to 11
30
Reducing reverse-bias penalty in dynamic SOI
gates
P0
F
A
Body-A
Cost 5 increase in clock energy
M1
Stack node
B
Body-B
Out
A
B

Point solution for dynamic designs
Pre-charging stack node decreases penalty to 2

Max-delay margin reduced to 2
31
0.18mm ALU performance after margining
0.18mm technology, Vcc1.5V

Maximum PD-SOI speedup reduces to 19

32
Scaling to 0.13mm technologies

Equal SOI bulk IOFF-DC
MOSFET impact ionization data obtained from
0.13mm bulk measurements
SOI parasitic BJT/diode characteristics
unchanged from 0.18mm fitting

33
Scaling ALU designs to 0.13mm technology
0.13mm technology, Vcc1.2V

Maximum PD-SOI speedup reduces to 16

34
SOI vs. bulk Summary

482ps energy-efficient dynamic 64b ALU in 0.18mm
bulk
310ps adder core
Direct port to 0.18mm SOI 14 speedup
SOI optimal redesign 19 speedup
Floating body can get reverse-biased
Preconditioning reduces margin from 10 to 2
Scaling to 0.13mm decreases PD-SOI speedup
Maximum PD-SOI speedup in 0.13mm falls to 16

35
High-Performance Low Power Datapath design
Energy
Delay
Goal Shift the E-D curve
36
A 4GHz 130nm Address Generation Unit with 32-bit
Sparse-tree Adder Core
S. Mathew et al, VLSI Symp. 2002, S. Mathew et
al, JSSC May 2003
37
Motivation
Cache
Processor thermal map
Temp (oC)
Execution core
AGU
120oC

AGUs performance and peak-current limiters
High activity ? thermal hotspot
Goal high-performance energy-efficient design

38
AGU Architecture

32
32 Compressor
32
Base
32
Effective Address
32b add
3b shift
32
32
Index
clk3
32
clk2
Segment
32
clk

Displacement
clk

Single-cycle latency and throughput
Effective Address Base IndexScale
(Segment Displacement)
2-phase address computation

39
AGU Operation Phase 1

32 Compressor
32
32
Base
32
Effective Address
32b adder

32
3b shift
32
Index
clk3
32
clk2
Segment
32
clk

Displacement
Carry-Save format

Index pre-scaled via 3-bit barrel shifter
32 compressor renders partial address
Carry-save format
Adder in pre-charge state

clk
40
AGU Operation Phase 2
32
32 Compressor
32
Base
32
Effective Address
32b adder
3b shift
32
32
Index
clk3
32
clk2
Segment
32
clk

Displacement
clk

Carry-save to binary format conversion
2s complement parallel 32-bit adder

41
Kogge-Stone Adder
PG
1
2
3
5
4
6
7
9
8
10
11
13
12
14
15
17
16
18
19
21
20
22
23
25
24
26
27
29
28
30
31
0
Carry-merge gates
XOR

Critical path PG5XOR 7 gate stages
Generate,Propagate fanout of 2,3
Maximum interconnect spans 16b

Energy inefficient
42
Sparse-tree Adder Architecture

Generate every 4th carry in parallel
Side-path 4-bit conditional sum generator
73 fewer carry-merge gates?energy-efficient

43
Non-critical Sum Generator
Pi2 ,Gi2
Gi1
Pi
Pi1
Pi3,Gi3

1
0
CM
CM
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi3
Sumi1
Sumi2
Sumi

Non-critical path ripple carry chain
Reduced area, energy consumption, leakage
Generate conditional sums for each bit
Sparse-tree carry selects appropriate sum

44
Optimized First-level Carry-merge
Conditional Carry for Cin0
0
CM
Gi
C_0

Carry-merge stage reduces to inverter
Conditional carry_0 Gi

45
Optimized First-level Carry-merge
1
Conditional carry for Cin1
CM
Pi
C_1
Gi
Pi
C_1

Pi Gi correlated
Conditional carry_1 Pi

46
Optimized Sum Generator
Pi2 ,Gi2
Gi1
Pi3,Gi3
Pi
Pi1

Optimized 1st-level carry-merge
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi1
Sumi3
Sumi
Sumi2

Optimized non-critical path 4 stages

47
Adder Core Critical Path
clk3
clk2
clk
Adder Inputs
C27
PG
GG1
GG7
GG27
GG15
GG3
Single-rail dynamic sparse-tree path
Sum31_0
Sum31
CM0 Latch
CM1
XOR
clk
Sum31_1
Static sum generator

Critical path 7 gate stages ? same as KS
Sparse-tree single-rail dynamic
Exploit non-criticality of sum generator
Convert to static logic?Semi-dynamic design

48
1st-level Carry-merge Static Latch

Holds state in pre-charge phase
Prevents pre-charging of static stages

49
Domino-Static Interface
clk0
clk1

SumSum0 during pre-charge
Mux output resolves during evaluation

50
Sparse-tree Architecture

Performance impact (20 speedup)
33-50 reduced G/P fanouts
80 reduced wiring complexity
30 reduction in maximum interconnect
Power impact (56 reduction)
73 fewer carry-merge gates
50 reduction in average transistor size

51
Energy-delay Space
100
130nm CMOS, 1.2V, 110oC Simulation
80
56
60
Dynamic Kogge-Stone
Worst-case Energy (pJ)
40
20
20
4GHz Design
Semi-dynamic Sparse-Tree
0
140
160
180
200
220
240
260
280
Delay (ps)