VLSI Arithmetic Lecture 8 - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

VLSI Arithmetic Lecture 8

Description:

Ultimate Speed Adders, IEEE Trans on Electronic Computers, April, 1963 ... Carry-merge1. Carry-merge5. 3N. 2P. 2N. 2N. 2P. b59. Energy-efficient adder core ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 64
Provided by: ramkrishna5
Category:

less

Transcript and Presenter's Notes

Title: VLSI Arithmetic Lecture 8


1
VLSI ArithmeticLecture 8
  • Prof. Vojin G. Oklobdzija
  • University of California
  • http//www.ece.ucdavis.edu/acsel

2
Designing for Speed and Power
  • Ultimate Speed Adders, IEEE Trans on Electronic
    Computers, April, 1963 correspondence between
    Sklansky and Lehman
  • Sklansky
  • Consequently the question Which adder is the
    fastest? is an impossibly difficult question if
    we define adder speed as the contribution of an
    adder to the over-all computational
    effectiveness.

3
High-Performance ArithmeticChallenges From
Architectures to Circuits
  • Ram K. Krishnamurthy, Sanu K. Mathew, Shekhar
    Borkar
  • Microprocessor Research, Intel Labs
  • Intel Corporation, Hillsboro, OR, USA
  • ramk_at_ichips.intel.com
  • Prof. Vojin Oklobdzija
  • ACSEL Lab, Dept. of ECE
  • University of California, Davis, CA, USA
  • vojin_at_ece.ucdavis.edu

16th IEEE International Computer Arithmetic
Symposium, Santiago, June 18th 2003
4
Outline
  • Motivation
  • Design choices for high-performance circuits
  • SOI vs. Bulk devices ALU design test-case
  • 64-bit ALUs in PD-SOI and Bulk CMOS
  • Energy-efficient high-performance AGU/ALUs
  • 4GHz Sparse-tree AGU Design
  • 6.5-10GHz Integer ALU Design
  • Summary

5
High-performance trends
  • Frequency doubles every generation
  • Performance-critical units
  • ALUs AGUs
  • Register files, L0 caches

Single-cycle latency throughput
6
64-bit ALUs in 0.18mm PD-SOI/Bulk CMOSDesign
Scaling Trends
S. Mathew et al, ISSCC 2001
S. Mathew et al, JSSC, Nov 2001
7
Design choices
  • High performance devices
  • Partially depleted Silicon-on-Insulator
  • Pros Cons vs. bulk CMOS
  • Scaling trends
  • High performance circuit design
  • Sparse-tree semi-dynamic AGU
  • Single-rail dynamic ALU

8
PD-SOI Devices
p
n
n
n
p
STI
p
P type body
N type body
STI
STI
Buried Oxide
P-Substrate
  • Body of devices not tied to Vcc/Vss
  • Body is isolated by buried oxide
  • Floating Body!

9
History Effect in PD-SOI
G
S
D
n Gate
Cgb
n
Body Potential
n
Cdb
Csb
Buried Oxide
Cbox
Backgate
  • Delay Function of switching history
  • Capacitive coupling from S/G/D
  • Impact Ionization, Diode conduction
  • Transient Vbs DC Vbs

Complicates timing analysis
10
64-bit ALU architecture
Mux control
Shift control
External operands
51 Mux
0.5pF
91 Mux
Single rail adder core
31 Mux
Sum
External operands
21 Mux
91 Mux
Mux control
Sign control
1200mm Loopback bus
Ideal test-bed for evaluating process technologies
11
High-performance Adders Kogge Stone
1 2 3 4 5
6 7
Sumeven
Even input bits
PG Gen.
CM1
CM2
CM3
CM4
CM5
XOR
Sumodd
Odd input bits
CM1
CM2
CM3
CM4
CM5
XOR
PG Gen.
GGGiPiGi-1 GPPiPi-1
  • Generate all carries
  • Full-blown binary tree ? energy-inefficient
  • Carry-merge stages log2(N)

12
64-bit Han-Carlson adder core
PG generator
3N
b59
b1
b0
b2
b3
b63
b62
b61
b60
Odd bit
Even bit
Carry-merge0
2P
Carry-merge1
2N
CM0
CM1
Carry-merge5
2N
Odd carry generator
2P
Sum XOR
  • Carry-merge done on even bitslices
  • 50 fewer carry-merge gates vs. Kogge-Stone
  • Extra logic stage generates odd carries

13
Energy-efficient adder core
  • 43 less energy/transition at iso-performance

14
Han Carlson carry-merge tree
Complementary signal generator
PG gen.
CM0
CM1
CM2
CM3
CM4
CM5
CM6
Ceven
Even inputs
2P
CSG
2N
2N
3N
2P
2P
2N
Ceven
Codd
Odd inputs
CSG
2P
3N
Codd
Carry-merge tree
Odd carry generator
Dual rail
Single rail
  • Single rail adder core
  • CSG circuit generates dual-rail carry

15
Complementary signal gen.
F2
Keeper
Carryi
True pull-down path
Cini
Keeper
Complementary pull-down path
Carryi
  • Domino-compatible Carry/Carry
  • Permits a single-rail carry-merge tree design
  • Not time-borrowable Penalty absorbed by placing
    gate at F2 boundary

16
Partial sum generator
F1
F1
Pi
Ai
Bi
Keeper
F1
Gi
Psumi
Ai
Bi
  • Generates domino-compatible partial sum
  • Placing the gate at F1 boundary mitigates output
    noise-glitches

17
ALU performance in bulk CMOS
F1
F2
Adder core
Inp.
Sum
91 Mux
51 Mux
31 Mux
Bus driver
1200mm Bus
2N
2P
2P
2P
2N
XOR
2N
3N
2P
310ps
0.18mm bulk CMOS, Vcc1.5V
18
Porting from bulk to PD-SOI
Direct port
SOI design
  • Design issues
  • Noise tolerance due to lowered Vt
  • Min-delay timing-analysis

Bulk design
SOI-optimal design
SOI favored redesign
  • Motivation for redesign
  • Reduced SOI stack penalty
  • Deeper stack design
  • Stage reduction
  • Design choices
  • Architecture should favor deep stack design
  • Avoid increase in fanouts

19
0.18mm Bulk PD-SOI technologies
  • Equal IOFF at DC Vbs
  • SOI IDSAT is 1-2 lower

20
History effect measurements in 0.18mm PD-SOI
21
Direct port of Han-Carlson ALU to PD-SOI
0.18mm technology, Vcc1.5V
  • Adder core speedup 14
  • Stasiak et al.,ISSCC 2000 21 speedup

22
Speedup analysis
  • Diffusion dominated muxes Max. speedup
  • Load dominated gates Speedup decreases

23
Motivation for PDSOI-optimal redesign
  • Reduced stack penalty in SOI
  • Deeper stack design Stage reduction
  • ALU is amenable to such a redesign
  • Not true for all CPU critical paths
  • SOI-optimal ALU architecture
  • Increasing stack depth must not increase fanouts
  • A novel deep-stack sparse-tree ALU was developed

24
Sparse-tree adder core
2N
PG generator
b1
b0
b2
b3
b63
b62
b61
b60
2P
6362
6160
5958
32
10
76
54
4N
158
2316
3124
3932
70
4740
Fast carry-merge tree
2P
150
3116
4732
3N
470
4948
310
5554
5352
5150
1110
5958
5756
1716
3534
3332
2726
2524
2322
2120
1918
4342
4140
3938
3736
54
98
76
32
10
Mux
Int. carry gen.
Int. carry gen.
Int. carry gen.
Int. carry gen.
Mux
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
SumGen
  • 50 reduced fanouts compared to Han-Carlson
  • 7 gate stages (Two less than Han-Carlson)

25
Intermediate Carry Generator
P74 G74
P30 G30
P118 G118
0
1
2
2
2
2
CM
CM
CM
CM
CM
CM
Carry from Fast CM Chain
21 Mux
21 Mux
21 Mux
C3
C11
C7
  • Generates 1 in 4 carries (C3, C7, C19.. C59)
  • Non-critical path (ripple carry-select scheme)
  • Fast carry selects bet. the conditional carries

26
Non-critical Sum Generator
Pi2 ,Gi2
Gi1
Pi
Pi1
Pi3,Gi3

1
0
CM
CM
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi3
Sumi1
Sumi2
Sumi
  • Non-critical path ripple carry chain
  • Reduced area, energy consumption, leakage
  • Generate conditional sums for each bit
  • 1 in 4 carry selects appropriate sum

27
Sparse-tree adder critical path
Intermediate carry generator
3N
2P
2N
Input
2N
2P
4N
2P
3N
Fast carry-merge path
2N
Sumout
Inv
3N
2P
Sum generator
  • Fast carry-merge path Critical path
  • Non-critical side-paths Ripple-carry

28
PD-SOI optimal redesign in 0.18mm
0.18mm technology, Vcc1.5V
  • Deeper stack redesign additional 5 speedup

29
Margining for reverse-body bias in PD-SOI
  • 400mV rvs. bias increases rise-delay by 10
  • Difficult to detect for large circuits
  • 10 Margin required for all max-delay paths

Overall PD-SOI speedup reduces to 11
30
Reducing reverse-bias penalty in dynamic SOI
gates
P0
F
A
Body-A
Cost 5 increase in clock energy
M1
Stack node
B
Body-B
Out
A
B
  • Point solution for dynamic designs
  • Pre-charging stack node decreases penalty to 2

Max-delay margin reduced to 2
31
0.18mm ALU performance after margining
0.18mm technology, Vcc1.5V
  • Maximum PD-SOI speedup reduces to 19

32
Scaling to 0.13mm technologies
  • Equal SOI bulk IOFF-DC
  • MOSFET impact ionization data obtained from
    0.13mm bulk measurements
  • SOI parasitic BJT/diode characteristics
    unchanged from 0.18mm fitting

33
Scaling ALU designs to 0.13mm technology
0.13mm technology, Vcc1.2V
  • Maximum PD-SOI speedup reduces to 16

34
SOI vs. bulk Summary
  • 482ps energy-efficient dynamic 64b ALU in 0.18mm
    bulk
  • 310ps adder core
  • Direct port to 0.18mm SOI 14 speedup
  • SOI optimal redesign 19 speedup
  • Floating body can get reverse-biased
  • Preconditioning reduces margin from 10 to 2
  • Scaling to 0.13mm decreases PD-SOI speedup
  • Maximum PD-SOI speedup in 0.13mm falls to 16

35
High-Performance Low Power Datapath design
Energy
Delay
Goal Shift the E-D curve
36
A 4GHz 130nm Address Generation Unit with 32-bit
Sparse-tree Adder Core
S. Mathew et al, VLSI Symp. 2002, S. Mathew et
al, JSSC May 2003
37
Motivation
Cache
Processor thermal map
Temp (oC)
Execution core
AGU
120oC
  • AGUs performance and peak-current limiters
  • High activity ? thermal hotspot
  • Goal high-performance energy-efficient design

38
AGU Architecture

32
32 Compressor
32
Base
32
Effective Address
32b add
3b shift
32
32
Index
clk3
32
clk2
Segment
32
clk

Displacement
clk
  • Single-cycle latency and throughput
  • Effective Address Base IndexScale
  • (Segment Displacement)
  • 2-phase address computation

39
AGU Operation Phase 1

32 Compressor
32
32
Base
32
Effective Address
32b adder

32
3b shift
32
Index
clk3
32
clk2
Segment
32
clk

Displacement
Carry-Save format
  • Index pre-scaled via 3-bit barrel shifter
  • 32 compressor renders partial address
  • Carry-save format
  • Adder in pre-charge state

clk
40
AGU Operation Phase 2
32
32 Compressor
32
Base
32
Effective Address
32b adder
3b shift
32
32
Index
clk3
32
clk2
Segment
32
clk

Displacement
clk
  • Carry-save to binary format conversion
  • 2s complement parallel 32-bit adder

41
Kogge-Stone Adder
PG
1
2
3
5
4
6
7
9
8
10
11
13
12
14
15
17
16
18
19
21
20
22
23
25
24
26
27
29
28
30
31
0
Carry-merge gates
XOR
  • Critical path PG5XOR 7 gate stages
  • Generate,Propagate fanout of 2,3
  • Maximum interconnect spans 16b

Energy inefficient
42
Sparse-tree Adder Architecture
  • Generate every 4th carry in parallel
  • Side-path 4-bit conditional sum generator
  • 73 fewer carry-merge gates?energy-efficient

43
Non-critical Sum Generator
Pi2 ,Gi2
Gi1
Pi
Pi1
Pi3,Gi3

1
0
CM
CM
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi3
Sumi1
Sumi2
Sumi
  • Non-critical path ripple carry chain
  • Reduced area, energy consumption, leakage
  • Generate conditional sums for each bit
  • Sparse-tree carry selects appropriate sum

44
Optimized First-level Carry-merge
Conditional Carry for Cin0
0
CM
Gi
C_0
  • Carry-merge stage reduces to inverter
  • Conditional carry_0 Gi

45
Optimized First-level Carry-merge
1
Conditional carry for Cin1
CM
Pi
C_1
Gi
Pi
C_1
  • Pi Gi correlated
  • Conditional carry_1 Pi

46
Optimized Sum Generator
Pi2 ,Gi2
Gi1
Pi3,Gi3
Pi
Pi1

Optimized 1st-level carry-merge
CM
CM
CM
CM
Sumi ,1
Sumi ,0
XOR
XOR
XOR
XOR
XOR
XOR
Carry
21
21
21
21
Sumi1
Sumi3
Sumi
Sumi2
  • Optimized non-critical path 4 stages

47
Adder Core Critical Path
clk3
clk2
clk
Adder Inputs
C27
PG
GG1
GG7
GG27
GG15
GG3
Single-rail dynamic sparse-tree path
Sum31_0
Sum31
CM0 Latch
CM1
XOR
clk
Sum31_1
Static sum generator
  • Critical path 7 gate stages ? same as KS
  • Sparse-tree single-rail dynamic
  • Exploit non-criticality of sum generator
  • Convert to static logic?Semi-dynamic design

48
1st-level Carry-merge Static Latch
  • Holds state in pre-charge phase
  • Prevents pre-charging of static stages

49
Domino-Static Interface
clk0
clk1
  • SumSum0 during pre-charge
  • Mux output resolves during evaluation

50
Sparse-tree Architecture
  • Performance impact (20 speedup)
  • 33-50 reduced G/P fanouts
  • 80 reduced wiring complexity
  • 30 reduction in maximum interconnect
  • Power impact (56 reduction)
  • 73 fewer carry-merge gates
  • 50 reduction in average transistor size

51
Energy-delay Space
100
130nm CMOS, 1.2V, 110oC Simulation
80
56
60
Dynamic Kogge-Stone
Worst-case Energy (pJ)
40
20
20
4GHz Design
Semi-dynamic Sparse-Tree
0
140
160
180
200
220
240
260
280
Delay (ps)
  • 20 speedup over Kogge-Stone
  • 56 worst-case energy reduction
  • Scales with activity factor

52
Semi-dynamic Design
40
Dynamic Kogge-Stone
30
71
Average Energy (pJ)
20
Semi-dynamic Sparse-Tree
10
0
0
0.1
0.2
0.3
0.4
0.5
Activity factor
  • Static sum generators low switching activity
  • 71 lower average energy at 10 activity

53
Dual-Vt Allocation
130nm CMOS, 1.2V, 110oC Simulation
  • Exploit non-criticality of sidepaths
  • Use high-Vt devices
  • 0 performance penalty
  • 56 reduction in active leakage energy

54
Scaling Performance
  • Average transistor size 3.5mm
  • Reduces impact of increasing leakage
  • 80 reduction in wiring complexity
  • Reduces impact of wire resistance
  • 33 delay scaling, 50 energy reduction

55
A 6.5GHz, 130nm Single-ended Dynamic ALU
M. Anders et al, ISSCC 2002, S. Vangal et al,
JSSC November 2002
56
32-bit ALU/Scheduler Loop
  • Performance-critical execution core loop

57
Han-Carlson ALU Organization
  • Single-rail dynamic 9-stage low-Vt design

58
Odd-bits CSG Sum Generation
  • Final carry-merge CSG(dual-rail carry output)
  • ? pass-transistor sum XOR

59
Even-bits CSG Sum Generation
  • Domino-compatible sum
  • Dual-rail sum from single-ended g inputs

60
Die Micro-photograph
  • 130nm 6-metal dual-Vt CMOS
  • Scheduler
  • 210µm x 210µm
  • ALU
  • 84µm x 336µm

Scheduler
ALU
61
Delay and Power Measurements
25ºC
25ºC
Design target
  • 6.5GHz at 1.1V, 25ºC
  • Power 120mW total, 15mW leakage
  • Scalable to 10GHz at 1.7V, 25ºC

62
Improvements Over Dual-rail Domino
  • Leakage reduced by eliminating dual-rail logic
  • Robustness not compromised
  • CSG improves both area and performance

63
Summary
  • 4GHz AGU in 1.2V, 130nm technology
  • Sparse-tree adder architecture described
  • 20 speedup and 56 energy reduction
  • Semi-dynamic design
  • Energy scales with switching activity
  • Dual-Vt non-critical paths
  • Low active leakage energy
  • 6.5GHz ALU and scheduler loop at 1.1V, 25ºC
  • Scalable to 10GHz at 1.7V, 25ºC
Write a Comment
User Comments (0)
About PowerShow.com