Title: Digital and Other ICs
1Digital (and Other) ICs
- Borivoje Nikolic
- bora_at_eecs.berkeley.edu
2Research Areas
- Low-Density Parity-Check Decoders
- PLLs
- Low-Power Digital ICs
- Power-performance optimization
- Compensating the impact of variations
- Working with advanced devices
- Analog-to-Digital Converters
3Outline
- Low-Density Parity-Check Decoders
- PLLs
- Low-Power Digital ICs
- Power-performance optimization
- Compensating the impact of variations
- Working with advanced devices
- Analog-to-Digital Converters
4Iterative Coding
- Iterative decoders are a part of many new
standards
- Also 10G Ethernet, magnetic disk drive and tape
storage,
5Low-Density Parity-Check Codes
- Low density parity check codes Gallager63
- Sparse binary parity check matrix, H
- Nullspace of H forms set of codewords
- Decoded using message-passing algorithms
- Message-passing decoders
- Low-density-parity check (LDPC) codes
- Turbo-product codes are decoded similarly
interleaver
6Parallel LDPC Decoder Architecture
A. Blanksby and C. J. Howland, JSSC 2002
PEv4
PEv3
PEv1
PEv2
PEvN
1
2
Interconnect Fabric
. . .
PEc1
PEc2
PEcM
7Staggered Serial LDPC Decoder
E. Yeo, et. al. Globecom2001
8LDPC codes based on Galois Fields
- Codes based on GF projections are low rate.
- No cycles of length 4 (short loop)
- Cyclic rows
- e.g. (1023 x 1023) code has rate of 0.68
- Column splitting
- Each column in original matrix is split into four
- Non-zero entries in original column are cycled
through the 4 new columns - eg. (1023 x 4092) code has rate of 0.75
- Partial loss of regularity (cyclic structure)
- Complex O(N2) encoding
- Puncturing
- Truncate height of PC matrix
- Columns in the maximum zero runlength region
correspond to parity bit locations - Cyclic encoding using direct application of PC
matrix now possible
Y. Kou, et. al. ISIT 2000
9Shift register-based implementation
- Staggered decoding.
- Regularity of codes based on Finite Field
geometries.
E. Yeo, et. al. Globecom2001
104092-bit LDPC Decoder
1.8 million transistors 2.7mm x 3.1mm (10x
smaller than a 1024-bit LDPC decoder) 1GHz Chi
p back in December E.Yeo
11Structured LDPCs
Bit node groups
Check node groups
Ed Liao
- Construction based on Ramanujan graphs allows for
hierarchical decomposition and good performance
12LDPC Codes - Status
- Two students graduated
- Engling Yeo (Ph.D). ST Microelectronics,
Berkeley Lab - Ed Liao (M.S.) Qualcomm RD, San Diego
- Continuing investigation of variable rate,
variable block size LDPC codes, based on
structured constructions - BEE and ASIC implementations
13Outline
- Low-Density Parity-Check Decoders
- PLLs
- Low-Power Digital ICs
- Power-performance optimization
- Compensating the impact of variations
- Working with advanced devices
- Analog-to-Digital Converters
14PLL Jitter Analysis
15PLL Jitter Analysis
Adjusting the loop characteristics (wN, z)
modulates the output jitter. There exists a
minimum that depends on the noise source
characteristics.
Problem Noise characteristics are NOT known a
priori!
Therefore, adaptive jitter optimization is
desirable!
16PLL Circuit
17Jitter Estimation
Signals track jitter boundaries
18Implementation
Jitter Estimation
- Circuit designed in 0.13 mm CMOS and taped out
- Chip back in December, in the evaluation
- Socrates Vamvakos
PLL
DL
Driver
19Outline
- Low-Density Parity-Check Decoders
- PLLs
- Low-Power Digital ICs
- Power-performance optimization
- Compensating the impact of variations
- Working with advanced devices
- Analog-to-Digital Converters
20Power is a Problem
- If we continue doing business as usual, both
dynamic and leakage power will be a problem
chips are getting hot
and phones leaky!
- Need to delivermaximum performance under power
constraints
From S. Borkar, Intel
21Optimizing Combinational Circuits
Initial W, Vdd,Vth
netlist
Static timer (C)
- OPTIMIZER (Matlab)
- Minimize DELAY subject to
- Maximum ENERGY
Delay, Energy
W, Vdd,Vth
Output
- Generate Energy Delay (E-D) tradeoffs for
combinational blocks - Investigate the optimality of any given design
- Optimize critical single-cycle blocks
- Use inside microarchitecture optimizer
22Example 64-bit CLA Adders
- Wide adders are common in the critical paths of
high performance microprocessors - Static adders are low power but slow
- Domino logic is the choice for short cycle times
- Setup
- 0.13?m, 6M, 1.2V
- Cout 450fF
- Cin ? 150fF
Zlatanovici et al., ESSCIRC03
23Adder Architecture in 90nm CMOS
psel
pc2
pc3
pc4
pc1
carry-in
c1
c2
c3
c4
sum select
H4, I4
H16, I16
H64
t, g gen
a630
t630
group16 gen
group64 gen
group4 gen
b630
g630
group64 gen
group16 prop
group4 prop
XOR (transmission gate)
sum gen
sum0630
j, k (static)
sum1630
Critical path of five gate delays 6.3 FO4 _at_
8.5 pJ/cycle
S. Kao, R. Zlatanovici
2490nm Design
- Finalize test strategy
- Implement clock generator and assemble all
circuits - Extract layout for timing and power verification
- Tapeout Feb03
25Optimizing Pipelined Circuits
- Cycle boundaries transparent latches
COMBINATIONAL LOGIC
- Grand goal find the configuration (transistor
sizes, cutset) achieving shortest cycle time for
given power budget and pipeline depth
26Using Posynomial Models Results
- Widening profile of NANDs and NORs, 3 cycles
- Latches migrate towards the output as the power
constraint tightens start with the input - Reverse direction of migration for narrowing
profile - No migration for flat profile
- Demonstrate on a floating-point unit
R. Zlatanovici
27Micro-Architecture Optimization
A, B adders Input data rate f
Optimal ELk/ESw about 0.5 (All designs operate
at the throughput of the nominal design sized
for minimum delay under Vddmax and Vthref)
D. Markovic
28Time-Mux SVD Example
s1,w1
s2,w2
s3,w3
s4,w4
PE U?
PE U?
PE U?
PE U?
rk4
rk3
rk2
rk1
y
PE-U?1
PE-U?2
wasted time
wasted time
PE-U?3
0
Tsymbol
PE-U?4
PE-U?1
PE-U?2
PE-U?3
PE-U?4
s1,w1
s2,w2
Time-Mux Architecutre
PE U?
s3,w3
y
- Some Mux overhead
- Large Area reduction
s4,w4
29Energy-Area Tradeoff
- Top1 can be achieved with M5 (E lt Eop1) or M3
(E lt Eop2 ) - Area (M5) 3/5 Area (M3)
Energy-Area is a measure of the overall chip cost
30Working With Advanced Devices
Gate
Gate
Gate
Source
Drain
Source
Drain
Source
Drain
Buried Oxide
Gate
Tbody
Substrate
Bulk MOSFET
Double-Gate (DG)
Ultra-Thin Body (UTB)
12
12
7.2
10.2
9
8.7
7.9
Match Delay
Energy fJ
4.4
FO4 ps
FO4 ps
Match Leakage
Match Power
3.3
Bulk
Bulk
Bulk
DG
UTB
DG
UTB
DG
UTB
by changing VDD
L. Chang, T.-J. King
31FinFET SRAM Array
- FinFet devices, Ldrawn 50nm Leff 20nm
- 1 metal layer, 0.35µm technology
- SRAM Cell size 5.75x4 µm
- WL poly
- BL M1 fin
- 15x15 SRAM array
- Static NAND decoder
- Cross-coupled latch-based sense amp
- Array size approx.140µm x 70µm
- Sematech run Jan04
R. Zlatanovici, S. Balasubramanian, with Prof.
T.-J. King
32Advanced Devices
HP FinFET
LP FinFET
Enhancement mode
Accumulation mode
DSP
embedded
100.0
uP
BG-ENH
20
10.0
HP Fin
BG-ACC
HP Fin
BG ACC
10
1.0
BG ENH
LP Fin
LP Fin
0.1
0
0
2
4
1.E02
1.E03
1.E04
J. GarrettS. Balasubramanianwith T.J. King
delay (ps)
Frequency (GHz)
Logic depth
Adaptive VDD, Vth
33Outline
- Low-Density Parity-Check Decoders
- PLLs
- Low-Power Digital ICs
- Power-performance optimization
- Compensating the impact of variations
- Working with advanced devices
- Analog-to-Digital Converters
34ADCs
- Measured 1.8-V, 14-b, 12-MS/s pipelined ADC in
0.18-mm CMOS with 102-dB SFDR (Yun Chiu) - In design 500MS/s, 12-b, 1.2V digitally
background calibrated ADC in 0.13mm CMOS - After the lunch
35Summary
- LDPC decoder in testing (E. Yeo)
- PLL in testing (S. Vamvakos)
- Optimal power-performance tradeoffs, SVD (D.
Markovic) - Power-performance optimal FPU (R. Zlatanovici)
- Optimal 64-bit adder close to tapeout (S.Kao, R.
Zlatanovici) - Adaptive VDD, VTh for low power (J. Garrett)
- Power-performance optimization in synthesis flows
(F. Sheikh) - Layout techniques to control variations (L.-T.
Pang)