Title: Architectures and VLSI Implementations of High Throughput Iterative Decoders
1- Architectures and VLSI Implementations of High
Throughput Iterative Decoders - Engling Yeo
- August 9, 2002
- Department of Electrical Engineering and Computer
SciencesUniversity of California, Berkeley
2Coding in Communication Applications
xk
Encoder
Modulator (write)
Channel (medium)
Noise
yk
Decoder
Demodulator (read)
3Background Iterative Codes
4 dB
C. Berrou and A. Glavieux, "Near Optimum Error
Correcting Coding And Decoding Turbo-Codes,"
IEEE Trans. Comms., Vol.44, No.10, Oct 1996.
- Key Problem Implementation Complexity
- !! Block size of 107 bits.
4Competing Types of Iterative Codes
Convolutional Encoder 1
LDPC Encoder
uk
xk
xk
uk
Puncture
p
Convolutional Encoder 2
Concatenated convolutional schemes (Turbo
convolutional)
Low Density Parity Check (LDPC) codes
5Complexity Issues with Iterative Codes
Turbo
LDPC
Convolutional Code
Comparisons based on 64 iterations of decoding
6Outline
- Objectives
- Concatenated encoders/ Iterative decoder systems
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
7Outline
- Objectives
- Concatenated encoders/ Iterative decoder systems
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures - Summary and projections
8Iterative Decoding for Partial Response Channels
(Serial Concatenation)
xk
? Pseudo Random Interleaver ? -1 Pseudo Random
Deinterleaver
- Inner Code Partial Response Channel
- Outer Code
- Convolutional Code
- LDPC Outer Code
T. Souvignier, M. Oberg, P. Siegel and R.
Swanson, and J. Wolf, Turbo decoding for partial
response channels, IEEE Trans. Comms, Aug. 2000.
p.1297-308.
9Convolutional Codes
xk
Rate ½ convolutional code
uk
D
D
xk
Finite State Machine
10Message Passing Analogy (Belief Propagation)
- S. M. Aji and R. J. McEliece, "The generalized
distributive law," IEEE Transactions on
Information Theory, vol.46, (no.2), IEEE, March
2000. p.325-43.
Objective To evaluate total number of nodes
in a tree by message passing to adjacent
nodes. Method Each each node outputs the
Marginalized Sum of inputs 1.
11Outline
- Objectives
- Concatenated encoders/ Iterative decoder systems
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures - Summary and projections
12Outline
- Objectives
- Concatenated encoders/ Iterative decoder systems
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures - Summary and projections
13MAP Algorithm (BCJR)
- Each bit decision affected by received values of
both prior and future symbols. - Bi-directional trellis path propagation
- Forward Propagation ?(k).
- Backward Propagation ?(k)
- Large memory requirement.
- Extended latencies.
14Outline
- Objectives
- Concatenated encoders/ Iterative decoder systems
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures - Summary and projections
15Soft Output Viterbi Algorithm (SOVA)
- Measure of confidence by comparing the difference
in path metric between - Most likely path (?).
- Next most likely path (b).
16Proposed SOVA Decoder Architecture
Reliability Measure Unit
- Realize a SOVA decoder by cascading a typical VA
survival memory unit with a SOVA section. - Viterbi Algorithm goes through an initial pass to
determine most likely path ?. This includes the
Add-Compare-Select and Traceback sections.
17Register Exchange
- SOVA requirements
- Ensure branching off ML path does not result in
an equivalent decision. - Find branch corresponding to minimum difference
in path metric.
18Proposed structures for SOVA Implementation
(Register Exchange Method)
Reliability Measure Unit
SOVA Survival Memory Unit
RMU implements recursion to determine next ML
path
XOR gates added to test for equivalence between
inputs to each multiplexer in register-exchange.
19SOVA Decoder Implementation
- E. Yeo, S. Augsburger, W. R. Davis, B. Nikolic,
A 510Mbps Soft Output Viterbi Decoder," to
appear at IEEE ESSCIRC 2002.
20Chip Testing
- 4-layer PCB designed and fabricated with 75
discrete components. - Logical verification at 50MHz.
- Download and upload data with networked Logic
Analyzer. - Test vectors generated from Simulink.
21Outline
- Concatenated encoders/ Iterative decoder systems
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures - Summary and projections
22Low Density Parity Check Codes (LDPC)
CHECK NODES
VARIABLE NODES
- R. G. Gallager, IRE Trans. Info. Theory, Vol.
8(1962) p. 21
- LDPC representation by bi-partite graph.
- Decoding by message computation and relay along
edges - Iteratively improved estimates of log-likelihood
ratios - Example code
23Message from Variable n to Check m
CHECK NODES
m
1
2
3
R2n
R1n
Qnm
R3n
VARIABLE NODE
n
Decoder input
24Message from Check m to Variable n
- Signed magnitude representation
- MSB represents parity information
CHECK NODE
m
Q3m
Q1m
Rnm
Q2m
VARIABLE NODE
n
1
2
3
25Hardware for Computation of Rmn
b Wordlength of messages
26Parallel Architecture of LDPC decoders
A. Blanksby and C. J. Howland, A 220mW 1-Gbit/s
1024-Bit Rate-1/2 Low Density Parity Check Code
Decoder, Proc IEEE CICC, Las Vegas, NV, USA, pp.
293-6, May 2001.
27Serial Architecture of LDPC decoders
G. Al-Rawi, J. Cioffi, and M. Horowitz,
Optimizing the mapping of low-density parity
check codes on parallel decoding architectures,
Proc. IEEE ITCC, Las Vegas, NV, USA, pp.578-86,
Apr 2001.
28Hardware Pipelining of serial architecture
STALL!!
- Traditional DSP Algorithms
- e.g. FFT, Digital Filters
- Throughput increases
- High spatial locality
29Decoding with Software Approach
- General purpose microprocessors and Digital
Signal Processors (DSP) - Limited number of Processing Elements (ALUs)
- Serial Architecture
- Few hundreds of kbps throughput
- Design, simulate, and perform comparative
analysis of LDPC codes - Low throughput applications with fast time to
market element
30Decoding with Hardware Approach
- Parallel architecture
- Power and throughput efficiency
- FPGA
- Parallel adders and table lookups
- Need to fit PEs and routing onto single FPGA die
- Existing implementations with serial architecture
limited to 56Mbps throughput - M. M. Mansour and N. R. Shanbhag,
Memory-efficient turbo decoder architectures for
LDPC codes, Proc. IEEE SIPS 2002, San Diego, CA,
Oct. 2002. - T. Zhang and Keshab Parhi, A 56Mbps
(3,6)-Regular FPGA LDPC Decoder, Proc. IEEE SIPS
2002, San Diego, CA, Oct. 2002.
31Decoding with Hardware Approach
- Custom ASIC
- Parallel implementation demonstrated with 1Gbps
throughput - A.J. Blanksby and C.J. Howland, A 690-mW
1-Gb/s 1024-b, rate-1/2 low-density parity-check
code decoder, IEEE Journal of Solid-State
Circuits, vol.37, (no.3), (Proceedings of the
IEEE 2001 Custom Integrated Circuits Conference,
San Diego, CA, USA, 6-9 May 2001.) IEEE, March
2002. p.404-12. - Routing congestion
- Logic density is 50
- Design not scalable to codes with larger block
sizes
32Solving Routing Congestion in Hardware
- Serial architecture with groups of parallel
optimized processing elements - Full utilization of pipelined hardware with
alternating blocks - E.g. 128x parallelism in commercial IP
(FlarionTM) - Further memory reduction through staggered
decoding schedule - E. Yeo, P. Pakzad, B. Nikolic, and V.
Anantharam, "High throughput low-density
parity-check architectures," Proc. IEEE
Globecom2001, San Antonio, TX, pp.3019-24, Nov
2001.
33Platform vs. Throughput Summary
107
108
109
103
105
106
104
34Designing Systems-on-a-Chip in a Day
35Architecture choices
.5-5 MIPS/mW
10-100 MOPS/mW
Flexibility
Embedded Processor
DSP (e.g. TI 320CXX )
100-1000 MOPS/mW
Reconfigurable Processors (Maia)
Embedded
Factor of 100-1000
FPGA
Direct Mapped
Area or Power
Hardware
Brodersen Rabaey
36Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1M Baseband Chips per Day
(Source Texas Instruments)
37DSP Software Development
I. Verbauwhede, ISSCC00
38Results in fully parallel solutions
Reducing supply voltage saves energy E CV2
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
39Why program in C?
Algorithm developers use a parallel description
Then it is re-entered in C
Then architects try to rediscover the parallelism
- While (i0iiltnum)
- a a ci
- bi sin (a pi) cos(api)
-
- Outfil bi indata
Isnt there a more direct path to a parallel
solution?
40Direct Mapping
- Direct mapping of an algorithm into the
architecture - Algorithms are typically developed in C (or
Matlab) - Then translated into Verilog/VHDL
- Synthesized with added timing constraints
- Mapped into standard cell layout
41Start with a parallel description of the
algorithm
42then map it into hardware
43Chip-in-a-Day Design FlowAn User Perspective
- Allow regeneration and reanalysis of the design
for small changes at the push of a button - Uses flow dependency graphs to manage large
projects
44Example 3 Soft-Output Viterbi Decoder
- BER/wordlength optimization, architectural
exploration performed in Simulink, then just
passed through the flow - Compare-select-add is optimal for implementing
ACS recursion
45SOVA Chip
SOVA Chip Summer 2001 (E. Yeo)
- 500k transistors
- 0.18 mm
- 1.0 V
- 500 MHz
- Functional first time
46TDMA Baseband Receiver
- DSSS TDMA w/ length 31 spreading code 25 MHz
chip rate - 806 kHz symbol rate, w/ QPSK gives 1.6 Mb/s data
rate - 7 bit I Q streams at 200 MHz, 8 parallel
streams at 25 MHz
Q
47Design Effort
- Spec. Changes required no modification of
datapath macros - Routing began after 2 months
- No modification to dataflow graph from
switch-level sims. - Flow under development
- Reuse is crucial
48Automation Statistics
- Assuming automated flow and libraries are
debugged, design time is little more than a day
49Chip Layout Plot
- 600k transistors
- 0.18 mm
- 1.0 V
- 25 MHz
- 3.7 mm x 3.7 mm(w/ pads)
- 1.8 mm x 1.3 mm(core only)
- 21 mW
- J. Ammer
50Example 4 Maskless Lithography
OPTICS
LASER
DATA
MIRROR CHIP
WAFER STAGE
Required 10Tb/s throughput
B. Wild, B. Warlick
51Maskless Lithography Chip
Parallel decompression data paths
Mirror-interface SRAM memory
Taped out in April 2002
52Future SSHAFT/BEE Designs
Simulink
Module Compiler /
Xilinx System
Design Compiler
Generator
Interaction
Final Silicon Layout
Emulation on BEE Board
53The Berkeley Emulation Engine
54Whats BEE?
- Real time hardware emulator built from 20 FPGAs.
- Emulation capacity of 10 Million ASIC
- 600 BOPS (16-bit adds).
- Emulation speed 1 100 MHz
- 2400 external I/O for add-ons e.g. radios.
- Automated design flow from Simulink to FPGA
emulation, integrated with the Chip-in-a-Day ASIC
design flow.
55The Hive
Analog Front-end
Network
LVDS
Dedicated Ethernet
Integrated Design Flow
FPGA Bit Stream Conf File
Simulink MDL
ASIC Layout
56Applications
- Real-time hardware emulation
- Novel Communication Systems with analog front-end
hardware (MCMA, UWB, 60GHz) - Digital signal processing systems
- Real-time control systems
- Neuron-like network processing
- Hardware acceleration
- Large communication/signal processing system
simulation - Hardware-in-the-loop co-simulation with software
system - Complex parallel computing algorithms
57System Architecture
- Processing Board
- Total 20 Xilinx VirtexE 2000 chips, 16 on a first
level mesh processing, 4 on a second level mesh. - 16 ZBT SRAM chips, 1MB each.
- Control module
- Intel StrongARM 1110, on board 10 Base-T
Ethernet, Linux OS - Radio Rx/Tx Front-End
- 2.4 GHz transceiver, Ultrawide-band transceiver
- Design Flow
- Integrated Simulink to Implementations
(ASIC/FPGA) automatic design flow.
58Processing Board Architecture
48 bit buses
59Chassis
60BEE Hardware Performance
- Board-level Main Clock Rate 160MHz
- On Board connection speed
- FPGA to FPGA 100MHz
- XBAR to XBAR 70MHz
- Off board connection speed (3 ft SCSI cable loop
back through riser card) - LVTTL 40MHz
- LVDS 160MHz 220MHz
61BEE Hardware Capacity
- Reference Design
- 10240 tap FIR filter
- 512 taps per FPGA
- Slice utilization 99 of 19200 slices
- Max Clock Rate 28.5MHz
- ASIC Gate 401K per FPGA, 8M total
- MOPS 583,680 total (16bit add 12bit cmult)
- Power 2.5W per FPGA, 50W total
6210240 Tap Fir Design
6310240 Tap Fir Design (cont.)
64Conclusion
65Density Evolution
- Density Evolution
- Very good codes (lt 0.0045dB from theoretical
bound) - Large variable edge degree ( 100)
- Large block size (107)
- Cayley and Ramanujan Graphs
- Unstructured interconnects
- Algebraic Constructions
- Cyclic or quasi-cyclic properties
- Use of shift registers
- Parallel implementation has to address sparse
code / interconnect issue.
66Architecture of LDPC decoders
Memory
Memory
Switch Fabric
Message Comp
Message Comp
Message Comp
Message Comp
Switch Fabric
Message Comp
Message Comp
Message Comp
Memory
Memory
Switch Fabric
Message Comp
A. Blanksby and C. J. Howland, A 220mW 1-Gbit/s
1024-Bit Rate-1/2 Low Density Parity Check Code
Decoder, Proc IEEE CICC, Las Vegas, NV, USA, pp.
293-6, May 2001.
G. Al-Rawi, J. Cioffi, and M. Horowitz,
Optimizing the mapping of low-density parity
check codes on parallel decoding architectures,
Proc. IEEE ITCC, Las Vegas, NV, USA, pp.578-86,
Apr 2001.
67LDPC Decoding Algorithm
- E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
"High throughput low-density parity-check
architectures," to appear at Globecom2001,
November 2001
68Simulation Results
BER
BER
SNR
SNR
- New decoding algorithm appears to require far
fewer iterations to converge. - Results with high iterations (gt10) are poor
compared to original decoding algorithm.
69Proposed shift register based implementation
- Code generated from 2D GF(2M).
- Column splitting (14) reduces matrix density and
increases irregularity. - Takes advantage of
- Regularity of codes based on Finite Field
geometries. - Independence between bits in consecutive groups
of 4. - Staggered decoding.
70Outline
- Concatenated encoders/ Iterative decoder systems.
- Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures - Summary and projections
71Research Contributions
- Architectural evaluation of various Soft-Input
Soft-Output decoders. - Proposed timing schedule of MAP decoder allows
high-speed memory access pattern with minimal
control logic. - 510Mbps SOVA decoder chip fabricated in 0.18mm
technology. - Proposed staggered LDPC decoding alleviate large
memory requirement in direct implementation of
LDPC decoder.
72Summary of Computational Complexities
- Assumptions
- CSA structures implemented
- 7-bit wordlengths
- 32kb of interleaver memory not included
73Future Work
- Complete testing and evaluation of fabricated
SOVA decoders (Spring 2002). 2002 IEEE VLSI
Symposia - Behavior of LDPC code under staggered decoding,
and modifications to avoid saturation after 5
iterations (Immediate). - Physical implementation of LDPC decoder using
codes constructed from 2-dimensional Galois
Fields and staggered decoding (Immediate).
74Research Schedule
IEEE Globecom Conf. Pub.
IEEE Globecom Conf. Pub.
IEEE TMRC Conf. Pub.
IEEE Trans. Magnetics Pub.
Qualifying Exam
IEEE VLSI Symposia
IEEE ISSCC
1/02
7/01
1/00
1/01
7/02
7/00
1/03
Algorithmic Exploration
SOVA ASIC Physical Design
Architectural Exploration
SOVA Board Design and Testing
Current Status Working chip undergoing BER
tests
LDPC ASIC Design
LDPC Board Design and Testing
LDPC ASIC design to include behavior of LDPC
code under staggered decoding, and modifications
to avoid saturation after 5 iterations.
Dissertation
75Current List of Publications
- W. R. Davis, N. Zhang, K. Camera, D. Markovic, T.
Smilkstein, M. J. Ammer, E. Yeo, S. Augsburger,
B. Nikolic, and R. W. Brodersen, A Design
Environment for High-Throughput, Low-Power
Dedicated Signal Processing Systems, to appear
in March, 2002 issue of the IEEE Journ.
Solid-State Circuits. - E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
"High throughput low-density parity-check
architectures," to appear at Globecom2001,
November 2001. - E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
"VLSI architectures for iterative decoders in
magnetic recording channels," IEEE Trans.
Magnetics, vol.37, no.2, p 748-55, March 2001. - E. Yeo, P. Pakzad, B. Nikolic, V. Anantharam,
"VLSI architectures for iterative decoders in
magnetic recording channels," Digests of The
Magnetic Recording Conference, TMRC 2000, on
Magnetic Recording Systems, p. E6, Santa Clara,
CA, August 14-16, 2000.
76Read Channel System Block
NRZI Encoding
RLL/MTR Encoder
Data In
Encoder
ECC
N S
N S
S N
S N
N S
S N
RLL/MTR Decoder
Sequence Detector
Data Out
LPF
A/D
EQ
ECC
PLL
DFE
- No industry-wide standardization.
- High speed communication requirements.
- lt 10-6 bit error rates (BER).
- gt 1Gbps throughputs.
- Advances in areal densities.
- Decreasing SNR.
- Costs limit die size to 25mm2.