Architectures and VLSI Implementations of High Throughput Iterative Decoders - PowerPoint PPT Presentation

1 / 76

About This Presentation

Title:

Architectures and VLSI Implementations of High Throughput Iterative Decoders

Description:

C. Berrou and A. Glavieux, 'Near Optimum Error Correcting Coding And Decoding: ... Then translated into Verilog/VHDL. Synthesized with added timing constraints ... – PowerPoint PPT presentation

Number of Views:241

Avg rating:3.0/5.0

Slides: 77

Provided by: erefU

Category:

more less

Transcript and Presenter's Notes

Title: Architectures and VLSI Implementations of High Throughput Iterative Decoders

1

Architectures and VLSI Implementations of High
Throughput Iterative Decoders
Engling Yeo
August 9, 2002
Department of Electrical Engineering and Computer
SciencesUniversity of California, Berkeley

2
Coding in Communication Applications
xk
Encoder
Modulator (write)
Channel (medium)
Noise
yk
Decoder
Demodulator (read)
3
Background Iterative Codes
4 dB
C. Berrou and A. Glavieux, "Near Optimum Error
Correcting Coding And Decoding Turbo-Codes,"
IEEE Trans. Comms., Vol.44, No.10, Oct 1996.

Key Problem Implementation Complexity
!! Block size of 107 bits.

4
Competing Types of Iterative Codes
Convolutional Encoder 1
LDPC Encoder
uk
xk
xk
uk
Puncture
p
Convolutional Encoder 2
Concatenated convolutional schemes (Turbo
convolutional)
Low Density Parity Check (LDPC) codes
5
Complexity Issues with Iterative Codes
Turbo
LDPC
Convolutional Code
Comparisons based on 64 iterations of decoding
6
Outline

Objectives
Concatenated encoders/ Iterative decoder systems
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures

Summary and projections

7
Outline

Objectives
Concatenated encoders/ Iterative decoder systems
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
Summary and projections

8
Iterative Decoding for Partial Response Channels
(Serial Concatenation)
xk
? Pseudo Random Interleaver ? -1 Pseudo Random
Deinterleaver

Inner Code Partial Response Channel
Outer Code
Convolutional Code
LDPC Outer Code

T. Souvignier, M. Oberg, P. Siegel and R.
Swanson, and J. Wolf, Turbo decoding for partial
response channels, IEEE Trans. Comms, Aug. 2000.
p.1297-308.
9
Convolutional Codes
xk
Rate ½ convolutional code
uk
D
D
xk
Finite State Machine
10
Message Passing Analogy (Belief Propagation)

S. M. Aji and R. J. McEliece, "The generalized
distributive law," IEEE Transactions on
Information Theory, vol.46, (no.2), IEEE, March
2000. p.325-43.

Objective To evaluate total number of nodes
in a tree by message passing to adjacent
nodes. Method Each each node outputs the
Marginalized Sum of inputs 1.
11
Outline

Objectives
Concatenated encoders/ Iterative decoder systems
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
Summary and projections

12
Outline

Objectives
Concatenated encoders/ Iterative decoder systems
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
Summary and projections

13
MAP Algorithm (BCJR)

Each bit decision affected by received values of
both prior and future symbols.
Bi-directional trellis path propagation
Forward Propagation ?(k).
Backward Propagation ?(k)
Large memory requirement.
Extended latencies.

14
Outline

Objectives
Concatenated encoders/ Iterative decoder systems
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
Summary and projections

15
Soft Output Viterbi Algorithm (SOVA)

Measure of confidence by comparing the difference
in path metric between
Most likely path (?).
Next most likely path (b).

16
Proposed SOVA Decoder Architecture
Reliability Measure Unit

Realize a SOVA decoder by cascading a typical VA
survival memory unit with a SOVA section.
Viterbi Algorithm goes through an initial pass to
determine most likely path ?. This includes the
Add-Compare-Select and Traceback sections.

17
Register Exchange

SOVA requirements
Ensure branching off ML path does not result in
an equivalent decision.
Find branch corresponding to minimum difference
in path metric.

18
Proposed structures for SOVA Implementation
(Register Exchange Method)
Reliability Measure Unit
SOVA Survival Memory Unit
RMU implements recursion to determine next ML
path
XOR gates added to test for equivalence between
inputs to each multiplexer in register-exchange.
19
SOVA Decoder Implementation

E. Yeo, S. Augsburger, W. R. Davis, B. Nikolic,
A 510Mbps Soft Output Viterbi Decoder," to
appear at IEEE ESSCIRC 2002.

20
Chip Testing

4-layer PCB designed and fabricated with 75
discrete components.
Logical verification at 50MHz.
Download and upload data with networked Logic
Analyzer.
Test vectors generated from Simulink.

21
Outline

Concatenated encoders/ Iterative decoder systems
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
Summary and projections

22
Low Density Parity Check Codes (LDPC)
CHECK NODES
VARIABLE NODES
- R. G. Gallager, IRE Trans. Info. Theory, Vol.
8(1962) p. 21

LDPC representation by bi-partite graph.
Decoding by message computation and relay along
edges
Iteratively improved estimates of log-likelihood
ratios
Example code

23
Message from Variable n to Check m
CHECK NODES
m
1
2
3
R2n
R1n
Qnm
R3n
VARIABLE NODE
n
Decoder input
24
Message from Check m to Variable n

Signed magnitude representation
MSB represents parity information

CHECK NODE
m
Q3m
Q1m
Rnm
Q2m
VARIABLE NODE
n
1
2
3
25
Hardware for Computation of Rmn
b Wordlength of messages
26
Parallel Architecture of LDPC decoders
A. Blanksby and C. J. Howland, A 220mW 1-Gbit/s
1024-Bit Rate-1/2 Low Density Parity Check Code
Decoder, Proc IEEE CICC, Las Vegas, NV, USA, pp.
293-6, May 2001.
27
Serial Architecture of LDPC decoders
G. Al-Rawi, J. Cioffi, and M. Horowitz,
Optimizing the mapping of low-density parity
check codes on parallel decoding architectures,
Proc. IEEE ITCC, Las Vegas, NV, USA, pp.578-86,
Apr 2001.
28
Hardware Pipelining of serial architecture
STALL!!

Traditional DSP Algorithms
e.g. FFT, Digital Filters
Throughput increases
High spatial locality

29
Decoding with Software Approach

General purpose microprocessors and Digital
Signal Processors (DSP)
Limited number of Processing Elements (ALUs)
Serial Architecture
Few hundreds of kbps throughput
Design, simulate, and perform comparative
analysis of LDPC codes
Low throughput applications with fast time to
market element

30
Decoding with Hardware Approach

Parallel architecture
Power and throughput efficiency
FPGA
Parallel adders and table lookups
Need to fit PEs and routing onto single FPGA die
Existing implementations with serial architecture
limited to 56Mbps throughput
M. M. Mansour and N. R. Shanbhag,
Memory-efficient turbo decoder architectures for
LDPC codes, Proc. IEEE SIPS 2002, San Diego, CA,
Oct. 2002.
T. Zhang and Keshab Parhi, A 56Mbps
(3,6)-Regular FPGA LDPC Decoder, Proc. IEEE SIPS
2002, San Diego, CA, Oct. 2002.

31
Decoding with Hardware Approach

Custom ASIC
Parallel implementation demonstrated with 1Gbps
throughput
A.J. Blanksby and C.J. Howland, A 690-mW
1-Gb/s 1024-b, rate-1/2 low-density parity-check
code decoder, IEEE Journal of Solid-State
Circuits, vol.37, (no.3), (Proceedings of the
IEEE 2001 Custom Integrated Circuits Conference,
San Diego, CA, USA, 6-9 May 2001.) IEEE, March
2002. p.404-12.
Routing congestion
Logic density is 50
Design not scalable to codes with larger block
sizes

32
Solving Routing Congestion in Hardware

Serial architecture with groups of parallel
optimized processing elements
Full utilization of pipelined hardware with
alternating blocks
E.g. 128x parallelism in commercial IP
(FlarionTM)
Further memory reduction through staggered
decoding schedule
E. Yeo, P. Pakzad, B. Nikolic, and V.
Anantharam, "High throughput low-density
parity-check architectures," Proc. IEEE
Globecom2001, San Antonio, TX, pp.3019-24, Nov
2001.

33
Platform vs. Throughput Summary
107
108
109
103
105
106
104
34
Designing Systems-on-a-Chip in a Day
35
Architecture choices
.5-5 MIPS/mW
10-100 MOPS/mW
Flexibility
Embedded Processor
DSP (e.g. TI 320CXX )
100-1000 MOPS/mW
Reconfigurable Processors (Maia)
Embedded
Factor of 100-1000
FPGA
Direct Mapped
Area or Power
Hardware
Brodersen Rabaey
36
Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1M Baseband Chips per Day
(Source Texas Instruments)
37
DSP Software Development
I. Verbauwhede, ISSCC00
38
Results in fully parallel solutions
Reducing supply voltage saves energy E CV2
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
39
Why program in C?
Algorithm developers use a parallel description
Then it is re-entered in C
Then architects try to rediscover the parallelism

While (i0iiltnum)
a a ci
bi sin (a pi) cos(api)
Outfil bi indata

Isnt there a more direct path to a parallel
solution?
40
Direct Mapping

Direct mapping of an algorithm into the
architecture
Algorithms are typically developed in C (or
Matlab)
Then translated into Verilog/VHDL
Synthesized with added timing constraints
Mapped into standard cell layout

41
Start with a parallel description of the
algorithm
42
then map it into hardware
43
Chip-in-a-Day Design FlowAn User Perspective

Allow regeneration and reanalysis of the design
for small changes at the push of a button
Uses flow dependency graphs to manage large
projects

44
Example 3 Soft-Output Viterbi Decoder

BER/wordlength optimization, architectural
exploration performed in Simulink, then just
passed through the flow
Compare-select-add is optimal for implementing
ACS recursion

45
SOVA Chip
SOVA Chip Summer 2001 (E. Yeo)

500k transistors
0.18 mm
1.0 V
500 MHz
Functional first time

46
TDMA Baseband Receiver

DSSS TDMA w/ length 31 spreading code 25 MHz
chip rate
806 kHz symbol rate, w/ QPSK gives 1.6 Mb/s data
rate
7 bit I Q streams at 200 MHz, 8 parallel
streams at 25 MHz

Q
47
Design Effort

Spec. Changes required no modification of
datapath macros
Routing began after 2 months
No modification to dataflow graph from
switch-level sims.
Flow under development
Reuse is crucial

48
Automation Statistics

Assuming automated flow and libraries are
debugged, design time is little more than a day

49
Chip Layout Plot

600k transistors
0.18 mm
1.0 V
25 MHz
3.7 mm x 3.7 mm(w/ pads)
1.8 mm x 1.3 mm(core only)
21 mW
J. Ammer

50
Example 4 Maskless Lithography
OPTICS
LASER
DATA
MIRROR CHIP
WAFER STAGE
Required 10Tb/s throughput
B. Wild, B. Warlick
51
Maskless Lithography Chip
Parallel decompression data paths
Mirror-interface SRAM memory
Taped out in April 2002
52
Future SSHAFT/BEE Designs
Simulink
Module Compiler /
Xilinx System
Design Compiler
Generator
Interaction
Final Silicon Layout
Emulation on BEE Board
53
The Berkeley Emulation Engine
54
Whats BEE?

Real time hardware emulator built from 20 FPGAs.
Emulation capacity of 10 Million ASIC
600 BOPS (16-bit adds).
Emulation speed 1 100 MHz
2400 external I/O for add-ons e.g. radios.
Automated design flow from Simulink to FPGA
emulation, integrated with the Chip-in-a-Day ASIC
design flow.

55
The Hive
Analog Front-end
Network
LVDS
Dedicated Ethernet
Integrated Design Flow
FPGA Bit Stream Conf File
Simulink MDL
ASIC Layout
56
Applications

Real-time hardware emulation
Novel Communication Systems with analog front-end
hardware (MCMA, UWB, 60GHz)
Digital signal processing systems
Real-time control systems
Neuron-like network processing
Hardware acceleration
Large communication/signal processing system
simulation
Hardware-in-the-loop co-simulation with software
system
Complex parallel computing algorithms

57
System Architecture

Processing Board
Total 20 Xilinx VirtexE 2000 chips, 16 on a first
level mesh processing, 4 on a second level mesh.
16 ZBT SRAM chips, 1MB each.
Control module
Intel StrongARM 1110, on board 10 Base-T
Ethernet, Linux OS
Radio Rx/Tx Front-End
2.4 GHz transceiver, Ultrawide-band transceiver
Design Flow
Integrated Simulink to Implementations
(ASIC/FPGA) automatic design flow.

58
Processing Board Architecture
48 bit buses
59
Chassis
60
BEE Hardware Performance

Board-level Main Clock Rate 160MHz
On Board connection speed
FPGA to FPGA 100MHz
XBAR to XBAR 70MHz
Off board connection speed (3 ft SCSI cable loop
back through riser card)
LVTTL 40MHz
LVDS 160MHz 220MHz

61
BEE Hardware Capacity

Reference Design
10240 tap FIR filter
512 taps per FPGA
Slice utilization 99 of 19200 slices
Max Clock Rate 28.5MHz
ASIC Gate 401K per FPGA, 8M total
MOPS 583,680 total (16bit add 12bit cmult)
Power 2.5W per FPGA, 50W total

62
10240 Tap Fir Design
63
10240 Tap Fir Design (cont.)
64
Conclusion
65
Density Evolution

Density Evolution
Very good codes (lt 0.0045dB from theoretical
bound)
Large variable edge degree ( 100)
Large block size (107)
Cayley and Ramanujan Graphs
Unstructured interconnects
Algebraic Constructions
Cyclic or quasi-cyclic properties
Use of shift registers
Parallel implementation has to address sparse
code / interconnect issue.

66
Architecture of LDPC decoders
Memory
Memory
Switch Fabric
Message Comp
Message Comp
Message Comp
Message Comp
Switch Fabric
Message Comp
Message Comp
Message Comp
Memory
Memory
Switch Fabric
Message Comp
A. Blanksby and C. J. Howland, A 220mW 1-Gbit/s
1024-Bit Rate-1/2 Low Density Parity Check Code
Decoder, Proc IEEE CICC, Las Vegas, NV, USA, pp.
293-6, May 2001.
G. Al-Rawi, J. Cioffi, and M. Horowitz,
Optimizing the mapping of low-density parity
check codes on parallel decoding architectures,
Proc. IEEE ITCC, Las Vegas, NV, USA, pp.578-86,
Apr 2001.
67
LDPC Decoding Algorithm

E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
"High throughput low-density parity-check
architectures," to appear at Globecom2001,
November 2001

68
Simulation Results
BER
BER
SNR
SNR

New decoding algorithm appears to require far
fewer iterations to converge.
Results with high iterations (gt10) are poor
compared to original decoding algorithm.

69
Proposed shift register based implementation

Code generated from 2D GF(2M).
Column splitting (14) reduces matrix density and
increases irregularity.
Takes advantage of
Regularity of codes based on Finite Field
geometries.
Independence between bits in consecutive groups
of 4.
Staggered decoding.

70
Outline

Concatenated encoders/ Iterative decoder systems.
Soft-Input-Soft Output (SISO) Decoder Algorithms
and Architectures
Summary and projections

71
Research Contributions

Architectural evaluation of various Soft-Input
Soft-Output decoders.
Proposed timing schedule of MAP decoder allows
high-speed memory access pattern with minimal
control logic.
510Mbps SOVA decoder chip fabricated in 0.18mm
technology.
Proposed staggered LDPC decoding alleviate large
memory requirement in direct implementation of
LDPC decoder.

72
Summary of Computational Complexities

Assumptions
CSA structures implemented
7-bit wordlengths
32kb of interleaver memory not included

73
Future Work

Complete testing and evaluation of fabricated
SOVA decoders (Spring 2002). 2002 IEEE VLSI
Symposia
Behavior of LDPC code under staggered decoding,
and modifications to avoid saturation after 5
iterations (Immediate).
Physical implementation of LDPC decoder using
codes constructed from 2-dimensional Galois
Fields and staggered decoding (Immediate).

74
Research Schedule
IEEE Globecom Conf. Pub.
IEEE Globecom Conf. Pub.
IEEE TMRC Conf. Pub.
IEEE Trans. Magnetics Pub.
Qualifying Exam
IEEE VLSI Symposia
IEEE ISSCC
1/02
7/01
1/00
1/01
7/02
7/00
1/03
Algorithmic Exploration
SOVA ASIC Physical Design
Architectural Exploration
SOVA Board Design and Testing
Current Status Working chip undergoing BER
tests
LDPC ASIC Design
LDPC Board Design and Testing
LDPC ASIC design to include behavior of LDPC
code under staggered decoding, and modifications
to avoid saturation after 5 iterations.
Dissertation
75
Current List of Publications

W. R. Davis, N. Zhang, K. Camera, D. Markovic, T.
Smilkstein, M. J. Ammer, E. Yeo, S. Augsburger,
B. Nikolic, and R. W. Brodersen, A Design
Environment for High-Throughput, Low-Power
Dedicated Signal Processing Systems, to appear
in March, 2002 issue of the IEEE Journ.
Solid-State Circuits.
E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
"High throughput low-density parity-check
architectures," to appear at Globecom2001,
November 2001.
E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
"VLSI architectures for iterative decoders in
magnetic recording channels," IEEE Trans.
Magnetics, vol.37, no.2, p 748-55, March 2001.
E. Yeo, P. Pakzad, B. Nikolic, V. Anantharam,
"VLSI architectures for iterative decoders in
magnetic recording channels," Digests of The
Magnetic Recording Conference, TMRC 2000, on
Magnetic Recording Systems, p. E6, Santa Clara,
CA, August 14-16, 2000.

76
Read Channel System Block
NRZI Encoding
RLL/MTR Encoder
Data In
Encoder
ECC
N S
N S
S N
S N
N S
S N
RLL/MTR Decoder
Sequence Detector
Data Out
LPF
A/D
EQ
ECC
PLL
DFE