Architectures and VLSI Implementations of High Throughput Iterative Decoders - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Architectures and VLSI Implementations of High Throughput Iterative Decoders

Description:

C. Berrou and A. Glavieux, 'Near Optimum Error Correcting Coding And Decoding: ... Then translated into Verilog/VHDL. Synthesized with added timing constraints ... – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 77
Provided by: erefU
Category:

less

Transcript and Presenter's Notes

Title: Architectures and VLSI Implementations of High Throughput Iterative Decoders


1
  • Architectures and VLSI Implementations of High
    Throughput Iterative Decoders
  • Engling Yeo
  • August 9, 2002
  • Department of Electrical Engineering and Computer
    SciencesUniversity of California, Berkeley

2
Coding in Communication Applications
xk
Encoder
Modulator (write)
Channel (medium)
Noise
yk
Decoder
Demodulator (read)
3
Background Iterative Codes
4 dB
C. Berrou and A. Glavieux, "Near Optimum Error
Correcting Coding And Decoding Turbo-Codes,"
IEEE Trans. Comms., Vol.44, No.10, Oct 1996.
  • Key Problem Implementation Complexity
  • !! Block size of 107 bits.

4
Competing Types of Iterative Codes
Convolutional Encoder 1
LDPC Encoder
uk
xk
xk
uk
Puncture
p
Convolutional Encoder 2
Concatenated convolutional schemes (Turbo
convolutional)
Low Density Parity Check (LDPC) codes
5
Complexity Issues with Iterative Codes
Turbo
LDPC
Convolutional Code
Comparisons based on 64 iterations of decoding
6
Outline
  • Objectives
  • Concatenated encoders/ Iterative decoder systems
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

7
Outline
  • Objectives
  • Concatenated encoders/ Iterative decoder systems
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

8
Iterative Decoding for Partial Response Channels
(Serial Concatenation)
xk
? Pseudo Random Interleaver ? -1 Pseudo Random
Deinterleaver
  • Inner Code Partial Response Channel
  • Outer Code
  • Convolutional Code
  • LDPC Outer Code

T. Souvignier, M. Oberg, P. Siegel and R.
Swanson, and J. Wolf, Turbo decoding for partial
response channels, IEEE Trans. Comms, Aug. 2000.
p.1297-308.
9
Convolutional Codes
xk
Rate ½ convolutional code
uk
D
D
xk
Finite State Machine
10
Message Passing Analogy (Belief Propagation)
  • S. M. Aji and R. J. McEliece, "The generalized
    distributive law," IEEE Transactions on
    Information Theory, vol.46, (no.2), IEEE, March
    2000. p.325-43.

Objective To evaluate total number of nodes
in a tree by message passing to adjacent
nodes. Method Each each node outputs the
Marginalized Sum of inputs 1.
11
Outline
  • Objectives
  • Concatenated encoders/ Iterative decoder systems
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

12
Outline
  • Objectives
  • Concatenated encoders/ Iterative decoder systems
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

13
MAP Algorithm (BCJR)
  • Each bit decision affected by received values of
    both prior and future symbols.
  • Bi-directional trellis path propagation
  • Forward Propagation ?(k).
  • Backward Propagation ?(k)
  • Large memory requirement.
  • Extended latencies.

14
Outline
  • Objectives
  • Concatenated encoders/ Iterative decoder systems
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

15
Soft Output Viterbi Algorithm (SOVA)
  • Measure of confidence by comparing the difference
    in path metric between
  • Most likely path (?).
  • Next most likely path (b).

16
Proposed SOVA Decoder Architecture
Reliability Measure Unit
  • Realize a SOVA decoder by cascading a typical VA
    survival memory unit with a SOVA section.
  • Viterbi Algorithm goes through an initial pass to
    determine most likely path ?. This includes the
    Add-Compare-Select and Traceback sections.

17
Register Exchange
  • SOVA requirements
  • Ensure branching off ML path does not result in
    an equivalent decision.
  • Find branch corresponding to minimum difference
    in path metric.

18
Proposed structures for SOVA Implementation
(Register Exchange Method)
Reliability Measure Unit
SOVA Survival Memory Unit
RMU implements recursion to determine next ML
path
XOR gates added to test for equivalence between
inputs to each multiplexer in register-exchange.
19
SOVA Decoder Implementation
  • E. Yeo, S. Augsburger, W. R. Davis, B. Nikolic,
    A 510Mbps Soft Output Viterbi Decoder," to
    appear at IEEE ESSCIRC 2002.

20
Chip Testing
  • 4-layer PCB designed and fabricated with 75
    discrete components.
  • Logical verification at 50MHz.
  • Download and upload data with networked Logic
    Analyzer.
  • Test vectors generated from Simulink.

21
Outline
  • Concatenated encoders/ Iterative decoder systems
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

22
Low Density Parity Check Codes (LDPC)
CHECK NODES
VARIABLE NODES
- R. G. Gallager, IRE Trans. Info. Theory, Vol.
8(1962) p. 21
  • LDPC representation by bi-partite graph.
  • Decoding by message computation and relay along
    edges
  • Iteratively improved estimates of log-likelihood
    ratios
  • Example code

23
Message from Variable n to Check m
CHECK NODES
m
1
2
3
R2n
R1n
Qnm
R3n
VARIABLE NODE
n
Decoder input
24
Message from Check m to Variable n
  • Signed magnitude representation
  • MSB represents parity information

CHECK NODE
m
Q3m
Q1m
Rnm
Q2m
VARIABLE NODE
n
1
2
3
25
Hardware for Computation of Rmn
b Wordlength of messages
26
Parallel Architecture of LDPC decoders
A. Blanksby and C. J. Howland, A 220mW 1-Gbit/s
1024-Bit Rate-1/2 Low Density Parity Check Code
Decoder, Proc IEEE CICC, Las Vegas, NV, USA, pp.
293-6, May 2001.
27
Serial Architecture of LDPC decoders
G. Al-Rawi, J. Cioffi, and M. Horowitz,
Optimizing the mapping of low-density parity
check codes on parallel decoding architectures,
Proc. IEEE ITCC, Las Vegas, NV, USA, pp.578-86,
Apr 2001.
28
Hardware Pipelining of serial architecture
STALL!!
  • Traditional DSP Algorithms
  • e.g. FFT, Digital Filters
  • Throughput increases
  • High spatial locality

29
Decoding with Software Approach
  • General purpose microprocessors and Digital
    Signal Processors (DSP)
  • Limited number of Processing Elements (ALUs)
  • Serial Architecture
  • Few hundreds of kbps throughput
  • Design, simulate, and perform comparative
    analysis of LDPC codes
  • Low throughput applications with fast time to
    market element

30
Decoding with Hardware Approach
  • Parallel architecture
  • Power and throughput efficiency
  • FPGA
  • Parallel adders and table lookups
  • Need to fit PEs and routing onto single FPGA die
  • Existing implementations with serial architecture
    limited to 56Mbps throughput
  • M. M. Mansour and N. R. Shanbhag,
    Memory-efficient turbo decoder architectures for
    LDPC codes, Proc. IEEE SIPS 2002, San Diego, CA,
    Oct. 2002.
  • T. Zhang and Keshab Parhi, A 56Mbps
    (3,6)-Regular FPGA LDPC Decoder, Proc. IEEE SIPS
    2002, San Diego, CA, Oct. 2002.

31
Decoding with Hardware Approach
  • Custom ASIC
  • Parallel implementation demonstrated with 1Gbps
    throughput
  • A.J. Blanksby and C.J. Howland, A 690-mW
    1-Gb/s 1024-b, rate-1/2 low-density parity-check
    code decoder, IEEE Journal of Solid-State
    Circuits, vol.37, (no.3), (Proceedings of the
    IEEE 2001 Custom Integrated Circuits Conference,
    San Diego, CA, USA, 6-9 May 2001.) IEEE, March
    2002. p.404-12.
  • Routing congestion
  • Logic density is 50
  • Design not scalable to codes with larger block
    sizes

32
Solving Routing Congestion in Hardware
  • Serial architecture with groups of parallel
    optimized processing elements
  • Full utilization of pipelined hardware with
    alternating blocks
  • E.g. 128x parallelism in commercial IP
    (FlarionTM)
  • Further memory reduction through staggered
    decoding schedule
  • E. Yeo, P. Pakzad, B. Nikolic, and V.
    Anantharam, "High throughput low-density
    parity-check architectures," Proc. IEEE
    Globecom2001, San Antonio, TX, pp.3019-24, Nov
    2001.

33
Platform vs. Throughput Summary
107
108
109
103
105
106
104
34
Designing Systems-on-a-Chip in a Day
35
Architecture choices
.5-5 MIPS/mW
10-100 MOPS/mW
Flexibility
Embedded Processor
DSP (e.g. TI 320CXX )
100-1000 MOPS/mW
Reconfigurable Processors (Maia)
Embedded
Factor of 100-1000
FPGA
Direct Mapped
Area or Power
Hardware
Brodersen Rabaey
36
Cellular Phone Baseband SOC
ROM
MCU
DSP
Gates
RAM
Analog
2000 phones on each 8 wafer _at_ .15 Leff
1M Baseband Chips per Day
(Source Texas Instruments)
37
DSP Software Development
I. Verbauwhede, ISSCC00
38
Results in fully parallel solutions
Reducing supply voltage saves energy E CV2
(numbers taken from vendor-published
benchmarks) Orders of magnitude lower efficiency
even for an optimized processor architecture
39
Why program in C?
Algorithm developers use a parallel description
Then it is re-entered in C
Then architects try to rediscover the parallelism
  • While (i0iiltnum)
  • a a ci
  • bi sin (a pi) cos(api)
  • Outfil bi indata

Isnt there a more direct path to a parallel
solution?
40
Direct Mapping
  • Direct mapping of an algorithm into the
    architecture
  • Algorithms are typically developed in C (or
    Matlab)
  • Then translated into Verilog/VHDL
  • Synthesized with added timing constraints
  • Mapped into standard cell layout

41
Start with a parallel description of the
algorithm
42
then map it into hardware
43
Chip-in-a-Day Design FlowAn User Perspective
  • Allow regeneration and reanalysis of the design
    for small changes at the push of a button
  • Uses flow dependency graphs to manage large
    projects

44
Example 3 Soft-Output Viterbi Decoder
  • BER/wordlength optimization, architectural
    exploration performed in Simulink, then just
    passed through the flow
  • Compare-select-add is optimal for implementing
    ACS recursion

45
SOVA Chip
SOVA Chip Summer 2001 (E. Yeo)
  • 500k transistors
  • 0.18 mm
  • 1.0 V
  • 500 MHz
  • Functional first time

46
TDMA Baseband Receiver
  • DSSS TDMA w/ length 31 spreading code 25 MHz
    chip rate
  • 806 kHz symbol rate, w/ QPSK gives 1.6 Mb/s data
    rate
  • 7 bit I Q streams at 200 MHz, 8 parallel
    streams at 25 MHz

Q
47
Design Effort
  • Spec. Changes required no modification of
    datapath macros
  • Routing began after 2 months
  • No modification to dataflow graph from
    switch-level sims.
  • Flow under development
  • Reuse is crucial

48
Automation Statistics
  • Assuming automated flow and libraries are
    debugged, design time is little more than a day

49
Chip Layout Plot
  • 600k transistors
  • 0.18 mm
  • 1.0 V
  • 25 MHz
  • 3.7 mm x 3.7 mm(w/ pads)
  • 1.8 mm x 1.3 mm(core only)
  • 21 mW
  • J. Ammer

50
Example 4 Maskless Lithography
OPTICS
LASER
DATA
MIRROR CHIP
WAFER STAGE
Required 10Tb/s throughput
B. Wild, B. Warlick
51
Maskless Lithography Chip
Parallel decompression data paths
Mirror-interface SRAM memory
Taped out in April 2002
52
Future SSHAFT/BEE Designs
Simulink
Module Compiler /
Xilinx System
Design Compiler
Generator
Interaction
Final Silicon Layout
Emulation on BEE Board
53
The Berkeley Emulation Engine
54
Whats BEE?
  • Real time hardware emulator built from 20 FPGAs.
  • Emulation capacity of 10 Million ASIC
  • 600 BOPS (16-bit adds).
  • Emulation speed 1 100 MHz
  • 2400 external I/O for add-ons e.g. radios.
  • Automated design flow from Simulink to FPGA
    emulation, integrated with the Chip-in-a-Day ASIC
    design flow.

55
The Hive
Analog Front-end
Network
LVDS
Dedicated Ethernet
Integrated Design Flow
FPGA Bit Stream Conf File
Simulink MDL
ASIC Layout
56
Applications
  • Real-time hardware emulation
  • Novel Communication Systems with analog front-end
    hardware (MCMA, UWB, 60GHz)
  • Digital signal processing systems
  • Real-time control systems
  • Neuron-like network processing
  • Hardware acceleration
  • Large communication/signal processing system
    simulation
  • Hardware-in-the-loop co-simulation with software
    system
  • Complex parallel computing algorithms

57
System Architecture
  • Processing Board
  • Total 20 Xilinx VirtexE 2000 chips, 16 on a first
    level mesh processing, 4 on a second level mesh.
  • 16 ZBT SRAM chips, 1MB each.
  • Control module
  • Intel StrongARM 1110, on board 10 Base-T
    Ethernet, Linux OS
  • Radio Rx/Tx Front-End
  • 2.4 GHz transceiver, Ultrawide-band transceiver
  • Design Flow
  • Integrated Simulink to Implementations
    (ASIC/FPGA) automatic design flow.

58
Processing Board Architecture
48 bit buses
59
Chassis
60
BEE Hardware Performance
  • Board-level Main Clock Rate 160MHz
  • On Board connection speed
  • FPGA to FPGA 100MHz
  • XBAR to XBAR 70MHz
  • Off board connection speed (3 ft SCSI cable loop
    back through riser card)
  • LVTTL 40MHz
  • LVDS 160MHz 220MHz

61
BEE Hardware Capacity
  • Reference Design
  • 10240 tap FIR filter
  • 512 taps per FPGA
  • Slice utilization 99 of 19200 slices
  • Max Clock Rate 28.5MHz
  • ASIC Gate 401K per FPGA, 8M total
  • MOPS 583,680 total (16bit add 12bit cmult)
  • Power 2.5W per FPGA, 50W total

62
10240 Tap Fir Design
63
10240 Tap Fir Design (cont.)
64
Conclusion
65
Density Evolution
  • Density Evolution
  • Very good codes (lt 0.0045dB from theoretical
    bound)
  • Large variable edge degree ( 100)
  • Large block size (107)
  • Cayley and Ramanujan Graphs
  • Unstructured interconnects
  • Algebraic Constructions
  • Cyclic or quasi-cyclic properties
  • Use of shift registers
  • Parallel implementation has to address sparse
    code / interconnect issue.

66
Architecture of LDPC decoders
Memory
Memory
Switch Fabric
Message Comp
Message Comp
Message Comp
Message Comp
Switch Fabric
Message Comp
Message Comp
Message Comp
Memory
Memory
Switch Fabric
Message Comp
A. Blanksby and C. J. Howland, A 220mW 1-Gbit/s
1024-Bit Rate-1/2 Low Density Parity Check Code
Decoder, Proc IEEE CICC, Las Vegas, NV, USA, pp.
293-6, May 2001.
G. Al-Rawi, J. Cioffi, and M. Horowitz,
Optimizing the mapping of low-density parity
check codes on parallel decoding architectures,
Proc. IEEE ITCC, Las Vegas, NV, USA, pp.578-86,
Apr 2001.
67
LDPC Decoding Algorithm
  • E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
    "High throughput low-density parity-check
    architectures," to appear at Globecom2001,
    November 2001

68
Simulation Results
BER
BER
SNR
SNR
  • New decoding algorithm appears to require far
    fewer iterations to converge.
  • Results with high iterations (gt10) are poor
    compared to original decoding algorithm.

69
Proposed shift register based implementation
  • Code generated from 2D GF(2M).
  • Column splitting (14) reduces matrix density and
    increases irregularity.
  • Takes advantage of
  • Regularity of codes based on Finite Field
    geometries.
  • Independence between bits in consecutive groups
    of 4.
  • Staggered decoding.

70
Outline
  • Concatenated encoders/ Iterative decoder systems.
  • Soft-Input-Soft Output (SISO) Decoder Algorithms
    and Architectures
  • Summary and projections

71
Research Contributions
  • Architectural evaluation of various Soft-Input
    Soft-Output decoders.
  • Proposed timing schedule of MAP decoder allows
    high-speed memory access pattern with minimal
    control logic.
  • 510Mbps SOVA decoder chip fabricated in 0.18mm
    technology.
  • Proposed staggered LDPC decoding alleviate large
    memory requirement in direct implementation of
    LDPC decoder.

72
Summary of Computational Complexities
  • Assumptions
  • CSA structures implemented
  • 7-bit wordlengths
  • 32kb of interleaver memory not included

73
Future Work
  • Complete testing and evaluation of fabricated
    SOVA decoders (Spring 2002). 2002 IEEE VLSI
    Symposia
  • Behavior of LDPC code under staggered decoding,
    and modifications to avoid saturation after 5
    iterations (Immediate).
  • Physical implementation of LDPC decoder using
    codes constructed from 2-dimensional Galois
    Fields and staggered decoding (Immediate).

74
Research Schedule
IEEE Globecom Conf. Pub.
IEEE Globecom Conf. Pub.
IEEE TMRC Conf. Pub.
IEEE Trans. Magnetics Pub.
Qualifying Exam
IEEE VLSI Symposia
IEEE ISSCC
1/02
7/01
1/00
1/01
7/02
7/00
1/03
Algorithmic Exploration
SOVA ASIC Physical Design
Architectural Exploration
SOVA Board Design and Testing
Current Status Working chip undergoing BER
tests
LDPC ASIC Design
LDPC Board Design and Testing
LDPC ASIC design to include behavior of LDPC
code under staggered decoding, and modifications
to avoid saturation after 5 iterations.
Dissertation
75
Current List of Publications
  • W. R. Davis, N. Zhang, K. Camera, D. Markovic, T.
    Smilkstein, M. J. Ammer, E. Yeo, S. Augsburger,
    B. Nikolic, and R. W. Brodersen, A Design
    Environment for High-Throughput, Low-Power
    Dedicated Signal Processing Systems, to appear
    in March, 2002 issue of the IEEE Journ.
    Solid-State Circuits.
  • E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
    "High throughput low-density parity-check
    architectures," to appear at Globecom2001,
    November 2001.
  • E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam,
    "VLSI architectures for iterative decoders in
    magnetic recording channels," IEEE Trans.
    Magnetics, vol.37, no.2, p 748-55, March 2001.
  • E. Yeo, P. Pakzad, B. Nikolic, V. Anantharam,
    "VLSI architectures for iterative decoders in
    magnetic recording channels," Digests of The
    Magnetic Recording Conference, TMRC 2000, on
    Magnetic Recording Systems, p. E6, Santa Clara,
    CA, August 14-16, 2000.

76
Read Channel System Block
NRZI Encoding
RLL/MTR Encoder
Data In
Encoder
ECC
N S
N S
S N
S N
N S
S N
RLL/MTR Decoder
Sequence Detector
Data Out
LPF
A/D
EQ
ECC
PLL
DFE
  • No industry-wide standardization.
  • High speed communication requirements.
  • lt 10-6 bit error rates (BER).
  • gt 1Gbps throughputs.
  • Advances in areal densities.
  • Decreasing SNR.
  • Costs limit die size to 25mm2.
Write a Comment
User Comments (0)
About PowerShow.com