Title: Design of a HighThroughput LowPower IS95 Viterbi Decoder
1Design of a High-Throughput Low-Power IS95
Viterbi Decoder
- Xun Liu Marios C. Papaefthymiou
- Advanced Computer Architecture Laboratory
- Electrical Engineering and Computer Science
Department - University of Michigan
2(No Transcript)
3(No Transcript)
4IS95 Convolutional Encoding
- Used in the reverse link of IS95 CDMA system
- 256 states (8 state registers)
- Rate 1/3
- Maximum Free Distance coding
5Viterbi Decoding (VD)
- VD is optimal for convolutional codes.
- Maximum likelihood decoding scheme.
- Minimum error for additive white Gaussian noise
channel. - VD procedure.
- Construction of a complex graph called trellis.
- Computation of the shortest path.
6 7Challenge of Large-State VD Designs
- High computational complexity.
- VDs with hundreds of states require multiple Gops
throughput, when symbol transfer rates reach
Mbps. - Parallel processing.
- High interconnect power dissipation.
- Complex routing among the processors.
For large-state VDs, global data transfer and
interconnect issues must be considered carefully
8Viterbi Decoder Designs
9Presentation Outline
- Viterbi decoding overview
- Our contributions
- Data transfer oriented hierarchical
inter-processor optimization - Intra-processor power optimization
- Chip data
10Encoding Example
11Viterbi Decoding
12(No Transcript)
13(No Transcript)
14VD Summary
- Each decoded symbol requires a layer of similar
computations - 2N edge weight computations (N of states).
- N add-compare-select (ACS) operations.
- Operations within each layer are independent.
15Viterbi Decoder Architectures
Design space number of processors used
16Viterbi Decoder Architectures
Design space number of processors used
Intermediate solutions
17Key Issues
- How many ACS processors?
- Which ACS operations are executed in each
processor? - Which ACS operations can be executed
concurrently? - In what order are the operations executed?
- Can processors be pipelined?
18Q Which operations are executed in each ACS
processor?A Operation partitioning for global
data transfer reduction
19Operation Partitioning Example
20Operation Partitioning Results
- Obtain solution by iterative bi-partitioning
(KL). - For 64 partitions, gt50 data transfers are
global. - Largest absolute reduction 4 to 32 partitions.
21Q Which operations are executed
simultaneously?A Operation packing for global
bus minimization
22Operation Packing Example
0
2
2
0
23Operation Packing
- Packing procedure for global bus minimization
- One operation from each partition in each slice
- Global data transfers within a slice done
simultaneously - Bus cost the number of ACS units connected
- Our heuristic
- Distribute global transfers evenly in all slices
24Operation Packing Results
- Comparison solution one bus between any two ACS
processors - Global buses reduction 31 on the average
- Most effective range 8 to 32 partitions
25Q In what order should operations be executed?
Q Can ACS units be pipelined? A
Non-forwarding scheduling
26Non-forwarding Scheduling
27Non-forwarding Scheduling Results
- Greedy heuristic
- Pick slice with the least dependencies first.
- Iteratively pick the next slice such that the
upper bound of the non-forwarding pipeline depth
derived by the chosen slices is maximized.
- Architectures with 16 or more parallel processors
allow very limited non-forwarding pipeline depth.
28Q How many ACS processors should be used?
29Viterbi Decoder Architecture
30Processor Internal Architecture
- 16-bit datapath
- 8 pipeline stages
31Processor Level Power Reduction
- Combine precomputation and saturation arithmetic.
- If one or two operands overflow, ACS is partially
shut off. - No significant degradation of the decoding
performance.
32Chip Implementation
- Design RTL Verilog
- Synthesis Design Analyzer
- Placement manual floorplan
- Routing Silicon Ensemble
- Verification gate level Verilog
- Power estimation Primepower
33Chip Summary
34Conclusion
- Design case study of a 256-state IS95 VD
- Hierarchical optimization methodology
- Global data transfer minimization
- Global bus reduction
- Non-forwarding scheduling
- Precomputation and saturation arithmetic
- Viterbi decoder
- 8 pipelined processors
- 4 global buses
- Throughput 20Mbps
- Power dissipation 450mW