DSP architectures for wireless communications

About This Presentation

Title:

DSP architectures for wireless communications

Description:

Decoded bits. Kernels (computation) and streams (communication) ... Parallel Viterbi Decoding. 1. Add-Compare-Select (ACS) : trellis interconnect ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 31

Provided by: Srid

Category:

more less

Transcript and Presenter's Notes

Title: DSP architectures for wireless communications

1
DSP architectures for wireless communications

Sridhar Rajagopal
Department of Electrical and Computer Engineering
Rice University, Houston TX
ECE Pizza Talk
March 28, 2003

This work has been supported in part by Nokia,
TI, TATP and NSF
2
Future wireless devices

High data rate mobile devices with multimedia
Multiple antennas w/ complex algorithms, GOPs of
computation
Area-Time-Power constraints
Seamless connection across environments and
standards
Use the fastest and cheapest available service

3
Aim of the talk
4
Trends
FLEXIBILITY
5
Change in flexibility requirements
No change (already flexible)
Maximum change (needs to support multiple
environments, algorithms and standards)
6
Architecture trade-offs

Past more DSP less ASIC, Current less DSP
more ASIC
Reason need less flexibility OR DSPs not
powerful enough?
Cant we build better DSPs?
How much flexibility do we need?

7
Problems with current DSPs

Current DSPs
Not enough functional units (FUs) for GOPs of
computation
Need 100s of FUs
Not low power enough!!
Cannot extend to more FUs
Limited Instruction Level Parallelism (ILP)
Limited Subword Parallelism (such as MMX)
Cannot support more registers (area,ports)
Compilers difficult to find ILP as FUs increase

8
Scalable Wireless Application-specific Procesors
(SWAPs)

Exploit data parallelism (DP)
Available in many wireless algorithms
This is what ASICs do!!
Example
int i,a,b,c // 32 bits
short int d,e,f // 16 bits packed
for (i 1 ilt 1024 i)
ai bi ci
di ei fi

DP
ILP
Subword
9
SWAPs stream processors for wireless

Kernels (computation) and streams (communication)
Operations on kernels use local data
Streams expose data parallelism
Imagine stream processor at Stanford

10
DSP vs. SWAPs
Stream Register File (SRF)
SWAPs (max. clusters All clusters same do same
operations)
DSP (1 cluster)
11
Arithmetic clusters

FUs (,,/)
Scratch-pad (Sp)
Indexed accesses
Comm. unit (CU)
Intercluster comm.
Distributed reg. Files
more FUs

From/To SRF
Local Register File

SRF
/
Cross Point
/
/
/
Sp
Intercluster Network
CU
12
SWAPs vs. DSPs trade-offs

Same internal memory size as DSPs
Dependent on application, not architecture
Needs more area to support more functional units
Area is less of a constraint than power
Varying levels of DP in applications
Needs reconfiguration!!
Need to turn off unused clusters (and FUs)
More parallelism ? lower clock frequency ? lower
voltage
? low power (?CV2f leakage) in spite of
larger area

13
Design methodology
Chain of receiver algorithms
Low complexity, parallel, fixed point
Flexibility- performance tradeoffs
High level language implementation
Architecture exploration
FPGA, customized, reconfigurable, heterogeneous
designs
ASIC design
learn
learn
Modular programmable architecture design
DSP, SWAPs
H-SWAPs
14
Physical layer of wireless receivers
Receiver more complex than transmitter
15
Algorithms for

Multiple antenna systems (MIMO systems)
Complexity exponential with transmit receive
antennas
Wide range of extremely complex algorithms
Optimal depends on fading, mobility, bandwidth,
antennas
GOPs of computations
Estimation Linear MMSE, blind, conjugate
gradient.
Detection FFT, (blind) interference
cancellation.
Decoding Viterbi, Turbo, LDPC.
Implement ALL of them AND the NEXT one in line
Use for the best for the situation
Example for concept demonstration Viterbi
decoding

16
Parallel Viterbi Decoding

1. Add-Compare-Select (ACS) trellis
interconnect
Parallelism depends on constraint length
(states)
2. Conventional Traceback
Sequential (No DP)
Difficult to implement in parallel architecture
Use Register Exchange (RE)
parallel solution

17
Re-ordering for parallel Viterbi

Exploiting Viterbi DP in SWAPs
Re-order ACS, RE
Overhead

18
SWAP Algorithms Architecture

Algorithm design for parallelism
Architecture design?

19
SWAP design

Decide how many clusters
Exploit DP
Decide what to put within each cluster
Maximize ILP with high functional unit efficiency
Search design space with explore tool
See how it meets time-area-power constraints

20
Inside a SWAP cluster EXPLORE
Auto-exploration of adders and multipliers for
ACS"
(Adder FU, Multiplier FU)
21
Explore tool benefits

Instruction count vs. functional unit efficiency
What goes inside each cluster
Explore all algorithms
turn off functional units not in use for given
kernel
Design customized application-specific units
Better performance with increased FU utilization
Algorithm 1 3 adders, 3 multipliers, 32
clusters
Algorithm 2 4 adders, 1 multiplier, 64 clusters
Architecture 4 adders, 3 multipliers, 64 clusters

22
Viterbi reconfiguration
DP
Can be turned OFF
Packet 1 Constraint length 7 (16 clusters)
Packet 2 Constraint length 9 (64 clusters)
Packet 3 Constraint length 5 (4 clusters)
23
Viterbi decoding rate 1/2 at 128 Kbps 10 MHz
1000
K 9
K 7
Static architecture
DSP
K 5
SWAPs
100
Frequency needed to attain real-time (in MHz)
10
1
1
10
100
Number of clusters
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
24
SWAPs Salient features

1-2 orders of magnitude better than 1 processor
DSP
Any constraint length ? 10 MHz at 128 Kbps
Same code for all constraint lengths
no need to re-compile or load another code
as long as parallelism/cluster ratio is constant
Power savings due to dynamic cluster scaling

25
Expected SWAP power consumption

64 clusters and 1 multiplier per cluster
0.13 micron, 1.2 V
Peak Active Power 9 mW at 1 MHz
Area 53.7 mm2
10 MHz, 128 Kbps with reconfiguration

Exploring the VLSI Scalability of Stream
Processors, Brucek Khailany et al, Proceedings of
the Ninth Symposium on High Performance Computer
Architecture, February 8-12, 2003, Anaheim,
California, USA, pp. 153-164
26
Flexibility vs. performance

Suitable for mobile devices?
SWAPs Real-time at 10-100 mW
Maybe but can we do better?
ASICs Real-time at 10-100 ?W
No special customization for the application
No application-specific units
Generic inter-cluster communication network
Overhead for extracting parallelism
SWAPs suitable for base-stations?
Why not? power is not a primary constraint!

27
Multiuser Estimation-DetectionDecoding
Real-time target 128 Kbps per user
Ideal C64x (w/o co-proc) needs 15 GHz for
real-time
28
Current research

SWAPs Completely flexible and general
How do we trade-off flexibility for better
performance?
Handset SWAPs (H-SWAPs)

29
H-SWAPs Potential advantages
DSP (RE)
Execution time
H-SWAPs
SWAPs
30
Conclusions

Need flexible architectures for future wireless
devices
Higher data rates, lower power, more complex
algorithms
Design methodology (SWAPs, H-SWAPs, ASICs)
Flexibility vs. performance trade-offs
Blurs distinction between ASICs and programmable
solutions
Also need parallel, low precision algorithms for
efficient mapping
Inter-disciplinary research
Computer architecture, VLSI, wireless
communications, computer arithmetic, compilers

Write a Comment

User Comments (0)

About PowerShow.com

DSP architectures for wireless communications - PowerPoint PPT Presentation

DSP architectures for wireless communications

Decoded bits. Kernels (computation) and streams (communication) ... Parallel Viterbi Decoding. 1. Add-Compare-Select (ACS) : trellis interconnect ... – PowerPoint PPT presentation