Title: DSP architectures for wireless communications
1DSP architectures for wireless communications
- Sridhar Rajagopal
- Department of Electrical and Computer Engineering
- Rice University, Houston TX
- ECE Pizza Talk
- March 28, 2003
This work has been supported in part by Nokia,
TI, TATP and NSF
2Future wireless devices
- High data rate mobile devices with multimedia
- Multiple antennas w/ complex algorithms, GOPs of
computation - Area-Time-Power constraints
- Seamless connection across environments and
standards - Use the fastest and cheapest available service
3Aim of the talk
4Trends
FLEXIBILITY
5Change in flexibility requirements
No change (already flexible)
Maximum change (needs to support multiple
environments, algorithms and standards)
6Architecture trade-offs
- Past more DSP less ASIC, Current less DSP
more ASIC - Reason need less flexibility OR DSPs not
powerful enough? - Cant we build better DSPs?
- How much flexibility do we need?
-
7Problems with current DSPs
- Current DSPs
- Not enough functional units (FUs) for GOPs of
computation - Need 100s of FUs
- Not low power enough!!
- Cannot extend to more FUs
- Limited Instruction Level Parallelism (ILP)
- Limited Subword Parallelism (such as MMX)
- Cannot support more registers (area,ports)
- Compilers difficult to find ILP as FUs increase
8Scalable Wireless Application-specific Procesors
(SWAPs)
- Exploit data parallelism (DP)
- Available in many wireless algorithms
- This is what ASICs do!!
- Example
- int i,a,b,c // 32 bits
- short int d,e,f // 16 bits packed
- for (i 1 ilt 1024 i)
-
- ai bi ci
- di ei fi
-
DP
ILP
Subword
9SWAPs stream processors for wireless
- Kernels (computation) and streams (communication)
- Operations on kernels use local data
- Streams expose data parallelism
- Imagine stream processor at Stanford
10DSP vs. SWAPs
Stream Register File (SRF)
SWAPs (max. clusters All clusters same do same
operations)
DSP (1 cluster)
11Arithmetic clusters
- FUs (,,/)
- Scratch-pad (Sp)
- Indexed accesses
- Comm. unit (CU)
- Intercluster comm.
- Distributed reg. Files
- more FUs
From/To SRF
Local Register File
SRF
/
Cross Point
/
/
/
Sp
Intercluster Network
CU
12SWAPs vs. DSPs trade-offs
- Same internal memory size as DSPs
- Dependent on application, not architecture
- Needs more area to support more functional units
- Area is less of a constraint than power
- Varying levels of DP in applications
- Needs reconfiguration!!
- Need to turn off unused clusters (and FUs)
- More parallelism ? lower clock frequency ? lower
voltage - ? low power (?CV2f leakage) in spite of
larger area
13Design methodology
Chain of receiver algorithms
Low complexity, parallel, fixed point
Flexibility- performance tradeoffs
High level language implementation
Architecture exploration
FPGA, customized, reconfigurable, heterogeneous
designs
ASIC design
learn
learn
Modular programmable architecture design
DSP, SWAPs
H-SWAPs
14Physical layer of wireless receivers
Receiver more complex than transmitter
15Algorithms for
- Multiple antenna systems (MIMO systems)
- Complexity exponential with transmit receive
antennas - Wide range of extremely complex algorithms
- Optimal depends on fading, mobility, bandwidth,
antennas - GOPs of computations
- Estimation Linear MMSE, blind, conjugate
gradient. - Detection FFT, (blind) interference
cancellation. - Decoding Viterbi, Turbo, LDPC.
- Implement ALL of them AND the NEXT one in line
- Use for the best for the situation
- Example for concept demonstration Viterbi
decoding
16Parallel Viterbi Decoding
- 1. Add-Compare-Select (ACS) trellis
interconnect - Parallelism depends on constraint length
(states) - 2. Conventional Traceback
- Sequential (No DP)
- Difficult to implement in parallel architecture
- Use Register Exchange (RE)
- parallel solution
17Re-ordering for parallel Viterbi
- Exploiting Viterbi DP in SWAPs
- Re-order ACS, RE
- Overhead
18SWAP Algorithms Architecture
-
- Algorithm design for parallelism
- Architecture design?
19SWAP design
- Decide how many clusters
- Exploit DP
- Decide what to put within each cluster
- Maximize ILP with high functional unit efficiency
- Search design space with explore tool
- See how it meets time-area-power constraints
20Inside a SWAP cluster EXPLORE
Auto-exploration of adders and multipliers for
ACS"
(Adder FU, Multiplier FU)
21Explore tool benefits
- Instruction count vs. functional unit efficiency
- What goes inside each cluster
- Explore all algorithms
- turn off functional units not in use for given
kernel - Design customized application-specific units
- Better performance with increased FU utilization
- Algorithm 1 3 adders, 3 multipliers, 32
clusters - Algorithm 2 4 adders, 1 multiplier, 64 clusters
- Architecture 4 adders, 3 multipliers, 64 clusters
22Viterbi reconfiguration
DP
Can be turned OFF
Packet 1 Constraint length 7 (16 clusters)
Packet 2 Constraint length 9 (64 clusters)
Packet 3 Constraint length 5 (4 clusters)
23Viterbi decoding rate 1/2 at 128 Kbps 10 MHz
1000
K 9
K 7
Static architecture
DSP
K 5
SWAPs
100
Frequency needed to attain real-time (in MHz)
10
1
1
10
100
Number of clusters
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
24SWAPs Salient features
- 1-2 orders of magnitude better than 1 processor
DSP - Any constraint length ? 10 MHz at 128 Kbps
- Same code for all constraint lengths
- no need to re-compile or load another code
- as long as parallelism/cluster ratio is constant
- Power savings due to dynamic cluster scaling
25Expected SWAP power consumption
- 64 clusters and 1 multiplier per cluster
- 0.13 micron, 1.2 V
- Peak Active Power 9 mW at 1 MHz
- Area 53.7 mm2
- 10 MHz, 128 Kbps with reconfiguration
Exploring the VLSI Scalability of Stream
Processors, Brucek Khailany et al, Proceedings of
the Ninth Symposium on High Performance Computer
Architecture, February 8-12, 2003, Anaheim,
California, USA, pp. 153-164
26Flexibility vs. performance
- Suitable for mobile devices?
- SWAPs Real-time at 10-100 mW
- Maybe but can we do better?
- ASICs Real-time at 10-100 ?W
- No special customization for the application
- No application-specific units
- Generic inter-cluster communication network
- Overhead for extracting parallelism
- SWAPs suitable for base-stations?
- Why not? power is not a primary constraint!
27Multiuser Estimation-DetectionDecoding
Real-time target 128 Kbps per user
Ideal C64x (w/o co-proc) needs 15 GHz for
real-time
28Current research
- SWAPs Completely flexible and general
- How do we trade-off flexibility for better
performance? - Handset SWAPs (H-SWAPs)
29H-SWAPs Potential advantages
DSP (RE)
Execution time
H-SWAPs
SWAPs
30Conclusions
- Need flexible architectures for future wireless
devices - Higher data rates, lower power, more complex
algorithms - Design methodology (SWAPs, H-SWAPs, ASICs)
- Flexibility vs. performance trade-offs
- Blurs distinction between ASICs and programmable
solutions - Also need parallel, low precision algorithms for
efficient mapping - Inter-disciplinary research
- Computer architecture, VLSI, wireless
communications, computer arithmetic, compilers