Title: Flexible wireless communication architectures
1Flexible wireless communication architectures
- Sridhar Rajagopal
- Department of Electrical and Computer Engineering
- Rice University, Houston TX
- Faculty Candidate Seminar University of
Rochester - March 31, 2003
This work has been supported in part by Nokia,
TI, TATP and NSF
2Future wireless devices
- High data rate mobile devices with multimedia
- Multiple antennas w/ complex algorithms, GOPs of
computation - Area-Time-Power constraints
- Seamless connection across environments and
standards - Use the fastest and cheapest available service
Wireless Cellular
Bluetooth/ Home Networks
Wireless LAN
3Change in flexibility requirements
No change (already flexible)
Maximum change (needs to support multiple
environments, algorithms and standards)
4Challenges faced in achieving this goal
- Long time-to-market
- Algorithm research
- Implementation issues on current architectures
- Architecture research
- Ad-hoc design methodology for architecture
designs - ASICs
- DSPs
- Heterogeneous
- Reconfigurable
5Research vision
- Architecture design methodology to explore
- Flexibility support variety of sophisticated
algorithms - High Performance GOPs of computation (Mbps)
- Low Power lt 500 mW
- Algorithms
- Need efficient algorithms for mapping to
architectures
6My contributions Algorithms
- Multi-user EstimationJnl. Of VLSI Sig.
Proc.02, ASAP00 - matrix-inversions, high numerical stability
- Numerical techniques
- conjugate-gradient descent for complexity
reduction - Multi-user Detection ISCAS01
- Block-based computation to streaming computations
- Pipelining, lower memory req.
- Parallel, fixed-point, streaming VLSI
implementations Trans. Wireless Comm.02
7My contributions Architectures
- Heterogeneous system designs ICSPAT00
- Computer arithmeticSymp. On Comp. Arith01
- Dynamic truncation in ASICs using on-line
arithmetic -
- Ph.D. Thesis
- Design methodology to explore flexibility-power-pe
rformance tradeoffs - Scalable Wireless Application-specific Processors
(SWAPs)
8SWAP design methodology
Chain of receiver algorithms
Low complexity, parallel, fixed point
Flexibility- performance tradeoffs
High level language implementation
Architecture exploration
FPGA, customized, reconfigurable, heterogeneous
designs
ASIC design
learn
learn
Scalable programmable architecture design
SWAPs
9Benefits of this approach
- Provides a framework to explore
- algorithms
- flexible, high performance, low power
architectures (SWAPs) - Understanding of both algorithms and ASICs used
for better SWAP designs - Flexibility-performance trade-off with increasing
customization in SWAPs - Inter-disciplinary research
- Wireless communications, VLSI Signal Processing,
Computer architecture, Computer arithmetic, CAD,
Compilers
10Talk Outline
- SWAPs framework
- SWAP Concept demonstration
- Algorithm design
- Application-specific architecture design
- Current and Future Research Goals
11DSP solutions
- Current DSPs
- Not enough functional units (FUs) for GOPs of
computation - TI C6x DSP has 8 FUs -- Need 100s of FUs
- Not low power enough!!
- Cannot extend to more FUs
- Limited Instruction Level Parallelism (ILP)
- Limited Subword Parallelism (MMX)
- Cannot support more registers (area,ports)
- Compilers difficult to find ILP as FUs increase
12Solution SWAPs
- Exploit data parallelism (DP)
- Available in many wireless algorithms
- This is what ASICs do!!
- Example
- int i,aN,bN,cN // 32 bits
- short int dN,eN,fN // 16 bits packed
- for (i 0 ilt 1024 i)
-
- ai bi ci
- di ei fi
-
DP
ILP
Subword
13SWAPs stream processors for wireless
- Kernels (computation) and streams (communication)
- Operations on kernels use local data in clusters
- Streams expose data parallelism
- Imagine stream processor at Stanford
Input Data
Output Data
Interference Cancellation
Viterbi decoding
receivedsignal
Matched filter
Decoded bits
Correlator
channel estimation
14DSP vs. SWAPs
Stream Register File (SRF)
SWAPs max. clusters clusters same, same
operations. Power-down unused FUs, clusters
DSP (1 cluster)
15Arithmetic clusters
- FUs (,,/)
- Scratch-pad (Sp)
- Indexed accesses
- Comm. unit (CU)
- Intercluster comm.
- Distributed reg. Files
- more FUs
From/To SRF
Local Register File
SRF
/
Cross Point
/
/
/
Sp
Intercluster Network
CU
16Talk Outline
- SWAPs framework
- SWAP Concept demonstration
- Algorithm design
- Application-specific architecture design
- Current and Future Research Goals
17Physical layer of wireless receivers
Receiver more complex than transmitter
18Algorithms for
- Multiple antenna systems (MIMO systems)
- Complexity exponential with transmit receive
antennas - Wide range of extremely complex algorithms
- Optimal depends on fading, mobility, bandwidth,
antennas - GOPs of computations
- Estimation Linear MMSE, blind, conjugate
gradient. - Detection FFT, (blind) interference
cancellation. - Decoding Viterbi, Turbo, LDPC.
- Implement ALL of them AND the NEXT one in line
- Use for the best for the situation
- Example for concept demonstration Viterbi
decoding
19Parallel Viterbi Decoding
ACS Unit
Traceback Unit
Detected bits
Decoded bits
- Add-Compare-Select (ACS) trellis interconnect
- Parallelism depends on constraint length
(states) - Conventional Traceback
- Sequential (No DP)
- Difficult to implement in parallel architecture
- Use Register Exchange (RE)
- parallel solution
20Re-ordering for parallel Viterbi
DP
- Exploiting Viterbi DP in SWAPs
- Re-order ACS, RE
21SWAP Algorithms Architecture
-
- Algorithm design for parallelism
- Architecture design?
22SWAP design
- Decide how many clusters
- Exploit DP
- Decide what to put within each cluster
- Maximize ILP with high functional unit efficiency
- Search design space with explore tool
- See how it meets time-area-power constraints
23Inside a SWAP cluster EXPLORE
Auto-exploration of adders and multipliers for
ACS"
(Adder FU, Multiplier FU)
24Explore tool benefits
- Instruction count vs. functional unit efficiency
- What goes inside each cluster
- Explore all algorithms
- turn off functional units not in use for given
kernel - Design customized application-specific units
- Better performance with increased FU utilization
- Algorithm 1 3 adders, 3 multipliers, 32
clusters - Algorithm 2 4 adders, 1 multiplier, 64 clusters
- Architecture 4 adders, 3 multipliers, 64 clusters
25Saving Power
- Turning off FUs
- Easy
- Use the right FUs for staticly scheduled
algorithm - Turning off clusters
- Not so easy
- Each cluster does not have access to entire SRF
- Need data from SRF of other clusters
26Reconfiguration 1 Data transfer
SRF
CU
Clusters
Move data to appropriate clusters via comm units
Significant performance loss, additional SRF
memory required Can turn off SRF too!
27Reconfiguration 2 Conditional streams
Sp
Sp
Sp
Sp
Transfer data via comm unit (CU) and scratchpad
(Sp) Minimal loss in performance Cannot turn off
SRF, comm unit , scratchpad in clusters
28Reconfiguration 3 Multiplexed buffers
Use mux-demux buffers Minimal loss in
performance Can turn off clusters entirely more
power savings
29Viterbi reconfiguration
DP
Can be turned OFF
Packet 1 Constraint length 7 (16 clusters)
Packet 2 Constraint length 9 (64 clusters)
Packet 3 Constraint length 5 (4 clusters)
30Execution Time (cycles)
Clusters
Memory
64-bit Packet 1 Rate ½ K 7
No Data Memory accesses
Kernels (Computation)
Packet 2 K 9
Packet 3 K 5
31Viterbi decoding rate 1/2 at 128 Kbps 10 MHz
1000
K 9
K 7
Static architecture
DSP
K 5
SWAPs
100
Frequency needed to attain real-time (in MHz)
10
1
1
10
100
Number of clusters
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
32SWAPs Salient features
- 1-2 orders of magnitude better than 1 processor
DSP - Any constraint length ? 10 MHz at 128 Kbps
- Same code for all constraint lengths
- no need to re-compile or load another code
- as long as parallelism/cluster ratio is constant
- Power savings due to dynamic cluster scaling
33Expected SWAP power consumption
- 64 clusters and 1 multiplier per cluster
- 0.13 micron, 1.2 V
- Peak Active Power 9 mW at 1 MHz (DSP 1 mW at
1 MHz) - Area 53.7 mm2
- 10 MHz, 128 Kbps with reconfiguration (
DSP 200mW)
Exploring the VLSI Scalability of Stream
Processors, Brucek Khailany et al, Proceedings of
the Ninth Symposium on High Performance Computer
Architecture, February 8-12, 2003, Anaheim,
California, USA, pp. 153-164
34Flexibility vs. performance
- Suitable for mobile devices?
- SWAPs 128 Kbps at 10-100 mW for Viterbi
- What if we want to do better?
- No special customization for the application
- No application-specific units
- Generic inter-cluster communication network
- Overhead for extracting parallelism
- SWAPs suitable for base-stations?
- Why not? power is not a primary constraint!
35Multiuser Estimation-DetectionDecoding
Real-time target 128 Kbps per user
Ideal C64x (w/o co-proc) needs 15 GHz for
real-time
36Expected SWAP power base-station
- 32 user base-station with 3 Xs per cluster and
64 clusters - 0.13 micron, 1.2 V
- Peak Active Power 18.19 mW for 1 MHz
(increased ) - Area 93.4 mm2
- Total Peak Base-station power consumption
- 18.19 W at 1 GHz for 32 users at 128 Kbps/user
37Talk Outline
- SWAPs framework
- SWAP Concept demonstration
- Algorithm design
- Application-specific architecture design
- Current and Future Research Goals
38Current research
- SWAPs Completely flexible and general
- How do we trade-off flexibility for better
performance? - Handset SWAPs (H-SWAPs)
39Handset SWAPs H-SWAPs
- Trade Data Parallelism for Task Pipelining
- Design SWAPlets and customize each SWAPlet
SWAPs (max. clusters and reconfigure)
40H-SWAPs Potential advantages
Programmable solutions with increased
customization
DSPs
SWAPs
H-SWAPs
ILP Subword DP Task Pipelining Custom FUs
ILP Subword DP
ILP Subword
Performance, Power benefits
41Future research efficient algorithms
42Future research architectures
- Generalized framework and tools for evaluating
algorithm-architecture and area-time-power-flexibi
lity trade-offs - Some other potential applications
- Image processing
- Cameras variety of compression algorithms
- Biomedical applications
- Hearing aids DSP running on body heat
- Sensor networks
- Compression of data before transmission
Quote Gene Frantz, TI Fellow
43Conclusions
- Need flexible architectures for future wireless
devices - Higher data rates, lower power, more complex
algorithms - Design methodology (SWAPs concept)
- Flexibility vs. performance trade-offs
- SWAPs
- Exploit data parallelism like ASICs
- 1-2 orders better than DSPs
- Turn off unused clusters and unused FUs for low
power - H-SWAPs for better performance and power benefits