Flexible wireless communication architectures - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Flexible wireless communication architectures

Description:

... Performance: GOPs of computation (Mbps) Low Power: 500 mW ... Sp. Sp. Sp. Sp. Transfer data via comm unit (CU) and scratchpad (Sp) Minimal loss in performance ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 44

Provided by: Srid

Category:

more less

Transcript and Presenter's Notes

Title: Flexible wireless communication architectures

1
Flexible wireless communication architectures

Sridhar Rajagopal
Department of Electrical and Computer Engineering
Rice University, Houston TX
Faculty Candidate Seminar University of
Rochester
March 31, 2003

This work has been supported in part by Nokia,
TI, TATP and NSF
2
Future wireless devices

High data rate mobile devices with multimedia
Multiple antennas w/ complex algorithms, GOPs of
computation
Area-Time-Power constraints
Seamless connection across environments and
standards
Use the fastest and cheapest available service

Wireless Cellular
Bluetooth/ Home Networks
Wireless LAN
3
Change in flexibility requirements
No change (already flexible)
Maximum change (needs to support multiple
environments, algorithms and standards)
4
Challenges faced in achieving this goal

Long time-to-market
Algorithm research
Implementation issues on current architectures
Architecture research
Ad-hoc design methodology for architecture
designs
ASICs
DSPs
Heterogeneous
Reconfigurable

5
Research vision

Architecture design methodology to explore
Flexibility support variety of sophisticated
algorithms
High Performance GOPs of computation (Mbps)
Low Power lt 500 mW
Algorithms
Need efficient algorithms for mapping to
architectures

6
My contributions Algorithms

Multi-user EstimationJnl. Of VLSI Sig.
Proc.02, ASAP00
matrix-inversions, high numerical stability
Numerical techniques
conjugate-gradient descent for complexity
reduction
Multi-user Detection ISCAS01
Block-based computation to streaming computations
Pipelining, lower memory req.
Parallel, fixed-point, streaming VLSI
implementations Trans. Wireless Comm.02

7
My contributions Architectures

Heterogeneous system designs ICSPAT00
Computer arithmeticSymp. On Comp. Arith01
Dynamic truncation in ASICs using on-line
arithmetic
Ph.D. Thesis
Design methodology to explore flexibility-power-pe
rformance tradeoffs
Scalable Wireless Application-specific Processors
(SWAPs)

8
SWAP design methodology
Chain of receiver algorithms
Low complexity, parallel, fixed point
Flexibility- performance tradeoffs
High level language implementation
Architecture exploration
FPGA, customized, reconfigurable, heterogeneous
designs
ASIC design
learn
learn
Scalable programmable architecture design
SWAPs
9
Benefits of this approach

Provides a framework to explore
algorithms
flexible, high performance, low power
architectures (SWAPs)
Understanding of both algorithms and ASICs used
for better SWAP designs
Flexibility-performance trade-off with increasing
customization in SWAPs
Inter-disciplinary research
Wireless communications, VLSI Signal Processing,
Computer architecture, Computer arithmetic, CAD,
Compilers

10
Talk Outline

SWAPs framework
SWAP Concept demonstration
Algorithm design
Application-specific architecture design
Current and Future Research Goals

11
DSP solutions

Current DSPs
Not enough functional units (FUs) for GOPs of
computation
TI C6x DSP has 8 FUs -- Need 100s of FUs
Not low power enough!!
Cannot extend to more FUs
Limited Instruction Level Parallelism (ILP)
Limited Subword Parallelism (MMX)
Cannot support more registers (area,ports)
Compilers difficult to find ILP as FUs increase

12
Solution SWAPs

Exploit data parallelism (DP)
Available in many wireless algorithms
This is what ASICs do!!
Example
int i,aN,bN,cN // 32 bits
short int dN,eN,fN // 16 bits packed
for (i 0 ilt 1024 i)
ai bi ci
di ei fi

DP
ILP
Subword
13
SWAPs stream processors for wireless

Kernels (computation) and streams (communication)
Operations on kernels use local data in clusters
Streams expose data parallelism
Imagine stream processor at Stanford

Input Data
Output Data
Interference Cancellation
Viterbi decoding
receivedsignal
Matched filter
Decoded bits
Correlator
channel estimation
14
DSP vs. SWAPs
Stream Register File (SRF)
SWAPs max. clusters clusters same, same
operations. Power-down unused FUs, clusters
DSP (1 cluster)
15
Arithmetic clusters

FUs (,,/)
Scratch-pad (Sp)
Indexed accesses
Comm. unit (CU)
Intercluster comm.
Distributed reg. Files
more FUs

From/To SRF
Local Register File

SRF
/
Cross Point
/
/
/
Sp
Intercluster Network
CU
16
Talk Outline

SWAPs framework
SWAP Concept demonstration
Algorithm design
Application-specific architecture design
Current and Future Research Goals

17
Physical layer of wireless receivers
Receiver more complex than transmitter
18
Algorithms for

Multiple antenna systems (MIMO systems)
Complexity exponential with transmit receive
antennas
Wide range of extremely complex algorithms
Optimal depends on fading, mobility, bandwidth,
antennas
GOPs of computations
Estimation Linear MMSE, blind, conjugate
gradient.
Detection FFT, (blind) interference
cancellation.
Decoding Viterbi, Turbo, LDPC.
Implement ALL of them AND the NEXT one in line
Use for the best for the situation
Example for concept demonstration Viterbi
decoding

19
Parallel Viterbi Decoding
ACS Unit
Traceback Unit
Detected bits
Decoded bits

Add-Compare-Select (ACS) trellis interconnect
Parallelism depends on constraint length
(states)
Conventional Traceback
Sequential (No DP)
Difficult to implement in parallel architecture
Use Register Exchange (RE)
parallel solution

20
Re-ordering for parallel Viterbi
DP

Exploiting Viterbi DP in SWAPs
Re-order ACS, RE

21
SWAP Algorithms Architecture

Algorithm design for parallelism
Architecture design?

22
SWAP design

Decide how many clusters
Exploit DP
Decide what to put within each cluster
Maximize ILP with high functional unit efficiency
Search design space with explore tool
See how it meets time-area-power constraints

23
Inside a SWAP cluster EXPLORE
Auto-exploration of adders and multipliers for
ACS"
(Adder FU, Multiplier FU)
24
Explore tool benefits

Instruction count vs. functional unit efficiency
What goes inside each cluster
Explore all algorithms
turn off functional units not in use for given
kernel
Design customized application-specific units
Better performance with increased FU utilization
Algorithm 1 3 adders, 3 multipliers, 32
clusters
Algorithm 2 4 adders, 1 multiplier, 64 clusters
Architecture 4 adders, 3 multipliers, 64 clusters

25
Saving Power

Turning off FUs
Easy
Use the right FUs for staticly scheduled
algorithm
Turning off clusters
Not so easy
Each cluster does not have access to entire SRF
Need data from SRF of other clusters

26
Reconfiguration 1 Data transfer
SRF
CU
Clusters
Move data to appropriate clusters via comm units
Significant performance loss, additional SRF
memory required Can turn off SRF too!
27
Reconfiguration 2 Conditional streams
Sp
Sp
Sp
Sp
Transfer data via comm unit (CU) and scratchpad
(Sp) Minimal loss in performance Cannot turn off
SRF, comm unit , scratchpad in clusters
28
Reconfiguration 3 Multiplexed buffers
Use mux-demux buffers Minimal loss in
performance Can turn off clusters entirely more
power savings
29
Viterbi reconfiguration
DP
Can be turned OFF
Packet 1 Constraint length 7 (16 clusters)
Packet 2 Constraint length 9 (64 clusters)
Packet 3 Constraint length 5 (4 clusters)
30
Execution Time (cycles)
Clusters
Memory
64-bit Packet 1 Rate ½ K 7
No Data Memory accesses
Kernels (Computation)
Packet 2 K 9
Packet 3 K 5
31
Viterbi decoding rate 1/2 at 128 Kbps 10 MHz
1000
K 9
K 7
Static architecture
DSP
K 5
SWAPs
100
Frequency needed to attain real-time (in MHz)
10
1
1
10
100
Number of clusters
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
32
SWAPs Salient features

1-2 orders of magnitude better than 1 processor
DSP
Any constraint length ? 10 MHz at 128 Kbps
Same code for all constraint lengths
no need to re-compile or load another code
as long as parallelism/cluster ratio is constant
Power savings due to dynamic cluster scaling

33
Expected SWAP power consumption

64 clusters and 1 multiplier per cluster
0.13 micron, 1.2 V
Peak Active Power 9 mW at 1 MHz (DSP 1 mW at
1 MHz)
Area 53.7 mm2
10 MHz, 128 Kbps with reconfiguration (
DSP 200mW)

Exploring the VLSI Scalability of Stream
Processors, Brucek Khailany et al, Proceedings of
the Ninth Symposium on High Performance Computer
Architecture, February 8-12, 2003, Anaheim,
California, USA, pp. 153-164
34
Flexibility vs. performance

Suitable for mobile devices?
SWAPs 128 Kbps at 10-100 mW for Viterbi
What if we want to do better?
No special customization for the application
No application-specific units
Generic inter-cluster communication network
Overhead for extracting parallelism
SWAPs suitable for base-stations?
Why not? power is not a primary constraint!

35
Multiuser Estimation-DetectionDecoding
Real-time target 128 Kbps per user
Ideal C64x (w/o co-proc) needs 15 GHz for
real-time
36
Expected SWAP power base-station

32 user base-station with 3 Xs per cluster and
64 clusters
0.13 micron, 1.2 V
Peak Active Power 18.19 mW for 1 MHz
(increased )
Area 93.4 mm2
Total Peak Base-station power consumption
18.19 W at 1 GHz for 32 users at 128 Kbps/user

37
Talk Outline

SWAPs framework
SWAP Concept demonstration
Algorithm design
Application-specific architecture design
Current and Future Research Goals

38
Current research

SWAPs Completely flexible and general
How do we trade-off flexibility for better
performance?
Handset SWAPs (H-SWAPs)

39
Handset SWAPs H-SWAPs

Trade Data Parallelism for Task Pipelining
Design SWAPlets and customize each SWAPlet

SWAPs (max. clusters and reconfigure)
40
H-SWAPs Potential advantages
Programmable solutions with increased
customization
DSPs
SWAPs
H-SWAPs
ILP Subword DP Task Pipelining Custom FUs
ILP Subword DP
ILP Subword
Performance, Power benefits
41
Future research efficient algorithms
42
Future research architectures

Generalized framework and tools for evaluating
algorithm-architecture and area-time-power-flexibi
lity trade-offs
Some other potential applications
Image processing
Cameras variety of compression algorithms
Biomedical applications
Hearing aids DSP running on body heat
Sensor networks
Compression of data before transmission

Quote Gene Frantz, TI Fellow
43
Conclusions