Flexible wireless communication architectures - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Flexible wireless communication architectures

Description:

... Performance: GOPs of computation (Mbps) Low Power: 500 mW ... Sp. Sp. Sp. Sp. Transfer data via comm unit (CU) and scratchpad (Sp) Minimal loss in performance ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 44
Provided by: Srid
Category:

less

Transcript and Presenter's Notes

Title: Flexible wireless communication architectures


1
Flexible wireless communication architectures
  • Sridhar Rajagopal
  • Department of Electrical and Computer Engineering
  • Rice University, Houston TX
  • Faculty Candidate Seminar University of
    Rochester
  • March 31, 2003

This work has been supported in part by Nokia,
TI, TATP and NSF
2
Future wireless devices
  • High data rate mobile devices with multimedia
  • Multiple antennas w/ complex algorithms, GOPs of
    computation
  • Area-Time-Power constraints
  • Seamless connection across environments and
    standards
  • Use the fastest and cheapest available service

Wireless Cellular
Bluetooth/ Home Networks
Wireless LAN
3
Change in flexibility requirements
No change (already flexible)
Maximum change (needs to support multiple
environments, algorithms and standards)
4
Challenges faced in achieving this goal
  • Long time-to-market
  • Algorithm research
  • Implementation issues on current architectures
  • Architecture research
  • Ad-hoc design methodology for architecture
    designs
  • ASICs
  • DSPs
  • Heterogeneous
  • Reconfigurable

5
Research vision
  • Architecture design methodology to explore
  • Flexibility support variety of sophisticated
    algorithms
  • High Performance GOPs of computation (Mbps)
  • Low Power lt 500 mW
  • Algorithms
  • Need efficient algorithms for mapping to
    architectures

6
My contributions Algorithms
  • Multi-user EstimationJnl. Of VLSI Sig.
    Proc.02, ASAP00
  • matrix-inversions, high numerical stability
  • Numerical techniques
  • conjugate-gradient descent for complexity
    reduction
  • Multi-user Detection ISCAS01
  • Block-based computation to streaming computations
  • Pipelining, lower memory req.
  • Parallel, fixed-point, streaming VLSI
    implementations Trans. Wireless Comm.02

7
My contributions Architectures
  • Heterogeneous system designs ICSPAT00
  • Computer arithmeticSymp. On Comp. Arith01
  • Dynamic truncation in ASICs using on-line
    arithmetic
  • Ph.D. Thesis
  • Design methodology to explore flexibility-power-pe
    rformance tradeoffs
  • Scalable Wireless Application-specific Processors
    (SWAPs)

8
SWAP design methodology
Chain of receiver algorithms
Low complexity, parallel, fixed point
Flexibility- performance tradeoffs
High level language implementation
Architecture exploration
FPGA, customized, reconfigurable, heterogeneous
designs
ASIC design
learn
learn
Scalable programmable architecture design
SWAPs
9
Benefits of this approach
  • Provides a framework to explore
  • algorithms
  • flexible, high performance, low power
    architectures (SWAPs)
  • Understanding of both algorithms and ASICs used
    for better SWAP designs
  • Flexibility-performance trade-off with increasing
    customization in SWAPs
  • Inter-disciplinary research
  • Wireless communications, VLSI Signal Processing,
    Computer architecture, Computer arithmetic, CAD,
    Compilers

10
Talk Outline
  • SWAPs framework
  • SWAP Concept demonstration
  • Algorithm design
  • Application-specific architecture design
  • Current and Future Research Goals

11
DSP solutions
  • Current DSPs
  • Not enough functional units (FUs) for GOPs of
    computation
  • TI C6x DSP has 8 FUs -- Need 100s of FUs
  • Not low power enough!!
  • Cannot extend to more FUs
  • Limited Instruction Level Parallelism (ILP)
  • Limited Subword Parallelism (MMX)
  • Cannot support more registers (area,ports)
  • Compilers difficult to find ILP as FUs increase

12
Solution SWAPs
  • Exploit data parallelism (DP)
  • Available in many wireless algorithms
  • This is what ASICs do!!
  • Example
  • int i,aN,bN,cN // 32 bits
  • short int dN,eN,fN // 16 bits packed
  • for (i 0 ilt 1024 i)
  • ai bi ci
  • di ei fi

DP
ILP
Subword
13
SWAPs stream processors for wireless
  • Kernels (computation) and streams (communication)
  • Operations on kernels use local data in clusters
  • Streams expose data parallelism
  • Imagine stream processor at Stanford

Input Data
Output Data
Interference Cancellation
Viterbi decoding
receivedsignal
Matched filter
Decoded bits
Correlator
channel estimation
14
DSP vs. SWAPs
Stream Register File (SRF)
SWAPs max. clusters clusters same, same
operations. Power-down unused FUs, clusters
DSP (1 cluster)
15
Arithmetic clusters
  • FUs (,,/)
  • Scratch-pad (Sp)
  • Indexed accesses
  • Comm. unit (CU)
  • Intercluster comm.
  • Distributed reg. Files
  • more FUs

From/To SRF
Local Register File












SRF
/
Cross Point
/
/
/
Sp
Intercluster Network
CU
16
Talk Outline
  • SWAPs framework
  • SWAP Concept demonstration
  • Algorithm design
  • Application-specific architecture design
  • Current and Future Research Goals

17
Physical layer of wireless receivers
Receiver more complex than transmitter
18
Algorithms for
  • Multiple antenna systems (MIMO systems)
  • Complexity exponential with transmit receive
    antennas
  • Wide range of extremely complex algorithms
  • Optimal depends on fading, mobility, bandwidth,
    antennas
  • GOPs of computations
  • Estimation Linear MMSE, blind, conjugate
    gradient.
  • Detection FFT, (blind) interference
    cancellation.
  • Decoding Viterbi, Turbo, LDPC.
  • Implement ALL of them AND the NEXT one in line
  • Use for the best for the situation
  • Example for concept demonstration Viterbi
    decoding

19
Parallel Viterbi Decoding
ACS Unit
Traceback Unit
Detected bits
Decoded bits
  • Add-Compare-Select (ACS) trellis interconnect
  • Parallelism depends on constraint length
    (states)
  • Conventional Traceback
  • Sequential (No DP)
  • Difficult to implement in parallel architecture
  • Use Register Exchange (RE)
  • parallel solution

20
Re-ordering for parallel Viterbi
DP
  • Exploiting Viterbi DP in SWAPs
  • Re-order ACS, RE

21
SWAP Algorithms Architecture
  • Algorithm design for parallelism
  • Architecture design?

22
SWAP design
  • Decide how many clusters
  • Exploit DP
  • Decide what to put within each cluster
  • Maximize ILP with high functional unit efficiency
  • Search design space with explore tool
  • See how it meets time-area-power constraints

23
Inside a SWAP cluster EXPLORE
Auto-exploration of adders and multipliers for
ACS"
(Adder FU, Multiplier FU)
24
Explore tool benefits
  • Instruction count vs. functional unit efficiency
  • What goes inside each cluster
  • Explore all algorithms
  • turn off functional units not in use for given
    kernel
  • Design customized application-specific units
  • Better performance with increased FU utilization
  • Algorithm 1 3 adders, 3 multipliers, 32
    clusters
  • Algorithm 2 4 adders, 1 multiplier, 64 clusters
  • Architecture 4 adders, 3 multipliers, 64 clusters

25
Saving Power
  • Turning off FUs
  • Easy
  • Use the right FUs for staticly scheduled
    algorithm
  • Turning off clusters
  • Not so easy
  • Each cluster does not have access to entire SRF
  • Need data from SRF of other clusters

26
Reconfiguration 1 Data transfer
SRF
CU
Clusters
Move data to appropriate clusters via comm units
Significant performance loss, additional SRF
memory required Can turn off SRF too!
27
Reconfiguration 2 Conditional streams
Sp
Sp
Sp
Sp
Transfer data via comm unit (CU) and scratchpad
(Sp) Minimal loss in performance Cannot turn off
SRF, comm unit , scratchpad in clusters
28
Reconfiguration 3 Multiplexed buffers
Use mux-demux buffers Minimal loss in
performance Can turn off clusters entirely more
power savings
29
Viterbi reconfiguration
DP
Can be turned OFF
Packet 1 Constraint length 7 (16 clusters)
Packet 2 Constraint length 9 (64 clusters)
Packet 3 Constraint length 5 (4 clusters)
30
Execution Time (cycles)
Clusters
Memory
64-bit Packet 1 Rate ½ K 7
No Data Memory accesses
Kernels (Computation)
Packet 2 K 9
Packet 3 K 5
31
Viterbi decoding rate 1/2 at 128 Kbps 10 MHz
1000
K 9
K 7
Static architecture
DSP
K 5
SWAPs
100
Frequency needed to attain real-time (in MHz)
10
1
1
10
100
Number of clusters
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
32
SWAPs Salient features
  • 1-2 orders of magnitude better than 1 processor
    DSP
  • Any constraint length ? 10 MHz at 128 Kbps
  • Same code for all constraint lengths
  • no need to re-compile or load another code
  • as long as parallelism/cluster ratio is constant
  • Power savings due to dynamic cluster scaling

33
Expected SWAP power consumption
  • 64 clusters and 1 multiplier per cluster
  • 0.13 micron, 1.2 V
  • Peak Active Power 9 mW at 1 MHz (DSP 1 mW at
    1 MHz)
  • Area 53.7 mm2
  • 10 MHz, 128 Kbps with reconfiguration (
    DSP 200mW)

Exploring the VLSI Scalability of Stream
Processors, Brucek Khailany et al, Proceedings of
the Ninth Symposium on High Performance Computer
Architecture, February 8-12, 2003, Anaheim,
California, USA, pp. 153-164
34
Flexibility vs. performance
  • Suitable for mobile devices?
  • SWAPs 128 Kbps at 10-100 mW for Viterbi
  • What if we want to do better?
  • No special customization for the application
  • No application-specific units
  • Generic inter-cluster communication network
  • Overhead for extracting parallelism
  • SWAPs suitable for base-stations?
  • Why not? power is not a primary constraint!

35
Multiuser Estimation-DetectionDecoding
Real-time target 128 Kbps per user
Ideal C64x (w/o co-proc) needs 15 GHz for
real-time
36
Expected SWAP power base-station
  • 32 user base-station with 3 Xs per cluster and
    64 clusters
  • 0.13 micron, 1.2 V
  • Peak Active Power 18.19 mW for 1 MHz
    (increased )
  • Area 93.4 mm2
  • Total Peak Base-station power consumption
  • 18.19 W at 1 GHz for 32 users at 128 Kbps/user

37
Talk Outline
  • SWAPs framework
  • SWAP Concept demonstration
  • Algorithm design
  • Application-specific architecture design
  • Current and Future Research Goals

38
Current research
  • SWAPs Completely flexible and general
  • How do we trade-off flexibility for better
    performance?
  • Handset SWAPs (H-SWAPs)

39
Handset SWAPs H-SWAPs
  • Trade Data Parallelism for Task Pipelining
  • Design SWAPlets and customize each SWAPlet

SWAPs (max. clusters and reconfigure)
40
H-SWAPs Potential advantages
Programmable solutions with increased
customization
DSPs
SWAPs
H-SWAPs
ILP Subword DP Task Pipelining Custom FUs
ILP Subword DP
ILP Subword
Performance, Power benefits
41
Future research efficient algorithms
42
Future research architectures
  • Generalized framework and tools for evaluating
    algorithm-architecture and area-time-power-flexibi
    lity trade-offs
  • Some other potential applications
  • Image processing
  • Cameras variety of compression algorithms
  • Biomedical applications
  • Hearing aids DSP running on body heat
  • Sensor networks
  • Compression of data before transmission

Quote Gene Frantz, TI Fellow
43
Conclusions
  • Need flexible architectures for future wireless
    devices
  • Higher data rates, lower power, more complex
    algorithms
  • Design methodology (SWAPs concept)
  • Flexibility vs. performance trade-offs
  • SWAPs
  • Exploit data parallelism like ASICs
  • 1-2 orders better than DSPs
  • Turn off unused clusters and unused FUs for low
    power
  • H-SWAPs for better performance and power benefits
Write a Comment
User Comments (0)
About PowerShow.com