Programmable processors for wireless basestations - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Programmable processors for wireless basestations

Description:

Trying to use your cell phone during the blackout was nearly impossible. What went wrong? ... Schedule on sufficient number of adders/multipliers. If DP remains, ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 47

Provided by: Srid

Learn more at: http://www.ece.rice.edu

Category:

more less

Transcript and Presenter's Notes

Title: Programmable processors for wireless basestations

1
Programmable processors for wireless base-stations

Sridhar Rajagopal
(sridhar_at_rice.edu)
December 11, 2003

2
Wireless rates ? clock rates
4 GHz
54-100 Mbps
200 MHz
2-10 Mbps
1 Mbps
9.6 Kbps

Need to process 100X more bits per clock cycle
today than in 1996

3
Base-stations need horsepower
Sophisticated signal processing for multiple
users Need 100-1000s of arithmetic operations to
process 1 bit Base-stations require gt 100 ALUs
4
Power efficiency and flexibility
implies does not waste power does not imply low
power

Wireless systems getting harder-to-design
Evolving standards, compatibility issues
More base-stations per unit area
operational and maintenance costs
Flexibility provides power-efficiency
Base-stations rarely operate at full capacity
Varying users, data rates, spreading, modulation,
coding
Adapt resources to needs

5
Thesis addresses the following problem

I want to design programmable processors for
wireless base-stations with 100s of ALUs
map wireless algorithms on these processors
power-efficient (adapt resources to needs)
(c) decide ALUs, clock frequency

how much programmable? as programmable as
possible
6
Choice Stream processors

Single processors wont do
ILP, subword parallelism not sufficient
Register file explosion with increasing ALUs
Multiprocessors
Data parallelism in wireless systems
SIMD (vector) processors appropriate
Stream processors media processing
Share characteristics with wireless systems
Shown potential to support 100-1000s of ALUs
Cycle accurate simulator and compiler tools
available

7
Thesis contributions

(a)Mapping algorithms on stream processors
designing data-parallel algorithm versions
tradeoffs between packing, ALU utilization and
memory
reduced inter-cluster communication network
(b)Improve power efficiency in stream processors
adapting compute resources to workload variations
varying voltage and frequency to real-time
requirements
(c) Design exploration between ALUs and clock
frequency to minimize power consumption
fast real-time performance prediction

8
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

9
Wireless workloads
Time 1996
2004 ?
10
Key kernels studied for wireless

FFT Media processing
QRD Media processing
Outer product updates
Matrix vector operations
matrix matrix operations
Matrix transpose
Viterbi decoding
LDPC decoding

11
Characteristics of wireless

Compute-bound
Finite precision
Limited temporal data reuse
Streaming data
Data parallelism
Static, deterministic, regular workloads
Limited control flow

12
Parallelism levels in wireless systems

int i,aN,bN,sumN // 32 bits
short int cN,dN,diffN // 16 bits packed
for (i 0 ilt 1024 i)
sumi ai bi
diffi ci - di
Instruction Level Parallelism (ILP) - DSP
Subword Parallelism (MMX) - DSP
Data Parallelism (DP) Vector Processor
DP can decrease by increasing ILP and MMX
Example loop unrolling

DP
ILP
MMX
13
Stream Processors multi-cluster DSPs
Memory Stream Register File (SRF)

ILP MMX

DP
adapt clusters to DP Identical clusters, same
operations. Power-down unused FUs, clusters
VLIW DSP (1 cluster)
14
Programming model
Communication
Computation
Your new hardware wont run your old software
Balchs law
15
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

16
Viterbi needs odd-even grouping

Exploiting Viterbi DP in SWAPs
Use Register exchange (RE) instead of regular
traceback
Re-order ACS, RE

17
Performance of Viterbi decoding
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
18
Patterns in inter-cluster comm

Intercluster comm network fully connected
Structure in access patterns can be exploited
Broadcasting
Matrix-vector multiplication, matrix-matrix
multiplication, outer product updates
Odd-even grouping
Transpose, Packing, Viterbi decoding

19
Odd-even grouping

Packing
overhead when input and output precisions are
different
Not always beneficial for performance
Odd-even grouping required for bringing data to
right cluster
Matrix transpose
Better done in ALUs than in memory
Shown to have an order-of-magnitude better
performance
Done in ALUs as repeated odd-even groupings

20
Odd-even grouping
0 1 2 3 4 5 6 7 ? 0 2 4 8 1 3 5 7
Inter-cluster communication
Entire chip length Limits clock frequency Limits
scaling
21
A reduced inter-cluster comm network
only nearest neighbor interconnections
22
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

23
Flexibility needed in workloads
3G Workload variation from 1 GOPs for 4 users,
constraint 7 viterbi to 23 GOPs for 32 users,
constraint 9 viterbi
24
Flexibility affects DP
U - Users, K - constraint length, N - spreading
gain, R - decoding rate
25
When DP changes

4 ? 2 clusters
Data not in the right SRF banks
Overhead in bringing data to the right banks
Via memory
Via inter-cluster communication network

26
Adapting clusters to Data Parallelism
SRF
Turned off using voltage gating to eliminate
static and dynamic power dissipation
Adaptive Multiplexer Network
Clusters
C
C
C
C
No reconfiguration
4 2 reconfiguration
41 reconfiguration
All clusters off
C
C
C
27
Cluster utilization variation
Cluster Index
Cluster utilization variation on a 32-cluster
processor (32, 9) 32 users, constraint length
9 Viterbi
28
Frequency variation on 32 clusters
29
Operation

Dynamic Voltage-Frequency scaling when system
changes significantly
Users, data rates
Coarse time scale (every few seconds)
Turn off clusters
when parallelism changes significantly
Memory operations
Exceed real-time requirements
Finer time scales (100s of microseconds)

30
Power Voltage Gating Scaling
Power can change from 12.38 W to 300 mW (40x
savings) depending on workload changes
31
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

32
Deciding ALUs vs. clock frequency

No independent variables
Clusters, ALUs, frequency, voltage (c,a,m,f)
Trade-offs exist
How to find the right combination for lowest
power!

33
Static design exploration
also helps in quickly predicting real-time
performance
34
Sensitivity analysis important

We have a capacitance model Khailany2003
All equations not exact
Need to see how variations affect solutions

35
Design exploration methodology

3 types of parallelism ILP, MMX, DP
For best performance (power)
Maximize the use of all
Maximize ILP and MMX at expense of DP
Loop unrolling, packing
Schedule on sufficient number of
adders/multipliers
If DP remains, set clusters DP
No other way to exploit that parallelism

36
Setting clusters, adders, multipliers

If sufficient DP, linear decrease in frequency
with clusters
Set clusters depending on DP and execution time
estimate
To find adders and multipliers,
Let compiler schedule algorithm workloads across
different numbers of adders and multipliers and
let it find execution time
Put all numbers in power equation
Compare increase in capacitance due to added ALUs
and clusters with benefits in execution time
Choose the solution that minimizes the power

37
Design exploration for clusters (c)
DP
ILP

For sufficiently large
adders, multipliers per cluster
Explore Algorithm 1 32 clusters
Explore Algorithm 2 64 clusters
Explore Algorithm 3 64 clusters
Explore Algorithm 4 16 clusters

38
Clusters frequency and power
32 clusters at frequency 836.692 MHz (p 1) 64
clusters at frequency 543.444 MHz (p 2) 64
clusters at frequency 543.444 MHz (p 3)
3G workload
39
ALU utilization with frequency
3G workload
40
Choice of adders and multipliers
41
Exploration results

Final Design Conclusion
Clusters 64
Multipliers/cluster 1
Multiplier Utilization 62
Adders/cluster 3
Adder Utilization 55
Real-time frequency 568.68 MHz for 128
Kbps/user
Exploration done in seconds.

42
Outline

Background
Wireless systems
Stream processors
Mapping algorithms to stream processors
Power efficiency
Design exploration
Broad impact and future work

43
Broader impact

Results not specific to base-stations
High performance, low power system designs
Concepts can be extended to handsets
Mux network applicable to all SIMD processors
Power efficiency in scientific computing
Results 2, 3 applicable to all stream
applications
Design and power efficiency
Multimedia, MPEG,

44
Future work

Dont believe the model is the reality
(Proof is in the pudding)
Fabrication needed to verify concepts
Cycle accurate simulator
Extrapolating models for power
LDPC decoding (in progress)
Sparse matrix requires permutations over large
data
Indexed SRF may help
3G requires 1 GHz at 128 Kbps/user
4G equalization at 1 Mbps breaks down (expected)

45
Need for new architectures, definitions and
benchmarks

Road ends - conventional architecturesAgarwal2000
Wide range of architectures DSP, ASSP, ASIP,
reconfigurable,stream, ASIC, programmable
Difficult to compare and contrast
Need new definitions that allow comparisons
Wireless workloads
Typically ASIC designs
SPEC benchmark needed for programmable designs

46
Conclusions

Utilizing 100-1000s ALUs/clock cycle and mapping
algorithms not easy in programmable architectures
Data parallel algorithms need to be designed and
mapped
Power efficiency needs to be provided
Design exploration needed to decide ALUs to meet
real-time constraints
My thesis lays the initial foundations

Write a Comment

User Comments (0)