Title: Programmable processors for wireless basestations
1Programmable processors for wireless base-stations
- Sridhar Rajagopal
- (sridhar_at_rice.edu)
- December 11, 2003
2Wireless rates ? clock rates
4 GHz
54-100 Mbps
200 MHz
2-10 Mbps
1 Mbps
9.6 Kbps
- Need to process 100X more bits per clock cycle
today than in 1996
3Base-stations need horsepower
Sophisticated signal processing for multiple
users Need 100-1000s of arithmetic operations to
process 1 bit Base-stations require gt 100 ALUs
4Power efficiency and flexibility
implies does not waste power does not imply low
power
- Wireless systems getting harder-to-design
- Evolving standards, compatibility issues
- More base-stations per unit area
- operational and maintenance costs
- Flexibility provides power-efficiency
- Base-stations rarely operate at full capacity
- Varying users, data rates, spreading, modulation,
coding - Adapt resources to needs
5Thesis addresses the following problem
- I want to design programmable processors for
wireless base-stations with 100s of ALUs - map wireless algorithms on these processors
- power-efficient (adapt resources to needs)
- (c) decide ALUs, clock frequency
how much programmable? as programmable as
possible
6Choice Stream processors
- Single processors wont do
- ILP, subword parallelism not sufficient
- Register file explosion with increasing ALUs
- Multiprocessors
- Data parallelism in wireless systems
- SIMD (vector) processors appropriate
- Stream processors media processing
- Share characteristics with wireless systems
- Shown potential to support 100-1000s of ALUs
- Cycle accurate simulator and compiler tools
available
7Thesis contributions
- (a)Mapping algorithms on stream processors
- designing data-parallel algorithm versions
- tradeoffs between packing, ALU utilization and
memory - reduced inter-cluster communication network
- (b)Improve power efficiency in stream processors
- adapting compute resources to workload variations
- varying voltage and frequency to real-time
requirements - (c) Design exploration between ALUs and clock
frequency to minimize power consumption - fast real-time performance prediction
8Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
9Wireless workloads
Time 1996
2004 ?
10Key kernels studied for wireless
- FFT Media processing
- QRD Media processing
- Outer product updates
- Matrix vector operations
- matrix matrix operations
- Matrix transpose
- Viterbi decoding
- LDPC decoding
11Characteristics of wireless
- Compute-bound
- Finite precision
- Limited temporal data reuse
- Streaming data
- Data parallelism
- Static, deterministic, regular workloads
- Limited control flow
12Parallelism levels in wireless systems
- int i,aN,bN,sumN // 32 bits
- short int cN,dN,diffN // 16 bits packed
- for (i 0 ilt 1024 i)
- sumi ai bi
- diffi ci - di
-
- Instruction Level Parallelism (ILP) - DSP
- Subword Parallelism (MMX) - DSP
- Data Parallelism (DP) Vector Processor
- DP can decrease by increasing ILP and MMX
- Example loop unrolling
DP
ILP
MMX
13Stream Processors multi-cluster DSPs
Memory Stream Register File (SRF)
ILP MMX
DP
adapt clusters to DP Identical clusters, same
operations. Power-down unused FUs, clusters
VLIW DSP (1 cluster)
14Programming model
Communication
Computation
Your new hardware wont run your old software
Balchs law
15Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
16Viterbi needs odd-even grouping
- Exploiting Viterbi DP in SWAPs
- Use Register exchange (RE) instead of regular
traceback - Re-order ACS, RE
17Performance of Viterbi decoding
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
18Patterns in inter-cluster comm
- Intercluster comm network fully connected
- Structure in access patterns can be exploited
- Broadcasting
- Matrix-vector multiplication, matrix-matrix
multiplication, outer product updates - Odd-even grouping
- Transpose, Packing, Viterbi decoding
19Odd-even grouping
- Packing
- overhead when input and output precisions are
different - Not always beneficial for performance
- Odd-even grouping required for bringing data to
right cluster - Matrix transpose
- Better done in ALUs than in memory
- Shown to have an order-of-magnitude better
performance - Done in ALUs as repeated odd-even groupings
20Odd-even grouping
0 1 2 3 4 5 6 7 ? 0 2 4 8 1 3 5 7
Inter-cluster communication
Entire chip length Limits clock frequency Limits
scaling
21A reduced inter-cluster comm network
only nearest neighbor interconnections
22Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
23Flexibility needed in workloads
3G Workload variation from 1 GOPs for 4 users,
constraint 7 viterbi to 23 GOPs for 32 users,
constraint 9 viterbi
24Flexibility affects DP
U - Users, K - constraint length, N - spreading
gain, R - decoding rate
25When DP changes
- 4 ? 2 clusters
- Data not in the right SRF banks
- Overhead in bringing data to the right banks
- Via memory
- Via inter-cluster communication network
26Adapting clusters to Data Parallelism
SRF
Turned off using voltage gating to eliminate
static and dynamic power dissipation
Adaptive Multiplexer Network
Clusters
C
C
C
C
No reconfiguration
4 2 reconfiguration
41 reconfiguration
All clusters off
C
C
C
27Cluster utilization variation
Cluster Index
Cluster utilization variation on a 32-cluster
processor (32, 9) 32 users, constraint length
9 Viterbi
28Frequency variation on 32 clusters
29Operation
- Dynamic Voltage-Frequency scaling when system
changes significantly - Users, data rates
- Coarse time scale (every few seconds)
- Turn off clusters
- when parallelism changes significantly
- Memory operations
- Exceed real-time requirements
- Finer time scales (100s of microseconds)
30Power Voltage Gating Scaling
Power can change from 12.38 W to 300 mW (40x
savings) depending on workload changes
31Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
32Deciding ALUs vs. clock frequency
- No independent variables
- Clusters, ALUs, frequency, voltage (c,a,m,f)
- Trade-offs exist
- How to find the right combination for lowest
power!
33Static design exploration
also helps in quickly predicting real-time
performance
34Sensitivity analysis important
- We have a capacitance model Khailany2003
- All equations not exact
- Need to see how variations affect solutions
35Design exploration methodology
- 3 types of parallelism ILP, MMX, DP
- For best performance (power)
- Maximize the use of all
- Maximize ILP and MMX at expense of DP
- Loop unrolling, packing
- Schedule on sufficient number of
adders/multipliers - If DP remains, set clusters DP
- No other way to exploit that parallelism
36Setting clusters, adders, multipliers
- If sufficient DP, linear decrease in frequency
with clusters - Set clusters depending on DP and execution time
estimate - To find adders and multipliers,
- Let compiler schedule algorithm workloads across
different numbers of adders and multipliers and
let it find execution time - Put all numbers in power equation
- Compare increase in capacitance due to added ALUs
and clusters with benefits in execution time - Choose the solution that minimizes the power
37Design exploration for clusters (c)
DP
ILP
- For sufficiently large
- adders, multipliers per cluster
- Explore Algorithm 1 32 clusters
- Explore Algorithm 2 64 clusters
- Explore Algorithm 3 64 clusters
- Explore Algorithm 4 16 clusters
38Clusters frequency and power
32 clusters at frequency 836.692 MHz (p 1) 64
clusters at frequency 543.444 MHz (p 2) 64
clusters at frequency 543.444 MHz (p 3)
3G workload
39ALU utilization with frequency
3G workload
40Choice of adders and multipliers
41Exploration results
-
- Final Design Conclusion
-
- Clusters 64
- Multipliers/cluster 1
- Multiplier Utilization 62
- Adders/cluster 3
- Adder Utilization 55
- Real-time frequency 568.68 MHz for 128
Kbps/user -
- Exploration done in seconds.
42Outline
- Background
- Wireless systems
- Stream processors
- Mapping algorithms to stream processors
- Power efficiency
- Design exploration
- Broad impact and future work
43Broader impact
- Results not specific to base-stations
- High performance, low power system designs
- Concepts can be extended to handsets
- Mux network applicable to all SIMD processors
- Power efficiency in scientific computing
- Results 2, 3 applicable to all stream
applications - Design and power efficiency
- Multimedia, MPEG,
44Future work
- Dont believe the model is the reality
- (Proof is in the pudding)
-
- Fabrication needed to verify concepts
- Cycle accurate simulator
- Extrapolating models for power
- LDPC decoding (in progress)
- Sparse matrix requires permutations over large
data - Indexed SRF may help
- 3G requires 1 GHz at 128 Kbps/user
- 4G equalization at 1 Mbps breaks down (expected)
45Need for new architectures, definitions and
benchmarks
- Road ends - conventional architecturesAgarwal2000
- Wide range of architectures DSP, ASSP, ASIP,
reconfigurable,stream, ASIC, programmable - Difficult to compare and contrast
- Need new definitions that allow comparisons
- Wireless workloads
- Typically ASIC designs
- SPEC benchmark needed for programmable designs
46Conclusions
- Utilizing 100-1000s ALUs/clock cycle and mapping
algorithms not easy in programmable architectures - Data parallel algorithms need to be designed and
mapped - Power efficiency needs to be provided
- Design exploration needed to decide ALUs to meet
real-time constraints - My thesis lays the initial foundations