Programmable processors for wireless basestations - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Programmable processors for wireless basestations

Description:

Trying to use your cell phone during the blackout was nearly impossible. What went wrong? ... Schedule on sufficient number of adders/multipliers. If DP remains, ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 47
Provided by: Srid
Learn more at: http://www.ece.rice.edu
Category:

less

Transcript and Presenter's Notes

Title: Programmable processors for wireless basestations


1
Programmable processors for wireless base-stations
  • Sridhar Rajagopal
  • (sridhar_at_rice.edu)
  • December 11, 2003

2
Wireless rates ? clock rates
4 GHz
54-100 Mbps
200 MHz
2-10 Mbps
1 Mbps
9.6 Kbps
  • Need to process 100X more bits per clock cycle
    today than in 1996

3
Base-stations need horsepower
Sophisticated signal processing for multiple
users Need 100-1000s of arithmetic operations to
process 1 bit Base-stations require gt 100 ALUs
4
Power efficiency and flexibility
implies does not waste power does not imply low
power
  • Wireless systems getting harder-to-design
  • Evolving standards, compatibility issues
  • More base-stations per unit area
  • operational and maintenance costs
  • Flexibility provides power-efficiency
  • Base-stations rarely operate at full capacity
  • Varying users, data rates, spreading, modulation,
    coding
  • Adapt resources to needs

5
Thesis addresses the following problem
  • I want to design programmable processors for
    wireless base-stations with 100s of ALUs
  • map wireless algorithms on these processors
  • power-efficient (adapt resources to needs)
  • (c) decide ALUs, clock frequency

how much programmable? as programmable as
possible
6
Choice Stream processors
  • Single processors wont do
  • ILP, subword parallelism not sufficient
  • Register file explosion with increasing ALUs
  • Multiprocessors
  • Data parallelism in wireless systems
  • SIMD (vector) processors appropriate
  • Stream processors media processing
  • Share characteristics with wireless systems
  • Shown potential to support 100-1000s of ALUs
  • Cycle accurate simulator and compiler tools
    available

7
Thesis contributions
  • (a)Mapping algorithms on stream processors
  • designing data-parallel algorithm versions
  • tradeoffs between packing, ALU utilization and
    memory
  • reduced inter-cluster communication network
  • (b)Improve power efficiency in stream processors
  • adapting compute resources to workload variations
  • varying voltage and frequency to real-time
    requirements
  • (c) Design exploration between ALUs and clock
    frequency to minimize power consumption
  • fast real-time performance prediction

8
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

9
Wireless workloads
Time 1996
2004 ?
10
Key kernels studied for wireless
  • FFT Media processing
  • QRD Media processing
  • Outer product updates
  • Matrix vector operations
  • matrix matrix operations
  • Matrix transpose
  • Viterbi decoding
  • LDPC decoding

11
Characteristics of wireless
  • Compute-bound
  • Finite precision
  • Limited temporal data reuse
  • Streaming data
  • Data parallelism
  • Static, deterministic, regular workloads
  • Limited control flow

12
Parallelism levels in wireless systems
  • int i,aN,bN,sumN // 32 bits
  • short int cN,dN,diffN // 16 bits packed
  • for (i 0 ilt 1024 i)
  • sumi ai bi
  • diffi ci - di
  • Instruction Level Parallelism (ILP) - DSP
  • Subword Parallelism (MMX) - DSP
  • Data Parallelism (DP) Vector Processor
  • DP can decrease by increasing ILP and MMX
  • Example loop unrolling

DP
ILP
MMX
13
Stream Processors multi-cluster DSPs
Memory Stream Register File (SRF)










ILP MMX















DP
adapt clusters to DP Identical clusters, same
operations. Power-down unused FUs, clusters
VLIW DSP (1 cluster)
14
Programming model
Communication
Computation
Your new hardware wont run your old software
Balchs law
15
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

16
Viterbi needs odd-even grouping
  • Exploiting Viterbi DP in SWAPs
  • Use Register exchange (RE) instead of regular
    traceback
  • Re-order ACS, RE

17
Performance of Viterbi decoding
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
18
Patterns in inter-cluster comm
  • Intercluster comm network fully connected
  • Structure in access patterns can be exploited
  • Broadcasting
  • Matrix-vector multiplication, matrix-matrix
    multiplication, outer product updates
  • Odd-even grouping
  • Transpose, Packing, Viterbi decoding

19
Odd-even grouping
  • Packing
  • overhead when input and output precisions are
    different
  • Not always beneficial for performance
  • Odd-even grouping required for bringing data to
    right cluster
  • Matrix transpose
  • Better done in ALUs than in memory
  • Shown to have an order-of-magnitude better
    performance
  • Done in ALUs as repeated odd-even groupings

20
Odd-even grouping
0 1 2 3 4 5 6 7 ? 0 2 4 8 1 3 5 7
Inter-cluster communication
Entire chip length Limits clock frequency Limits
scaling
21
A reduced inter-cluster comm network
only nearest neighbor interconnections
22
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

23
Flexibility needed in workloads
3G Workload variation from 1 GOPs for 4 users,
constraint 7 viterbi to 23 GOPs for 32 users,
constraint 9 viterbi
24
Flexibility affects DP
U - Users, K - constraint length, N - spreading
gain, R - decoding rate
25
When DP changes
  • 4 ? 2 clusters
  • Data not in the right SRF banks
  • Overhead in bringing data to the right banks
  • Via memory
  • Via inter-cluster communication network

26
Adapting clusters to Data Parallelism
SRF
Turned off using voltage gating to eliminate
static and dynamic power dissipation
Adaptive Multiplexer Network
Clusters
C
C
C
C
No reconfiguration
4 2 reconfiguration
41 reconfiguration
All clusters off
C
C
C
27
Cluster utilization variation
Cluster Index
Cluster utilization variation on a 32-cluster
processor (32, 9) 32 users, constraint length
9 Viterbi
28
Frequency variation on 32 clusters
29
Operation
  • Dynamic Voltage-Frequency scaling when system
    changes significantly
  • Users, data rates
  • Coarse time scale (every few seconds)
  • Turn off clusters
  • when parallelism changes significantly
  • Memory operations
  • Exceed real-time requirements
  • Finer time scales (100s of microseconds)

30
Power Voltage Gating Scaling
Power can change from 12.38 W to 300 mW (40x
savings) depending on workload changes
31
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

32
Deciding ALUs vs. clock frequency
  • No independent variables
  • Clusters, ALUs, frequency, voltage (c,a,m,f)
  • Trade-offs exist
  • How to find the right combination for lowest
    power!

33
Static design exploration
also helps in quickly predicting real-time
performance
34
Sensitivity analysis important
  • We have a capacitance model Khailany2003
  • All equations not exact
  • Need to see how variations affect solutions

35
Design exploration methodology
  • 3 types of parallelism ILP, MMX, DP
  • For best performance (power)
  • Maximize the use of all
  • Maximize ILP and MMX at expense of DP
  • Loop unrolling, packing
  • Schedule on sufficient number of
    adders/multipliers
  • If DP remains, set clusters DP
  • No other way to exploit that parallelism

36
Setting clusters, adders, multipliers
  • If sufficient DP, linear decrease in frequency
    with clusters
  • Set clusters depending on DP and execution time
    estimate
  • To find adders and multipliers,
  • Let compiler schedule algorithm workloads across
    different numbers of adders and multipliers and
    let it find execution time
  • Put all numbers in power equation
  • Compare increase in capacitance due to added ALUs
    and clusters with benefits in execution time
  • Choose the solution that minimizes the power

37
Design exploration for clusters (c)
DP
ILP
  • For sufficiently large
  • adders, multipliers per cluster
  • Explore Algorithm 1 32 clusters
  • Explore Algorithm 2 64 clusters
  • Explore Algorithm 3 64 clusters
  • Explore Algorithm 4 16 clusters

38
Clusters frequency and power
32 clusters at frequency 836.692 MHz (p 1) 64
clusters at frequency 543.444 MHz (p 2) 64
clusters at frequency 543.444 MHz (p 3)
3G workload
39
ALU utilization with frequency
3G workload
40
Choice of adders and multipliers
41
Exploration results
  • Final Design Conclusion
  • Clusters 64
  • Multipliers/cluster 1
  • Multiplier Utilization 62
  • Adders/cluster 3
  • Adder Utilization 55
  • Real-time frequency 568.68 MHz for 128
    Kbps/user
  • Exploration done in seconds.

42
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

43
Broader impact
  • Results not specific to base-stations
  • High performance, low power system designs
  • Concepts can be extended to handsets
  • Mux network applicable to all SIMD processors
  • Power efficiency in scientific computing
  • Results 2, 3 applicable to all stream
    applications
  • Design and power efficiency
  • Multimedia, MPEG,

44
Future work
  • Dont believe the model is the reality
  • (Proof is in the pudding)
  • Fabrication needed to verify concepts
  • Cycle accurate simulator
  • Extrapolating models for power
  • LDPC decoding (in progress)
  • Sparse matrix requires permutations over large
    data
  • Indexed SRF may help
  • 3G requires 1 GHz at 128 Kbps/user
  • 4G equalization at 1 Mbps breaks down (expected)

45
Need for new architectures, definitions and
benchmarks
  • Road ends - conventional architecturesAgarwal2000
  • Wide range of architectures DSP, ASSP, ASIP,
    reconfigurable,stream, ASIC, programmable
  • Difficult to compare and contrast
  • Need new definitions that allow comparisons
  • Wireless workloads
  • Typically ASIC designs
  • SPEC benchmark needed for programmable designs

46
Conclusions
  • Utilizing 100-1000s ALUs/clock cycle and mapping
    algorithms not easy in programmable architectures
  • Data parallel algorithms need to be designed and
    mapped
  • Power efficiency needs to be provided
  • Design exploration needed to decide ALUs to meet
    real-time constraints
  • My thesis lays the initial foundations
Write a Comment
User Comments (0)
About PowerShow.com