Title: Lecture 14 Software Design for LowPower
1Lecture 14Software Design for Low-Power
- Software dictates much of hardware activity
- Need software power estimation method
- Must optimize software at several levels of
abstraction - Ultimately involves hardware/software trade-offs
- Summary
- Michael L. Bushnell
- CAIP Center and WINLAB
- ECE Dept., Rutgers U., Piscataway, NJ
2Sources of Software Power Dissipation
- Memory system uses power (1/10 to ¼) in portable
computers - Dominates in video processing
- System bus switching activity controlled by
software - ALU and FPU data paths needs good scheduling to
avoid pipeline stalls - Control logic and clock reduce by using
shortest possible program to do the computation - Software control of hardware
- Reduces power of idle components
- Controls power saving modes
3Memory Accesses Are Expensive
- Highly capacitive data / address / row / column
decode / word data lines with high fanout - Mapping of data structures to memory banks
determines possibility of parallel word loads
(which are more energy efficient) - Memory access patterns greatly affect cache
performance - Cache more power efficient than main memory
- Closer to CPU and smaller than main memory
4Methods of Software Power Estimation for Code
- Lower level use gate-level simulation power
estimation on gate-level description of hardware
running the code most accurate - Higher level look at frequency of each type of
instruction / sequence (execution profile) - Use lookup table
- Model ALUs, register files, etc.
- Use bus switching activity exclusively
- Must know bus architecture, OPCODEs,
representative input data to program, mapping of
code/data to address space - Characterize instruction power empirically on
real hardware
5Instruction Level Power Analysis
- For each instruction, need to measure
- Base power cost (independent of prior state)
- Prior state cost needs large-scale analysis /
simulation - Pipeline stalls, buffer stalls, cache misses
- Circuit state effects due to localized
processor state change from execution of
instruction pair - Example ADD A B, C
- MULT C A, B
- Includes change of instruction bus state
- Switching of control lines
- ALU mode changes
- Routing costs to / from register file
6Instruction Analysis Example
- EP overall energy cost of program
- Bi base cost of instruction type i
- Oi,j cost of instruction type i followed by
type j - Oi,j Oj,i
- Ek costs of pipeline stalls and cache misses
7Detailed Example
- Simple DSP with 4 registers A, B, C, D
- Evaluate (x y) z
8Example Concluded
9Software Power Optimizations
- Select least expensive instructions / instruction
sequences - Minimize frequency or cost of memory access
- Use hardware power minimization features
- Algorithm choices
- Must map well to available hardware
- Be efficient for problem being solved
- Must maximize performance parallel processing
- Constraints
- Battery-powered system minimize total energy
dissipated - When heat dissipation or reliability is important
minimize instantaneous / average power
10Algorithm Transformations
- Reduction operations use parallel hardware, but
run slowly
11Two Adders Available
12Four Adders Available
13Minimizing Memory Access Costs
- Memory causes power and performance bottleneck
- Minimize accesses needed by algorithm
- Minimize total memory size needed by algorithm
- Put memory accesses as close as possible to CPU
- Register, then cache, external RAM last
- Efficiently use memory bandwidth
- Use multiple-word parallel loads, not single word
loads
14Improve Loop Nesting and Operation Order
- B too large to store in registers - used memory
transfers, instead - Loop rearrangement allowed intermediate B to stay
in general register - Got rid of 2N memory accesses for B and N memory
locations not needed
15Reduce Space Requirements
- C not needed afterward reorder so that B can
overwrite C
16Reduced Intermediate Value Storage
- Reduce storage for intermediate values AiM
- Reduced from N locations to 1
17Dual Memory Load Example
18Maximize Parallel Loads of Multiple Memory Words
- Dual word loads reduce energy by 47
- 1st maximize dual loads with memory allocation
- 2nd Combine memory accesses
- No dual load Dual
load Dual load parallel exec.
19Minimize Memory Bandwidth
- Allocate registers to minimize memory references
- Cache blocking loop unrolling fix array
computations so that blocks of array only read
once - Register blocking eliminate redundant register
loads - Recurrence detection optimization use
registers for values carried over from one
recursion level to next - Compact multiple memory references into 1
reference - 40 speedup on DEC alpha, 25 on Motorola 88100
20Cache Locking
- Lock data into cache
- Prevents memory reads/writes from going to main
memory - Real benefit -- cache write hits not written
through to main memory - Read hits no main memory reference even if
cache were unlocked - Fujitsu SPARClite writing 0 drew 341 mA from
power supply when cache unlocked, only 194 mA
when cache locked
21Instruction Selection/Ordering
- Instruction packing
- Single instruction does both ALU operation and
memory data transfer - Much instruction overhead not duplicated when
operations run in parallel - Concurrent execution of integer floating point
Ops - Easier to do in VLIW and superscalar
architectures - Reorder instructions to minimize circuit state
effects - Most significant for DSP units
- Accumulator spilling and mode switching are most
sensitive
22Operand Swapping/Ordering
- Swapping minimizes switching of functional unit
inputs - Example x 7 and y 7 keep 7 on same adder
input - Ordering most significant if commutative operands
not treated symmetrically by hardware - Example Booth multiplier
- 2nd operand bit pattern determines additions
and subtractions (called recoding weight) - Put operand with lowest recoding weight on 2nd
operand - Saved 10-30 of the power in Lees experiments
23Power Management
- Software can often control processor power-down
modes - User interfaces activity comes in bursts
- When system idle time exceeds threshold, likely
to continue to be idle start shutdown - Example SPARClite
- Power-down register masks/enables clock for
- SDRAM, DMA module, FPU, floating-point queues
- Example Hitachi SH3
- Standby mode CPU core stopped, peripheral
controller, bus controller, memory refresh
continue - Sleep mode Everything but real-time clock stops
24More Examples
- Example Intel 486SL
- System Management Mode entered by asynchronous
interrupt - Can enable, disable, switch between fast and
slow clocks for CPU and ISA bus - Example PowerPC 603 and 604
- Dynamic power management removes clock from
execution units (saves 8-16) - Static power management
- Doze shuts off most function units, keeps bus
snooping enabled to maintain data cache coherence - Nap shuts off bus snooping and sets wakeup
timer, keeps phase-locked loop running to allow
quick clock restart - Sleep mode also shuts off phase-locked loop
- Software control of power management better than
pure hardware control more information
25Automated Low-Power Code Generation
- High-level
- Graphical / textual languages available to
describe DSP algorithms - HYPER_LP uncovers parallelism and minimizes
critical paths - Allows data path supply voltage to be reduced
- MASAI reorganizes loops to minimize memory
transfers and size - DSP Compiler technology must deal with small
register set, irregular data paths, fully using
parallel resources
26Sus Cold Scheduling Algorithm
- Allocate registers
- Pre-assemble Calculate target addresses, index
symbol table, transform instructions to binary - Schedule instructions with cold scheduling
- Post-assemble Complete assembly
- Reduced switching activity 20-30, performance
loss of 2-4 - Lee et al. had a similar approach, but used an as
soon as possible packing of instructions - Saved 26 to 73 energy compared with no
instruction packing and no memory bank assignment
optimization
27Instruction Set Co-Design
- PEAS-I system
- Takes HDL for CPU and C compiler, assembler, and
simulator for CPU - Takes design constraints (chip area, power),
hardware module database, sample application
program, program data set - Optimizes instruction set and implementations for
application program and given data set - Starts with core instructions needed for any C
program - Augments with instructions for C operators not
already included as a single instruction - Defines hardware, microprogram, and software
implementations - Estimates power and area of each instruction
implementation - Accounts for pipeline hazards
- Solves as an integer program using
branch-and-bound search
28Instruction Set Design
- Huang and Despain system
- Optimizes instruction set for sample application,
too - Groups micro-operations (MOPS) together to form
higher-level instructions - Merge MOPS together as byproduct of scheduling
- MOPS must be scheduled to same clock cycle
- Constrain with instruction bit width, instruction
set size, and hardware resources - Solved by simulated annealing with objective of
minimizing execution cycles instruction set size
29Reconfigurable Computing
- Some of hardware interconnect and logic is
modified at run time - Can closely optimize to wide variety of
applications - Can reconfigure at gate level or at architectural
level - Usually implemented using FPGAs
- Gives software designer chance to tailor software
and processor to fit each other - Even after hardware design is fixed
30Memory System Considerations
- Feature
- Total size
- Bank partitioning
- Wide data bus
- Proximity to CPU
- Cache size
- Cache protocol
- Cache locking
Low-Power Impact Code compaction, algorithm
transformations Smaller, lower CL,
faster Determined by application size, data
structure Needed for parallel loads Reduces CL,
makes memory use less costly Minimizes cache
misses for application Need spatial and temporal
locality Optimize for application Use to run part
of application only from cache
31Architectural Considerations
- Processor class DSP/RISC/CISC
- Parallel Processing VLIW, Superscalar, SIMD, MIMD
- Bus architecture
- Register file size
Application dependent Does parallel processing
greatly improve performance so that reduced VDD
and slower clock can be used? Separate address,
data, instruction, I/O busses make it easier to
optimize instructions, addressing, and data to
minimize bus activity Eases register allocation,
reduces memory accesses, too many increases power
32Power Management Considerations
- Software vs. hardware control
- Granularity of clock/voltage removal
- Clock/voltage reduction
- Guarded evaluation
Software control lets power management be
tailored to application Shutdown modes needed to
benefit from minimized execution times Useful if
there is schedule slack for a software task
(power reduced during slack) Latch functional
unit inputs to avoid meaningless calculation when
not in use
33Summary
- Estimation of software contribution to power
dissipation - Minimize software power dissipation
- Choose best algorithm that is suited to hardware
resources - Minimize memory size and expensive memory
accesses - Algorithm transformations
- Efficient data mapping onto memory
- Optimal use of memory bandwidth, registers
cache - Optimally use available parallelism for
application - Use hardware power management support
- Select instruction sequences to minimize
switching in CPU and data path