Title: An InstructionLevel Power Model for the Xscale Architecture
1An Instruction-Level Power Model for the Xscale
Architecture
- By Hailemelekot Seifu and
- Rishi Ranjan
- Real-Time Systems
2BACKGROUND I
- Previous Work by Tiwari and others
- Software is a major component of power
consumption in real-time and non-real-time
applications - Therefore, software must be analyzed in terms of
power consumption to minimize energy needs of
total systems
3BACKGROUND II
- The lowest level of hardware analysis is the
logic gate - The instruction in software parallels the logic
gate in hardware. Power analysis should be at
this basic level. - Most work pevious to Tiwari were hardware-based
4BACKGROUND III
- Instruction level power model
- Used to quantify the fundamental information
needed to evaluate power cost of a program. - Can be used by compilers, code generators and
code schedulers targeted for low power. - Does not require knowledge of lower level details
of processor.
5IMPORTANCE
- Low Power Consumption
- Prolongs life of embedded systems that have low
battery/power stores. - Requires less maintenance and upkeep
- Saves money
6Applications
- Instruction-Level Power Models
- can give total power consumption information on a
software system before it is deployed on a power
constrained system - allow creation of software optimization
techniques at code compilation and generation
stages
7Project Goals
- Perform an instruction-level power analysis of
the Intel Xscale-based Architecture (only on
processor core). - Create a power estimator that takes a programs
instruction trace as input and determines its
power consumption - Evaluate the model and estimator using power
measurements
8Hardware I
- ADI BRH 80200
- Intel XScale (400-733MHz) 80200 CPU
- 128MB DRAM.
- Dual Intel EEpro/100 (10/100) ethernet ports.
- Oscilloscope
- A Tektronix? was used for current readings. It
allows time granularity of 5ms and current
granularity of 5mA.
9HARDWARE II
Power Source
Oscilloscope
CPU
10ARM ISA I
- Six Instruction Classes were distinguished
- Branch B, BL, BX, BLX
- Data Processing ADD, ADC, BIC, CMP
- Multiply MUL, MLA, SMLAL, UMLAL
- Load/Store LDR, LDMDA, STR, STMDA
- Miscellaneous MRS, MSR, MRC, MCR
- No operation NOP
11ARM ISA II
- Branch instruction readings were not performed
due to difficulty in creating programs that would
accurately discern current caused by branches - All readings were taken on top of Linux operating
system
12Measurement Methods I
- Power Current Voltage
- Energy Power Time
- Time Number of processor cycles Clock Period
- Energy Current Voltage Time
- Voltage is constant for a battery ( Xscale
supports voltage scaling) - Measure Current
- Measure Time
13Measurement Methods II
- Energy consumption has three factors
- Base Costs
- the energy consumed by the basic processing of
each instruction - Inter-Instruction Effects
- the energy costs due to the change in circuit
state when two consecutive, different
instructions are executed - Cache misses and stalls
- the energy effect of instruction/data cache
misses or pipeline stalls due to resource
constraints
14Measurement Methods III
- Base Cost Calculation
- Base instruction cost is measured and the given
current reading is taken as an average - The programs contain the instruction to be
measured repeated in a loop. - Avoid stalls and cache misses
- Overcome the effect of jump instruction at bottom
- Contradictory requirements
15Base Cost Program Example
- .global main
- main
- ADC R0,R1,R3 Cache
Size 32 KB -gt 200 lines avoids cache misses - ADC R1,R1,R3 With 200
lines, Branch instruction at the end will have - ADC R2,R1,R3
insignificant effect on measured current. - ADC R3,R1,R3
- ADC R4,R1,R3
- ADC R0,R1,R3
- ADC R1,R1,R3
- repeat for 200 LINES
- B main
16Measurement Methods IV
- Inter-Instruction Effect
- These effects occur mostly during the fetch stage
of the execution pipeline when the hardware
circuit changes significantly due to setup of two
different instructions - The programs are generally a loop of two
alternating instructions
17Inter-Instruction Program Example
- ADD R0,R10,R11
- MLA R1,R2,R10,R11
- ADD R0,R10,R11
- MLA R1,R2,R10,R11
- Expected current (405 508)/2 456.5 mA
- measured current 567 mA
- difference 100.5 mA
-
18Measurement Methods V
- Resource Contraint Effects
- Can occur for various reasons
- Most likely is due to register dependencies among
consecutive instructions - Run a program in loop having repeated occurrence
of re-used destination registers
19Resource Constraint Example
- .global main
- main
- ADD R0,R6,R7 Avoids Cache misses
- ADD R0,R6,R7 Same destination register
ADD R0, R6, R7 causes
pipeline stalls - ADD R0,R6,R7
- ADD R0,R6,R7
- repeat for 200 lines
- B main
20Measurement Methods VI
- To create cache miss effects
- Run programs with initial instructions that
invalidate all cache entries.
21Cache Miss Program Example
- .global main
- main
- MCR P15,0,R1,C7,C5,0 32KB
instruction cache - MRC P15, 0, R0, C2, C0, 0 MCR
invalidates the cache - MOV R0, R0
8000 lines to cause all -
possible cache misses - SUB PC, PC, 4 8000
lines will compensate - ADC R0, R1, R3 for
the time we wait - ADC R0,R1,R3 for
cache invalidation - ADC R0,R1,R3 MCR
is a privileged instr - 8000 lines
code as kernel modules. - B main
22Base Cost Results
Instruction Base Readings (mA) ADC 426 ADD 406
AND 405 BIC 401 CLZ 400 CMN 400 CMP 474 EOR
401 MOV 398 RSB 503 LDMDA 606 SWP 537 NOP 380
23Inter-Instruction Results
Instruction Readings (mA) ADC_LDMDA 471 CLZ_STR
530 BIC_MSR 460 STMDA_NOP 495
24Cache Miss and Stall Results
Instruction Cache Readings (mA) ADC 456 MLA 350
SWP 393
Instruction Stall Readings (mA) BIC 384 MVN 386
TEQ 383
25Power Model Analysis I
Base Cost
26Power Model Analysis II
Inter-Instruction
27Power Model Analysis III
Cache Miss
28Base Cost
29Base Cost Observations
- Base Cost power for instructions in same class
have similar base cost - Can be used to group together instructions to
assign average power cost. - Profiler optimization.
- Assumes insignificant change due to variations in
arguments. - Same base cost for an instruction
- Inter-instruction cost calculated between groups
of instructions.
30Inter-Instruction
31Inter-Instruction Observations
- Circuit overhead insignificant for x86(Tiwari
work). - Our measurement shows that it IS significant
compared to our base costs. - Power profiler should include this overhead for
ARM architecture.
32Resource Constraint Penalty
33Observations
- There is Limited variation for different
instructions - Profiler would assign average power to all stalls
- Profiler would assign penalty due to single
occurrence multiplied by total occurrence in a
program
34Current Estimator I
- The software estimator will
- take as input a programs instruction trace,
cache miss, and stall information - use the above instruction power model and
- output the corresponding average current for that
program
35Current Estimator II
- Estimators Design
- SimpleScalarARM (Umich) was used as a starting
point since it could give an instruction trace
most similar to the ARM architecture - Cache miss and stall information would be
obtained from a per-process Performance Counter
facility for the ADI boards.
36Current Estimator III
- Several Challenges Emerged
- Performance Counter facility is not working
- SimpleScalars cache miss functionality was also
not working with chosen benchmarks - So
- cache miss and stall information could not be
included in the evaluation phase. - Base cost and inter-instruction effects can still
be evaluated with SimpleScalars instruction trace
37Benchmarks
- MIBench (Umich)
- is a benchmark suite for embedded systems
- has several sub-categories depending on the
systems use such as telecommunications,
automotive, and consumer - was chosen because it is an embedded suite and is
freely available
38Estimator Results I
39Estimator II
- Results are very skewed due to lack of cache miss
and stall information - Other benchmarks were challenging to compile for
ARM/SimpleScalar to function, we are working on
other benchmarks - Proper handling of floating point instructions
need to be added
40Necessary Work - LOTS!
- Include Cache miss and stall information
- power profiler cant be accurate
- Inclusion of Branch instruction in model
- branch instruction is used for functions, etc
- Inclusion of instr operands in power model
- due to time constraints we only look at op
- Improvement of Estimator performance
- using better data structures, opcode search
41Necessary Work II
- Proper analysis of Stall information
- Modelling of Load/Store inter-instruction effects
- Data cache accesses, etc
42QUESTIONS