An InstructionLevel Power Model for the Xscale Architecture - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

An InstructionLevel Power Model for the Xscale Architecture

Description:

Oscilloscope. A Tektronix? was used for current readings. ... Oscilloscope. ARM ISA I. Six Instruction Classes were distinguished: Branch: B, BL, BX, BLX ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 43

Provided by: Seif7

Category:

more less

Transcript and Presenter's Notes

Title: An InstructionLevel Power Model for the Xscale Architecture

1
An Instruction-Level Power Model for the Xscale
Architecture

By Hailemelekot Seifu and
Rishi Ranjan
Real-Time Systems

2
BACKGROUND I

Previous Work by Tiwari and others
Software is a major component of power
consumption in real-time and non-real-time
applications
Therefore, software must be analyzed in terms of
power consumption to minimize energy needs of
total systems

3
BACKGROUND II

The lowest level of hardware analysis is the
logic gate
The instruction in software parallels the logic
gate in hardware. Power analysis should be at
this basic level.
Most work pevious to Tiwari were hardware-based

4
BACKGROUND III

Instruction level power model
Used to quantify the fundamental information
needed to evaluate power cost of a program.
Can be used by compilers, code generators and
code schedulers targeted for low power.
Does not require knowledge of lower level details
of processor.

5
IMPORTANCE

Low Power Consumption
Prolongs life of embedded systems that have low
battery/power stores.
Requires less maintenance and upkeep
Saves money

6
Applications

Instruction-Level Power Models
can give total power consumption information on a
software system before it is deployed on a power
constrained system
allow creation of software optimization
techniques at code compilation and generation
stages

7
Project Goals

Perform an instruction-level power analysis of
the Intel Xscale-based Architecture (only on
processor core).
Create a power estimator that takes a programs
instruction trace as input and determines its
power consumption
Evaluate the model and estimator using power
measurements

8
Hardware I

ADI BRH 80200
Intel XScale (400-733MHz) 80200 CPU
128MB DRAM.
Dual Intel EEpro/100 (10/100) ethernet ports.
Oscilloscope
A Tektronix? was used for current readings. It
allows time granularity of 5ms and current
granularity of 5mA.

9
HARDWARE II
Power Source
Oscilloscope
CPU
10
ARM ISA I

Six Instruction Classes were distinguished
Branch B, BL, BX, BLX
Data Processing ADD, ADC, BIC, CMP
Multiply MUL, MLA, SMLAL, UMLAL
Load/Store LDR, LDMDA, STR, STMDA
Miscellaneous MRS, MSR, MRC, MCR
No operation NOP

11
ARM ISA II

Branch instruction readings were not performed
due to difficulty in creating programs that would
accurately discern current caused by branches
All readings were taken on top of Linux operating
system

12
Measurement Methods I

Power Current Voltage
Energy Power Time
Time Number of processor cycles Clock Period
Energy Current Voltage Time
Voltage is constant for a battery ( Xscale
supports voltage scaling)
Measure Current
Measure Time

13
Measurement Methods II

Energy consumption has three factors
Base Costs
the energy consumed by the basic processing of
each instruction
Inter-Instruction Effects
the energy costs due to the change in circuit
state when two consecutive, different
instructions are executed
Cache misses and stalls
the energy effect of instruction/data cache
misses or pipeline stalls due to resource
constraints

14
Measurement Methods III

Base Cost Calculation
Base instruction cost is measured and the given
current reading is taken as an average
The programs contain the instruction to be
measured repeated in a loop.
Avoid stalls and cache misses
Overcome the effect of jump instruction at bottom
Contradictory requirements

15
Base Cost Program Example

.global main
main
ADC R0,R1,R3 Cache
Size 32 KB -gt 200 lines avoids cache misses
ADC R1,R1,R3 With 200
lines, Branch instruction at the end will have
ADC R2,R1,R3
insignificant effect on measured current.
ADC R3,R1,R3
ADC R4,R1,R3
ADC R0,R1,R3
ADC R1,R1,R3
repeat for 200 LINES
B main

16
Measurement Methods IV

Inter-Instruction Effect
These effects occur mostly during the fetch stage
of the execution pipeline when the hardware
circuit changes significantly due to setup of two
different instructions
The programs are generally a loop of two
alternating instructions

17
Inter-Instruction Program Example

ADD R0,R10,R11
MLA R1,R2,R10,R11
ADD R0,R10,R11
MLA R1,R2,R10,R11
Expected current (405 508)/2 456.5 mA
measured current 567 mA
difference 100.5 mA

18
Measurement Methods V

Resource Contraint Effects
Can occur for various reasons
Most likely is due to register dependencies among
consecutive instructions
Run a program in loop having repeated occurrence
of re-used destination registers

19
Resource Constraint Example

.global main
main
ADD R0,R6,R7 Avoids Cache misses
ADD R0,R6,R7 Same destination register
ADD R0, R6, R7 causes
pipeline stalls
ADD R0,R6,R7
ADD R0,R6,R7
repeat for 200 lines
B main

20
Measurement Methods VI

To create cache miss effects
Run programs with initial instructions that
invalidate all cache entries.

21
Cache Miss Program Example

.global main
main
MCR P15,0,R1,C7,C5,0 32KB
instruction cache
MRC P15, 0, R0, C2, C0, 0 MCR
invalidates the cache
MOV R0, R0
8000 lines to cause all
possible cache misses
SUB PC, PC, 4 8000
lines will compensate
ADC R0, R1, R3 for
the time we wait
ADC R0,R1,R3 for
cache invalidation
ADC R0,R1,R3 MCR
is a privileged instr
8000 lines
code as kernel modules.
B main

22
Base Cost Results
Instruction Base Readings (mA) ADC 426 ADD 406
AND 405 BIC 401 CLZ 400 CMN 400 CMP 474 EOR
401 MOV 398 RSB 503 LDMDA 606 SWP 537 NOP 380
23
Inter-Instruction Results
Instruction Readings (mA) ADC_LDMDA 471 CLZ_STR
530 BIC_MSR 460 STMDA_NOP 495
24
Cache Miss and Stall Results
Instruction Cache Readings (mA) ADC 456 MLA 350
SWP 393
Instruction Stall Readings (mA) BIC 384 MVN 386
TEQ 383
25
Power Model Analysis I
Base Cost
26
Power Model Analysis II
Inter-Instruction
27
Power Model Analysis III
Cache Miss
28
Base Cost
29
Base Cost Observations

Base Cost power for instructions in same class
have similar base cost
Can be used to group together instructions to
assign average power cost.
Profiler optimization.
Assumes insignificant change due to variations in
arguments.
Same base cost for an instruction
Inter-instruction cost calculated between groups
of instructions.

30
Inter-Instruction
31
Inter-Instruction Observations

Circuit overhead insignificant for x86(Tiwari
work).
Our measurement shows that it IS significant
compared to our base costs.
Power profiler should include this overhead for
ARM architecture.

32
Resource Constraint Penalty
33
Observations

There is Limited variation for different
instructions
Profiler would assign average power to all stalls
Profiler would assign penalty due to single
occurrence multiplied by total occurrence in a
program

34
Current Estimator I

The software estimator will
take as input a programs instruction trace,
cache miss, and stall information
use the above instruction power model and
output the corresponding average current for that
program

35
Current Estimator II

Estimators Design
SimpleScalarARM (Umich) was used as a starting
point since it could give an instruction trace
most similar to the ARM architecture
Cache miss and stall information would be
obtained from a per-process Performance Counter
facility for the ADI boards.

36
Current Estimator III

Several Challenges Emerged
Performance Counter facility is not working
SimpleScalars cache miss functionality was also
not working with chosen benchmarks
So
cache miss and stall information could not be
included in the evaluation phase.
Base cost and inter-instruction effects can still
be evaluated with SimpleScalars instruction trace

37
Benchmarks

MIBench (Umich)
is a benchmark suite for embedded systems
has several sub-categories depending on the
systems use such as telecommunications,
automotive, and consumer
was chosen because it is an embedded suite and is
freely available

38
Estimator Results I
39
Estimator II

Results are very skewed due to lack of cache miss
and stall information
Other benchmarks were challenging to compile for
ARM/SimpleScalar to function, we are working on
other benchmarks
Proper handling of floating point instructions
need to be added

40
Necessary Work - LOTS!