Title: Runtime%20Processor%20Power%20Monitoring
1Runtime Processor Power Monitoring
- VICE VERSA
- 07/18/2003 Talk
- Canturk ISCI
2INTRODUCTION
- Runtime processor power
- Measurement with HW
- Estimation with Performance counters
- CPU Unit Power Breakdowns
- Runtime verification
- Processor thermal modeling
- Power Phase Behavior of programs
- Mapping between power behavior and program
structure
3THE BIG PICTURE
Bottom line
- To Estimate component power temperature
breakdowns for P4 at runtime - To analyze how power phase behavior relates to
program structure
4Agenda
- Performance Monitoring
- P4 Performance Counters
- Performance Reader LKM
- Real Power Measurement
- P4 Power Measurement Setup
- Examples
- Power Modeling
- P4 Power Model
- Model Measurement Sync Setup, Verification
- Thermal Modeling
- Brief Thermal Model Intro
- Ppro Thermal Model Results
5Bonus Material
- Power Phase Behavior
- Similarity Based on Power Vectors
- Identifying similar program regions
- Profiling Execution Flow
- Sampling process execution
- PCsampler LKM
- Program Structure
- Execution vs. Code space
- Power Phases ? Exec. Phases
- ltOR VICE VERSAgt
6Performance Monitoring
- Related Work
- Performance Monitoring
- P4 Performance Counters
- Performance Reader LKM
- Real Power Measurement
- P4 Power Measurement Setup
- Examples
- Power Modeling
- P4 Power Model
- Model Measurement Sync Setup, Verification
- Thermal Modeling
- Refined Thermal Model
- Ex Ppro Thermal Model
7Live CPU Performance Monitoring with Hardware
Counters
- Most CPUs have hardware performance counters
- P4 Performance Monitoring HW
- 18 Event Counters
- 18 Counter Configuration Control Registers
- Configure how to count
- 45 Event Selection Control Registers
- Configure what to count
- Additional Control Registers
8Our Event-Counter Performance Reader
- Performance Reader implemented as Linux Loadable
Kernel Module - Implements 6 syscalls
- select_events()
- reset_event_counter()
- start_event_counter()
- stop_event_counter()
- get_event_counts()
- set_replay_MSRs()
- User Level Interface
- Defines the events ? Starts counters
- Stops counters ? Reads counters TSC
- Event Types
- 59 event classes
- 100s of events to count
9Performance Reader Example Validation
- L1_Dcache benchmark
- Controls cache hit behavior
- Validated against measured cache events
- Vary hit rate from 0-100
10Processor Power Measurement
- Related Work
- Performance Monitoring
- P4 Performance Counters
- Performance Reader LKM
- Real Power Measurement
- P4 Power Measurement Setup
- Examples
- Power Modeling
- P4 Power Model
- Model Measurement Sync Setup, Verification
- Thermal Modeling
- Refined Thermal Model
- Ex Ppro Thermal Model
11P4 Power Measurement Setup
Clamp ammeter on 12V lines on measured CPU
DMM reading clamp voltages
1mV/Adc conversion
Voltage readings via RS232 to logging machine
Convert to Power vs. time window
12PowerPlotter Example
13SPEC Power Examples
- Different programs show very different power
characteristics - Timescale of interest can be huge gt inaccessible
via simulation
14Processor Power Modeling
- Related Work
- Performance Monitoring
- P4 Performance Counters
- Performance Reader LKM
- Real Power Measurement
- P4 Power Measurement Setup
- Examples
- Power Modeling
- P4 Power Model
- Model Measurement Sync Setup, Verification
- Thermal Modeling
- Refined Thermal Model
- Ex Ppro Thermal Model
15P4 POWER MODEL
16Defining Components
17Defining Events ? Access Rates
- We determined 24 events to approximate access
rates for 22 components - Used Several Heuristics to represent each access
rate - Examples
- Need to rotate counters 4 times to collect all
event data - Used 15 counters 4 rotations to collect all
event data
18Access Rates ? Component Powers
- Performance Counter based Access Rate
estimations are used as proxy for max component
power weighting together with microarchitectural
details in order to estimate processor sub-unit
powers - EX Trace cache delivers 3 uops/cycle in deliver
mode and 1 uop/cycle in build mode - Power(TC)Access-Rate(TC)/3 Access-Rate(ID)
x MaxPower(TC)
Non-gated TC CLK power - Total power is computed as the sum of all 22
component powers measured idle power (8W)
19Experiment Setup Recall
Clamp ammeter on 12V lines on measured CPU
DMM reading clamp voltages
1mV/Adc conversion
Voltage readings via RS232 to logging machine
Convert to Power vs. time window
20Experiment Setup
1mV/Adc conversion
Voltage readings via RS232 to logging machine
21Experiment Setup
1mV/Adc conversion
POWER SERVER
Voltage readings via RS232 to logging machine
Component access rates over ethernet
POWER CLIENT
Convert voltage to measured power Convert access
rates to modeled powers Sync together in time
window
22Tuning Benchmarks
Branch exercise (Taken rate 1)
High-Low
L1Dcache (Hit Rate 0.1)
Fast
23Component Breakdowns
Component Breakdowns for branch_exercise Colors
for 4 CPU subsystems
Execution
Issue - Retire
24Benchmark Power Breakdowns
25SPEC2000 Results
VPR ElaborationInteger benchmark2 runs 1st ?
Placement, 2nd Route1st run much stable power,
2nd more variablePlacement has higher miss than
route lt L2 pwrgtSignificant FPE power due to
x87_SIMD_moves
Equake Elaboration (FP benchmark)
Initialization and computation phasesFP
intensive mesh computation phaseInitialization
with high complex IA32 instructions
Twolf Elaboration(Integer benchmark)Several
loop computations traversing memoryltHigh Memory
PowergtAlthough const. Total power, component
powers have slight gradients
26Average SPEC Total Powers
- 1st set Overall, 2nd set Non-idle power
- Average difference between measurement and
estimation 3W - Worst case Equake (5.8W)
27Stdev of SPEC Total Powers
- 1st set Overall, 2nd set Non-idle power
- Average difference 2W
- Worst case Vortex (3.5W)
28Desktop Applications
- We aim to track low power utilizations as well.
- Desktop applications are usually low power with
intermittent power bursts - 3 applications, with common operations such as
open/close application, web, streaming media,
text editing, save to disk, statistical
computations.
29Thermal Model
- Related Work
- Performance Monitoring
- P4 Performance Counters
- Performance Reader LKM
- Real Power Measurement
- P4 Power Measurement Setup
- Examples
- Power Modeling
- P4 Power Model
- Model Measurement Sync Setup, Verification
- Thermal Modeling
- Brief Thermal Model Intro
- Ppro Thermal Model Results
Skip Thermal
30THERMAL MODELING A Basic Model
- Based on lumpedR-C model from packaging
- Built uponpower modeling
- Sampled Component Powers
- Respective component areas
- Physical processor Parameters
- Packaging
- Heat Transfer
31Physical Structure vs. Thermal Model
Ambient Temperature
Ambient Airflow
Heatsink
Thermal Grease
Heat Spreader
Package
Die
32Simulation Outputs
- Thermal nodes updated every ?t20ms
- Component Temperatures Build up to 350K in 5hrs
- Theatsink moves very slowly as expected
33Power Phase Behavior
- Power Phase Behavior
- Similarity Based on Power Vectors
- Identifying similar program regions
- Profiling Execution Flow
- Sampling process execution
- PCsampler LKM
- Program Structure
- Execution vs. Code space
- Power Phases ? Exec. Phases
- ltOR VICE VERSAgt
34Power Vectors for Similarity
- Similar to basic block vectors
- Use component power vector samples to represent
program phases - Consider Manhattan distance between 2 vectors as
the measure of dissimilarity between the
corresponding execution points - Construct a similarity matrix to represent
similarity among all pairs of execution points - Each entry in the similarity matrix
35Gcc Gzip Similarity Matrices
Gzip Elaboration Much regular power
behavior Spurious similarities such as 100-150s
and 200-280 are distinguished by the similarity
analysis
Gcc Elaboration Very variant power Almost
identical power behavior at 30, 50,
180s. Although 88s, 110s, 140s, 210s and 230s
show similar total power 88, 210 and 230 share
higher similarity.
36Generating representative vectors
- Gzip has 1000 power vectors
- Cluster vectors based on similarity
- Could we represent power behavior with reasonable
accuracy, with a small number of signature
vectors? - Ex 26 representative vectors with Thresholding
Algorithm
37Program Execution Profile
- Power Phase Behavior
- Similarity Based on Power Vectors
- Identifying similar program regions
- Profiling Execution Flow
- Sampling process execution
- PCsampler LKM
- Program Structure
- Execution vs. Code space
- Power Phases ? Exec. Phases
- ltOR VICE VERSAgt
38Program Execution Profile
- Sample program flow simultaneously with power
- Our LKM implementation PCsampler
- Generate code space similarity in parallel with
power space similarity - Relative comparisons for
- Complexity
- Accuracy
- Applicability, etc.
39 40Where all this is useful?
- Measurement/Modeling for microarchitectural
details - Compiler level power
- SW power profiling
- Power Aware OS
- Dynamic power/thermal/March. Configuration
- Dynamic memory allocate, Process cruise control,
etc. - Demonstrates modern processor power
- Need for speed! Long Timescales, thermal
constants - Identify program phases w/o knowledge of code, no
basic block info whatsoever - Program signatures for detailed simulation, say
power points rather than simpoints
41 42Counter Access Heuristics
- 1) BUS CONTROL
- No 3rd Level cache ? BSQ allocations IOQ
allocations - Metric1 Bus accesses from all agents
- Event IOQ_allocation
- Counts various types of bus transactions
- Should account for BSQ as well
- access based rather than duration
- MASK
- Default req. type, all read (128B) and write
(64B) types, include OWN,OTHER and PREFETCH - Metric2 Bus Utilization(The of time Bus is
utilized) - Event FSB_data_activity
- Counts DataReaDY and DataBuSY events on Bus
- Mask
- Count when processor or other agents
drive/read/reserve the bus - Expression FSB_data_activity x BusRatio
/ Clocks Elapsed - To account for clock ratios
43Counter Access Heuristics
- 2) L2 Cache
- Metric 2nd Level cache references
- Event BSQ_cache_reference
- Counts cache ref-s as seen by bus unit
- MASK
- All MESI read misses (LD RFO)
- 2nd level WR misses
- 3) 2nd Level BPU
- Metric 1 Instructions fetched from L2 (predict)
- Event ITLB_Reference
- Counts ITLB translations
- Mask
- All hits, misses UC hits
- Metric 2 Branches retired (history update)
- Event branch_retired
- Counts branches retired
- Mask
- Count all Taken/NT/Predicted/MissP
44Counter Access Heuristics
- 4) ITLB I-Fetch
- etc
- 10) FP Execution
- Metric FP instructions executed
- event1 packed_SP_uop
- counts packed single precision uops
- event2 packed_DP_uop
- counts packed single precision uops
- event3 scalar_SP_uop
- counts scalar double precision uops
- event4 scalar_DP_uop
- counts scalar double precision uops
- event5 64bit_MMX_uop
- counts MMX uops with 64bit SIMD operands
- event6 128bit_MMX_uop
- counts integer SSE2 uops with 128bit SIMD
operands - event7 x87_FP_UOP
- counts x87 FP uops
- event8 x87_SIMD_moves_uop
Back
45 46P4 Details
- Karelian.ee
- P4 1.4GHz
- 0.18?, C4-FC-PGA-423
- Heatsink ? Folded Fin
- M6, Al interconnect
- Die Size 217 mm2
- Package Size 5.34cm x 5.17cm
- Power Idle/typ./max??/51.8/71W
- D1T1/L2 8K12KUops/256K
- Voltage 1.7/1.75V
47P4 Details
- 1st LKM ltLKM_CPUinfo UserLevel_CPUinfogt
- Implements syscall getCPUinfo()
- Gathers CPU info from
- /asm/processor.h
- Intel control registers (CR4)
- CPUID instruction
- Reveals
- Debug Store mechanism exists for PEBS
- TSC exists
- MSRs implemented
- We can read/write performance counters
- EX
- karelian (P4,willamette) UserLevel_CPUinfo
- viale (P4, Northwood) UserLevel_CPUinfo
Back
48P4 Detector - Counter Clusters
Event Detectors
Event Counters
EVENTS
P4 Components
4 bit wide bus
49Counters, ESCRs CCCRs
- Simplified Recipe
- Select Event to count
- Select a counter (also defines CCCR)
- Select an ESCR
- Set ESCR fields
- Set CCCR fields
- Enable CCCR
50Counting Mechanisms
- Counting Types
- Non-retirement Events occur any time during
execution - At-Retirement Events at the retirement of
instruction - Can count BOGUS vs NBOGUS, Tag uops to count,
etc. - Terminology?
- Mechanisms
- Front end tagging (i.e. LD/ST retired)
- Execution tagging (i.e. packed_DP_retired)
- Replay Tagging (i.e. L1 misses)
- No Tags (i.e. uops retired)
- Also
- Event Counting IEBS PEBS
Back
51At Retirement Counting Terminology
- BOGUS/NBOGUS (speculative)
- Tagging (count uops that encounter event)
- Replay (Data speculation)
Back
52Verifying Counter Reader
- 1) L1Dcache_exercise
- Uses pointer assignment
- L18K, L2256K
- Array Size (L1 Size/Hit Rate)
- i.e. for 10 Hit rate 80K ? 20K entries
- Array Size lt L2 size
- Array elements PRBS of array indices
- Bench loop
- new index ? arrayold index
- However, gcc puts 5 LDs in the bench loop
- 4 static ? Hit rate 100
- 1 our load ? our desired hit rate
53Verifying Counter Reader
- 1) L1Dcache_exercise results
Ex L1Dcache_exercise Hit Rate 0.25
54Verifying Counter Reader
- 2) branch_exercise
- Uses random number comparison
- Assigns 400K PRBS array outside bench loop
- To avoid rand() instructions in bench loop
- bench loop
- Compares array index to threshod
- Threshold RAND_MAXTakenRate
- Repeats 1000 reseeding each time
- However gcc adds 2 more branches into bench loop
- Loop exit condition (Prediction 100)
- Unconditional JMP (Prediction 100)
- Our Branchs Expected Mispredict Rate
- (0.5 - TakenRate 0.5 )
55Verifying Counter Reader
- 2) branch_exercise results
Back
Ex branch_exercise Taken Rate0.5
56MEASUREMENT Method
- Select Power lines that reflect CPU power
- P4 uses 12 V lines
- Clamp the current probe over the 12V lines
- 1mV/Adc conversion
- Connect the clamp into DMM
- Send Voltage reading over serial
- Log the voltage readings
- Convert to instantaneous power as
- 12 x Vsample x 1000
- Log Power values
- Plot Power values
57MEASUREMENT Tools
- Poll serial port 20ms
- quicker ? overkill, slower ? overlook
- Compute running average
- sample every ?t you select
- Easier to sync with Power Model
- PowerMeter
- Convert voltage reading to power and log
- P12 x Vread x 1000
- PowerPlotter
- Plot Power samples over sliding time window
- 100 s history with 1000 samples (?t 100ms)
58Current Probe
- Fluke i410
- Uses Hall Voltage to measure current and convert
to Voltage - 1mV / Adc
- Range 0.5 400A
- Accuracy 3.50.5A
- Generated voltage is fed to DMM
- Compared against the Ppro Amoeba shunt setup for
verification - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
59DMM
- Agilent 34401A
- Measurement Motive
- We should sample as quick as possible (grep
case) - Measurement Setup
- Fast 4 digit, Autozero OFF, Display OFF
- From 8, 1000 readings/s (x150 faster than
fast 6 digit) - Serial Interface
- From 9 55 ASCII readings /s
- Polling serial port faster than 20ms is overkill
Back
60P4 Power Lines
- Which power lines should we cut / clamp?
- 5 shows the power lines
- 1-CPU power connectorÂ
- 13-System power connector
- P1 ? 13 P2 ? 1
- 6,7 say P4 uses 12V lines for CPU, rather
than 5V lines - Both P1 P2 have 12, 5 and 3.3 V lines
- I run branch_exercise (takenRate1) and
gzip_static ? obtain the current variation on the
lines - Â ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
61Current on Power Lines
Reveals ALL 3 12V lines currents follow CPU
activity ? All add to CPU Power!
Back
62About the ripples
- Add ripple stuff here!!!!!!!!!!!!!!!!!!!!!!!!!!!
63P4 Architecture vs Layout
Components to Model
- FP RF
- Decode
- Trace
- 1st Level BPU
- Microcode ROM
- Allocation
- MOB
- Mem Control
- DTLB
- Int EXE
- FP EXE
- Int RF
- Rename
- Inst-n Qs
- Schedule
- Inst-n Qs
- Retirement
- Bus Control
- L2 Cache
- 2nd Level BPU
- ITLB Ifetch
- L1 Cache
Back
64Counter Rotations
Back
65