Runtime%20Processor%20Power%20Monitoring - PowerPoint PPT Presentation

About This Presentation
Title:

Runtime%20Processor%20Power%20Monitoring

Description:

Runtime Processor Power Monitoring VICE VERSA 07/18/2003 Talk Canturk ISCI – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 66
Provided by: cant88
Category:

less

Transcript and Presenter's Notes

Title: Runtime%20Processor%20Power%20Monitoring


1
Runtime Processor Power Monitoring
  • VICE VERSA
  • 07/18/2003 Talk
  • Canturk ISCI

2
INTRODUCTION
  • Runtime processor power
  • Measurement with HW
  • Estimation with Performance counters
  • CPU Unit Power Breakdowns
  • Runtime verification
  • Processor thermal modeling
  • Power Phase Behavior of programs
  • Mapping between power behavior and program
    structure

3
THE BIG PICTURE
Bottom line
  • To Estimate component power temperature
    breakdowns for P4 at runtime
  • To analyze how power phase behavior relates to
    program structure

4
Agenda
  • Performance Monitoring
  • P4 Performance Counters
  • Performance Reader LKM
  • Real Power Measurement
  • P4 Power Measurement Setup
  • Examples
  • Power Modeling
  • P4 Power Model
  • Model Measurement Sync Setup, Verification
  • Thermal Modeling
  • Brief Thermal Model Intro
  • Ppro Thermal Model Results

5
Bonus Material
  • Power Phase Behavior
  • Similarity Based on Power Vectors
  • Identifying similar program regions
  • Profiling Execution Flow
  • Sampling process execution
  • PCsampler LKM
  • Program Structure
  • Execution vs. Code space
  • Power Phases ? Exec. Phases
  • ltOR VICE VERSAgt

6
Performance Monitoring
  • Related Work
  • Performance Monitoring
  • P4 Performance Counters
  • Performance Reader LKM
  • Real Power Measurement
  • P4 Power Measurement Setup
  • Examples
  • Power Modeling
  • P4 Power Model
  • Model Measurement Sync Setup, Verification
  • Thermal Modeling
  • Refined Thermal Model
  • Ex Ppro Thermal Model

7
Live CPU Performance Monitoring with Hardware
Counters
  • Most CPUs have hardware performance counters
  • P4 Performance Monitoring HW
  • 18 Event Counters
  • 18 Counter Configuration Control Registers
  • Configure how to count
  • 45 Event Selection Control Registers
  • Configure what to count
  • Additional Control Registers

8
Our Event-Counter Performance Reader
  • Performance Reader implemented as Linux Loadable
    Kernel Module
  • Implements 6 syscalls
  • select_events()
  • reset_event_counter()
  • start_event_counter()
  • stop_event_counter()
  • get_event_counts()
  • set_replay_MSRs()
  • User Level Interface
  • Defines the events ? Starts counters
  • Stops counters ? Reads counters TSC
  • Event Types
  • 59 event classes
  • 100s of events to count

9
Performance Reader Example Validation
  • L1_Dcache benchmark
  • Controls cache hit behavior
  • Validated against measured cache events
  • Vary hit rate from 0-100

10
Processor Power Measurement
  • Related Work
  • Performance Monitoring
  • P4 Performance Counters
  • Performance Reader LKM
  • Real Power Measurement
  • P4 Power Measurement Setup
  • Examples
  • Power Modeling
  • P4 Power Model
  • Model Measurement Sync Setup, Verification
  • Thermal Modeling
  • Refined Thermal Model
  • Ex Ppro Thermal Model

11
P4 Power Measurement Setup
Clamp ammeter on 12V lines on measured CPU
DMM reading clamp voltages
1mV/Adc conversion
Voltage readings via RS232 to logging machine
Convert to Power vs. time window
12
PowerPlotter Example
13
SPEC Power Examples
  • Different programs show very different power
    characteristics
  • Timescale of interest can be huge gt inaccessible
    via simulation

14
Processor Power Modeling
  • Related Work
  • Performance Monitoring
  • P4 Performance Counters
  • Performance Reader LKM
  • Real Power Measurement
  • P4 Power Measurement Setup
  • Examples
  • Power Modeling
  • P4 Power Model
  • Model Measurement Sync Setup, Verification
  • Thermal Modeling
  • Refined Thermal Model
  • Ex Ppro Thermal Model

15
P4 POWER MODEL
16
Defining Components
  • Bus Control
  • L2 Cache
  • ITLB
  • Ifetch
  • 2nd
  • Level
  • BPU
  • Mem Cntrl
  • L1 Cache
  • Int
  • RF
  • Instr-n
  • Queue
  • Int EXE
  • MOB
  • DTLB
  • Scheduler
  • Check
  • Replay
  • Retire
  • Instr-n
  • Decode
  • FP
  • RF
  • Instr-n
  • Queue
  • 1st Level
  • BPU
  • Rename
  • Ucode ROM
  • Allocate

17
Defining Events ? Access Rates
  • We determined 24 events to approximate access
    rates for 22 components
  • Used Several Heuristics to represent each access
    rate
  • Examples
  • Need to rotate counters 4 times to collect all
    event data
  • Used 15 counters 4 rotations to collect all
    event data

18
Access Rates ? Component Powers
  • Performance Counter based Access Rate
    estimations are used as proxy for max component
    power weighting together with microarchitectural
    details in order to estimate processor sub-unit
    powers
  • EX Trace cache delivers 3 uops/cycle in deliver
    mode and 1 uop/cycle in build mode
  • Power(TC)Access-Rate(TC)/3 Access-Rate(ID)
    x MaxPower(TC)
    Non-gated TC CLK power
  • Total power is computed as the sum of all 22
    component powers measured idle power (8W)

19
Experiment Setup Recall
Clamp ammeter on 12V lines on measured CPU
DMM reading clamp voltages
1mV/Adc conversion
Voltage readings via RS232 to logging machine
Convert to Power vs. time window
20
Experiment Setup
1mV/Adc conversion
Voltage readings via RS232 to logging machine
21
Experiment Setup
1mV/Adc conversion
POWER SERVER
Voltage readings via RS232 to logging machine
Component access rates over ethernet
POWER CLIENT
Convert voltage to measured power Convert access
rates to modeled powers Sync together in time
window
22
Tuning Benchmarks
Branch exercise (Taken rate 1)
High-Low
L1Dcache (Hit Rate 0.1)
Fast
23
Component Breakdowns
Component Breakdowns for branch_exercise Colors
for 4 CPU subsystems
Execution
Issue - Retire
24
Benchmark Power Breakdowns
25
SPEC2000 Results
VPR ElaborationInteger benchmark2 runs 1st ?
Placement, 2nd Route1st run much stable power,
2nd more variablePlacement has higher miss than
route lt L2 pwrgtSignificant FPE power due to
x87_SIMD_moves
Equake Elaboration (FP benchmark)
Initialization and computation phasesFP
intensive mesh computation phaseInitialization
with high complex IA32 instructions
Twolf Elaboration(Integer benchmark)Several
loop computations traversing memoryltHigh Memory
PowergtAlthough const. Total power, component
powers have slight gradients
26
Average SPEC Total Powers
  • 1st set Overall, 2nd set Non-idle power
  • Average difference between measurement and
    estimation 3W
  • Worst case Equake (5.8W)

27
Stdev of SPEC Total Powers
  • 1st set Overall, 2nd set Non-idle power
  • Average difference 2W
  • Worst case Vortex (3.5W)

28
Desktop Applications
  • We aim to track low power utilizations as well.
  • Desktop applications are usually low power with
    intermittent power bursts
  • 3 applications, with common operations such as
    open/close application, web, streaming media,
    text editing, save to disk, statistical
    computations.

29
Thermal Model
  • ltRush Mode!gt
  • Related Work
  • Performance Monitoring
  • P4 Performance Counters
  • Performance Reader LKM
  • Real Power Measurement
  • P4 Power Measurement Setup
  • Examples
  • Power Modeling
  • P4 Power Model
  • Model Measurement Sync Setup, Verification
  • Thermal Modeling
  • Brief Thermal Model Intro
  • Ppro Thermal Model Results

Skip Thermal
30
THERMAL MODELING A Basic Model
  • Based on lumpedR-C model from packaging
  • Built uponpower modeling
  • Sampled Component Powers
  • Respective component areas
  • Physical processor Parameters
  • Packaging
  • Heat Transfer

31
Physical Structure vs. Thermal Model
Ambient Temperature
Ambient Airflow
Heatsink
Thermal Grease
Heat Spreader
Package
Die
32
Simulation Outputs
  • Thermal nodes updated every ?t20ms
  • Component Temperatures Build up to 350K in 5hrs
  • Theatsink moves very slowly as expected

33
Power Phase Behavior
  • Power Phase Behavior
  • Similarity Based on Power Vectors
  • Identifying similar program regions
  • Profiling Execution Flow
  • Sampling process execution
  • PCsampler LKM
  • Program Structure
  • Execution vs. Code space
  • Power Phases ? Exec. Phases
  • ltOR VICE VERSAgt

34
Power Vectors for Similarity
  • Similar to basic block vectors
  • Use component power vector samples to represent
    program phases
  • Consider Manhattan distance between 2 vectors as
    the measure of dissimilarity between the
    corresponding execution points
  • Construct a similarity matrix to represent
    similarity among all pairs of execution points
  • Each entry in the similarity matrix

35
Gcc Gzip Similarity Matrices
Gzip Elaboration Much regular power
behavior Spurious similarities such as 100-150s
and 200-280 are distinguished by the similarity
analysis
Gcc Elaboration Very variant power Almost
identical power behavior at 30, 50,
180s. Although 88s, 110s, 140s, 210s and 230s
show similar total power 88, 210 and 230 share
higher similarity.
  • 44
  • 88
  • 132
  • 176
  • 220
  • 264
  • 308
  • 352
  • 396
  • 220.44
  • 440
  • 264.44

36
Generating representative vectors
  • Gzip has 1000 power vectors
  • Cluster vectors based on similarity
  • Could we represent power behavior with reasonable
    accuracy, with a small number of signature
    vectors?
  • Ex 26 representative vectors with Thresholding
    Algorithm

37
Program Execution Profile
  • Power Phase Behavior
  • Similarity Based on Power Vectors
  • Identifying similar program regions
  • Profiling Execution Flow
  • Sampling process execution
  • PCsampler LKM
  • Program Structure
  • Execution vs. Code space
  • Power Phases ? Exec. Phases
  • ltOR VICE VERSAgt

38
Program Execution Profile
  • Sample program flow simultaneously with power
  • Our LKM implementation PCsampler
  • Generate code space similarity in parallel with
    power space similarity
  • Relative comparisons for
  • Complexity
  • Accuracy
  • Applicability, etc.

39
  • EOP

40
Where all this is useful?
  • Measurement/Modeling for microarchitectural
    details
  • Compiler level power
  • SW power profiling
  • Power Aware OS
  • Dynamic power/thermal/March. Configuration
  • Dynamic memory allocate, Process cruise control,
    etc.
  • Demonstrates modern processor power
  • Need for speed! Long Timescales, thermal
    constants
  • Identify program phases w/o knowledge of code, no
    basic block info whatsoever
  • Program signatures for detailed simulation, say
    power points rather than simpoints

41
  • ACCESS
  • HEURISTICS

42
Counter Access Heuristics
  • 1) BUS CONTROL
  • No 3rd Level cache ? BSQ allocations IOQ
    allocations
  • Metric1 Bus accesses from all agents
  • Event IOQ_allocation
  • Counts various types of bus transactions
  • Should account for BSQ as well
  • access based rather than duration
  • MASK
  • Default req. type, all read (128B) and write
    (64B) types, include OWN,OTHER and PREFETCH
  • Metric2 Bus Utilization(The of time Bus is
    utilized)
  • Event FSB_data_activity
  • Counts DataReaDY and DataBuSY events on Bus
  • Mask
  • Count when processor or other agents
    drive/read/reserve the bus
  • Expression FSB_data_activity x BusRatio
    / Clocks Elapsed
  • To account for clock ratios

43
Counter Access Heuristics
  • 2) L2 Cache
  • Metric 2nd Level cache references
  • Event BSQ_cache_reference
  • Counts cache ref-s as seen by bus unit
  • MASK
  • All MESI read misses (LD RFO)
  • 2nd level WR misses
  • 3) 2nd Level BPU
  • Metric 1 Instructions fetched from L2 (predict)
  • Event ITLB_Reference
  • Counts ITLB translations
  • Mask
  • All hits, misses UC hits
  • Metric 2 Branches retired (history update)
  • Event branch_retired
  • Counts branches retired
  • Mask
  • Count all Taken/NT/Predicted/MissP

44
Counter Access Heuristics
  • 4) ITLB I-Fetch
  • etc
  • 10) FP Execution
  • Metric FP instructions executed
  • event1 packed_SP_uop
  • counts packed single precision uops
  • event2 packed_DP_uop
  • counts packed single precision uops
  • event3 scalar_SP_uop
  • counts scalar double precision uops
  • event4 scalar_DP_uop
  • counts scalar double precision uops
  • event5 64bit_MMX_uop
  • counts MMX uops with 64bit SIMD operands
  • event6 128bit_MMX_uop
  • counts integer SSE2 uops with 128bit SIMD
    operands
  • event7 x87_FP_UOP
  • counts x87 FP uops
  • event8 x87_SIMD_moves_uop

Back
45
  • ADDITIONAL
  • SLIDES

46
P4 Details
  • Karelian.ee
  • P4 1.4GHz
  • 0.18?, C4-FC-PGA-423
  • Heatsink ? Folded Fin
  • M6, Al interconnect
  • Die Size 217 mm2
  • Package Size 5.34cm x 5.17cm
  • Power Idle/typ./max??/51.8/71W
  • D1T1/L2 8K12KUops/256K
  • Voltage 1.7/1.75V

47
P4 Details
  • 1st LKM ltLKM_CPUinfo UserLevel_CPUinfogt
  • Implements syscall getCPUinfo()
  • Gathers CPU info from
  • /asm/processor.h
  • Intel control registers (CR4)
  • CPUID instruction
  • Reveals
  • Debug Store mechanism exists for PEBS
  • TSC exists
  • MSRs implemented
  • We can read/write performance counters
  • EX
  • karelian (P4,willamette) UserLevel_CPUinfo
  • viale (P4, Northwood) UserLevel_CPUinfo

Back
48
P4 Detector - Counter Clusters
Event Detectors
Event Counters
EVENTS
P4 Components
4 bit wide bus
49
Counters, ESCRs CCCRs
  • Simplified Recipe
  • Select Event to count
  • Select a counter (also defines CCCR)
  • Select an ESCR
  • Set ESCR fields
  • Set CCCR fields
  • Enable CCCR

50
Counting Mechanisms
  • Counting Types
  • Non-retirement Events occur any time during
    execution
  • At-Retirement Events at the retirement of
    instruction
  • Can count BOGUS vs NBOGUS, Tag uops to count,
    etc.
  • Terminology?
  • Mechanisms
  • Front end tagging (i.e. LD/ST retired)
  • Execution tagging (i.e. packed_DP_retired)
  • Replay Tagging (i.e. L1 misses)
  • No Tags (i.e. uops retired)
  • Also
  • Event Counting IEBS PEBS

Back
51
At Retirement Counting Terminology
  • BOGUS/NBOGUS (speculative)
  • Tagging (count uops that encounter event)
  • Replay (Data speculation)

Back
52
Verifying Counter Reader
  • 1) L1Dcache_exercise
  • Uses pointer assignment
  • L18K, L2256K
  • Array Size (L1 Size/Hit Rate)
  • i.e. for 10 Hit rate 80K ? 20K entries
  • Array Size lt L2 size
  • Array elements PRBS of array indices
  • Bench loop
  • new index ? arrayold index
  • However, gcc puts 5 LDs in the bench loop
  • 4 static ? Hit rate 100
  • 1 our load ? our desired hit rate

53
Verifying Counter Reader
  • 1) L1Dcache_exercise results

Ex L1Dcache_exercise Hit Rate 0.25
54
Verifying Counter Reader
  • 2) branch_exercise
  • Uses random number comparison
  • Assigns 400K PRBS array outside bench loop
  • To avoid rand() instructions in bench loop
  • bench loop
  • Compares array index to threshod
  • Threshold RAND_MAXTakenRate
  • Repeats 1000 reseeding each time
  • However gcc adds 2 more branches into bench loop
  • Loop exit condition (Prediction 100)
  • Unconditional JMP (Prediction 100)
  • Our Branchs Expected Mispredict Rate
  • (0.5 - TakenRate 0.5 )

55
Verifying Counter Reader
  • 2) branch_exercise results

Back
Ex branch_exercise Taken Rate0.5
56
MEASUREMENT Method
  • Select Power lines that reflect CPU power
  • P4 uses 12 V lines
  • Clamp the current probe over the 12V lines
  • 1mV/Adc conversion
  • Connect the clamp into DMM
  • Send Voltage reading over serial
  • Log the voltage readings
  • Convert to instantaneous power as
  • 12 x Vsample x 1000
  • Log Power values
  • Plot Power values

57
MEASUREMENT Tools
  • Poll serial port 20ms
  • quicker ? overkill, slower ? overlook
  • Compute running average
  • sample every ?t you select
  • Easier to sync with Power Model
  • PowerMeter
  • Convert voltage reading to power and log
  • P12 x Vread x 1000
  • PowerPlotter
  • Plot Power samples over sliding time window
  • 100 s history with 1000 samples (?t 100ms)

58
Current Probe
  • Fluke i410
  • Uses Hall Voltage to measure current and convert
    to Voltage
  • 1mV / Adc
  • Range 0.5 400A
  • Accuracy 3.50.5A
  • Generated voltage is fed to DMM
  • Compared against the Ppro Amoeba shunt setup for
    verification
  • ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

59
DMM
  • Agilent 34401A
  • Measurement Motive
  • We should sample as quick as possible (grep
    case)
  • Measurement Setup
  • Fast 4 digit, Autozero OFF, Display OFF
  • From 8, 1000 readings/s (x150 faster than
    fast 6 digit)
  • Serial Interface
  • From 9 55 ASCII readings /s
  • Polling serial port faster than 20ms is overkill

Back
60
P4 Power Lines
  • Which power lines should we cut / clamp?
  • 5 shows the power lines
  • 1-CPU power connector 
  • 13-System power connector
  • P1 ? 13 P2 ? 1
  • 6,7 say P4 uses 12V lines for CPU, rather
    than 5V lines
  • Both P1 P2 have 12, 5 and 3.3 V lines
  • I run branch_exercise (takenRate1) and
    gzip_static ? obtain the current variation on the
    lines
  •   ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

61
Current on Power Lines
Reveals ALL 3 12V lines currents follow CPU
activity ? All add to CPU Power!
Back
62
About the ripples
  • Add ripple stuff here!!!!!!!!!!!!!!!!!!!!!!!!!!!

63
P4 Architecture vs Layout
Components to Model
  1. FP RF
  2. Decode
  3. Trace
  4. 1st Level BPU
  5. Microcode ROM
  6. Allocation
  1. MOB
  2. Mem Control
  3. DTLB
  4. Int EXE
  5. FP EXE
  6. Int RF
  1. Rename
  2. Inst-n Qs
  3. Schedule
  4. Inst-n Qs
  5. Retirement
  1. Bus Control
  2. L2 Cache
  3. 2nd Level BPU
  4. ITLB Ifetch
  5. L1 Cache

Back
64
Counter Rotations
Back
65
  • EOP
Write a Comment
User Comments (0)
About PowerShow.com