Title: Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation
1Full-System Chip Multiprocessor Power
EvaluationsUsing FPGA-Based Emulation
- Abhishek Bhattacharjee
- Gilberto Contreras
- Margaret Martonosi
- Princeton University
2Problem SW Simulators for Architectural Power
Estimation
- Power has become a first-class design problem
- Affects power density, thermal behavior,
packaging constraints - Early stage µ-arch perf/power evaluation is
crucial - Convention SW simulators (Wattch, SimplePower,
Hotspot) - Flexible, low development time
- But SW simulations are too slow
- Chips getting more complex core counts,
interconnect, etc. - Design space getting more complex
perf/power/thermal - Must consider OS, workload interaction
-
3Alternatives to Long Simulations
- Run application snippets, ignore OS
- Compromises result accuracy and credibility
- Parallelize simulator Falsafi et al. ACM
Modeling 97, Mukherjee et al. IEEE Concurrency
00, Chidester et al. ACM Modeling 02 and
others - Shared structures (LLC, coherence) limit
scalability - Hardware runtime monitoring Joseph et al.
ISLPED 01, Bellosa et al. ACM SIGOPS, and
others - Fast evaluation time
- Restricted view of components
- Requires existing design
-
4Our Approach FPGA-Based Full-System Emulation
- Develop FPGA-based perf./power emulator of a
proposed CMP machine - Emulation rate of 50-300 MHz ? run full apps, OS
- Similar to HW monitoring
- Programmable ? insert relevant monitors, model
various designs - Similar to SW simulations
- Bottomline Get detail and full-system effects of
real measurements before it is built - First full-system FPGA emulation of CMP running
Linux - Demonstrate use on activity migration example
-
5Recent Related Work on FPGA-Based Emulation
- Memory controller emulation
- RPM Oner et al. ISFPGA 95
- Purely performance emulation
- HASim Emer et al. ISFPGA 06, RAMP Wawrzynek
et al. 06 - Modular, parameterizable perf. models on FPGAs
- Purely power emulation Coburn et al. DAC 05
- RTL with power-models on FPGA (area/latency
overhead analysis) - Performance and power emulation Atienza et al.
DAC 06 - Performance and thermal emulation of MPSoCs for
existing cores - Runs OS on host and communicates with FPGA
6Presentation Outline
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
7Steps in Designing Emulator
- 1. Choose target platform
- 2. Choose candidate core design
- 3. Design event counters
- 4. Design power models
- 5. Boot OS and run full apps.
8Target Emulation Platform
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
- Target FPGA platform BEE2
- 5 Xilinx V2P 70 FPGAs (1 control/4 user)
- Current design on control unit
- Methodology extensible to other platforms
9Candidate Core DesignLeon3 SparcV8 CMP
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
Candidate Core Leon3 Sparc V8 VHDL core
Organization L1 snoopy cache coherence (ARM bus)
Pipeline Single-issue, in-order, 7-stage
Functional Units Adder, Shifter, Pipelined Mul /Div
L1 I-Cache 4 KB, 2-way, 32-byte lines, LRR
L1 D-Cache 4 KB, 2-way, 32-byte lines, LRR, write-through, virtually addressed
MMU 8-entry I and D TLBs, LRU
- Paper emulates 2 cores subsequently scaled to 4
cores - Currently use 60 LUTs, 20 BRAM on 1 Virtex2P 70
- Current synthesized FPGA clock rate 65MHz
- Future further scale core count, L1 caches, add
LLC, FPU - Methodology extensible to other core designs
10Inserting Event Counters
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
SparcV8 Core 1
SparcV8 Core N
3-Port Reg. File
3-Port Reg. File
. . .
7-Stage Integer Pipeline
7-Stage Integer Pipeline
Memory-mapped counters Add to ISA
start/stop/reset counters 36 counters ? 3
LUTs, no impact on freq.
Event Counters 64-bit
4KB I
4KB D
4KB I
4KB D
AHB Cont.
AHB Bus
11Power Model Development
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
- General form of component power model
- How to assign event Ei?
- Want power of emulated machine, not FPGA !
- Calibrate with gate-level simulations and
microbenchmarks - Write 500-1000 instruction benchmarks exercising
events - Get Leon3 gate-level netlist from Synopsys Design
Compiler - Feed µ-benchmarks and netlist into Synopsys
PrimeTime to get component power breakdown - Please refer to paper for details
Dynamic power term
Un-clock gated leakage power
12Register File SwitchingPower Model
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
- Write 500-instruction microbenchmarks
- Vary event/nop ratio
- Idle Power 18.83 mW, Write 0.53 nJ,
- Single Read 0.29 nJ, Double Read 0.39 nJ
13Full-System Emulator with OS and Applications
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
FPGA Platform BEE2 Control Unit
I/O
Emulated CMP
Linux 2.6, applications (Spec2006, Splash-2,
PARSEC) Knowledge of power models
Host PC
RS-232
SparcV8Core 0
SparcV8Core 1
Ethernet
AHB Bus
Event counters for all modules
Main Memory
14Presentation Outline
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
15Validating Emulator Power Models
- Extensive validation with Synopsys PrimeTime PX
using - Validation micro-benchmarks
- 2x calibration micro-benchmarks, multiple event
types - Spec 2006 benchmarks
- Mcf, Libquantum, Bzip2, Gcc, Sjeng (train problem
size) - Run 5 distinct 1-million instruction snapshots
(short snippets due to PrimeTime)
Module µ-benchmarks Spec 2006
Pipeline 7.51 7.58
Reg. File 6.03 6.23
I-Cache 6.81 7.21
D-Cache 7.21 7.41
AHB 5.66 7.30
16Results Emulation Speedup
- Speedup over architectural simulator, Multifacet
GEMS - 2-core, 4KB L1 caches
- Mcf, Libquantum, Bzip2, Gcc, Sjeng on each core
with train size - With Ruby Max. 35x
- Even greater speedup expected for
- Detailed pipeline modeling
- Modeling greater core counts
- Collecting power/thermal data
- Greater FPGA clock
NOTE GEMS host uses a 64-bit, 2-GHz dual-core
AMD Athlon processor
17Presentation Outline
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
18Runtime Power Profiling
- Important for OS controlled power-aware
scheduling - Modify Linux kernel to feed counter values to
power models - Read counters within 10ms timer interrupt
- Sampling rate multiples of 10ms
- Access 36 counters in 5700 cycles ? Max. 0.87
perturbation
19Runtime Power for LU (2-threads)
CPU 1 master, CPU0 idle (380 mW)
Barrier CPU0 spin-waiting
Possible Reg. File hotspot cannot be tracked on
CPU composite profile
Low power numbers and swing no L2, no FPU, no
gating, simple pipeline
20Case Study Activity Migration
- Goal Demonstrate use of emulation system on
problem of real-world relevance - Problem Use activity migration (AM) to mitigate
CMP hotspots Heo et al. ISLPED 03, Choi et al.
ISLPED 07 - Our Solution Modify Linux kernel scheduler to
read counters, deduce power trends, and migrate
threads accordingly - Our emulator is the ideal platform for AM studies
- Hotspots depend on component power
- Emulator directly provides this
- On-chip temperature rise/fall times 100 ms
- Emulator fast enough to run OS and apps. beyond
this time range
21Case Study AM on Bzip2, Mcf
Mcf data cached, computation proceeds
Bzip2 small working set, high activity, high
power
Migration triggered
CPU 0 (Bzip2) overheats
CPU 0 (Mcf) cools off
Mcf large working set, high stalls, low power
22Presentation Outline
- Designing the emulator
- Validating emulator power models
- Evaluating emulator speedup
- Profiling application runtime power behavior
- Case study Activity migration
- Conclusion
23Conclusion
- First FPGA-based emulation for CMPs for
full-system power-performance modeling of
early-stage designs. - Emulator combines HW speeds (65 MHz) with SW
programmability 35x speedup over GEMS (Ruby) - Power models accurate within 10 of Synopsys
simulations - Can model range of proposed designs
- Moores Law applies to FPGAs too!
- Ongoing/future work
- Emulate architecture with GHz frequency using raw
FPGA clock in MHz - DVFS emulation
- Thermal models
24Linux Kernel Scheduler for AM
Avg. migration time ? 300ms (65 MHz clock and
small caches) 2s interval for max. 15 migration
penalty
25Modeling Leakage Power
- General form of component power model
- Leakage power depends heavily on temp.
- Separate voltage/temperature dependent leakage
term possible - Emulator runs fast enough to collect accurate
temperature data ? more accurate leakage power
estimates - Calibrate with leakage estimates from Synopsys
PrimeTime - Write µ-benchmarks across range of temperatures
and see per-component leakage variation
Dynamic power term
Un-clock gated leakage power