Title: Identifying EnergyEfficient Concurrency Levels Using Machine Learning
1Identifying Energy-Efficient Concurrency Levels
Using Machine Learning
Matthew Curtis-Maury1,3, Karan Singh2,3, Sally A.
McKee2, Filip Blagojevic1, Dimitrios S.
Nikolopoulos1, Bronis R. de Supinski3, Martin
Schulz3 1) Virginia Tech 2) Cornell
University 3) Lawrence Livermore National Lab
2Outline
- Evaluation of scientific application scalability
- Study on an Intel quad-core processor
- Adaptive concurrency throttling
- Reduce concurrency at runtime
- Improve energy-efficiency
- Runtime scalability prediction
- Artificial Neural Networks
- Select optimal concurrency for each program phase
3High levels of parallelism becoming the norm
- Microarchitectural focus shifting towards
parallelism - Diminishing returns from exploiting ILP
- CMP, SMT, and heterogeneous multi-core
- Intels 80-core prototype
- Industry predictions include 100s of cores
within a decade - However, modern scientific applications often
scale poorly - Even in applications with plenty of parallelism
- Interaction between hardware and software
- Scalability bottlenecks in shared HW resources
- Must achieve scalability to be energy-efficient
4Our experimental multi-core platform
- Dell Precision 390 Workstation
- Linux kernel 2.6.18
- 2GB main memory
- Intel Q6600 quad-core processor
- NAS Parallel Benchmarks using OpenMP
- Extensively optimized for parallelism and
locality - Ran benchmarks with all static configurations
Core1
Core3
Configuration 2s
Configuration 3
Configuration 4
Intel Quadcore
Configuration 1
Configuration 2p
Core2
Core4
5Scalability analysis of NAS Parallel
BenchmarksGood scalability
- Application exhibits very good scalability to 4
cores (2.69X) - Also shows increasing power consumption (1.31X)
- Good energy-efficiency results from the
scalability - Example shows potential of CMPs
- Proves that multi-core can be effective for some
types of scientific applications
6Scalability analysis of NAS Parallel
BenchmarksPoor scalability
- IS has extremely poor scalability on this
architecture - More cores hurt performance significantly
- Results with shared cache on 2s, 3, and 4 reveal
cause - Power does not necessarily increase with more
cores for this application - Contention leads to reduced utilization, and
therefore lower power consumption - This occurs at only the quad-core level
- Suggests problems for scientific applications on
future many-core processors
7Scalability analysis of NAS Parallel
BenchmarksPhase variability
- Scalability can vary greatly by phase in parallel
applications - SP experiences optimality at 4 different
configurations - Other applications experience similar variability
- We exploit this property later in this work
8A high-level view of our library ACTOR
Application
ACTOR
ANN Model
Runtime System
PerformancePredictor
DecisionEnforcer
Hardware
Self-AdaptingApplication
HECs
9Concurrency throttling in multithreaded programs
Parallel regions
- Concurrency throttling
- Modifying the number of threads used to execute a
parallel code region, as well as the placement of
threads on processing elements - Optimal decision depends upon execution
characteristics of the phase (OpenMP parallel
region) in question - Why throttle concurrency? (Why not use all
available processing elements?) - Decrease execution time by alleviating
scalability bottlenecks - Often will also reduce power consumption by
disabling cores - Possibly a better use of additional cores (e.g.,
prefetching, hypervisors, reliability)
10Making decisions for adaptive concurrency
throttling
- How can we decide what configuration to use for
each phase? - Test all possible configurations, select best
performance - Test reduced set of configurations (HPPAC06)
- Each sample requires one iteration of execution
- Comes with overhead execution on suboptimal
configurations - Becomes a problem on machines with many potential
choices - Can instead find optimal configuration through
prediction - Reduces search overhead few samples to observe
behavior - We utilize machine learning (ANNs) to make
performance predictions - In previous work we have considered multiple
linear regression (ICS06) - Regression requires detailed architectural
knowledge in the model - ANNs provide non-linear model, no user-provided
domain knowledge - Further, here we consider CMPs rather than
SMPs/SMTs
11What are artificial neural networks?
- Machine learning studies algorithms that learn
automatically through experience - We specifically use artificial neural networks
(ANNs) - Learn to predict one or more targets from set of
inputs - Well suited for generalized non-linear regression
- Use the Fusion Predictive Modeling Tools (FPMT)
software
12ANN-based dynamic performance prediction
- Predict effects of changing concurrency and
thread placement - Map data from execution at maximal concurrency to
performance on other configurations - Specifically, hardware event counters (HECs) in
rates - Predict performance in terms of IPC
- Make predictions for each phase of the application
Targets
Sample
13ANN-based dynamic performance prediction (2)
- Develop ANN-based model of performance
- T target configuration, S sample configuration,
ex event x - Model training
- Select set of training applications and identify
phase boundaries - Collect samples offline to serve as training
input - Feed sample training data into ANN software
- Will generate model, FT, for each target
configuration -
IPCT FT(IPCS, e(1,S), e(2,S), , e(n, S))
Training data for target T
14ANN-based dynamic performance prediction (3)
- Select counters based on expected impact on
scalability - E.g., L2 misses, bus accesses, stall cycles, etc
- Events themselves largely determine the observed
scalability
Execute phase with maximal concurrency
Live Execution
Time
- Determine number of counters to use by limiting
overhead - Experimental platform supports 2 simultaneous
counters - We limit sample iterations to 20 of total
execution -
15Evaluation of the predictor Prediction accuracy
- Median error in IPC prediction 9 (compared to
observed value) - More important metric is identification of
optimal concurrency levels - Single best configuration selected for 59 of
phases - 29 more selected second best configuration
16Evaluation of adaptation Performance results
- Large gains in performance compared to 4 cores,
6.5 on average - Default choice of naïve developer
- Comparable to oracular input, but not quite as
good - Could be improved on architectures with more
counter registers - Even some scalable applications benefit due to
phase awareness
17Evaluation of adaptation Energy results
- No power savings on average through concurrency
throttling - Throttling in response to contention increases
processor utilization - More effective on SMPs where each unit consumes
considerable power - Substantial average energy savings still result
(5.2) - Due to the reduction in execution time without
increasing power - EnergyDelay2 improvement of 17.2 on average
18Conclusions and contributions
- Modern scientific applications achieve varying
scalability on multi-core - Some applications scale quite well and achieve
good energy-efficiency - Others see substantial performance losses through
more cores - Utilized artificial neural networks to predict
performance across hardware configurations using
hardware event counts - Median model prediction accuracy of 91
- Results in successful identification of improved
concurrency levels - Reduces end-user burden in model training
compared to regression - Adapted concurrency per phase at runtime based on
ANN-based predictions of performance on a real
quad-core CMP - Achieved substantial improvements in performance
of 6.5 on average - Improved the energy-efficiency of many of the
applications - Benefit of adaptation likely to improve in the
future