Identifying EnergyEfficient Concurrency Levels Using Machine Learning - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Identifying EnergyEfficient Concurrency Levels Using Machine Learning

Description:

Why throttle concurrency? ( Why not use all available processing elements? ... Making decisions for adaptive concurrency throttling ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 19

Provided by: matthe302

Category:

more less

Transcript and Presenter's Notes

Title: Identifying EnergyEfficient Concurrency Levels Using Machine Learning

1
Identifying Energy-Efficient Concurrency Levels
Using Machine Learning
Matthew Curtis-Maury1,3, Karan Singh2,3, Sally A.
McKee2, Filip Blagojevic1, Dimitrios S.
Nikolopoulos1, Bronis R. de Supinski3, Martin
Schulz3 1) Virginia Tech 2) Cornell
University 3) Lawrence Livermore National Lab
2
Outline

Evaluation of scientific application scalability
Study on an Intel quad-core processor
Adaptive concurrency throttling
Reduce concurrency at runtime
Improve energy-efficiency
Runtime scalability prediction
Artificial Neural Networks
Select optimal concurrency for each program phase

3
High levels of parallelism becoming the norm

Microarchitectural focus shifting towards
parallelism
Diminishing returns from exploiting ILP
CMP, SMT, and heterogeneous multi-core
Intels 80-core prototype
Industry predictions include 100s of cores
within a decade
However, modern scientific applications often
scale poorly
Even in applications with plenty of parallelism
Interaction between hardware and software
Scalability bottlenecks in shared HW resources
Must achieve scalability to be energy-efficient

4
Our experimental multi-core platform

Dell Precision 390 Workstation
Linux kernel 2.6.18
2GB main memory
Intel Q6600 quad-core processor
NAS Parallel Benchmarks using OpenMP
Extensively optimized for parallelism and
locality
Ran benchmarks with all static configurations

Core1
Core3
Configuration 2s
Configuration 3
Configuration 4
Intel Quadcore
Configuration 1
Configuration 2p

Core2
Core4
5
Scalability analysis of NAS Parallel
BenchmarksGood scalability

Application exhibits very good scalability to 4
cores (2.69X)
Also shows increasing power consumption (1.31X)
Good energy-efficiency results from the
scalability
Example shows potential of CMPs
Proves that multi-core can be effective for some
types of scientific applications

6
Scalability analysis of NAS Parallel
BenchmarksPoor scalability

IS has extremely poor scalability on this
architecture
More cores hurt performance significantly
Results with shared cache on 2s, 3, and 4 reveal
cause
Power does not necessarily increase with more
cores for this application
Contention leads to reduced utilization, and
therefore lower power consumption
This occurs at only the quad-core level
Suggests problems for scientific applications on
future many-core processors

7
Scalability analysis of NAS Parallel
BenchmarksPhase variability

Scalability can vary greatly by phase in parallel
applications
SP experiences optimality at 4 different
configurations
Other applications experience similar variability
We exploit this property later in this work

8
A high-level view of our library ACTOR
Application
ACTOR
ANN Model
Runtime System
PerformancePredictor
DecisionEnforcer
Hardware
Self-AdaptingApplication
HECs
9
Concurrency throttling in multithreaded programs
Parallel regions

Concurrency throttling
Modifying the number of threads used to execute a
parallel code region, as well as the placement of
threads on processing elements
Optimal decision depends upon execution
characteristics of the phase (OpenMP parallel
region) in question
Why throttle concurrency? (Why not use all
available processing elements?)
Decrease execution time by alleviating
scalability bottlenecks
Often will also reduce power consumption by
disabling cores
Possibly a better use of additional cores (e.g.,
prefetching, hypervisors, reliability)

10
Making decisions for adaptive concurrency
throttling

How can we decide what configuration to use for
each phase?
Test all possible configurations, select best
performance
Test reduced set of configurations (HPPAC06)
Each sample requires one iteration of execution
Comes with overhead execution on suboptimal
configurations
Becomes a problem on machines with many potential
choices
Can instead find optimal configuration through
prediction
Reduces search overhead few samples to observe
behavior
We utilize machine learning (ANNs) to make
performance predictions
In previous work we have considered multiple
linear regression (ICS06)
Regression requires detailed architectural
knowledge in the model
ANNs provide non-linear model, no user-provided
domain knowledge
Further, here we consider CMPs rather than
SMPs/SMTs

11
What are artificial neural networks?

Machine learning studies algorithms that learn
automatically through experience
We specifically use artificial neural networks
(ANNs)
Learn to predict one or more targets from set of
inputs
Well suited for generalized non-linear regression
Use the Fusion Predictive Modeling Tools (FPMT)
software

12
ANN-based dynamic performance prediction

Predict effects of changing concurrency and
thread placement
Map data from execution at maximal concurrency to
performance on other configurations
Specifically, hardware event counters (HECs) in
rates
Predict performance in terms of IPC
Make predictions for each phase of the application

Targets
Sample
13
ANN-based dynamic performance prediction (2)

Develop ANN-based model of performance
T target configuration, S sample configuration,
ex event x
Model training
Select set of training applications and identify
phase boundaries
Collect samples offline to serve as training
input
Feed sample training data into ANN software
Will generate model, FT, for each target
configuration

IPCT FT(IPCS, e(1,S), e(2,S), , e(n, S))
Training data for target T
14
ANN-based dynamic performance prediction (3)

Select counters based on expected impact on
scalability
E.g., L2 misses, bus accesses, stall cycles, etc
Events themselves largely determine the observed
scalability

Execute phase with maximal concurrency
Live Execution
Time

Determine number of counters to use by limiting
overhead
Experimental platform supports 2 simultaneous
counters
We limit sample iterations to 20 of total
execution

15
Evaluation of the predictor Prediction accuracy

Median error in IPC prediction 9 (compared to
observed value)
More important metric is identification of
optimal concurrency levels
Single best configuration selected for 59 of
phases
29 more selected second best configuration

16
Evaluation of adaptation Performance results

Large gains in performance compared to 4 cores,
6.5 on average
Default choice of naïve developer
Comparable to oracular input, but not quite as
good
Could be improved on architectures with more
counter registers
Even some scalable applications benefit due to
phase awareness

17
Evaluation of adaptation Energy results

No power savings on average through concurrency
throttling
Throttling in response to contention increases
processor utilization
More effective on SMPs where each unit consumes
considerable power
Substantial average energy savings still result
(5.2)
Due to the reduction in execution time without
increasing power
EnergyDelay2 improvement of 17.2 on average

18
Conclusions and contributions

Modern scientific applications achieve varying
scalability on multi-core
Some applications scale quite well and achieve
good energy-efficiency
Others see substantial performance losses through
more cores
Utilized artificial neural networks to predict
performance across hardware configurations using
hardware event counts
Median model prediction accuracy of 91
Results in successful identification of improved
concurrency levels
Reduces end-user burden in model training
compared to regression
Adapted concurrency per phase at runtime based on
ANN-based predictions of performance on a real
quad-core CMP
Achieved substantial improvements in performance
of 6.5 on average
Improved the energy-efficiency of many of the
applications
Benefit of adaptation likely to improve in the
future