Identifying EnergyEfficient Concurrency Levels Using Machine Learning - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Identifying EnergyEfficient Concurrency Levels Using Machine Learning

Description:

Why throttle concurrency? ( Why not use all available processing elements? ... Making decisions for adaptive concurrency throttling ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 19
Provided by: matthe302
Category:

less

Transcript and Presenter's Notes

Title: Identifying EnergyEfficient Concurrency Levels Using Machine Learning


1
Identifying Energy-Efficient Concurrency Levels
Using Machine Learning
Matthew Curtis-Maury1,3, Karan Singh2,3, Sally A.
McKee2, Filip Blagojevic1, Dimitrios S.
Nikolopoulos1, Bronis R. de Supinski3, Martin
Schulz3 1) Virginia Tech 2) Cornell
University 3) Lawrence Livermore National Lab
2
Outline
  • Evaluation of scientific application scalability
  • Study on an Intel quad-core processor
  • Adaptive concurrency throttling
  • Reduce concurrency at runtime
  • Improve energy-efficiency
  • Runtime scalability prediction
  • Artificial Neural Networks
  • Select optimal concurrency for each program phase

3
High levels of parallelism becoming the norm
  • Microarchitectural focus shifting towards
    parallelism
  • Diminishing returns from exploiting ILP
  • CMP, SMT, and heterogeneous multi-core
  • Intels 80-core prototype
  • Industry predictions include 100s of cores
    within a decade
  • However, modern scientific applications often
    scale poorly
  • Even in applications with plenty of parallelism
  • Interaction between hardware and software
  • Scalability bottlenecks in shared HW resources
  • Must achieve scalability to be energy-efficient

4
Our experimental multi-core platform
  • Dell Precision 390 Workstation
  • Linux kernel 2.6.18
  • 2GB main memory
  • Intel Q6600 quad-core processor
  • NAS Parallel Benchmarks using OpenMP
  • Extensively optimized for parallelism and
    locality
  • Ran benchmarks with all static configurations

Core1
Core3
Configuration 2s
Configuration 3
Configuration 4
Intel Quadcore
Configuration 1
Configuration 2p












Core2
Core4
5
Scalability analysis of NAS Parallel
BenchmarksGood scalability
  • Application exhibits very good scalability to 4
    cores (2.69X)
  • Also shows increasing power consumption (1.31X)
  • Good energy-efficiency results from the
    scalability
  • Example shows potential of CMPs
  • Proves that multi-core can be effective for some
    types of scientific applications

6
Scalability analysis of NAS Parallel
BenchmarksPoor scalability
  • IS has extremely poor scalability on this
    architecture
  • More cores hurt performance significantly
  • Results with shared cache on 2s, 3, and 4 reveal
    cause
  • Power does not necessarily increase with more
    cores for this application
  • Contention leads to reduced utilization, and
    therefore lower power consumption
  • This occurs at only the quad-core level
  • Suggests problems for scientific applications on
    future many-core processors

7
Scalability analysis of NAS Parallel
BenchmarksPhase variability
  • Scalability can vary greatly by phase in parallel
    applications
  • SP experiences optimality at 4 different
    configurations
  • Other applications experience similar variability
  • We exploit this property later in this work

8
A high-level view of our library ACTOR
Application
ACTOR
ANN Model
Runtime System
PerformancePredictor
DecisionEnforcer
Hardware
Self-AdaptingApplication
HECs
9
Concurrency throttling in multithreaded programs
Parallel regions
  • Concurrency throttling
  • Modifying the number of threads used to execute a
    parallel code region, as well as the placement of
    threads on processing elements
  • Optimal decision depends upon execution
    characteristics of the phase (OpenMP parallel
    region) in question
  • Why throttle concurrency? (Why not use all
    available processing elements?)
  • Decrease execution time by alleviating
    scalability bottlenecks
  • Often will also reduce power consumption by
    disabling cores
  • Possibly a better use of additional cores (e.g.,
    prefetching, hypervisors, reliability)

10
Making decisions for adaptive concurrency
throttling
  • How can we decide what configuration to use for
    each phase?
  • Test all possible configurations, select best
    performance
  • Test reduced set of configurations (HPPAC06)
  • Each sample requires one iteration of execution
  • Comes with overhead execution on suboptimal
    configurations
  • Becomes a problem on machines with many potential
    choices
  • Can instead find optimal configuration through
    prediction
  • Reduces search overhead few samples to observe
    behavior
  • We utilize machine learning (ANNs) to make
    performance predictions
  • In previous work we have considered multiple
    linear regression (ICS06)
  • Regression requires detailed architectural
    knowledge in the model
  • ANNs provide non-linear model, no user-provided
    domain knowledge
  • Further, here we consider CMPs rather than
    SMPs/SMTs

11
What are artificial neural networks?
  • Machine learning studies algorithms that learn
    automatically through experience
  • We specifically use artificial neural networks
    (ANNs)
  • Learn to predict one or more targets from set of
    inputs
  • Well suited for generalized non-linear regression
  • Use the Fusion Predictive Modeling Tools (FPMT)
    software

12
ANN-based dynamic performance prediction
  • Predict effects of changing concurrency and
    thread placement
  • Map data from execution at maximal concurrency to
    performance on other configurations
  • Specifically, hardware event counters (HECs) in
    rates
  • Predict performance in terms of IPC
  • Make predictions for each phase of the application

Targets
Sample
13
ANN-based dynamic performance prediction (2)
  • Develop ANN-based model of performance
  • T target configuration, S sample configuration,
    ex event x
  • Model training
  • Select set of training applications and identify
    phase boundaries
  • Collect samples offline to serve as training
    input
  • Feed sample training data into ANN software
  • Will generate model, FT, for each target
    configuration

IPCT FT(IPCS, e(1,S), e(2,S), , e(n, S))
Training data for target T
14
ANN-based dynamic performance prediction (3)
  • Select counters based on expected impact on
    scalability
  • E.g., L2 misses, bus accesses, stall cycles, etc
  • Events themselves largely determine the observed
    scalability

Execute phase with maximal concurrency
Live Execution
Time
  • Determine number of counters to use by limiting
    overhead
  • Experimental platform supports 2 simultaneous
    counters
  • We limit sample iterations to 20 of total
    execution

15
Evaluation of the predictor Prediction accuracy
  • Median error in IPC prediction 9 (compared to
    observed value)
  • More important metric is identification of
    optimal concurrency levels
  • Single best configuration selected for 59 of
    phases
  • 29 more selected second best configuration

16
Evaluation of adaptation Performance results
  • Large gains in performance compared to 4 cores,
    6.5 on average
  • Default choice of naïve developer
  • Comparable to oracular input, but not quite as
    good
  • Could be improved on architectures with more
    counter registers
  • Even some scalable applications benefit due to
    phase awareness

17
Evaluation of adaptation Energy results
  • No power savings on average through concurrency
    throttling
  • Throttling in response to contention increases
    processor utilization
  • More effective on SMPs where each unit consumes
    considerable power
  • Substantial average energy savings still result
    (5.2)
  • Due to the reduction in execution time without
    increasing power
  • EnergyDelay2 improvement of 17.2 on average

18
Conclusions and contributions
  • Modern scientific applications achieve varying
    scalability on multi-core
  • Some applications scale quite well and achieve
    good energy-efficiency
  • Others see substantial performance losses through
    more cores
  • Utilized artificial neural networks to predict
    performance across hardware configurations using
    hardware event counts
  • Median model prediction accuracy of 91
  • Results in successful identification of improved
    concurrency levels
  • Reduces end-user burden in model training
    compared to regression
  • Adapted concurrency per phase at runtime based on
    ANN-based predictions of performance on a real
    quad-core CMP
  • Achieved substantial improvements in performance
    of 6.5 on average
  • Improved the energy-efficiency of many of the
    applications
  • Benefit of adaptation likely to improve in the
    future
Write a Comment
User Comments (0)
About PowerShow.com