Dissecting On-node Memory Performance - PowerPoint PPT Presentation

About This Presentation

Title:

Dissecting On-node Memory Performance

Description:

Title: PowerPoint Presentation Author: Gyllenhaal, John C. Last modified by: Todd Gamblin Created Date: 1/1/1601 12:00:00 AM Document presentation format – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 15

Provided by: Gyll1

Learn more at: https://www.paradyn.org

Category:

more less

Transcript and Presenter's Notes

Title: Dissecting On-node Memory Performance

1
Dissecting On-node Memory Performance with
MemAxes
Petascale Tools Workshop 2014
Madison, WIAugust 4-7, 2014
Alfredo Gimenez, Todd Gamblin, Martin Schulz,
Peer-Timo Bremer, Barry Rountree, Abhinav
Bhatele, Ilir Jusufi, and Bernd Hammann
LLNL
UC Davis
2
Memory Access Sampling

Recent hardware additions allow us to precisely
sample events, including memory accesses
Intel PEBS, AMD IBS
Memory access samples contain
The instruction pointer
The address accessed
How many core clock cycles elapsed during the
access
Where in the memory hierarchy the address was
resolved (e.g. L1 cache, Local RAM, Remote RAM)
We need a way to meaningfully interpretthese
samples

3
Adding Context

Can better understand memory references with
appropriate context
Contexts include
The code
The node hardware topology
Calling context (call path)
The application (e.g. fluid dynamics)
Other work by Liu Mellor-Crummey has looked at
mapping latency access patterns to particular
variables, call paths, and access patterns.

4
We can already get coarse-grained application
context for some codes

Physics data is available in data structures
Time steps are easy to mark in the code
Per-process performance
easy to get
just turn on counters at the beginning of the run
read them periodically.
What if we want finer-grained attribution?
How to tie measurements to data structures?
How to slice and dice the data?

Aluminum
FLOP/s per MPI process
5
Node topology is easy to get, but not shown
clearly.

PEBS provides metadata for node topology
Want to highlight connections clearly to show
Load distribution
Bandwidth
Resource contention
Existing visualization from hwloc (right)
Does not scale
Clutters connections between components

6
We have developed a measurement tool for
collecting detailed context

Use PEBS sampling for hardware information
Supplement with application instrumentation for
mapping addresses to physical coordinates

SMT (Semantic Memory Tree) data structure used
to map callbacks sampled instruction operands
7
Currently the developer has to instrument the
application manually

Add calls to get metadata for allocated objects
Label string
Start and end addresses
Size of each element
Number of elements
Callback to map address to physical coordinates
Metadata must be provided by the programmer
Could easily be implemented in libraries
Lots of common mesh libraries would be
interesting for this.

8
Instrumentation

Specify DataObjects

Add additional semantic attributes and define
attribution function (optional)
9
Semantic Memory Tree
Binary Search Tree
Semantic Memory Ranges
Semantic Memory Range Tree Instrumentation
Semantic Memory Range Tree Instrumentation
Binary Search Tree
Binary Search Tree
Binary Search Tree
0x0F
0xF6
0x0F
0xF6
0x0F
0xF6
Address Ranges
Address Ranges
Address Ranges
0x0F
0x80
0xA2
0xF6
0x0F
0x80
0xA2
0xF6
0x0F
0x80
0xA2
0xF6
Velocity
Pressure
Temp
Density
Velocity
Pressure
Temp
Density
Velocity
Pressure
Temp
Density
0x40
0x80
0xA2
0xC2
0x0F
0x20
0xE0
0xF6
0x0F
0x20
0x40
0x80
0xA2
0xC2
0xE0
0xF6
0x40
0x80
0xA2
0xC2
0x0F
0x20
0xE0
0xF6
Data Buffers
Data Buffers
Addresses Application Domain
Addresses Application Domain
Record Performance Data in Application Domain
Record Performance Data in Application Domain
10
Lagrangian Hydrodynamics LULESH
2D
3D
3D with mapped performance data
11
We have developed MemAxes, a tool for analyzing
on-node memory performance

Measurement component samples memory instructions
We map latency information onto A) source code,
B) node topology
C) Pie chart shows percent of total latency
selected
D) Parallel coordinates view allows exploration
of correlations

12
Linked views clearly show on-nodelocality
problems

Parallel coordinates view shows correlation
between array index and core id in LULESH
Linked node topology view shows data motion for
highlighted memory operations
A contiguous chunk of an array is initially split
between threads on four cores
Using an optimized affinity scheme, we improve
locality
Performance improved by 10

Default thread affinity with poor locality
PIPER
Optimized thread affinity with good locality
13
Hyperion Thread/Core Binding
Improved cache usage 44 less access cycles 10
total speedup
14
Future work

Back-port perf_events API to production TOSS 2
kernel
Currently unable to do fine-grained memory
sampling on production machines due to PMU access
limits
Affects some Intel thread tools as well
More detailed architecture mapping
Sandy Bridge LLC ring interconnect information?
Other node architecture features?
Instrument AMR libraries for proper context
attribution
Study per-patch memory behavior
Study blocking behavior of solvers
How to query large instruction traces effectively?