Title: Dissecting On-node Memory Performance
1Dissecting On-node Memory Performance with
MemAxes
Petascale Tools Workshop 2014
Madison, WIAugust 4-7, 2014
Alfredo Gimenez, Todd Gamblin, Martin Schulz,
Peer-Timo Bremer, Barry Rountree, Abhinav
Bhatele, Ilir Jusufi, and Bernd Hammann
LLNL
UC Davis
2Memory Access Sampling
- Recent hardware additions allow us to precisely
sample events, including memory accesses - Intel PEBS, AMD IBS
- Memory access samples contain
- The instruction pointer
- The address accessed
- How many core clock cycles elapsed during the
access - Where in the memory hierarchy the address was
resolved (e.g. L1 cache, Local RAM, Remote RAM) - We need a way to meaningfully interpretthese
samples
3Adding Context
- Can better understand memory references with
appropriate context - Contexts include
- The code
- The node hardware topology
- Calling context (call path)
- The application (e.g. fluid dynamics)
- Other work by Liu Mellor-Crummey has looked at
mapping latency access patterns to particular
variables, call paths, and access patterns.
4We can already get coarse-grained application
context for some codes
- Physics data is available in data structures
- Time steps are easy to mark in the code
- Per-process performance
- easy to get
- just turn on counters at the beginning of the run
- read them periodically.
- What if we want finer-grained attribution?
- How to tie measurements to data structures?
- How to slice and dice the data?
Aluminum
FLOP/s per MPI process
5Node topology is easy to get, but not shown
clearly.
- PEBS provides metadata for node topology
- Want to highlight connections clearly to show
- Load distribution
- Bandwidth
- Resource contention
- Existing visualization from hwloc (right)
- Does not scale
- Clutters connections between components
6We have developed a measurement tool for
collecting detailed context
- Use PEBS sampling for hardware information
- Supplement with application instrumentation for
mapping addresses to physical coordinates
SMT (Semantic Memory Tree) data structure used
to map callbacks sampled instruction operands
7Currently the developer has to instrument the
application manually
- Add calls to get metadata for allocated objects
- Label string
- Start and end addresses
- Size of each element
- Number of elements
- Callback to map address to physical coordinates
- Metadata must be provided by the programmer
- Could easily be implemented in libraries
- Lots of common mesh libraries would be
interesting for this.
8Instrumentation
Add additional semantic attributes and define
attribution function (optional)
9Semantic Memory Tree
Binary Search Tree
Semantic Memory Ranges
Semantic Memory Range Tree Instrumentation
Semantic Memory Range Tree Instrumentation
Binary Search Tree
Binary Search Tree
Binary Search Tree
0x0F
0xF6
0x0F
0xF6
0x0F
0xF6
Address Ranges
Address Ranges
Address Ranges
0x0F
0x80
0xA2
0xF6
0x0F
0x80
0xA2
0xF6
0x0F
0x80
0xA2
0xF6
Velocity
Pressure
Temp
Density
Velocity
Pressure
Temp
Density
Velocity
Pressure
Temp
Density
0x40
0x80
0xA2
0xC2
0x0F
0x20
0xE0
0xF6
0x0F
0x20
0x40
0x80
0xA2
0xC2
0xE0
0xF6
0x40
0x80
0xA2
0xC2
0x0F
0x20
0xE0
0xF6
Data Buffers
Data Buffers
Addresses Application Domain
Addresses Application Domain
Record Performance Data in Application Domain
Record Performance Data in Application Domain
10Lagrangian Hydrodynamics LULESH
2D
3D
3D with mapped performance data
11We have developed MemAxes, a tool for analyzing
on-node memory performance
- Measurement component samples memory instructions
- We map latency information onto A) source code,
B) node topology - C) Pie chart shows percent of total latency
selected - D) Parallel coordinates view allows exploration
of correlations
12Linked views clearly show on-nodelocality
problems
- Parallel coordinates view shows correlation
between array index and core id in LULESH - Linked node topology view shows data motion for
highlighted memory operations - A contiguous chunk of an array is initially split
between threads on four cores - Using an optimized affinity scheme, we improve
locality - Performance improved by 10
Default thread affinity with poor locality
PIPER
Optimized thread affinity with good locality
13Hyperion Thread/Core Binding
Improved cache usage 44 less access cycles 10
total speedup
14Future work
- Back-port perf_events API to production TOSS 2
kernel - Currently unable to do fine-grained memory
sampling on production machines due to PMU access
limits - Affects some Intel thread tools as well
- More detailed architecture mapping
- Sandy Bridge LLC ring interconnect information?
- Other node architecture features?
- Instrument AMR libraries for proper context
attribution - Study per-patch memory behavior
- Study blocking behavior of solvers
- How to query large instruction traces effectively?