Title: Detailed Cache Coherence Characterization for OpenMP Benchmarks
1Detailed Cache Coherence Characterization for
OpenMP Benchmarks
Jaydeep Marathe1, Anita Nagarajan2, Frank Mueller1
1 Department of Computer Science, NCSU 2 Intel
Technology India Pvt. Ltd.
2Our Focus
- Target shared memory SMP systems.
- Characterize coherence behavior at application
level. - Metrics guide coherence optimization.
Bus-Based Shared Memory SMP
Processor-1
Processor-N
Cache
Cache
Coherence Protocol
Bus
3Invalidation-based coherence protocol
- Cache lines have state bits.
- Data migrates between processor caches, state
transitions maintain coherence.
- MESI Protocol 4 states M Modified, E
Exclusive, S Shared, I Invalid
Processor A
Processor B
x
1. Read x
I?E
exclusive
x
x
2. Read x
I?S
E?S
shared
shared
x
3. Write x
Cache line was invalidated
S?M
S?I
modified/dirty
invalid
4A performance perspective
- Writes to shared variables cause invalidation
traffic. (E-gtI, M-gtI, S-gtI) - Worse, invalidations lead to coherence misses !
Proc A.
Proc B.
1. Write shared vars Var_1 Var_2 Var_n
invalidations !
2. Read Var_1 Var_2 Var_n
Data
Coherence Misses in Proc. B !
3. Write Var_1 Var_2 Var_3
invalidations !
Coherence Bottlenecks
Reducing Coherence Misses, Invalidations ?
Improved Performance !
5Question Does a coherence bottleneck exist ?
One Approach Time-based Profilers
Load Imbalance between 2 threads (KAI GuideView)
Parallel Time Imbalance Time
Problem Implicit Information Does imbalance/
speedup loss indicate a coherence bottleneck ?
- We cant tell !
6Does a coherence bottleneck exist ? (contd)
Another Approach Using Hardware counters
Total Misses
Coherence Misses
Invalidations
Total?
P-1
P-2
P-3
P-4
Processors ?
Detect potential coherence bottlenecks. - Block
level statistics perturbation prevents
fine-grained monitoring. - Cant diagnose cause !
- Which source code refs ? data structures ?
Need more detailed information !
7What we offer..
- Hierarchical levels of detail..
Overall statistics
Ref Coherence Misses Invalidations Invalidations Invalidations Invalidations
Ref Coherence Misses True True False False
Ref Coherence Misses In Across In Across
V1_Read 8627 4 7517 31 1342
Timer_Read 3182 1 0 3122 0
Clock_Read 2165 2166 0 0 0
. . . . . .
Per-reference metrics
Invalidator True-Sharing False-Sharing V2
4811 20 V3 3199 0 Clock
200 1000 ....
.... ....
Invalidator Set for each reference
- Rich metrics coherence misses, true false
sharing, per-reference invalidators
- Facilitates easy isolation of bottleneck
references !
8Our Framework
- Bound OpenMP threads for SMP parallelism
- Access Traces using dynamic binary rewriting
- Traces used for incremental SMP cache
simulation. (L1L2coherence)
Trace Generation
Instrumentation
Thread-N
Thread-1
Execute
Thread-0
Handler()
Instrument
Compression
Handler()
Handler()
Target OpenMP Executable
Simulation
Access Trace
Access Trace
Access Trace
Controller
Extract
Detailed Coherence Metrics
Target Descriptor
SMP Cache Simulator
Instruction? Line, File Global Local Variables
9Target Executable Instrumentation
Machine Code
CFG
Controller
DynInst
myfunc() . pragma omp parallel for For(I0I
lt NI) AI BI CI //end parallel
for pragma omp barrier
myfunc .. .. CALL _xlsmpParallelDoSetup LOAD
BI LOAD CI STORE AI Exit_from_parallel_for
CALL _xlsmpBarrier
Instrumentation Points
- Enhanced DynInst Dynamic Binary Rewriting
package ( U. Wisconsin) - Instrument Memory access instructions (LD/ST)
- Instrument Compiler-generated OpenMP construct
functions.
10Per-reference metrics
- Uniprocessor Misses
- Coherence Misses
- Invalidations
- List count of Invalidator references
Fork-Join Model
//serial code .. pragma omp parallel . //e
nd parallel //serial code pragma omp
parallel do . . //end parallel
Region
Invalidations
In-Region
Across-Region
Region
False Sharing
Region
True Sharing
False Sharing
True Sharing
Region
11In-depth Example SMG2000
- ASCI Purple benchmark
- Has been scaled up to 3150 processors
- Hybrid OpenMP MPI Parallelization
- Code is known to be memory-intensive
- Code Characteristics
- 72 Files, 24213 lines (non-whitespace)
- Instrumentation Characteristics
- 4 OpenMP threads, default workload
- Functions instrumented 313 (69 OpenMP, 244
Others) - Access Points instrumented 10692
-
- 8531 Load (2184 64-bit, 6329 32-bit, 18
8-bit) - 2161 Store (722 64-bit, 1425 32-bit, 14
8-bit) - Tracing 16.73 Million accesses logged.
12Overall Performance SMG2000
10
A. Overall Misses
B. Cumulative Coherence Misses
- Most L2 misses are Coherence misses
- Most Invalidations result in Coherence Misses
- Only 280 out of 10692 access points show
coherence activity (2.6 !) - Only 10 of these points account for gt 90 of
the coherence misses
13Drilling Down Per-Access Point Metrics
- Top-5 Metrics for Processor-1
No File Line Ref Group Coherence Misses Invalidations Invalidations Invalidations Invalidations
True True False False
In-Region Across-Region In-Region Across-Region
1 smg_residual.c 289 rp_Read 1 168545 0 0 158842 9672
2 smg_residual.c 289 rp_Read 1 81729 0 0 74242 7587
3 smg_residual.c 289 rp_Write 1 43338 0 0 42684 3648
4 cyclic_reduction.c 853 xp_Write 1 22467 0 0 21388 1128
5 threading.c 24 num_threads_Write 2 16553 17402 0 0 0
- Group-1 Refs False-sharing In-Region (Same
OpenMP region) invalidations dominate - Group-2 Ref True-sharing In-Region
invalidations only
14Drilling Further A Ref its invalidators
FileLine Ref Invalidations Invalidations Invalidations
True False False
In Across
smg_residual.c289 rp_Read 0 158842 9672
for k 0 to Kmax for j 0 to Jmax
pragma omp parallel do for i 0 to Imax
... rpkji
rpkji Ai xp //end
omp do
Invalidator List
No Reference True-sharing Invalidations False-Sharing Invalidations
1 Proc_1rp_Write 0 77820
2 Proc_1rp_Write 0 86161
3 Proc_2rp_Write 0 2352
Cache Line
Cache Line
P1
P2
P3
P4
- Large number of False-In Region Invalidations !
- Sub-optimal Parallelization Fine-grained
sharing - Solution Parallelize Outermost loop
(Coarsening)
15Another Optimization
FileLine Ref Invalidations Invalidations Invalidations Invalidations
True True False False
In Across In Across
threading.c24 num_threads_Write 17402 0 0 0
pragma omp parallel num_threadsomp_get_nu
m_threads()
Invalidator List
Cache Line
No Reference True-sharing Invalidations False-Sharing Invalidations
1 Proc_1num_thread_Write 17402 0
num_thread
P1
P2
P3
P4
- Multiple threads updating same shared variable !
- Solution Remove unnecessary sharing
(SharedRemoval)
num_threads omp_get_max_threads()
16Impact of Optimizations
- SMG2000 run on IBM SP Blue Horizon. (POWER3)
- Wall-clock times for recommended full-sized
workloads (threads 1, 2, 4, 8) - Maximum of 73 improvement for 4th Workload
17Highlights Future Directions
Highlights
- First tool for characterizing OpenMP SMP
performance - Dynamic Binary Rewriting no source code
modification ! - Detailed source-correlated statistics.
- Rich set of coherence metrics.
Future Directions
- Use of partial access traces. (intermittent
instrumentation) - Other Threading Models Pthreads, etc.
- Characterizing Perennial Server applications
(Apache)
18The End
19Simulator Accuracy Comparing invalidations
IS MG CG FT BT SP NBF
HPM-Raw 165246 24631 134964 326595 185317 282269 474121
HPM-After OpenMP run-time correction 162964 13629 100487 325257 157384 258922 135926
IS MG CG FT BT SP NBF
HPM(Corrected) 162964 13629 100487 325257 157384 258922 135926
ccSIM-Interleaved 163073 13174 117117 302630 157503 268334 137498
ERROR -0.006 3.3 -16.5 6.9 -0.07 -3.6 -1.15
- NAS 3.0 OpenMP benchmarks NBF
- Total Invalidations Hardware Counters (IBM SP
HPM) vs. Simulator (ccSIM) - Account for OpenMP runtime overhead in HPM.
- 16.5 Maximum absolute error, most benchmarks
have lt 7 error.
20Related Work
MemSpy Martonosi et. al.(1992)
- Execution-driven simulation
- Classifies by code data objects
- Invalidations coherence misses
- No true/false sharing
- No invalidator lists
- Compiler-inserted instrumentation
- Uniprocessor-simulated parallel threads
SM-Prof Brorsson et.al.(1995)
- Variable Classification tool
- Access classes of shared/private read/write
few/many - Cant detect true/false sharing, magnitude
- No coherence misses,invalidator lists
Rsim, Proteus, SimOS
- Architecture-oriented simulators, only bulk
statistics - Not meant for application developers
21Tracing Overhead (earlier work METRIC)
- 1-3 Orders of Magnitude overhead, in most
cases. - Conventional breakpoints (TRAP) have gt 4 Orders
of Magnitude overhead.
22Trap-based instrumentation
From DynInst Documentation
dyninst gdb application
operations ops/sec time
(sec) time (sec) compress 95
32,513 406,655.7 0.08
74.35 li (xlmatch) 110,209
43,607.7 2.53 221.04 li
(compare) 4,475
640.2 6.99 16.39 li (binary)
401 19.4
20.69 21.62
23Compression Ratio (earlier work METRIC)
- Comparison of Uncompressed and Compressed Stream
sizes - 1 Million Accesses Logged
- 2-4 Orders of Magnitude Compression , in most
cases.
24(No Transcript)