Detailed Cache Coherence Characterization for OpenMP Benchmarks - PowerPoint PPT Presentation

About This Presentation

Title:

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Description:

1 Department of Computer Science, NCSU. 2 Intel Technology India Pvt. Ltd. 6/28/2004 ... Maximum of 73% improvement for 4th Workload. 6/28/2004. NC State University ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 25

Provided by: gautamgo

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Detailed Cache Coherence Characterization for OpenMP Benchmarks

1
Detailed Cache Coherence Characterization for
OpenMP Benchmarks
Jaydeep Marathe1, Anita Nagarajan2, Frank Mueller1
1 Department of Computer Science, NCSU 2 Intel
Technology India Pvt. Ltd.
2
Our Focus

Target shared memory SMP systems.
Characterize coherence behavior at application
level.
Metrics guide coherence optimization.

Bus-Based Shared Memory SMP
Processor-1
Processor-N
Cache
Cache
Coherence Protocol
Bus
3
Invalidation-based coherence protocol

Cache lines have state bits.
Data migrates between processor caches, state
transitions maintain coherence.

MESI Protocol 4 states M Modified, E
Exclusive, S Shared, I Invalid

Processor A
Processor B
x
1. Read x
I?E
exclusive
x
x
2. Read x
I?S
E?S
shared
shared
x
3. Write x
Cache line was invalidated
S?M
S?I
modified/dirty
invalid
4
A performance perspective

Writes to shared variables cause invalidation
traffic. (E-gtI, M-gtI, S-gtI)
Worse, invalidations lead to coherence misses !

Proc A.
Proc B.
1. Write shared vars Var_1 Var_2 Var_n
invalidations !
2. Read Var_1 Var_2 Var_n
Data
Coherence Misses in Proc. B !
3. Write Var_1 Var_2 Var_3
invalidations !
Coherence Bottlenecks
Reducing Coherence Misses, Invalidations ?
Improved Performance !
5
Question Does a coherence bottleneck exist ?
One Approach Time-based Profilers
Load Imbalance between 2 threads (KAI GuideView)
Parallel Time Imbalance Time
Problem Implicit Information Does imbalance/
speedup loss indicate a coherence bottleneck ?
- We cant tell !
6
Does a coherence bottleneck exist ? (contd)
Another Approach Using Hardware counters
Total Misses
Coherence Misses
Invalidations
Total?
P-1
P-2
P-3
P-4
Processors ?
Detect potential coherence bottlenecks. - Block
level statistics perturbation prevents
fine-grained monitoring. - Cant diagnose cause !
- Which source code refs ? data structures ?
Need more detailed information !
7
What we offer..

Hierarchical levels of detail..

Overall statistics
Ref Coherence Misses Invalidations Invalidations Invalidations Invalidations
Ref Coherence Misses True True False False
Ref Coherence Misses In Across In Across
V1_Read 8627 4 7517 31 1342
Timer_Read 3182 1 0 3122 0
Clock_Read 2165 2166 0 0 0
. . . . . .

Per-reference metrics
Invalidator True-Sharing False-Sharing V2
4811 20 V3 3199 0 Clock
200 1000 ....
.... ....
Invalidator Set for each reference

Rich metrics coherence misses, true false
sharing, per-reference invalidators

Facilitates easy isolation of bottleneck
references !

8
Our Framework

Bound OpenMP threads for SMP parallelism
Access Traces using dynamic binary rewriting
Traces used for incremental SMP cache
simulation. (L1L2coherence)

Trace Generation
Instrumentation
Thread-N
Thread-1
Execute
Thread-0
Handler()
Instrument
Compression
Handler()
Handler()
Target OpenMP Executable
Simulation
Access Trace
Access Trace
Access Trace
Controller
Extract
Detailed Coherence Metrics
Target Descriptor
SMP Cache Simulator
Instruction? Line, File Global Local Variables
9
Target Executable Instrumentation
Machine Code
CFG
Controller
DynInst
myfunc() . pragma omp parallel for For(I0I
lt NI) AI BI CI //end parallel
for pragma omp barrier
myfunc .. .. CALL _xlsmpParallelDoSetup LOAD
BI LOAD CI STORE AI Exit_from_parallel_for
CALL _xlsmpBarrier
Instrumentation Points

Enhanced DynInst Dynamic Binary Rewriting
package ( U. Wisconsin)
Instrument Memory access instructions (LD/ST)
Instrument Compiler-generated OpenMP construct
functions.

10
Per-reference metrics

Uniprocessor Misses
Coherence Misses

Invalidations
List count of Invalidator references

Fork-Join Model
//serial code .. pragma omp parallel . //e
nd parallel //serial code pragma omp
parallel do . . //end parallel
Region
Invalidations
In-Region
Across-Region
Region
False Sharing
Region
True Sharing
False Sharing
True Sharing
Region
11
In-depth Example SMG2000

ASCI Purple benchmark
Has been scaled up to 3150 processors
Hybrid OpenMP MPI Parallelization
Code is known to be memory-intensive
Code Characteristics
72 Files, 24213 lines (non-whitespace)
Instrumentation Characteristics
4 OpenMP threads, default workload
Functions instrumented 313 (69 OpenMP, 244
Others)
Access Points instrumented 10692
8531 Load (2184 64-bit, 6329 32-bit, 18
8-bit)
2161 Store (722 64-bit, 1425 32-bit, 14
8-bit)
Tracing 16.73 Million accesses logged.

12
Overall Performance SMG2000
10
A. Overall Misses
B. Cumulative Coherence Misses

Most L2 misses are Coherence misses
Most Invalidations result in Coherence Misses
Only 280 out of 10692 access points show
coherence activity (2.6 !)
Only 10 of these points account for gt 90 of
the coherence misses

13
Drilling Down Per-Access Point Metrics

Top-5 Metrics for Processor-1

No File Line Ref Group Coherence Misses Invalidations Invalidations Invalidations Invalidations
True True False False
In-Region Across-Region In-Region Across-Region
1 smg_residual.c 289 rp_Read 1 168545 0 0 158842 9672
2 smg_residual.c 289 rp_Read 1 81729 0 0 74242 7587
3 smg_residual.c 289 rp_Write 1 43338 0 0 42684 3648
4 cyclic_reduction.c 853 xp_Write 1 22467 0 0 21388 1128
5 threading.c 24 num_threads_Write 2 16553 17402 0 0 0

Group-1 Refs False-sharing In-Region (Same
OpenMP region) invalidations dominate
Group-2 Ref True-sharing In-Region
invalidations only

14
Drilling Further A Ref its invalidators
FileLine Ref Invalidations Invalidations Invalidations
True False False
In Across
smg_residual.c289 rp_Read 0 158842 9672
for k 0 to Kmax for j 0 to Jmax
pragma omp parallel do for i 0 to Imax
... rpkji
rpkji Ai xp //end
omp do
Invalidator List
No Reference True-sharing Invalidations False-Sharing Invalidations
1 Proc_1rp_Write 0 77820
2 Proc_1rp_Write 0 86161
3 Proc_2rp_Write 0 2352
Cache Line
Cache Line
P1
P2
P3
P4

Large number of False-In Region Invalidations !
Sub-optimal Parallelization Fine-grained
sharing
Solution Parallelize Outermost loop
(Coarsening)

15
Another Optimization
FileLine Ref Invalidations Invalidations Invalidations Invalidations
True True False False
In Across In Across
threading.c24 num_threads_Write 17402 0 0 0
pragma omp parallel num_threadsomp_get_nu
m_threads()
Invalidator List
Cache Line
No Reference True-sharing Invalidations False-Sharing Invalidations
1 Proc_1num_thread_Write 17402 0
num_thread
P1
P2
P3
P4

Multiple threads updating same shared variable !
Solution Remove unnecessary sharing
(SharedRemoval)

num_threads omp_get_max_threads()
16
Impact of Optimizations

SMG2000 run on IBM SP Blue Horizon. (POWER3)
Wall-clock times for recommended full-sized
workloads (threads 1, 2, 4, 8)
Maximum of 73 improvement for 4th Workload

17
Highlights Future Directions
Highlights

First tool for characterizing OpenMP SMP
performance
Dynamic Binary Rewriting no source code
modification !
Detailed source-correlated statistics.
Rich set of coherence metrics.

Future Directions

Use of partial access traces. (intermittent
instrumentation)
Other Threading Models Pthreads, etc.
Characterizing Perennial Server applications
(Apache)

18
The End
19
Simulator Accuracy Comparing invalidations
IS MG CG FT BT SP NBF
HPM-Raw 165246 24631 134964 326595 185317 282269 474121
HPM-After OpenMP run-time correction 162964 13629 100487 325257 157384 258922 135926
IS MG CG FT BT SP NBF
HPM(Corrected) 162964 13629 100487 325257 157384 258922 135926
ccSIM-Interleaved 163073 13174 117117 302630 157503 268334 137498
ERROR -0.006 3.3 -16.5 6.9 -0.07 -3.6 -1.15

NAS 3.0 OpenMP benchmarks NBF
Total Invalidations Hardware Counters (IBM SP
HPM) vs. Simulator (ccSIM)
Account for OpenMP runtime overhead in HPM.
16.5 Maximum absolute error, most benchmarks
have lt 7 error.

20
Related Work
MemSpy Martonosi et. al.(1992)

Execution-driven simulation
Classifies by code data objects
Invalidations coherence misses
No true/false sharing
No invalidator lists
Compiler-inserted instrumentation
Uniprocessor-simulated parallel threads

SM-Prof Brorsson et.al.(1995)

Variable Classification tool
Access classes of shared/private read/write
few/many
Cant detect true/false sharing, magnitude
No coherence misses,invalidator lists

Rsim, Proteus, SimOS

Architecture-oriented simulators, only bulk
statistics
Not meant for application developers

21
Tracing Overhead (earlier work METRIC)

1-3 Orders of Magnitude overhead, in most
cases.
Conventional breakpoints (TRAP) have gt 4 Orders
of Magnitude overhead.

22
Trap-based instrumentation
From DynInst Documentation
dyninst gdb application
operations ops/sec time
(sec) time (sec) compress 95
32,513 406,655.7 0.08
74.35 li (xlmatch) 110,209
43,607.7 2.53 221.04 li
(compare) 4,475
640.2 6.99 16.39 li (binary)
401 19.4
20.69 21.62
23
Compression Ratio (earlier work METRIC)