Detailed Cache Coherence Characterization for OpenMP Benchmarks - PowerPoint PPT Presentation

About This Presentation
Title:

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Description:

1 Department of Computer Science, NCSU. 2 Intel Technology India Pvt. Ltd. 6/28/2004 ... Maximum of 73% improvement for 4th Workload. 6/28/2004. NC State University ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 25
Provided by: gautamgo
Category:

less

Transcript and Presenter's Notes

Title: Detailed Cache Coherence Characterization for OpenMP Benchmarks


1
Detailed Cache Coherence Characterization for
OpenMP Benchmarks
Jaydeep Marathe1, Anita Nagarajan2, Frank Mueller1
1 Department of Computer Science, NCSU 2 Intel
Technology India Pvt. Ltd.
2
Our Focus
  • Target shared memory SMP systems.
  • Characterize coherence behavior at application
    level.
  • Metrics guide coherence optimization.

Bus-Based Shared Memory SMP
Processor-1
Processor-N
Cache
Cache
Coherence Protocol
Bus
3
Invalidation-based coherence protocol
  • Cache lines have state bits.
  • Data migrates between processor caches, state
    transitions maintain coherence.
  • MESI Protocol 4 states M Modified, E
    Exclusive, S Shared, I Invalid

Processor A
Processor B
x
1. Read x
I?E
exclusive
x
x
2. Read x
I?S
E?S
shared
shared
x
3. Write x
Cache line was invalidated
S?M
S?I
modified/dirty
invalid
4
A performance perspective
  • Writes to shared variables cause invalidation
    traffic. (E-gtI, M-gtI, S-gtI)
  • Worse, invalidations lead to coherence misses !

Proc A.
Proc B.
1. Write shared vars Var_1 Var_2 Var_n
invalidations !
2. Read Var_1 Var_2 Var_n
Data
Coherence Misses in Proc. B !
3. Write Var_1 Var_2 Var_3
invalidations !
Coherence Bottlenecks
Reducing Coherence Misses, Invalidations ?
Improved Performance !
5
Question Does a coherence bottleneck exist ?
One Approach Time-based Profilers
Load Imbalance between 2 threads (KAI GuideView)
Parallel Time Imbalance Time
Problem Implicit Information Does imbalance/
speedup loss indicate a coherence bottleneck ?
- We cant tell !
6
Does a coherence bottleneck exist ? (contd)
Another Approach Using Hardware counters
Total Misses
Coherence Misses
Invalidations
Total?
P-1
P-2
P-3
P-4
Processors ?
Detect potential coherence bottlenecks. - Block
level statistics perturbation prevents
fine-grained monitoring. - Cant diagnose cause !
- Which source code refs ? data structures ?
Need more detailed information !
7
What we offer..
  • Hierarchical levels of detail..

Overall statistics
Ref Coherence Misses Invalidations Invalidations Invalidations Invalidations
Ref Coherence Misses True True False False
Ref Coherence Misses In Across In Across
V1_Read 8627 4 7517 31 1342
Timer_Read 3182 1 0 3122 0
Clock_Read 2165 2166 0 0 0
. . . . . .



Per-reference metrics
Invalidator True-Sharing False-Sharing V2
4811 20 V3 3199 0 Clock
200 1000 ....
.... ....
Invalidator Set for each reference
  • Rich metrics coherence misses, true false
    sharing, per-reference invalidators
  • Facilitates easy isolation of bottleneck
    references !

8
Our Framework
  • Bound OpenMP threads for SMP parallelism
  • Access Traces using dynamic binary rewriting
  • Traces used for incremental SMP cache
    simulation. (L1L2coherence)

Trace Generation
Instrumentation
Thread-N
Thread-1
Execute
Thread-0
Handler()
Instrument
Compression
Handler()
Handler()
Target OpenMP Executable
Simulation
Access Trace
Access Trace
Access Trace
Controller
Extract
Detailed Coherence Metrics
Target Descriptor
SMP Cache Simulator
Instruction? Line, File Global Local Variables
9
Target Executable Instrumentation
Machine Code
CFG
Controller
DynInst
myfunc() . pragma omp parallel for For(I0I
lt NI) AI BI CI //end parallel
for pragma omp barrier
myfunc .. .. CALL _xlsmpParallelDoSetup LOAD
BI LOAD CI STORE AI Exit_from_parallel_for
CALL _xlsmpBarrier
Instrumentation Points
  • Enhanced DynInst Dynamic Binary Rewriting
    package ( U. Wisconsin)
  • Instrument Memory access instructions (LD/ST)
  • Instrument Compiler-generated OpenMP construct
    functions.

10
Per-reference metrics
  • Uniprocessor Misses
  • Coherence Misses
  • Invalidations
  • List count of Invalidator references

Fork-Join Model
//serial code .. pragma omp parallel . //e
nd parallel //serial code pragma omp
parallel do . . //end parallel
Region
Invalidations
In-Region
Across-Region
Region
False Sharing
Region
True Sharing
False Sharing
True Sharing
Region
11
In-depth Example SMG2000
  • ASCI Purple benchmark
  • Has been scaled up to 3150 processors
  • Hybrid OpenMP MPI Parallelization
  • Code is known to be memory-intensive
  • Code Characteristics
  • 72 Files, 24213 lines (non-whitespace)
  • Instrumentation Characteristics
  • 4 OpenMP threads, default workload
  • Functions instrumented 313 (69 OpenMP, 244
    Others)
  • Access Points instrumented 10692
  • 8531 Load (2184 64-bit, 6329 32-bit, 18
    8-bit)
  • 2161 Store (722 64-bit, 1425 32-bit, 14
    8-bit)
  • Tracing 16.73 Million accesses logged.

12
Overall Performance SMG2000
10
A. Overall Misses
B. Cumulative Coherence Misses
  • Most L2 misses are Coherence misses
  • Most Invalidations result in Coherence Misses
  • Only 280 out of 10692 access points show
    coherence activity (2.6 !)
  • Only 10 of these points account for gt 90 of
    the coherence misses

13
Drilling Down Per-Access Point Metrics
  • Top-5 Metrics for Processor-1

No File Line Ref Group Coherence Misses Invalidations Invalidations Invalidations Invalidations
True True False False
In-Region Across-Region In-Region Across-Region
1 smg_residual.c 289 rp_Read 1 168545 0 0 158842 9672
2 smg_residual.c 289 rp_Read 1 81729 0 0 74242 7587
3 smg_residual.c 289 rp_Write 1 43338 0 0 42684 3648
4 cyclic_reduction.c 853 xp_Write 1 22467 0 0 21388 1128
5 threading.c 24 num_threads_Write 2 16553 17402 0 0 0
  • Group-1 Refs False-sharing In-Region (Same
    OpenMP region) invalidations dominate
  • Group-2 Ref True-sharing In-Region
    invalidations only

14
Drilling Further A Ref its invalidators
FileLine Ref Invalidations Invalidations Invalidations
True False False
In Across
smg_residual.c289 rp_Read 0 158842 9672
for k 0 to Kmax for j 0 to Jmax
pragma omp parallel do for i 0 to Imax
... rpkji
rpkji Ai xp //end
omp do
Invalidator List
No Reference True-sharing Invalidations False-Sharing Invalidations
1 Proc_1rp_Write 0 77820
2 Proc_1rp_Write 0 86161
3 Proc_2rp_Write 0 2352
Cache Line
Cache Line
P1
P2
P3
P4
  • Large number of False-In Region Invalidations !
  • Sub-optimal Parallelization Fine-grained
    sharing
  • Solution Parallelize Outermost loop
    (Coarsening)

15
Another Optimization
FileLine Ref Invalidations Invalidations Invalidations Invalidations
True True False False
In Across In Across
threading.c24 num_threads_Write 17402 0 0 0
pragma omp parallel num_threadsomp_get_nu
m_threads()
Invalidator List
Cache Line
No Reference True-sharing Invalidations False-Sharing Invalidations
1 Proc_1num_thread_Write 17402 0
num_thread
P1
P2
P3
P4
  • Multiple threads updating same shared variable !
  • Solution Remove unnecessary sharing
    (SharedRemoval)

num_threads omp_get_max_threads()
16
Impact of Optimizations
  • SMG2000 run on IBM SP Blue Horizon. (POWER3)
  • Wall-clock times for recommended full-sized
    workloads (threads 1, 2, 4, 8)
  • Maximum of 73 improvement for 4th Workload

17
Highlights Future Directions
Highlights
  • First tool for characterizing OpenMP SMP
    performance
  • Dynamic Binary Rewriting no source code
    modification !
  • Detailed source-correlated statistics.
  • Rich set of coherence metrics.

Future Directions
  • Use of partial access traces. (intermittent
    instrumentation)
  • Other Threading Models Pthreads, etc.
  • Characterizing Perennial Server applications
    (Apache)

18
The End
19
Simulator Accuracy Comparing invalidations
IS MG CG FT BT SP NBF
HPM-Raw 165246 24631 134964 326595 185317 282269 474121
HPM-After OpenMP run-time correction 162964 13629 100487 325257 157384 258922 135926
IS MG CG FT BT SP NBF
HPM(Corrected) 162964 13629 100487 325257 157384 258922 135926
ccSIM-Interleaved 163073 13174 117117 302630 157503 268334 137498
ERROR -0.006 3.3 -16.5 6.9 -0.07 -3.6 -1.15
  • NAS 3.0 OpenMP benchmarks NBF
  • Total Invalidations Hardware Counters (IBM SP
    HPM) vs. Simulator (ccSIM)
  • Account for OpenMP runtime overhead in HPM.
  • 16.5 Maximum absolute error, most benchmarks
    have lt 7 error.

20
Related Work
MemSpy Martonosi et. al.(1992)
  • Execution-driven simulation
  • Classifies by code data objects
  • Invalidations coherence misses
  • No true/false sharing
  • No invalidator lists
  • Compiler-inserted instrumentation
  • Uniprocessor-simulated parallel threads

SM-Prof Brorsson et.al.(1995)
  • Variable Classification tool
  • Access classes of shared/private read/write
    few/many
  • Cant detect true/false sharing, magnitude
  • No coherence misses,invalidator lists

Rsim, Proteus, SimOS
  • Architecture-oriented simulators, only bulk
    statistics
  • Not meant for application developers

21
Tracing Overhead (earlier work METRIC)
  • 1-3 Orders of Magnitude overhead, in most
    cases.
  • Conventional breakpoints (TRAP) have gt 4 Orders
    of Magnitude overhead.

22
Trap-based instrumentation
From DynInst Documentation
dyninst gdb application
operations ops/sec time
(sec) time (sec) compress 95
32,513 406,655.7 0.08
74.35 li (xlmatch) 110,209
43,607.7 2.53 221.04 li
(compare) 4,475
640.2 6.99 16.39 li (binary)
401 19.4
20.69 21.62
23
Compression Ratio (earlier work METRIC)
  • Comparison of Uncompressed and Compressed Stream
    sizes
  • 1 Million Accesses Logged
  • 2-4 Orders of Magnitude Compression , in most
    cases.

24
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com