Title: Atomic Sections
1Atomic Sections A Design and Evaluation Study
under OpenMP-XN
Presentors Joseph Bryant Manzano
Franco Yuan Zhang Guang R. Gao Computer
Architecture and Parallel System
Laboratory University of Delaware In
Collaboration with Kemal Ebcioglu, Vivek
Sarkar X10 Team, IBM
2Context IBM PERCS/X10 project
- DARPA HPCS program Phase 2 focuses on evaluating
new technologies for productivity and performance.
PERCS Programming Tools performance-guided
parallelization and transformation, static
dynamic checking, separation of concerns --- all
integrated into a single development environment
(Eclipse)
Atomic Section Milestone 4 Under OpenMP-XN
- Focusing only on extensions from familiar SPMD
model is essential to PERCS programming model - Facilitate comparative studies on a large body
of OpenMP code - Permit an early study to begin before X10
project is underway. - Permit risky ideas to be studied under a more
focused context. - The algorithms developed can become a basis for
future implementation and integration under the
X10 framework.
PERCS Programming Model
OpenMP
MPI
Static and Dynamic Compilers for base language w/
programming model extensions Mature languages
C/C, Fortran, Java Emerging languages UPC,
StreamIt Experimental language X10
Language Runtime Dynamic Compilation
Continuous Optimization
PERCS System Software (K42, vHype)
PERCS System Hardware
3Major Goals
- We show how the OpenMP can be extended with the
concept of atomic section. - We develop a methodology of implementation of
analyzable atomic sections that include three
steps (1) identify the consistency-list for an
atomic section (2) assignment of locks to
concurrent atomic sections to expose maximum
parallelism with minimum cost of locks (3)
placement of fine-grained synchronization - We develop an OpenMP-XN prototype implementation
framework. We report the results and analysis of
our experiments on some selected set of
benchmarks and their analysis. - We have conducted a productivity study that show
how OpenMP-XN with analyzable atomic sections can
improve the programming productivity via examples
measured by time to the first correct
implementation.
4Random Access
The one-dimensional array is passed by reference
Initialized the table
Start a parallel region, and specify the shared
and private data
Initialize ran for each thread with a random
number seeded by the thread id and scheduling
information
Each thread begins to execute some iterations of
for loop
Atomic Section synchronizes accesses to shared
table by making its operations atomic and
mutually exclusive.
Atomic section (AS) a section of code that is
intended to be executed atomically, and mutually
exclusive with other conflicting atomic
operations.
5Atomic Section
- A section of code that is intended to be
executed atomically, and mutually exclusive with
other conflicting atomic operations.
6OpenMP-XN Runtime Model
7Atomic Section Implementation
Note The five-step process is produced
automatically by the OpenMP-XN compiler. High
productivity programmers need not know about the
lock assignment and data replacement.
An atomic section is implemented as a five-step
process (1) acquire lock (2) refresh (3)
computation (4) write-back (5) release lock.
8Atomic Section Implementation contd.
- Assumptions
- No nested atomic sections
- No nested parallel regions
- A Three-Step Approach
- Consistency List Analysis (CLA)
- Given an OpenMP-XN program, analyze each atomic
section and identify shared data which might be
read or written within that atomic section. - Lock Refinement and Assignment (LRA)
- Given an OpenMP-XN program, assign one or more
locks to guard the entrance of each atomic
section, so that any pair of concurrent atomic
sections that might access the same shared data
will be guarded by the same lock. - Generation of Consistency Actions (GCA)
- Generate refresh and write_back operations in
atomic sections so that the runtime number of
these operations is minimized.
9OpenMP-XN Experimental Testbed Structure (based
on Omni)
10Experiments
- A preliminary implementation of AS in OpenMP-XN
has been completed and tested - Performance analysis based on OpenMP-XN on
DARPA HPCS benchmarks and other benchmarks is in
progress - A productivity study on AS has been conducted
Benchmark
Micro-Benchmarks A set of small benchmarks to test the performance of atomic sections Source Delaware internal benchmarks
Delaware Banker A simple simulator of bank transactions. Implemented in parallel Source Delaware internal benchmarks
TAMMP Toy Another Molecular Mechanics Program. Kernel of the SPEC OMP molecular dynamics benchmark, ammp.
Random access Random access benchmark modified to run under OpenMP XN Source HPC Challenge (modified version)
Radix Sort Implementation of the parallel integer radix sort algorithm Source IBM MAMBO benchmarks
The Gram Schmidt Orthonormalization Compose of dot product derived from IBM benchmarks Source developed based on MAMBO/HPCS Inner-Product program
11Preliminary Experimental Results
- Current OpenMP-XN platform permits users to
collect/derive - Execution Time
- Speedup Curves
- Performance Statistics
- Cache consistency traffic
- Cache misses
- Number of memory operations
- CPU cycles of each computation unit in program
- Case Study
- Test bed Sun UltraSPARC III, 4 CPU, 400 MHz
- Benchmark Random access
- Problem Size 214
- Compare Atomic Section with Critical Section
- With right architectural support, atomic section
will not introduce performance overhead. - Delaware tool chain can help for more in-depth
performance study e.g. the following is one of
interesting observation
- Preliminary Performance Observations
- On conventional hardware platforms, the memory
wall (especially cache consistency traffic) is a
bottleneck for performance improvement. - Atomic section performance potential requires
architectural innovations.
Number of snoops averaged over 240 runs
Critical Section 2027.6
Atomic Section 1185.6
Observation It appears that the OpenMP-XN
runtime model based on atomic sections reduce the
number of coherence transactions considerably for
this example, compared to standard OpenMP
critical sections. However, more study is needed
for further explanation and exploration.
Future results will be obtained from PowerPC
systems (work already in progress)