Title: Study of OpenMP with MPI for IFS
1Study of OpenMP with MPI for IFS ECMWFs
production weather model on IBM Nighthawk2
John Hague (IBM UK) Deborah Salmond (ECMWF)
2IFS
- IFS (Integrated Forecast System) contains
- Global Weather Forecast model
- 4D-Var data assimilation
- Wave model
- Ocean model
- which are run operationally by ECMWF
- IFS has been parallelised to run on distributed
memory systems using MPI - OpenMP directives have been introduced at high
level to enhance the possibility for
parallelisation on shared memory systems
3Forecast Days/Day for T511
4OpenMP MPI
1024
512
MPI only
256
0 400 800
1200 1600 2000
Processors
5IFS MPI OpenMP
- IFS on IBM NightHawk2
- With MPI get good speedups with 100s of MPI
tasks - With OpenMP MPI get improved speedups for
more than 500 processors using a few OpenMP
threads
6IFS MPI OpenMP
!OMP PARALLEL DO J1,NGPTOT, NPROMA CALL
CPG ENDDO
IFS Timestep
transpose FTINV transpose LTINV
DYNAMICS RADIATION SL-TRAJ slcomm SL-INTERP PHYSIC
S
SPECTRAL CALCS
BUFFER COPY MPI
LTDIR transpose FTDIR transpose
Grid-Point space
Spectral Space
7IFS Details of MPI and OpenMP run
- MPI Variables
- MP_SHARED_MEMORYyes
- MP_WAIT_MODEpoll
- MP_EUILIBus
- OpenMP Variables
- XLSMPOPTSparthrds2stack50000000
- spins500000yields
50000
8IFS MPI OpenMP
- Large numbers of threads do not give good
speedups - Typically 2 to 4 threads give best performance
- Partly due to less than 100 in parallel regions
- Speedup in parallel regions is less than
expected
9IFS Parallel regions ( T159 )
24 Timesteps
PHYSICS SL-INTERP
SL-TRAJ DYNAMICS
SPECTRAL
10MPI speedup v OpenMP speedup for IFS parallel
regions (T159)
SL-TRAJ SL-INTERP
DYNAMICS
Total time
SLCOMM
11Factors which could affect OpenMP speedup
- Thread Dispatching Overhead
- Loop startup
- Memory Bandwidth Limitations
- Caused by stores/loads missing L2 and competing
for memory access - L2 cache Differences
- Caused by master threads having more data in L2
than other threads - Cache interference
- Caused by different threads storing to same
128Byte cache line - Load Imbalance
121)
13(No Transcript)
14Thread Startup times
15IFS T159 with 8 MPI and OpenMP
Speedups of IFS parallel regions with 1 and 2
threads
DYNAMICS
Speedup
SLCOMM
msec/call
162)
17s(i)zeros(i) s(i)zero
Store
Total GB/s
Total GB/s
Total number of processors
Total number of processors
18ssa(i)b(i)c(i)d(i) ssa(i)
Load
Total GB/s
Total GB/s
Total number of processors
Total number of processors
19IFS T159 with 8 MPI and OpenMP
- Speedups of IFS parallel regions with 1 and 2
threads - compared with L2 misses from hardware performance
monitor - effect of Memory bandwith
203) L2 Cache Differences
- Thread imbalance
- Example
- Master thread stores array
- Parallel loop loads array
- L2 misses are different for different threads
21IFS T159 with 8 MPI and OpenMP
Speedups of IFS parallel regions with 1 and 2
threads compared with difference in L2 misses
224)
23(No Transcript)
245)
25(No Transcript)
26Conclusions
NO YES YES Not much YES