Introduction to Parallel MM5 - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Introduction to Parallel MM5

Description:

Parallelized by John Michalakes, Software Engineer, MCS Division, ... cache1. cache2. cache3. cacheN. Distributed Memory. The alternative model to shared memory ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 30

Provided by: yupe

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Parallel MM5

1
Introduction to Parallel MM5

Yupeng Li
Air Quality Modeling,
Dept. of Computer Science,
UH

2
Credits

Parallelized by John Michalakes, Software
Engineer, MCS Division, Argonne National
Laboratory

3
Parallelism in MM5

What is meant by parallel?
Increase computational and memory resources
available for larger, faster runs by having more
than one computer work on the problem
Isnt MM5 already parallel?
Yes, the model has been able to run shared-memory
parallel since MM4 using Cray Microtasking
directives
More recently, standardized OpenMP directives
have been incorporated
What is DM-parallelism? Why?
Processors store part of model domain in local
memory, not shared with other processors, and
work together on a problem by exchanging messages
over a network
Scalable because it eliminates bottlenecks on
shared resources such as bus or memory
Possibly also more cost effective since systems
can be commodity
You already have the DM-parallel version of MM5

4
A Little on Parallel Computing

There are two ways to use more than processor
(SMP) Shared Memory
(DMP) Distributed Memory
There is newer one called Shared Distributed
Memory (SDMP)

5
Memory Hierarchy
CPU or processor
Register (1)
Cache (25) (May have several levels)
Main memory (30)
Disk (106)
Access speed in machine cycle.
6
Shared Memory Architecture
For small numbers of processors
Shared memory
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7
Distributed Memory

The alternative model to shared memory

mem1
mem2
mem3
memN
proc1
proc2
proc3
procN
network
8
The revolution of communication networks From
LAN to SAN (System Area Network)

Myrinet scalable interconnection networks
high bandwidth
low latency
reliable
Success Story Berkeley NOW

Single chip building block of Myrinet (1995)

Synfinity Component for very large highly
reliable interconnection networks configuration
(HAL/Fujitsu 1998)
high bandwidth (1.6GBytes/Sec/Port)
low latency (42ns)
highly reliable

9
Machines
10
Programming Technology for Parallel Architecture

(SMP) Shared Memory
PThread
Directives OpenMP (OMP)
(DMP) Distributed Memory
API Message Passing Interface (MPI)
Language High Performance Fortran (HPF)
OMP and MPI are most widely used.

11
OpenMP

Mainly directives, plus a small set of
functionPROGRAM TEST ... !OMP PARALLEL
...!OMP DO DO I... ... CALL SUB1 ...
ENDDO ... CALL SUB2 ... !OMP END PARALLEL

12
OpenMP

And!OMP PARALLEL PRIVATE(NTHREADS, TID) C
Obtain and print thread id TID
OMP_GET_THREAD_NUM() PRINT , 'Hello World from
thread ', TID C Only master thread does this
IF (TID .EQ. 0) THEN NTHREADS
OMP_GET_NUM_THREADS() PRINT , 'Number of
threads ', NTHREADS END IF C All threads
join master thread and disband !OMP END
PARALLEL

13
OpenMP

If compiler can not recognize OMP directives,
these directives will just be ignored.
If there is not enough number processors (for the
threads you ask for), some threads will have to
share processor.

14
MPI

A bunch of functions calls. include 'mpif.h'
integer myrank, numprocs, ierr integer
status(MPI_STATUS_SIZE) real side, square
side 7.0 square0.0 C call MPI_INIT(
ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD,
myrank, ierr ) call MPI_COMM_SIZE(
MPI_COMM_WORLD, numprocs, ierr ) print ,'I am
node ',myrank,' out of ',numprocs,' nodes.'
if(myrank.eq.1) then square side side
call MPI_SEND(square,1,MPI_REAL,0,99,MPI_COMM_
WORLD,ierr) endif if(myrank.eq.0) then
print ,'Before receive square is ',square call
MPI_RECV(square,1,MPI_REAL,1,99,MPI_COMM
_WORLD,status,ierr) print ,'After receive
square is ',square endif call
MPI_FINALIZE(ierr) end

15
MPI

Usually involving one master (rank 0)node and
several slave node
Master distribute job load to the set of
computing nodes (MPI_SEND), and collect results
(MPI_RECV) from them as well.
Initialization and Finalization is must.

16
Before we have MPI in MM5

There are already several versions of MPI
implementation of MM5 out there. But none of them
is officially adopted, because it is too
expensive to maintain two set of code.
Until Runtime System Library (RSL) and Fortran
Loop and Index Converter (FLIC) are used in
parallelizing MM5 for DMP. This is called Same
source approach.
Single-source implementation of parallelism has
obvious benefits for maintainability, avoiding
the effort needed to keep multiple,
architecture-specific versions up to date with
respect to each other.

17
Same source concept

Ideal Source code for the DM-parallel and
non-DM parallel model are identical at the
science level
Hide parallel details under the hood- automate
and encapsulate.
Parallel toolbox
FLIC - automatic generation of I and J loop
indexes
RSL routines for domain decomposition and
message passing

18
Distributed Memory Parallel Performance
19
DM-parallel Features

All MM5 options supported except
Moving nested grids
Arakawa-Schubert Cumulus
Pleim-Xu PBL

20
Building the DM-parallel MM5

Downloadftp//ftp.ucar.edu/mesouser/MM5V3/MM5.TA
R.gzftp//ftp.ucar.edu/mesouser/MM5V3/MPP.TAR.gz
Unzip and untargzip -d -c MM5.TAR.gz tar xf
cd MM5gzip -d -c MPP.TAR.gz tar xf
Edit configure.user file for computer and
configuration

21
Editing configure.user

Find the MPP subsection in Section 7 of
configure.user pertaining to your computer and
uncomment those rules
Adjust PROCMIN_NS and PROCMIN_EW settings at top
of Section 7 for memory scaling

22
PROCMIN variables

Determine horizontal dimensions of MM5 arrays for
each processor at compile time
PROCMIN_NS divides MIX (north-south
decomposition)PROCMIN_EW divides MJX (east-west
decomposition)
Can reduce per processor size of MM5 arrays to
exploit the aggregate memory size of the parallel
machine

23
PROCMIN variables

An executable compiled with PROCMIN_NS1 and
PROCMIN_EW1 uses maximum per processor memory
but is valid for any number of processors.
Warning! An executable compiled with PROCMIN_NS2
and PROCMIN_EW2 can be run on no fewer than 4
processors, but for example it can NOT be run on
5 processors (MIX/2 dimension is too small for
1x5 decomposition)

24
PROCMIN variables

Experiment with different decompositions e.g.,
runtimes for 16 processor jobs compiled as 2x8,
4x4, and 8x2 might vary significantly.

25
Building the code

Build the model make mpp
Resulting executable Run/mm5.mpp
To remake the code in different
configurationmake mpclean
To reinstall the code in different locationmake
uninstall

26
Running the model

Generate the mmlif (namelist) file
make mm5.deck
Edit mm5.deck
./mm5.deck (creates namelist file in Run/mmlif
does not run code.)
Also run it after you change configure.user
Run the model
cd Run
mpirun -np 4 ./mm5.mpp (standard, MPICH)
dmpirun (DEC MPI)
mprun (Sun MPI)
mpimon (Linux/ScaMPI)
POE (IBM)

27
Running the model

Model generates normal MMOUT_DOMAIN output files
and 3 text files per processor
rsl.out.0000 (contains standard output)
rsl.error.0000 (contains standard error)
show_domain_0000 (shows the domain decomposition)

28
Test datasets

Storm of the Centuryftp//ftp.ucar.edu/mesouser/M
M5V3/TESTDATA/input2mm5.tar.gzftp//ftp.ucar.edu/
mesouser/MM5V3/TESTDATA/soc_benchmark_config.tar.g
z
Good small case for initial testing
Includes a nest
Large domain (World Series Rain-out)ftp//ftp.uca
r.edu/mesouser/MM5V3/TESTDATA/largedomainrun.tar.g
z
Representative problem sizes for distributed
memory

29
Important Information Sources

http//www.mmm.ucar.edu/mm5/
When you really can not find answer in
documentations, send email to mesouser_at_ncar.ucar.
edu
Subscribe to MM5 user Discussions mailing list
through the address belowhttp//www.mmm.ucar.edu
/mm5/support/news.htmlto see if others have the
same experience

Write a Comment

User Comments (0)