Introduction to Parallel MM5 - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Introduction to Parallel MM5

Description:

Parallelized by John Michalakes, Software Engineer, MCS Division, ... cache1. cache2. cache3. cacheN. Distributed Memory. The alternative model to shared memory ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 30
Provided by: yupe
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Parallel MM5


1
Introduction to Parallel MM5
  • Yupeng Li
  • Air Quality Modeling,
  • Dept. of Computer Science,
  • UH

2
Credits
  • Parallelized by John Michalakes, Software
    Engineer, MCS Division, Argonne National
    Laboratory

3
Parallelism in MM5
  • What is meant by parallel?
  • Increase computational and memory resources
    available for larger, faster runs by having more
    than one computer work on the problem
  • Isnt MM5 already parallel?
  • Yes, the model has been able to run shared-memory
    parallel since MM4 using Cray Microtasking
    directives
  • More recently, standardized OpenMP directives
    have been incorporated
  • What is DM-parallelism? Why?
  • Processors store part of model domain in local
    memory, not shared with other processors, and
    work together on a problem by exchanging messages
    over a network
  • Scalable because it eliminates bottlenecks on
    shared resources such as bus or memory
  • Possibly also more cost effective since systems
    can be commodity
  • You already have the DM-parallel version of MM5

4
A Little on Parallel Computing
  • There are two ways to use more than processor
  • (SMP) Shared Memory
  • (DMP) Distributed Memory
  • There is newer one called Shared Distributed
    Memory (SDMP)

5
Memory Hierarchy
CPU or processor
Register (1)
Cache (25) (May have several levels)
Main memory (30)
Disk (106)
Access speed in machine cycle.
6
Shared Memory Architecture
For small numbers of processors
Shared memory
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7
Distributed Memory
  • The alternative model to shared memory

mem1
mem2
mem3
memN
proc1
proc2
proc3
procN
network
8
The revolution of communication networks From
LAN to SAN (System Area Network)
  • Myrinet scalable interconnection networks
  • high bandwidth
  • low latency
  • reliable
  • Success Story Berkeley NOW

Single chip building block of Myrinet (1995)
  • Synfinity Component for very large highly
    reliable interconnection networks configuration
    (HAL/Fujitsu 1998)
  • high bandwidth (1.6GBytes/Sec/Port)
  • low latency (42ns)
  • highly reliable

9
Machines
10
Programming Technology for Parallel Architecture
  • (SMP) Shared Memory
  • PThread
  • Directives OpenMP (OMP)
  • (DMP) Distributed Memory
  • API Message Passing Interface (MPI)
  • Language High Performance Fortran (HPF)
  • OMP and MPI are most widely used.

11
OpenMP
  • Mainly directives, plus a small set of
    functionPROGRAM TEST ... !OMP PARALLEL
    ...!OMP DO DO I... ... CALL SUB1 ...
    ENDDO ... CALL SUB2 ... !OMP END PARALLEL

12
OpenMP
  • And!OMP PARALLEL PRIVATE(NTHREADS, TID) C
    Obtain and print thread id TID
    OMP_GET_THREAD_NUM() PRINT , 'Hello World from
    thread ', TID C Only master thread does this
    IF (TID .EQ. 0) THEN NTHREADS
    OMP_GET_NUM_THREADS() PRINT , 'Number of
    threads ', NTHREADS END IF C All threads
    join master thread and disband !OMP END
    PARALLEL

13
OpenMP
  • If compiler can not recognize OMP directives,
    these directives will just be ignored.
  • If there is not enough number processors (for the
    threads you ask for), some threads will have to
    share processor.

14
MPI
  • A bunch of functions calls. include 'mpif.h'
    integer myrank, numprocs, ierr integer
    status(MPI_STATUS_SIZE) real side, square
    side 7.0 square0.0 C call MPI_INIT(
    ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD,
    myrank, ierr ) call MPI_COMM_SIZE(
    MPI_COMM_WORLD, numprocs, ierr ) print ,'I am
    node ',myrank,' out of ',numprocs,' nodes.'
    if(myrank.eq.1) then square side side
    call MPI_SEND(square,1,MPI_REAL,0,99,MPI_COMM_
    WORLD,ierr) endif if(myrank.eq.0) then
    print ,'Before receive square is ',square call
    MPI_RECV(square,1,MPI_REAL,1,99,MPI_COMM
    _WORLD,status,ierr) print ,'After receive
    square is ',square endif call
    MPI_FINALIZE(ierr) end

15
MPI
  • Usually involving one master (rank 0)node and
    several slave node
  • Master distribute job load to the set of
    computing nodes (MPI_SEND), and collect results
    (MPI_RECV) from them as well.
  • Initialization and Finalization is must.

16
Before we have MPI in MM5
  • There are already several versions of MPI
    implementation of MM5 out there. But none of them
    is officially adopted, because it is too
    expensive to maintain two set of code.
  • Until Runtime System Library (RSL) and Fortran
    Loop and Index Converter (FLIC) are used in
    parallelizing MM5 for DMP. This is called Same
    source approach.
  • Single-source implementation of parallelism has
    obvious benefits for maintainability, avoiding
    the effort needed to keep multiple,
    architecture-specific versions up to date with
    respect to each other.

17
Same source concept
  • Ideal Source code for the DM-parallel and
    non-DM parallel model are identical at the
    science level
  • Hide parallel details under the hood- automate
    and encapsulate.
  • Parallel toolbox
  • FLIC - automatic generation of I and J loop
    indexes
  • RSL routines for domain decomposition and
    message passing

18
Distributed Memory Parallel Performance
19
DM-parallel Features
  • All MM5 options supported except
  • Moving nested grids
  • Arakawa-Schubert Cumulus
  • Pleim-Xu PBL

20
Building the DM-parallel MM5
  • Downloadftp//ftp.ucar.edu/mesouser/MM5V3/MM5.TA
    R.gzftp//ftp.ucar.edu/mesouser/MM5V3/MPP.TAR.gz
  • Unzip and untargzip -d -c MM5.TAR.gz tar xf
    cd MM5gzip -d -c MPP.TAR.gz tar xf
  • Edit configure.user file for computer and
    configuration

21
Editing configure.user
  • Find the MPP subsection in Section 7 of
    configure.user pertaining to your computer and
    uncomment those rules
  • Adjust PROCMIN_NS and PROCMIN_EW settings at top
    of Section 7 for memory scaling

22
PROCMIN variables
  • Determine horizontal dimensions of MM5 arrays for
    each processor at compile time
  • PROCMIN_NS divides MIX (north-south
    decomposition)PROCMIN_EW divides MJX (east-west
    decomposition)
  • Can reduce per processor size of MM5 arrays to
    exploit the aggregate memory size of the parallel
    machine

23
PROCMIN variables
  • An executable compiled with PROCMIN_NS1 and
    PROCMIN_EW1 uses maximum per processor memory
    but is valid for any number of processors.
  • Warning! An executable compiled with PROCMIN_NS2
    and PROCMIN_EW2 can be run on no fewer than 4
    processors, but for example it can NOT be run on
    5 processors (MIX/2 dimension is too small for
    1x5 decomposition)

24
PROCMIN variables
  • Experiment with different decompositions e.g.,
    runtimes for 16 processor jobs compiled as 2x8,
    4x4, and 8x2 might vary significantly.

25
Building the code
  • Build the model make mpp
  • Resulting executable Run/mm5.mpp
  • To remake the code in different
    configurationmake mpclean
  • To reinstall the code in different locationmake
    uninstall

26
Running the model
  • Generate the mmlif (namelist) file
  • make mm5.deck
  • Edit mm5.deck
  • ./mm5.deck (creates namelist file in Run/mmlif
    does not run code.)
  • Also run it after you change configure.user
  • Run the model
  • cd Run
  • mpirun -np 4 ./mm5.mpp (standard, MPICH)
  • dmpirun (DEC MPI)
  • mprun (Sun MPI)
  • mpimon (Linux/ScaMPI)
  • POE (IBM)

27
Running the model
  • Model generates normal MMOUT_DOMAIN output files
    and 3 text files per processor
  • rsl.out.0000 (contains standard output)
  • rsl.error.0000 (contains standard error)
  • show_domain_0000 (shows the domain decomposition)

28
Test datasets
  • Storm of the Centuryftp//ftp.ucar.edu/mesouser/M
    M5V3/TESTDATA/input2mm5.tar.gzftp//ftp.ucar.edu/
    mesouser/MM5V3/TESTDATA/soc_benchmark_config.tar.g
    z
  • Good small case for initial testing
  • Includes a nest
  • Large domain (World Series Rain-out)ftp//ftp.uca
    r.edu/mesouser/MM5V3/TESTDATA/largedomainrun.tar.g
    z
  • Representative problem sizes for distributed
    memory

29
Important Information Sources
  • http//www.mmm.ucar.edu/mm5/
  • When you really can not find answer in
    documentations, send email to mesouser_at_ncar.ucar.
    edu
  • Subscribe to MM5 user Discussions mailing list
    through the address belowhttp//www.mmm.ucar.edu
    /mm5/support/news.htmlto see if others have the
    same experience
Write a Comment
User Comments (0)
About PowerShow.com