MPI for better scalability - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

MPI for better scalability

Description:

Pusan National University, Pusan, Korea. National Center for Supercomputing Applications ... MPI_Init starts up the MPI runtime environment at the beginning of a run. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 26
Provided by: tera3
Category:

less

Transcript and Presenter's Notes

Title: MPI for better scalability


1
MPI for better scalability application
performance
  • Byoung-Do Kim, Ph.D.
  • National Center for Supercomputing Applications
  • University of Illinois at Urbana-Champaign
  • bdkim_at_ncsa.uiuc.edu
  • Seungdo Hong
  • Dept. Of Mechanical Engineering
  • Pusan National University, Pusan, Korea

2
Outline
  • MPI basic
  • MPI collective communication
  • MPI datatype
  • Data parallelism domain decomposition
  • Algorithm Implementation
  • Examples
  • Conclusion

3
MPI Basics
  • MPI_Init starts up the MPI runtime environment at
    the beginning of a run.
  • MPI_Finalize shuts down the MPI runtime
    environment at the end of a run.
  • MPI_Comm_size gets the number of processes in a
    run, Np (typically called just after MPI_Init).
  • MPI_Comm_rank gets the process ID that the
    current process uses, which is between 0 and Np-1
    inclusive (typically called just after MPI_Init).

4
MPI example code in Fortran
  • PROGRAM my_mpi_program
  • IMPLICIT NONE
  • INCLUDE "mpif.h"
  • other includes
  • INTEGER my_rank, num_procs, mpi_error_code
  • other declarations
  • CALL MPI_Init(mpi_error_code) !! Start up
    MPI
  • CALL MPI_Comm_Rank(my_rank, mpi_error_code)
  • CALL MPI_Comm_size(num_procs, mpi_error_code)
  • actual work goes here
  • CALL MPI_Finalize(mpi_error_code) !! Shut down
    MPI
  • END PROGRAM my_mpi_program

5
MPI example code in C
  • include ltstdio.hgt
  • include "mpi.h"
  • other includes
  • int main (int argc, char argv)
  • / main /
  • int my_rank, num_procs, mpi_error
  • other declarations
  • MPI_Init(argc, argv) / Start up MPI /
  • MPI_Comm_rank(MPI_COMM_WORLD, my_rank)
  • MPI_Comm_size(MPI_COMM_WORLD, num_procs)
  • actual work goes here
  • MPI_Finalize() / Shut down MPI /
  • / main /

6
How an MPI Run Works
  • Every process gets a copy of the executable
    Single Program, Multiple Data (SPMD).
  • They all start executing it.
  • Each looks at its own rank to determine which
    part of the problem to work on.
  • Each process works completely independently of
    the other processes, except when communicating.

7
Send Receive
  • MPI_SEND(buf,count,datatype,dest,tag,comm)
  • MPI_SEND(buf,count,datatype,source,tag,comm,status
    )
  • When MPI sends a message, it doesnt just send
    the contents it also sends an envelope
    describing the contents
  • Buf initial address of send buffer
  • Count number of entries to send
  • Data type datatype of each entry
  • Source rank of sending process
  • Dest rank of process to receive
  • Tag (message ID)
  • Comm communicator (e.g., MPI_COMM_WORLD)

8
MPI_SENDRECV
  • MPI_SendRecv(sendbuf,sendcount,sendtype,dest,sendt
    ag,recvbuf,recvcount,recvtype,source,recvtag,comm,
    status)
  • Useful for communications patterns where each
    node both sends and receives messages.
  • Executes a blocking send receive operation
  • Both function use the same communicator, but have
    distinct tag argument

9
Collective Communication
  • Broadcast (MPI_Bcast)
  • A single proc sends the same data to every proc
  • Reduction (MPI_Reduce)
  • All the procs contribute data that is combined
    using a binary operation (min, max, sum, etc.)
    One proc obtains the final answer
  • Allreduce (MPI_Allreduce)
  • Same as MPI_Reduce, but every proc contains the
    final answer
  • Gather (MPI_Gather)
  • Collect the data from every proc and store the
    data on proc root
  • Scatter (MPI_Scatter)
  • Split the data on proc root into np segment

10
(No Transcript)
11
MPI Datatype
MPI supports several other data types, but most
are variations of these, and probably these are
all youll use.
12
Data packaging
  • Use MPI derived datatype constructor if data to
    be transmitted consists of a subset of the
    entries in an array
  • MPI_type_contiguous builds a derived type whose
    elements are contiguous entries in an array
  • MPI_Type_vector for equally spaced entries
  • MPI_Type_indexed for binary entries of an array

13
MPI_Type_Vector
  • MPI_TYPE_VECTOR(count,blocklength,stride,
    oldtype, newtype)
  • IN count number of blocks (int)
  • IN blocklength number of elements in each block
    (int)
  • IN stride spacing between start of each block,
    measured as number of elements (int)
  • IN oldtype old datatype (handle)
  • OUT newtype new datatype (handle)

3 count
1
2
oldtype
blocklength
stride 3
14
Virtual Topology
  • MPI_cart_creat(comm_old,ndims,dims,period,reorder,
    comm,cart)
  • Describe Cartesian structure of arbitrary
    dimension
  • Create a new communicator, contains information
    on the structure of the Cartesian topology.
  • Returns a handle to a new communicator with the
    topology information.
  • MPI_cart_rank(comm,coords,rank)
  • MPI_cart_coords(comm,rank,maxdims,coords)
  • Mpi_cart_shift(comm,direction,disp,rank_source,ran
    k_dest)

15
Application 3-D Heat Conduction Problem
  • Solving heat conduction equation
    by TDMA (Tri-Diagonal Matrix Algorithm)

16
Domain Decomposition
  • Data parallelization Extensibility, Portability
  • Divide computational domain into many sub-domains
    based on number of processors
  • Solves the same problem on the sub-domians, need
    to transfer the b.c. information of overlapping
    boundary area
  • Requires communication between the subdomains in
    every time step
  • Major parallelization method in CFD applications
  • In order to get a good scalability, need to
    implement algorithms carefully.

17
1-D decomposition
  • !-------------------------------------------------
    --------------
  • ! MPI Cartesian Coordinate Communicator
  • !-------------------------------------------------
    --------------
  • CALL MPI_CART_CREATE
  • (MPI_COMM_WORLD, NDIMS, DIMS,
  • PERIODIC,REORDER,CommZ,ierr)
  • CALL MPI_COMM_RANK (CommZ,myPE,ierr)
  • CALL MPI_CART_COORDS
  • (CommZ,myPE, NDIMS,CRDS,ierr)
  • CALL MPI_CART_SHIFT (CommZ,0,1,PEb,PEt,ierr)
  • !-------------------------------------------------
    -----------
  • ! MPI Datatype creation
  • !-------------------------------------------------
    -----------
  • CALL MPI_TYPE_CONTIGUOUS (NxNy,MPI_DOUBLE_PRECIS
    ION,XY_p,ierr)
  • CALL MPI_TYPE_COMMIT(XY_p,ierr)

18
2-D decomposition
  • CALL MPI_CART_CREATE
  • (MPI_COMM_WORLD, NDIMS, DIMS,
    PERIODIC,REORDER,CommXY,ierr)
  • CALL MPI_COMM_RANK (CommXY,myPE,ierr)
  • CALL MPI_CART_COORDS (CommXY,myPE,NDIMS,CRDS,ierr
    )
  • CALL MPI_CART_SHIFT (CommXY,1,1,PEw,PEe,ierr)
  • CALL MPI_CART_SHIFT (CommXY,0,1,PEs,PEn,ierr)
  • !-------------------------------------------------
    -----------
  • ! MPI Datatype creation
  • !-------------------------------------------------
    -----------
  • CALL MPI_TYPE_VECTOR
  • (cnt_yz,block_yz,strd_yz,MPI_DOUBLE_PRECISION,
  • YZ_p,ierr)
  • CALL MPI_TYPE_COMMIT (YZ_p,ierr)
  • CALL MPI_TYPE_VECTOR
  • (cnt_xz,block_xz,strd_xz,MPI_DOUBLE_PRECISION,
  • XZ_p,ierr)
  • CALL MPI_TYE_COMMIT (XZ_p,ierr)

19
3-D decomposition
  • CALL MPI_CART_CREATE
  • (MPI_COMM_WORLD,,commXYZ,ierr)
  • CALL MPI_COMM_RANK (CommXYZ,myPE,ierr)
  • CALL MPI_CART_COORDS
  • (CommXYZ,myPE,NDIMS,CRDS,ierr)
  • CALL MPI_CART_SHIFT (CommXYZ,2,1,PEw,PEe,ierr)
  • CALL MPI_CART_SHIFT (CommXYZ,1,1,PEs,PEn,ierr)
  • CALL MPI_CART_SHIFT (CommXYZ,0,1,PEb,PEt,ierr)
  • !-------------------------------------------------
    -----------
  • CALL MPI_TYPE_VECTOR (cnt_yz,block_yz,strd_yz,
  • MPI_DOUBLE_PRECISION,YZ_p,ierr)
  • CALL_MPI_TYPE_COMMIT (YZ_p,ierr)
  • CALL MPI_TYPE_VECTOR (cnt_xz,block_xz,strd_xz,
    MPI_DOUBLE_PRECISION,XZ_p,ierr)
  • CALL MPI_TYEP_COMMIT (XZ_p,ierr)
  • CALL MPI_TYPE_CONTIGUOUS (cnt_xy,
    MPI_DOUBLE_PRECISION,XY_p,ierr)
  • CALL MPI_TYPE_COMMIT (XY_p,ierr)

20
Scalability 1-D
  • Good Scalability up to small number of processors
    (16)
  • After choke point, communication overhead
    becomes dominant.
  • Performance degrade with large number of
    processors

21
Scalability 2-D
  • Strong Scalability up to large number of
    processors
  • Actual runtime larger than 1-D case in the case
    of small number of processors
  • Sweep direction of TDMA solver affects the
    parallel performance due to communication overhead

22
Scalability 3-D
  • Superior scalability behavior over the other two
    cases
  • No choke point observed up to 512 processors
  • Communication overhead ignorable compared to
    total runtime.

23
SpeedUps
24
Superlinear Speedup of 3-D Parallel Case
  • Benefit from Intel Itanium chip architecture
    (Large L3 cache, bypassing L1 for floating point
    calculation)
  • Small message size per communication due to good
    scalability

25
Conclusion
  • 1-D decomposition is OK for small application
    size, but has communication overhead problem when
    the size increases
  • 2-D shows strong scaling behavior, but need to be
    careful when apply due to influences from
    numerical solvers characteristics.
  • 3-D demonstrates superior scalability over the
    other two, have superlinear problem due to
    hardware architecture.
  • There is no one-size-fit-all magic solution. In
    order to get the best scalability application
    performance, the MPI algorithm, application
    characteristics, and hardware architectures are
    in harmony for the best possible solution.
Write a Comment
User Comments (0)
About PowerShow.com