MPI for better scalability - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

MPI for better scalability

Description:

Pusan National University, Pusan, Korea. National Center for Supercomputing Applications ... MPI_Init starts up the MPI runtime environment at the beginning of a run. ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 26

Provided by: tera3

Category:

more less

Transcript and Presenter's Notes

Title: MPI for better scalability

1
MPI for better scalability application
performance

Byoung-Do Kim, Ph.D.
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
bdkim_at_ncsa.uiuc.edu
Seungdo Hong
Dept. Of Mechanical Engineering
Pusan National University, Pusan, Korea

2
Outline

MPI basic
MPI collective communication
MPI datatype
Data parallelism domain decomposition
Algorithm Implementation
Examples
Conclusion

3
MPI Basics

MPI_Init starts up the MPI runtime environment at
the beginning of a run.
MPI_Finalize shuts down the MPI runtime
environment at the end of a run.
MPI_Comm_size gets the number of processes in a
run, Np (typically called just after MPI_Init).
MPI_Comm_rank gets the process ID that the
current process uses, which is between 0 and Np-1
inclusive (typically called just after MPI_Init).

4
MPI example code in Fortran

PROGRAM my_mpi_program
IMPLICIT NONE
INCLUDE "mpif.h"
other includes
INTEGER my_rank, num_procs, mpi_error_code
other declarations
CALL MPI_Init(mpi_error_code) !! Start up
MPI
CALL MPI_Comm_Rank(my_rank, mpi_error_code)
CALL MPI_Comm_size(num_procs, mpi_error_code)
actual work goes here
CALL MPI_Finalize(mpi_error_code) !! Shut down
MPI
END PROGRAM my_mpi_program

5
MPI example code in C

include ltstdio.hgt
include "mpi.h"
other includes
int main (int argc, char argv)
/ main /
int my_rank, num_procs, mpi_error
other declarations
MPI_Init(argc, argv) / Start up MPI /
MPI_Comm_rank(MPI_COMM_WORLD, my_rank)
MPI_Comm_size(MPI_COMM_WORLD, num_procs)
actual work goes here
MPI_Finalize() / Shut down MPI /
/ main /

6
How an MPI Run Works

Every process gets a copy of the executable
Single Program, Multiple Data (SPMD).
They all start executing it.
Each looks at its own rank to determine which
part of the problem to work on.
Each process works completely independently of
the other processes, except when communicating.

7
Send Receive

MPI_SEND(buf,count,datatype,dest,tag,comm)
MPI_SEND(buf,count,datatype,source,tag,comm,status
)
When MPI sends a message, it doesnt just send
the contents it also sends an envelope
describing the contents
Buf initial address of send buffer
Count number of entries to send
Data type datatype of each entry
Source rank of sending process
Dest rank of process to receive
Tag (message ID)
Comm communicator (e.g., MPI_COMM_WORLD)

8
MPI_SENDRECV

MPI_SendRecv(sendbuf,sendcount,sendtype,dest,sendt
ag,recvbuf,recvcount,recvtype,source,recvtag,comm,
status)
Useful for communications patterns where each
node both sends and receives messages.
Executes a blocking send receive operation
Both function use the same communicator, but have
distinct tag argument

9
Collective Communication

Broadcast (MPI_Bcast)
A single proc sends the same data to every proc
Reduction (MPI_Reduce)
All the procs contribute data that is combined
using a binary operation (min, max, sum, etc.)
One proc obtains the final answer
Allreduce (MPI_Allreduce)
Same as MPI_Reduce, but every proc contains the
final answer
Gather (MPI_Gather)
Collect the data from every proc and store the
data on proc root
Scatter (MPI_Scatter)
Split the data on proc root into np segment

10
(No Transcript)
11
MPI Datatype
MPI supports several other data types, but most
are variations of these, and probably these are
all youll use.
12
Data packaging

Use MPI derived datatype constructor if data to
be transmitted consists of a subset of the
entries in an array
MPI_type_contiguous builds a derived type whose
elements are contiguous entries in an array
MPI_Type_vector for equally spaced entries
MPI_Type_indexed for binary entries of an array

13
MPI_Type_Vector

MPI_TYPE_VECTOR(count,blocklength,stride,
oldtype, newtype)
IN count number of blocks (int)
IN blocklength number of elements in each block
(int)
IN stride spacing between start of each block,
measured as number of elements (int)
IN oldtype old datatype (handle)
OUT newtype new datatype (handle)

3 count
1
2
oldtype
blocklength
stride 3
14
Virtual Topology

MPI_cart_creat(comm_old,ndims,dims,period,reorder,
comm,cart)
Describe Cartesian structure of arbitrary
dimension
Create a new communicator, contains information
on the structure of the Cartesian topology.
Returns a handle to a new communicator with the
topology information.
MPI_cart_rank(comm,coords,rank)
MPI_cart_coords(comm,rank,maxdims,coords)
Mpi_cart_shift(comm,direction,disp,rank_source,ran
k_dest)

15
Application 3-D Heat Conduction Problem

Solving heat conduction equation
by TDMA (Tri-Diagonal Matrix Algorithm)

16
Domain Decomposition

Data parallelization Extensibility, Portability
Divide computational domain into many sub-domains
based on number of processors
Solves the same problem on the sub-domians, need
to transfer the b.c. information of overlapping
boundary area
Requires communication between the subdomains in
every time step
Major parallelization method in CFD applications
In order to get a good scalability, need to
implement algorithms carefully.

17
1-D decomposition

!-------------------------------------------------
--------------
! MPI Cartesian Coordinate Communicator
!-------------------------------------------------
--------------
CALL MPI_CART_CREATE
(MPI_COMM_WORLD, NDIMS, DIMS,
PERIODIC,REORDER,CommZ,ierr)
CALL MPI_COMM_RANK (CommZ,myPE,ierr)
CALL MPI_CART_COORDS
(CommZ,myPE, NDIMS,CRDS,ierr)
CALL MPI_CART_SHIFT (CommZ,0,1,PEb,PEt,ierr)
!-------------------------------------------------
-----------
! MPI Datatype creation
!-------------------------------------------------
-----------
CALL MPI_TYPE_CONTIGUOUS (NxNy,MPI_DOUBLE_PRECIS
ION,XY_p,ierr)
CALL MPI_TYPE_COMMIT(XY_p,ierr)

18
2-D decomposition

CALL MPI_CART_CREATE
(MPI_COMM_WORLD, NDIMS, DIMS,
PERIODIC,REORDER,CommXY,ierr)
CALL MPI_COMM_RANK (CommXY,myPE,ierr)
CALL MPI_CART_COORDS (CommXY,myPE,NDIMS,CRDS,ierr
)
CALL MPI_CART_SHIFT (CommXY,1,1,PEw,PEe,ierr)
CALL MPI_CART_SHIFT (CommXY,0,1,PEs,PEn,ierr)
!-------------------------------------------------
-----------
! MPI Datatype creation
!-------------------------------------------------
-----------
CALL MPI_TYPE_VECTOR
(cnt_yz,block_yz,strd_yz,MPI_DOUBLE_PRECISION,
YZ_p,ierr)
CALL MPI_TYPE_COMMIT (YZ_p,ierr)
CALL MPI_TYPE_VECTOR
(cnt_xz,block_xz,strd_xz,MPI_DOUBLE_PRECISION,
XZ_p,ierr)
CALL MPI_TYE_COMMIT (XZ_p,ierr)

19
3-D decomposition

CALL MPI_CART_CREATE
(MPI_COMM_WORLD,,commXYZ,ierr)
CALL MPI_COMM_RANK (CommXYZ,myPE,ierr)
CALL MPI_CART_COORDS
(CommXYZ,myPE,NDIMS,CRDS,ierr)
CALL MPI_CART_SHIFT (CommXYZ,2,1,PEw,PEe,ierr)
CALL MPI_CART_SHIFT (CommXYZ,1,1,PEs,PEn,ierr)
CALL MPI_CART_SHIFT (CommXYZ,0,1,PEb,PEt,ierr)
!-------------------------------------------------
-----------
CALL MPI_TYPE_VECTOR (cnt_yz,block_yz,strd_yz,
MPI_DOUBLE_PRECISION,YZ_p,ierr)
CALL_MPI_TYPE_COMMIT (YZ_p,ierr)
CALL MPI_TYPE_VECTOR (cnt_xz,block_xz,strd_xz,
MPI_DOUBLE_PRECISION,XZ_p,ierr)
CALL MPI_TYEP_COMMIT (XZ_p,ierr)
CALL MPI_TYPE_CONTIGUOUS (cnt_xy,
MPI_DOUBLE_PRECISION,XY_p,ierr)
CALL MPI_TYPE_COMMIT (XY_p,ierr)

20
Scalability 1-D

Good Scalability up to small number of processors
(16)
After choke point, communication overhead
becomes dominant.
Performance degrade with large number of
processors

21
Scalability 2-D

Strong Scalability up to large number of
processors
Actual runtime larger than 1-D case in the case
of small number of processors
Sweep direction of TDMA solver affects the
parallel performance due to communication overhead

22
Scalability 3-D

Superior scalability behavior over the other two
cases
No choke point observed up to 512 processors
Communication overhead ignorable compared to
total runtime.

23
SpeedUps
24
Superlinear Speedup of 3-D Parallel Case

Benefit from Intel Itanium chip architecture
(Large L3 cache, bypassing L1 for floating point
calculation)
Small message size per communication due to good
scalability

25
Conclusion

1-D decomposition is OK for small application
size, but has communication overhead problem when
the size increases
2-D shows strong scaling behavior, but need to be
careful when apply due to influences from
numerical solvers characteristics.
3-D demonstrates superior scalability over the
other two, have superlinear problem due to
hardware architecture.
There is no one-size-fit-all magic solution. In
order to get the best scalability application
performance, the MPI algorithm, application
characteristics, and hardware architectures are
in harmony for the best possible solution.