Title: MPI Communication
1MPI Communication
- (2002.8.22)
- Kyung-Lang Park
- Yonsei Univ. Super Computing Lab.
2Contents
- Backup Slides
- Misc. about MPI and our project
- Communication Overview
- Collective Operation Overview
- Analysis of MPI_Bcast
3Solaris and Linux
- Problem
- Cant run MPI on between solaris and linux
- Origin
- MAXHOSTNAMELEN is 64 in Linux but 256 in other
machines - How to patch ?
- MPI_DIR/mpid/globus2/mpi2.h
- -define COMMWORLDCHANNELSNAMELEN
(MAXHOSTNAMELEN20) - define G2_MAXHOSTNAMELEN 256
- define COMMWORLDCHANNELSNAMELEN
(G2_MAXHOSTNAMELEN20)
4Job Scheduler
- Misunderstanding
- We need not to install job scheduler such as lsf
and pbs on cluster because of the private IP - Job scheduler is not related to the private IP
problem - We should install Job scheduler to examine how
the mpi work in cluster
5Resource Management Architecture
RSL specialization
RSL
Application
Information Service
Queries
Info
Ground RSL
Simple ground RSL
Local resource managers
GRAM
GRAM
GRAM
LSF
Condor
Fork/default
6Job Scheduler Problem
DUROC
DUROC
GRAM
GRAM
GRAM
GRAM
GRAM
LSF
PBS
0
0
0
node101
0
0
node201
node101
node102
node201
node102
1
1
node202
GRAM
GRAM
GRAM
2
node103
0
0
0
node104
3
node103
node104
node202
Subjob_size4
Subjob_size4
SUBJOB_INDEX0
SUBJOB_INDEX1
7???
???
???
NGrid MPI Team Testbed
KISTI
Yonsei
Private IP 20 nodes
Compaq SMP(82)
Linux Cluster
sdd111
LSF
PBS
sdd112
Linux
Sparc-solaris
Private IP
LSF
sdd113
Fork?
sdd114
cybercs
supercom
imap
parallel
Sogang
Kut
80 node Cluster
Linux Cluster(8) - fork
Linux Cluster(8)
LSF16
mercury
intel
venus
LSF24
LSF24
mars
alpha
jupitor
LSF16
8MPI Communication (cont.)
- Preparing Communication - MPI_Init()
- Get basic information
- Gather information of each process
- Create ChannelTable
- Make passive socket
- Register listen_callback() function
- Sending Message MPI_Send()
- Get protocol information from C.T.
- Open socket using globus_io
- Write data to socket using globus_io
9MPI Communication
- Receiving Message listen_callback
- Accept socket connection
- Reading data from socket
- copy data into recv-queue
- Receiving Message MPI_Recv(..buf..)
- Search recv-queue
- Copy data from recv-queue to buf
10Making Channel Table
DUROC
CO-ALLOCATION
Subjob_size4
Subjob_size2
Subjob_size2
Subjob_size4
SUBJOB_INDEX0
SUBJOB_INDEX1
SUBJOB_INDEX2
SUBJOB_INDEX3
0
0
0
0
1
1
1
1
2
2
3
3
Grid2.sogang.ac.kr
Grid1.yonsei.ac.kr
Grid2.yonsei.ac.kr
Grid1.sogang.ac.kr
11Commworldchannels
Struct channel_t CommworldChannels
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
0
Type(tcp,mpi,unknown)
Void info
next
Type(tcp,mpi,unknown)
Void info
next
1
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
Type(tcp,mpi,unknown)
Void info
next
Hostname
port
handlep
whandle
header
To_self
Connection_lock
Connection_cond
attr
Cancell_head
Cancel_tail
Send_head
Send_tail
nprocs
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
12MPI Communication
12. Read data from posted Queue
Process A (Rank 3)
Process B (Rank 5)
1. Creating passive socket
5. Connection
6. Accept socket
7. Send data
8. Call listen_callback9. Copy data to unexpected
4. Making socket for writing
2. Getting information of Process B
COMMWORLDCHANNEL
MPID_recvs
Rank 5 selected protocol tcp , link -gt
posted
unexpected
10. Move data to posted11. Delete original buf
3. Getting protocol information
Hostname
port
handlep
whandle
header
To_self
Connection_lock
Connection_cond
attr
Cancell_head
Cancel_tail
Send_head
Send_tail
13Collective Operation
- Communicate between multiple processes
simultaneously - Patterns
- Root sends data to all processes
- broadcast and scatter
- Root receives data from all processes
- gather
- Each process communicates with each process
- allgather and alltoall
14Basic Concept (cont.)
- Flat tree vs. binomial
- Tr-Ts is small,
- binomial is better
- Tr-Ts is large,
- flat tree is better
- M.Bernaschi et al. Collective Communication
Operation Experimental Result vs. Theory, April
1998
15Basic Concept
- Exploiting Hierarchy
- WAN_TCP lt LAN_TCP lt intra TCP lt vendor MPI.
m1.utech.edu
m1.utech.edu
vendor MPI
Intra TCP
LAN_TCP
p0
p10
1. WAN_TCP Level P0, P20
2. LAN_TCP Level P0, P10
WAN_TCP
3. Intra TCP Level P10,., P19
vendor MPI
p20
4. Vendor MPI Level P0, P9, P20, P29
c1.nlab.gov
16MPI_Bcast (cont.)
MPI_Bcast(buf,comm)
comm_ptr MPIR_To_Pointer(comm)
comm_ptr-gtcollops-gtBcast(buf)
type MPI_INTRA
Intra_Bcast(buf)
Inter_Bcast(buf)
Not Supported yet
ifdef MPID_Bcast()
MPID_FN_Bcast(buf)
Intra_Bcast(buf)
Topology-aware bcast
binomial bcast
17MPI_Bcast (cont.)
MPID_FN_Bcast(buf)
involve(comm,set_info)
allocate request
all sets in set_info
level 0
flat_tree_bcast(buf)
binomial_bcast(buf)
Im root in this set
MPI_Recv(buf) from parent
MPI_Isend(buf)
MPI_Recv(buf)
MPI_Send(buf) to parent
18struct multiple_set_t
set_info
Rank 0
num
set
size
level
root_index
my_rank_index
set
0,20
size
level
root_index
my_rank_index
set
0,10
size
level
root_index
my_rank_index
set
09
set_info
Rank 10
num
set
size
level
root_index
my_rank_index
set
0,10
size
level
root_index
my_rank_index
set
1019
19Project ?? ??
- Testbed ?? ???
- ???? ??? ?? ?????? ??
- ???? MPI ????? ??
- CPI, Matrix Multiplication
- ???? ???? MPI?? ??
20Future Work
- ??? ?? ?? ??
- ???? ?? ??
- ???? ??? ??
- ?? ????
- CPI, Matrix Multiplication
- Nas Parallel Test Bench
- ???? ???? ??
- Isend, Bsend, Ssend
- ?? ?? ?? ?? ? ?????