Title: GridMPI
1GridMPI
- Yutaka Ishikawa
- University of Tokyo
- and
- Grid Technology Research Center at AIST
2Background
- SCore Cluster System Software
- Real World Computing Partnership (1992 2001)
- Funded by Ministry of Economy,Trade and Industry,
METI.
High Performance Communication Libraries
PMv2 11.0 usec Round Trip time 240 MB/s
Bandwidth MPICH-SCore MPI Library 24.4 usec
Round Trip time 228 MB/s Bandwidth PM/Ethernet
Network Trunking Utilizing more than one
NIC Global Operating System SCore-D Single/Multi
User Environment Gang scheduling Checkpoint and
restart Parallel Programming Language MPC
Multi-Thread Template Library Shared Memory
Programming Support Omni OpenMP on SCASH
3RWC SCore III
- Host
- NEC Express Servers
- Dual Pentium III 933 MHz
- 512 Mbytes of Main Memory
- of Hosts
- 512 Hosts (1,024 Processors)
- Networks
- Myrinet-2000 (2 Gbps 2 Gbps)
- 2 Ethernet Links
- Linpack Result
- 618.3 Gflops
This is the world fastest PC cluster at August of
2001
4TOP 500 as of December 2002
- HELICS, rank 64th (825.0 Gflops),
- 512 Athlons 1.4GHz, Myrinet-2000
- Heiderberg University IWR, http//helics.iwr.un
i-heidelberg.de/ - Presto III, rank 68th (760.2 Gflops),
- 512 Athlons 1.6GHz, Myrinet-2000
- GSIC, Tokyo Institute of Technology,
http//www.gsic.titech.ac.jp/ - Magi Cluster, rank 86th (654.0 Gflops),
- 1040 Pentium III 933MHz, Myrinet-2000
- CBRC-TACC/AIST, http//www.cbrc.jp/magi/
- RWC SCore III, 90th (618.3 Gflops),
- 1024 Pentium III 933MHz, Myrinet-2000
- RWCP, http//www.rwcp.or.jp
5SCore Users
- Japan
- Universities
- University of Tokyo, Tokyo institute of
technologies, Tsukuba university, - Industries
- Japanese Car Manufacturing Companies use the
production line - UK
- Oxford University
- Warwick University
- Germany
- University of Bonn
- University of Heidelberg
- University of Tuebingen
Streamline Computing Ltd SCore integration
business
6PC Cluster Consortium
http//www.pccluster.org
- Purpose
- Contribution to the PC cluster market, through
the development, maintenance, and promotion of
cluster system software based on the SCore
cluster system software and the Omni OpenMP
compiler, developed at RWCP. - Members
- Japanese companies
- NEC, Fujistu, Hitatchi, HP Japan, IBM Japan,
Intel Japan, AMD Japan, - Research Institutes
- Tokyo Institute of Technology GSIC, Riken
- Individuals
7Lesson Learned
- New MPI implementation is needed
- It is tough to change/modify existing MPI
implementations - New MPI implementation
- Open implementation in addition to open source
- Customizable to implement a new protocol
- New transport implementation is needed
- The PM library is not on top of IP protocol
- Not acceptable in the Grid environment
- The current TCP/IP implementation (BSD and Linux)
does not perform well in a large latency
environment - Mismatch between socket API and MPI communication
model - TCP/IP protocol is not an issue, but its
implementation is the issue
8Lesson Learned
- Mismatch between socket API and MPI communication
model
MPI_Irecv(buf, MPI_ANY_SOURCDE, MPI_ANY_TAG,
) MPI_Irecv(buf, 1, 2, ) MPI_Irecv(buf, 1,
MPI_ANY_TAG, )
9GridMPI
- Latency-aware MPI implementation
- Development of applications in a small cluster
located at a lab. - Production run in the Grid environment
Internet
Application development
Data Resource
10Is It Feasible ?
- Is it feasible to run non-EP (Embarrassingly
Parallel) applications on Grid-connected
clusters? - NO for long-distance networks
- YES for metropolitan- or campus-area networks
- Example Greater Tokyo area
- Diameter 100km-300km (or 60miles-200miles)
- Latency 1-2ms one-way
- Bandwidth 1-10G bps, or more
11Experimental Environment
Node
Node
Cluster 1
Cluster 2
Node
Node
16 nodes
16 nodes
Delay 0.5ms, 1.0ms, 1.5ms, 2.0ms, 10ms
1Gbps Ethernet
1Gbps Ethernet
Node
Nodes
Router PC (NIST Net)
Router PC
Cluster Nodes
12NAS Parallel Benchmark Results
CG (Class B)
LU (Class B)
MG (Class B)
Scalability Relative to 16 node MPICH-SCore with
no delay case
- Speed up 1.2 to twice
- Memory usage twice
13Approach
- Latency-aware Communication Facility
- New TCP/IP Implementation
- New socket API
- Additional feature for MPI
- New communication protocol in the MPI
implementation level - Message routing
- Dynamic collective communication path
14GridMPI Software Architecture
- MPI Core Grid ADI
- Providing MPI features Communicator, Group, and
Topology - Providing MPI communication facilities
implemented using Grid ADI - RPIM (Remote Process Invocation Mechanism)
- Abstraction of remote process invocation
mechanisms - IMPI
- Interoperable MPI specification
- Grid ADI
- Abstraction of communication facilities
- LACT (Latency-Aware Communication Topology)
- Transparency of latency and network topology
15LACT (Latency-Aware Communication Topology)
- Concerning network bandwidth and latency
- Message routing using point-to-point
communication - Independent of IP routing
- Collecting data in collective communication
- Communication pattern for network topology
16LACT (Latency-Aware Communication Topology)
An ExampleReduction
- Concerning network bandwidth and latency
- Message routing using point-to-point
communication - Independent of IP routing
- Collecting data in collective communication
- Communication pattern for network topology
Bandwidth 10 Gbps Latency 1ms
Cluster B
Cluster A
Bandwidth 1 Gbps Latency 0.5ms
Bandwidth 1 Gbps Latency 0.5ms
Cluster C
Cluster D
Bandwidth 100 Mbps Latency 2ms
17Schedule
- Current
- The first GridMPI implementation
- A part of MPI-1 and IMPI
- NAS parallel benchmarks run
- FY 2003
- GridMPI version 0.1
- MPI-1 and IMPI
- Prototype of new TCP/IP implementation
- Prototype of a LACT implementation
- FY 2004
- GridMPI version 0.5
- MPI-2
- New TCP/IP implementation
- LACT implementation
- OGSA interface
- Vendor MPI
- FY 2005
- GridMPI version 1.0