GridMPI - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

GridMPI

Description:

IMPI. Interoperable MPI specification. Grid ADI. Abstraction of ... MPI-1 and IMPI. Prototype of new TCP/IP implementation. Prototype of a LACT implementation ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 18

Provided by: nes6

Category:

more less

Transcript and Presenter's Notes

Title: GridMPI

1
GridMPI

Yutaka Ishikawa
University of Tokyo
and
Grid Technology Research Center at AIST

2
Background

SCore Cluster System Software
Real World Computing Partnership (1992 2001)
Funded by Ministry of Economy,Trade and Industry,
METI.

High Performance Communication Libraries
PMv2 11.0 usec Round Trip time 240 MB/s
Bandwidth MPICH-SCore MPI Library 24.4 usec
Round Trip time 228 MB/s Bandwidth PM/Ethernet
Network Trunking Utilizing more than one
NIC Global Operating System SCore-D Single/Multi
User Environment Gang scheduling Checkpoint and
restart Parallel Programming Language MPC
Multi-Thread Template Library Shared Memory
Programming Support Omni OpenMP on SCASH
3
RWC SCore III

Host
NEC Express Servers
Dual Pentium III 933 MHz
512 Mbytes of Main Memory
of Hosts
512 Hosts (1,024 Processors)
Networks
Myrinet-2000 (2 Gbps 2 Gbps)
2 Ethernet Links
Linpack Result
618.3 Gflops

This is the world fastest PC cluster at August of
2001
4
TOP 500 as of December 2002

HELICS, rank 64th (825.0 Gflops),
512 Athlons 1.4GHz, Myrinet-2000
Heiderberg University IWR, http//helics.iwr.un
i-heidelberg.de/
Presto III, rank 68th (760.2 Gflops),
512 Athlons 1.6GHz, Myrinet-2000
GSIC, Tokyo Institute of Technology,
http//www.gsic.titech.ac.jp/
Magi Cluster, rank 86th (654.0 Gflops),
1040 Pentium III 933MHz, Myrinet-2000
CBRC-TACC/AIST, http//www.cbrc.jp/magi/
RWC SCore III, 90th (618.3 Gflops),
1024 Pentium III 933MHz, Myrinet-2000
RWCP, http//www.rwcp.or.jp

5
SCore Users

Japan
Universities
University of Tokyo, Tokyo institute of
technologies, Tsukuba university,
Industries
Japanese Car Manufacturing Companies use the
production line
UK
Oxford University
Warwick University
Germany
University of Bonn
University of Heidelberg
University of Tuebingen

Streamline Computing Ltd SCore integration
business
6
PC Cluster Consortium
http//www.pccluster.org

Purpose
Contribution to the PC cluster market, through
the development, maintenance, and promotion of
cluster system software based on the SCore
cluster system software and the Omni OpenMP
compiler, developed at RWCP.
Members
Japanese companies
NEC, Fujistu, Hitatchi, HP Japan, IBM Japan,
Intel Japan, AMD Japan,
Research Institutes
Tokyo Institute of Technology GSIC, Riken
Individuals

7
Lesson Learned

New MPI implementation is needed
It is tough to change/modify existing MPI
implementations
New MPI implementation
Open implementation in addition to open source
Customizable to implement a new protocol
New transport implementation is needed
The PM library is not on top of IP protocol
Not acceptable in the Grid environment
The current TCP/IP implementation (BSD and Linux)
does not perform well in a large latency
environment
Mismatch between socket API and MPI communication
model
TCP/IP protocol is not an issue, but its
implementation is the issue

8
Lesson Learned

Mismatch between socket API and MPI communication
model

MPI_Irecv(buf, MPI_ANY_SOURCDE, MPI_ANY_TAG,
) MPI_Irecv(buf, 1, 2, ) MPI_Irecv(buf, 1,
MPI_ANY_TAG, )
9
GridMPI

Latency-aware MPI implementation

Development of applications in a small cluster
located at a lab.
Production run in the Grid environment

Internet
Application development
Data Resource
10
Is It Feasible ?

Is it feasible to run non-EP (Embarrassingly
Parallel) applications on Grid-connected
clusters?
NO for long-distance networks
YES for metropolitan- or campus-area networks
Example Greater Tokyo area
Diameter 100km-300km (or 60miles-200miles)
Latency 1-2ms one-way
Bandwidth 1-10G bps, or more

11
Experimental Environment
Node
Node
Cluster 1
Cluster 2
Node
Node
16 nodes
16 nodes
Delay 0.5ms, 1.0ms, 1.5ms, 2.0ms, 10ms
1Gbps Ethernet
1Gbps Ethernet
Node
Nodes
Router PC (NIST Net)
Router PC
Cluster Nodes
12
NAS Parallel Benchmark Results
CG (Class B)
LU (Class B)
MG (Class B)
Scalability Relative to 16 node MPICH-SCore with
no delay case

Speed up 1.2 to twice
Memory usage twice

13
Approach

Latency-aware Communication Facility
New TCP/IP Implementation
New socket API
Additional feature for MPI
New communication protocol in the MPI
implementation level
Message routing
Dynamic collective communication path

14
GridMPI Software Architecture

MPI Core Grid ADI
Providing MPI features Communicator, Group, and
Topology
Providing MPI communication facilities
implemented using Grid ADI
RPIM (Remote Process Invocation Mechanism)
Abstraction of remote process invocation
mechanisms
IMPI
Interoperable MPI specification
Grid ADI
Abstraction of communication facilities
LACT (Latency-Aware Communication Topology)
Transparency of latency and network topology

15
LACT (Latency-Aware Communication Topology)

Concerning network bandwidth and latency
Message routing using point-to-point
communication
Independent of IP routing
Collecting data in collective communication
Communication pattern for network topology

16
LACT (Latency-Aware Communication Topology)
An ExampleReduction

Concerning network bandwidth and latency
Message routing using point-to-point
communication
Independent of IP routing
Collecting data in collective communication
Communication pattern for network topology

Bandwidth 10 Gbps Latency 1ms
Cluster B
Cluster A
Bandwidth 1 Gbps Latency 0.5ms
Bandwidth 1 Gbps Latency 0.5ms
Cluster C
Cluster D
Bandwidth 100 Mbps Latency 2ms
17
Schedule