Enabling the Efficient Use of SMP Clusters - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Enabling the Efficient Use of SMP Clusters

Description:

Improvements to Maximize Efficiency on SMP's and Clusters of SMP's. ... Distributed-Memory Parallelism. Algorithms scale on memory requirements O ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 22

Provided by: ryanm72

Category:

more less

Transcript and Presenter's Notes

Title: Enabling the Efficient Use of SMP Clusters

1
Enabling the Efficient Use of SMP Clusters

The GAMESS / DDI Approach

Ryan M. OlsonIowa State University
2
Overview

Trends in Supercomputers and Beowulf Clusters.
Distributed-Memory Programming
Distributed Data Interface (DDI)
Current Implementation for Clusters
Improvements to Maximize Efficiency on SMPs and
Clusters of SMPs.

3
Trends in Supercomputers (ASCI)

ASCI Red
4536 Dual-CPU Pentium Pro 200 MHz/128 MB
ASCI Blue-Pacific
1464 4-CPU IBM PowerPC 604e
ASCI Q
3072 4-CPU HP AlphaServer ES45s
ASCI White
512 16-CPU IBM RS/6000 SP
ASCI Purple
196 64-CPU IBM Power5 (50 TB of Memory!!)
Thats 12,544 processors!!

4
Beowulf Clusters

Beowulf Cluster
Commodity PC components
Dedicated Compute Nodes on a Dedicated
Communication Network.
The Two Monsters
Time Get things done faster.
Money Supercomputers are expensive.
Modern Beowulf Clusters
Multi-processor (lt4 CPUs) nodes build on a
high-performance network (Gigabit, Myrinet,
Quadrics, Infiniband, etc)

5
More Processors per NodeReasons, Benefits,
Limitations

Traditional Bottleneck for HPC
The NETWORK!!
Really Fast Ridiculously Expensive!
Increased Computational Density
More CPUs with the same number of network
connections.
Cost effective.
Less Dedicated Network Bandwidth per CPU
More Complicated Memory Model
Some means of exploiting shared-memory
Explicitly programmed in the application
Latent benefit through SMP aware libraries (MPI,
etc.)

6
Our Interests

Computational Chemistry
GAMESS
Calculations can take a long time
40-day CCSD(T)
Required 10 GB of Memory and 100 GB of disk
Distributed-Memory Parallelism
Algorithms scale on memory requirements O(?N4)
Algorithms scale on operations, e.g. CCSD(T)
(Nv4No3 No4Nv3)

7
Gold Clusters (Au8)

Determine Lowest Energy Structure
Multiple different levels of theory
Most accurate method available CCSD(T)
1 Energy 40 Days / structure

8
Our Approach

Develop a common set of tools used for
Distributed-Memory programming
The Distributed Data Interface (DDI)
DDI allows us to
Create Distributed-Data Arrays
Access any element of a DD Array (regardless of
physical location) via one-sided communication.

9
DDI Implementations
GAMESSApplication Level
Distributed Data Interface (DDI) High-Level API
Implementation
Native Implementations
Non-Native Implementations
SHMEM / GPSHMEM
MPI-2
MPI-1 GA
MPI-1
TCP/IP
System V IPC
Hardware API Elan, GM, etc.
10
Virtual-Shared Memory ModelDistributed-Matrix
Example
CPU 0
CPU 1
CPU 3
CPU 2
NCols
CPU0
CPU1
CPU2
CPU3
Subpatch
0
1
2
3
NRows
Distributed Memory Storage
Distributed MatrixDDI_Create(Handle,NRows,NCols)

Two Types of Distributed-Memory
Local Fastest Access
Remote Accessible with penalty.

11
Virtual Shared Memory (Cray SHMEM)Native DDI
Implementation

Three essential distributed-data operations
DDI_Get (Handle,Patch,MyBuffer)
DDI_Put (Handle,Patch,MyBuffer)
DDI_Acc (Handle,Patch,MyBuffer)

CPU 0
CPU 1
CPU 2
CPU 3
DDI_ACC ()
DDI_PUT
DDI_GET
0
1
2
3
Distributed Memory Storage
12
Virtual Shared-Memory for ClustersOriginal DDI
Implementation

Remote One-Sided Access is not directly supported
by standard message-passing libraries (MPI-1,
TCP/IP sockets, etc)
Requires a specialized model
DDI used a data server model

Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
ComputeProcesses
ACC ()
PUT
GET
0
1
2
3
DataServers
4
5
6
7
Distributed Memory Storage(on separate data
servers)
13
Data Server ModelAdvantages / Disadvantages

Very portable easy to implement
All inter-process communication is handled via
send/recv operations through the message-passing
library
Inherit any latent advantages from
message-passing library (SMP Aware MPI)
Inherit any latent disadvantages from the
message-passing library (MPI Polling)
Ignores data locality

14
Improved Data Server ModelFast-Link Model
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
GET
ComputeProcesses
ACC ()
PUT
0
1
2
3
SharedMemorySegments
Data Servers
4
5
7
6

Fast Access to Local Distributed-Data
Maximize the use of Local Data!!
Ignores the remaining intra-node data
Generates a Race Condition!!
Exclusive access to distributed-data is not
guaranteed!!

Distributed Memory Storage(on separate System V
Shared Memory Segments)
15
Access Control for Shared MemorySystem V
Semaphores

General Semaphores
Initial Value BIG_NUM
Operation blocks if the resource is not
available.
Read-Access (- 1)
Not exclusive
Write-Access (-BIG_NUM)
Exclusive Access

Index
Array 0
Access0
Access1
Array 1
Array 2
Access2
FreeSpace
16
Further Improved Data Server ModelFull SMP
Implementation
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
GET
ComputeProcesses
ACC ()
PUT
0
1
2
3
SharedMemorySegments
5
6
7
Data Servers
4

Data Servers are Equivalent
Do we need so many?
Only 1 Data Request needs to be sent Per Node
Global Operations require significantly fewer
point-to-point operations

Distributed Memory Storage(on separate System V
Shared Memory Segments)
17
Benchmark

MP2 Gradient Distributed-Memory Algorithm
O(N5) / Memory O(N4)
Benzoquinone
N245 (Atomic Basis Functions)
Total Aggregate Memory needed 1024 MB
Relatively Small Problem

18
Average Data Transfer
19
Timings
20
Conclusions / Future Work

Major performance benefit from explicit use of
shared memory
FAST model Speeds up data used most
Good first start Good for 1-CPU Nodes
FULL model Best way to access intra-node data
Reduces number of data request from 1 per
processor to 1 per node.
Algorithms should make use of all local
intra-node data, not just the portion they own
DDI_Distrib ? DDI_NDistrib
Number of Data Servers??
Use something better than TCP/IP!!
GM for Myrinet Elan for Quadrics, Mellonox for
Infiniband.
Myrinet has a GM wrapper for TCP/IP sockets
next on the list!