Title: Enabling the Efficient Use of SMP Clusters
1Enabling the Efficient Use of SMP Clusters
- The GAMESS / DDI Approach
Ryan M. OlsonIowa State University
2Overview
- Trends in Supercomputers and Beowulf Clusters.
- Distributed-Memory Programming
- Distributed Data Interface (DDI)
- Current Implementation for Clusters
- Improvements to Maximize Efficiency on SMPs and
Clusters of SMPs.
3Trends in Supercomputers (ASCI)
- ASCI Red
- 4536 Dual-CPU Pentium Pro 200 MHz/128 MB
- ASCI Blue-Pacific
- 1464 4-CPU IBM PowerPC 604e
- ASCI Q
- 3072 4-CPU HP AlphaServer ES45s
- ASCI White
- 512 16-CPU IBM RS/6000 SP
- ASCI Purple
- 196 64-CPU IBM Power5 (50 TB of Memory!!)
- Thats 12,544 processors!!
4Beowulf Clusters
- Beowulf Cluster
- Commodity PC components
- Dedicated Compute Nodes on a Dedicated
Communication Network. - The Two Monsters
- Time Get things done faster.
- Money Supercomputers are expensive.
- Modern Beowulf Clusters
- Multi-processor (lt4 CPUs) nodes build on a
high-performance network (Gigabit, Myrinet,
Quadrics, Infiniband, etc)
5More Processors per NodeReasons, Benefits,
Limitations
- Traditional Bottleneck for HPC
- The NETWORK!!
- Really Fast Ridiculously Expensive!
- Increased Computational Density
- More CPUs with the same number of network
connections. - Cost effective.
- Less Dedicated Network Bandwidth per CPU
- More Complicated Memory Model
- Some means of exploiting shared-memory
- Explicitly programmed in the application
- Latent benefit through SMP aware libraries (MPI,
etc.)
6Our Interests
- Computational Chemistry
- GAMESS
- Calculations can take a long time
- 40-day CCSD(T)
- Required 10 GB of Memory and 100 GB of disk
- Distributed-Memory Parallelism
- Algorithms scale on memory requirements O(?N4)
- Algorithms scale on operations, e.g. CCSD(T)
(Nv4No3 No4Nv3)
7Gold Clusters (Au8)
- Determine Lowest Energy Structure
- Multiple different levels of theory
- Most accurate method available CCSD(T)
- 1 Energy 40 Days / structure
8Our Approach
- Develop a common set of tools used for
Distributed-Memory programming - The Distributed Data Interface (DDI)
- DDI allows us to
- Create Distributed-Data Arrays
- Access any element of a DD Array (regardless of
physical location) via one-sided communication.
9DDI Implementations
GAMESSApplication Level
Distributed Data Interface (DDI) High-Level API
Implementation
Native Implementations
Non-Native Implementations
SHMEM / GPSHMEM
MPI-2
MPI-1 GA
MPI-1
TCP/IP
System V IPC
Hardware API Elan, GM, etc.
10Virtual-Shared Memory ModelDistributed-Matrix
Example
CPU 0
CPU 1
CPU 3
CPU 2
NCols
CPU0
CPU1
CPU2
CPU3
Subpatch
0
1
2
3
NRows
Distributed Memory Storage
Distributed MatrixDDI_Create(Handle,NRows,NCols)
- Two Types of Distributed-Memory
- Local Fastest Access
- Remote Accessible with penalty.
11Virtual Shared Memory (Cray SHMEM)Native DDI
Implementation
- Three essential distributed-data operations
- DDI_Get (Handle,Patch,MyBuffer)
- DDI_Put (Handle,Patch,MyBuffer)
- DDI_Acc (Handle,Patch,MyBuffer)
CPU 0
CPU 1
CPU 2
CPU 3
DDI_ACC ()
DDI_PUT
DDI_GET
0
1
2
3
Distributed Memory Storage
12Virtual Shared-Memory for ClustersOriginal DDI
Implementation
- Remote One-Sided Access is not directly supported
by standard message-passing libraries (MPI-1,
TCP/IP sockets, etc) - Requires a specialized model
- DDI used a data server model
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
ComputeProcesses
ACC ()
PUT
GET
0
1
2
3
DataServers
4
5
6
7
Distributed Memory Storage(on separate data
servers)
13Data Server ModelAdvantages / Disadvantages
- Very portable easy to implement
- All inter-process communication is handled via
send/recv operations through the message-passing
library - Inherit any latent advantages from
message-passing library (SMP Aware MPI) - Inherit any latent disadvantages from the
message-passing library (MPI Polling) - Ignores data locality
14Improved Data Server ModelFast-Link Model
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
GET
ComputeProcesses
ACC ()
PUT
0
1
2
3
SharedMemorySegments
Data Servers
4
5
7
6
- Fast Access to Local Distributed-Data
- Maximize the use of Local Data!!
- Ignores the remaining intra-node data
- Generates a Race Condition!!
- Exclusive access to distributed-data is not
guaranteed!!
Distributed Memory Storage(on separate System V
Shared Memory Segments)
15Access Control for Shared MemorySystem V
Semaphores
- General Semaphores
- Initial Value BIG_NUM
- Operation blocks if the resource is not
available. - Read-Access (- 1)
- Not exclusive
- Write-Access (-BIG_NUM)
- Exclusive Access
Index
Array 0
Access0
Access1
Array 1
Array 2
Access2
FreeSpace
16Further Improved Data Server ModelFull SMP
Implementation
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
GET
ComputeProcesses
ACC ()
PUT
0
1
2
3
SharedMemorySegments
5
6
7
Data Servers
4
- Data Servers are Equivalent
- Do we need so many?
- Only 1 Data Request needs to be sent Per Node
- Global Operations require significantly fewer
point-to-point operations
Distributed Memory Storage(on separate System V
Shared Memory Segments)
17Benchmark
- MP2 Gradient Distributed-Memory Algorithm
O(N5) / Memory O(N4) - Benzoquinone
- N245 (Atomic Basis Functions)
- Total Aggregate Memory needed 1024 MB
- Relatively Small Problem
18Average Data Transfer
19Timings
20Conclusions / Future Work
- Major performance benefit from explicit use of
shared memory - FAST model Speeds up data used most
- Good first start Good for 1-CPU Nodes
- FULL model Best way to access intra-node data
- Reduces number of data request from 1 per
processor to 1 per node. - Algorithms should make use of all local
intra-node data, not just the portion they own - DDI_Distrib ? DDI_NDistrib
- Number of Data Servers??
- Use something better than TCP/IP!!
- GM for Myrinet Elan for Quadrics, Mellonox for
Infiniband. - Myrinet has a GM wrapper for TCP/IP sockets
next on the list!
21Acknowledgments
- Funding through APAC, U.S. Air Force Office of
Scientific Research, and NSF. - APAC and HP for the use of the SC and the GS
- And to Alistair