Title: Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000
1Hewlett Packard V2500Sharon BrunettCaltechAug
30, 2000
2V2500 Hardware Overview
- Two 64 CPU Shared Memory (ccNUMA) Systems
- Processors 128 PA-8500, 440 MHz, 4-way super
scalar - 2 nodes with 32 CPUs, 4 nodes with 16 CPUs
- Peak Speed 1.76 Gflops/CPU, 220 Gflops
- 1 MB Data and .5 MB Instruction on-chip cache
- 4-way set associative, 4 ns latency
- 128 GB Main Memory
- Low Latency, High Bandwidth
- Local - 484 ns , 15.3 GB/s peak per node
- Remote -1600 ns, 3.84 GB/s, peak
- 1.15 TB Disk
3V2500 Processing System
- I/O Subsystem
- 8 x 240 MB/sec channels
- Each channel hosts 240 MB/s 64 bit PCI (2x PCI)
bus - 3 or 4 PCI controllers per bus
- CPUs
- 440 MHz PA-8500
- 56 entry instruction reorder buffer
- 10 functional units
- 3.84 GB/s cache-to-register
- 1 MB 1st level D Cache
4-way set-assoc. - .5 MB 1st level I Cache
I/O Channel
- Agent
- cache-coherency agent
- ensures coherency of cache lines for these 1-4
processors
- HyperPlane port (processor board)
- 8 ports, two independent paths per port, each 64
bits _at_ 120 MHz - .96 GB/s read path, .96 GB/s write path
Agent
- Memory Cntlr. lt-gt SCA HyperLink
- 2 buses, 480 MB/s each
Crossbar
- HyperPlane port (memory side)
- 8 ports, two independent paths per port, each 64
bits _at_ 120 MHz - .96 GB/s read path, .96 GB/s write path
Memory Controller
- SCA HyperLink, Y-Direction
- 480 MB/s, 32 bits _at_ 120 MHz)
- Memory buses
- 8 ports, two independent paths per
port, each 64-bits _at_ 120 MHz - .96 GB/sec read path, .96 GB/sec write path
SCA HyperLink Controller
32 GB SDRAM (shared by all CPUs)
- SCA HyperLink, X-Direction
- 480 MB/s, 32 bits _at_ 120 MHz)
4SCA HyperLinkLow-Latency, High-Bandwidth
Scalable Memory Interconnect
- Local cabinet keeps most recently used cache
lines (up to 1/2 physical memory) in SCA cache,
reducing average latency - Supports 256 outstanding memory requests per
cabinet, reducing average latency - Cache-coherent, non-uniform memory (ccNUMA)
memory references and I/O transfers over
interconnect - Hardware cache coherency eliminates complicated
programming - 8-way interleaved, split-transaction protocol.
Each link 3.84 GB/s - Interconnect accelerates data transfers,
synchronization primitives and other
message-passing operations
5Software Overview
- Queuing System
- LSF
- Compilers, Libraries and Tools
- Fortran, Fortran90, C, ANSI C, C, KAI
- Mlib, LAPACK, MPI, OpenMP, compiler directives,
Pthreads, parallel FFTs, BLAS3 - Profilers (cxperf ,glance, mpiview) debuggers
(totalview, wdb)
6Programming Models
- MPI
- OpenMP
- Shared Memory Compiler Directives
- Pthreads
- Hybrid - All of the Above
7 Recommended Porting Methods
- Modify Build Procedure and/or Sources
- Makefile, OS specific calls, timers, etc.
- Start with safest optimization flags
- Create/Modify LSF Batch Script
- Run Small -gt Large Test Cases, Verify for
Correctness - Increase Compiler Optimization Levels
- Oreport for details on automatic and thwarted
parallelization
8 Performance Considerations
- Explore
- MPI environment variables, page size attributes,
process scheduling alternatives, memory mapping - Parallel math libraries
- FFTs and Sparse Matrix solvers
- Compiler optimization flags
- O2,3,4, Oparallel, Olibcalls, Onolimit,
Odataprefetch, DA2.0W, - Compare and Contrast Scaling and Performance on
Two and Four Node Systems
9 Performance Considerations II
- Shared Memory Tuning
- Align arrays to cache line boundaries
- Index loops with thread_id for best locality
- mpctl(), pthread gives thread_id
- Initialize loops to coerce data onto proper
locality domain - MPI Tuning
- Use synchronous messages, where possible
- Avoid virtual memory aliasing
- Avoid MPI_ANY_SOURCE, where possible
- Try persistent comms and shared memory primitives
10Performance Results/Comparisons
- Sample applications and Experimental Kernels
- Quantum Chemistry - Dr. Carl Winstead (Caltech)
- computation of electron molecule collision cross
section - Cochlea Modeling Dr. Ed Givelberg - (U of
Michigan) - Former Strategic Application Collaboration
Application - Java driven thread-based parallelism tests - Dr.
Roy Williams (Caltech)
11Application 1 - Quantum Chemistry
- Computation of Low-energy Electron Molecule
Collision Cross Sections - Analytical evaluation and transformation of
electron repulsion integrals - Substantial matrix computation demands
- Mature Parallel Code
- Paragon, Cray, IBM SP and HP X2000 and V2500
ports - Fortran, C, MPI, shared memory directives,
Pthreads!
12(No Transcript)
13Application 2 - Cochlea Modeling
- Application Goal
- 3D modeling of human cochlea (inner ear)
- Desired fidelity/problem size requires large
memory - Good fit to the V2500!
- Approach
- Modify sequential code with shared memory pragmas
in key areas - Use calls to shared memory FFT libs
14(No Transcript)
15HP V2500 3DFFT Performance for Various
Cochlea Grid Sizes
MFLOPS
CPUs
16Thread Scheduling Experiment
- Simple Parallel Java Program, Computing
Mandelbrot Set - Tests thread self-scheduling
- Shared memory worker threads
- Dynamic load balancing
- Difficult with MPI
- Garbage Collection Overhead
17Mandelbrot
Each scanline is a task -- heterogeneous Each
thread gets scanline, computes it, gets scanline,
....
next scanline (shared variable)
18Mandelbrot on HP VClass
Good speedup to 64 threads (64 processors)
64 x PA 8500, 440 MHz
64
19Mandelbrot on Sun HPC10000
64 x Ultrasparc II, 400 MHz
Good speedup to 64 threads (64 processors)
64
20HP and Sun Large problem
HP is 1.4 times faster than Sun
64
21HP and Sun Small Problem
Conclusion HP is faster, but Sun has better
thread management
factor 1.4
factor 1.6
64
22User Experiences
- Great Stability
- Hardware and system software
- Good Turnaround Time in Queuing System
- Short, long, small and large jobs
- Exploit Shared Memory FFT Libs.
- Performs well
- Straight forward integration
- Helpful Performance Tuning Tools
- Totalview, glance
23Future Developments
- Identify and Help Tune Well-suited Applications
- Large memory requirements
- Large I/O requirements
- Expand Robustness of Queuing System
- Prepare for Transitioning to Successor
SuperDome HP System