Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000 - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000

Description:

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE. SAN DIEGO ... Interconnect accelerates data transfers, synchronization primitives and other ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 21

Provided by: sharo172

Category:

more less

Transcript and Presenter's Notes

Title: Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000

1
Hewlett Packard V2500Sharon BrunettCaltechAug
30, 2000
2
V2500 Hardware Overview

Two 64 CPU Shared Memory (ccNUMA) Systems
Processors 128 PA-8500, 440 MHz, 4-way super
scalar
2 nodes with 32 CPUs, 4 nodes with 16 CPUs
Peak Speed 1.76 Gflops/CPU, 220 Gflops
1 MB Data and .5 MB Instruction on-chip cache
4-way set associative, 4 ns latency
128 GB Main Memory
Low Latency, High Bandwidth
Local - 484 ns , 15.3 GB/s peak per node
Remote -1600 ns, 3.84 GB/s, peak
1.15 TB Disk

3
V2500 Processing System

I/O Subsystem
8 x 240 MB/sec channels
Each channel hosts 240 MB/s 64 bit PCI (2x PCI)
bus
3 or 4 PCI controllers per bus

CPUs
440 MHz PA-8500
56 entry instruction reorder buffer
10 functional units
3.84 GB/s cache-to-register
1 MB 1st level D Cache
4-way set-assoc.
.5 MB 1st level I Cache

I/O Channel

Agent
cache-coherency agent
ensures coherency of cache lines for these 1-4
processors

HyperPlane port (processor board)
8 ports, two independent paths per port, each 64
bits _at_ 120 MHz
.96 GB/s read path, .96 GB/s write path

Agent

Memory Cntlr. lt-gt SCA HyperLink
2 buses, 480 MB/s each

Crossbar

HyperPlane port (memory side)
8 ports, two independent paths per port, each 64
bits _at_ 120 MHz
.96 GB/s read path, .96 GB/s write path

Memory Controller

SCA HyperLink, Y-Direction
480 MB/s, 32 bits _at_ 120 MHz)

Memory buses
8 ports, two independent paths per
port, each 64-bits _at_ 120 MHz
.96 GB/sec read path, .96 GB/sec write path

SCA HyperLink Controller
32 GB SDRAM (shared by all CPUs)

SCA HyperLink, X-Direction
480 MB/s, 32 bits _at_ 120 MHz)

4
SCA HyperLinkLow-Latency, High-Bandwidth
Scalable Memory Interconnect

Local cabinet keeps most recently used cache
lines (up to 1/2 physical memory) in SCA cache,
reducing average latency
Supports 256 outstanding memory requests per
cabinet, reducing average latency
Cache-coherent, non-uniform memory (ccNUMA)
memory references and I/O transfers over
interconnect
Hardware cache coherency eliminates complicated
programming
8-way interleaved, split-transaction protocol.
Each link 3.84 GB/s
Interconnect accelerates data transfers,
synchronization primitives and other
message-passing operations

5
Software Overview

Queuing System
LSF
Compilers, Libraries and Tools
Fortran, Fortran90, C, ANSI C, C, KAI
Mlib, LAPACK, MPI, OpenMP, compiler directives,
Pthreads, parallel FFTs, BLAS3
Profilers (cxperf ,glance, mpiview) debuggers
(totalview, wdb)

6
Programming Models

MPI
OpenMP
Shared Memory Compiler Directives
Pthreads
Hybrid - All of the Above

7
Recommended Porting Methods

Modify Build Procedure and/or Sources
Makefile, OS specific calls, timers, etc.
Start with safest optimization flags
Create/Modify LSF Batch Script
Run Small -gt Large Test Cases, Verify for
Correctness
Increase Compiler Optimization Levels
Oreport for details on automatic and thwarted
parallelization

8
Performance Considerations

Explore
MPI environment variables, page size attributes,
process scheduling alternatives, memory mapping
Parallel math libraries
FFTs and Sparse Matrix solvers
Compiler optimization flags
O2,3,4, Oparallel, Olibcalls, Onolimit,
Odataprefetch, DA2.0W,
Compare and Contrast Scaling and Performance on
Two and Four Node Systems

9
Performance Considerations II

Shared Memory Tuning
Align arrays to cache line boundaries
Index loops with thread_id for best locality
mpctl(), pthread gives thread_id
Initialize loops to coerce data onto proper
locality domain
MPI Tuning
Use synchronous messages, where possible
Avoid virtual memory aliasing
Avoid MPI_ANY_SOURCE, where possible
Try persistent comms and shared memory primitives

10
Performance Results/Comparisons

Sample applications and Experimental Kernels
Quantum Chemistry - Dr. Carl Winstead (Caltech)
computation of electron molecule collision cross
section
Cochlea Modeling Dr. Ed Givelberg - (U of
Michigan)
Former Strategic Application Collaboration
Application
Java driven thread-based parallelism tests - Dr.
Roy Williams (Caltech)

11
Application 1 - Quantum Chemistry

Computation of Low-energy Electron Molecule
Collision Cross Sections
Analytical evaluation and transformation of
electron repulsion integrals
Substantial matrix computation demands
Mature Parallel Code
Paragon, Cray, IBM SP and HP X2000 and V2500
ports
Fortran, C, MPI, shared memory directives,
Pthreads!

12
(No Transcript)
13
Application 2 - Cochlea Modeling

Application Goal
3D modeling of human cochlea (inner ear)
Desired fidelity/problem size requires large
memory
Good fit to the V2500!
Approach
Modify sequential code with shared memory pragmas
in key areas
Use calls to shared memory FFT libs

14
(No Transcript)
15
HP V2500 3DFFT Performance for Various
Cochlea Grid Sizes
MFLOPS
CPUs
16
Thread Scheduling Experiment

Simple Parallel Java Program, Computing
Mandelbrot Set
Tests thread self-scheduling
Shared memory worker threads
Dynamic load balancing
Difficult with MPI
Garbage Collection Overhead

17
Mandelbrot
Each scanline is a task -- heterogeneous Each
thread gets scanline, computes it, gets scanline,
....
next scanline (shared variable)
18
Mandelbrot on HP VClass
Good speedup to 64 threads (64 processors)
64 x PA 8500, 440 MHz
64
19
Mandelbrot on Sun HPC10000
64 x Ultrasparc II, 400 MHz
Good speedup to 64 threads (64 processors)
64
20
HP and Sun Large problem
HP is 1.4 times faster than Sun
64
21
HP and Sun Small Problem
Conclusion HP is faster, but Sun has better
thread management
factor 1.4
factor 1.6
64
22
User Experiences