Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000

Description:

NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE. SAN DIEGO ... Interconnect accelerates data transfers, synchronization primitives and other ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 21
Provided by: sharo172
Category:

less

Transcript and Presenter's Notes

Title: Hewlett Packard V2500 Sharon Brunett Caltech Aug 30, 2000


1
Hewlett Packard V2500Sharon BrunettCaltechAug
30, 2000
2
V2500 Hardware Overview
  • Two 64 CPU Shared Memory (ccNUMA) Systems
  • Processors 128 PA-8500, 440 MHz, 4-way super
    scalar
  • 2 nodes with 32 CPUs, 4 nodes with 16 CPUs
  • Peak Speed 1.76 Gflops/CPU, 220 Gflops
  • 1 MB Data and .5 MB Instruction on-chip cache
  • 4-way set associative, 4 ns latency
  • 128 GB Main Memory
  • Low Latency, High Bandwidth
  • Local - 484 ns , 15.3 GB/s peak per node
  • Remote -1600 ns, 3.84 GB/s, peak
  • 1.15 TB Disk

3
V2500 Processing System
  • I/O Subsystem
  • 8 x 240 MB/sec channels
  • Each channel hosts 240 MB/s 64 bit PCI (2x PCI)
    bus
  • 3 or 4 PCI controllers per bus
  • CPUs
  • 440 MHz PA-8500
  • 56 entry instruction reorder buffer
  • 10 functional units
  • 3.84 GB/s cache-to-register
  • 1 MB 1st level D Cache
    4-way set-assoc.
  • .5 MB 1st level I Cache

I/O Channel
  • Agent
  • cache-coherency agent
  • ensures coherency of cache lines for these 1-4
    processors
  • HyperPlane port (processor board)
  • 8 ports, two independent paths per port, each 64
    bits _at_ 120 MHz
  • .96 GB/s read path, .96 GB/s write path

Agent
  • Memory Cntlr. lt-gt SCA HyperLink
  • 2 buses, 480 MB/s each

Crossbar
  • HyperPlane port (memory side)
  • 8 ports, two independent paths per port, each 64
    bits _at_ 120 MHz
  • .96 GB/s read path, .96 GB/s write path

Memory Controller
  • SCA HyperLink, Y-Direction
  • 480 MB/s, 32 bits _at_ 120 MHz)
  • Memory buses
  • 8 ports, two independent paths per
    port, each 64-bits _at_ 120 MHz
  • .96 GB/sec read path, .96 GB/sec write path

SCA HyperLink Controller
32 GB SDRAM (shared by all CPUs)
  • SCA HyperLink, X-Direction
  • 480 MB/s, 32 bits _at_ 120 MHz)

4
SCA HyperLinkLow-Latency, High-Bandwidth
Scalable Memory Interconnect
  • Local cabinet keeps most recently used cache
    lines (up to 1/2 physical memory) in SCA cache,
    reducing average latency
  • Supports 256 outstanding memory requests per
    cabinet, reducing average latency
  • Cache-coherent, non-uniform memory (ccNUMA)
    memory references and I/O transfers over
    interconnect
  • Hardware cache coherency eliminates complicated
    programming
  • 8-way interleaved, split-transaction protocol.
    Each link 3.84 GB/s
  • Interconnect accelerates data transfers,
    synchronization primitives and other
    message-passing operations

5
Software Overview
  • Queuing System
  • LSF
  • Compilers, Libraries and Tools
  • Fortran, Fortran90, C, ANSI C, C, KAI
  • Mlib, LAPACK, MPI, OpenMP, compiler directives,
    Pthreads, parallel FFTs, BLAS3
  • Profilers (cxperf ,glance, mpiview) debuggers
    (totalview, wdb)

6
Programming Models
  • MPI
  • OpenMP
  • Shared Memory Compiler Directives
  • Pthreads
  • Hybrid - All of the Above

7
Recommended Porting Methods
  • Modify Build Procedure and/or Sources
  • Makefile, OS specific calls, timers, etc.
  • Start with safest optimization flags
  • Create/Modify LSF Batch Script
  • Run Small -gt Large Test Cases, Verify for
    Correctness
  • Increase Compiler Optimization Levels
  • Oreport for details on automatic and thwarted
    parallelization

8
Performance Considerations
  • Explore
  • MPI environment variables, page size attributes,
    process scheduling alternatives, memory mapping
  • Parallel math libraries
  • FFTs and Sparse Matrix solvers
  • Compiler optimization flags
  • O2,3,4, Oparallel, Olibcalls, Onolimit,
    Odataprefetch, DA2.0W,
  • Compare and Contrast Scaling and Performance on
    Two and Four Node Systems

9
Performance Considerations II
  • Shared Memory Tuning
  • Align arrays to cache line boundaries
  • Index loops with thread_id for best locality
  • mpctl(), pthread gives thread_id
  • Initialize loops to coerce data onto proper
    locality domain
  • MPI Tuning
  • Use synchronous messages, where possible
  • Avoid virtual memory aliasing
  • Avoid MPI_ANY_SOURCE, where possible
  • Try persistent comms and shared memory primitives

10
Performance Results/Comparisons
  • Sample applications and Experimental Kernels
  • Quantum Chemistry - Dr. Carl Winstead (Caltech)
  • computation of electron molecule collision cross
    section
  • Cochlea Modeling Dr. Ed Givelberg - (U of
    Michigan)
  • Former Strategic Application Collaboration
    Application
  • Java driven thread-based parallelism tests - Dr.
    Roy Williams (Caltech)

11
Application 1 - Quantum Chemistry
  • Computation of Low-energy Electron Molecule
    Collision Cross Sections
  • Analytical evaluation and transformation of
    electron repulsion integrals
  • Substantial matrix computation demands
  • Mature Parallel Code
  • Paragon, Cray, IBM SP and HP X2000 and V2500
    ports
  • Fortran, C, MPI, shared memory directives,
    Pthreads!

12
(No Transcript)
13
Application 2 - Cochlea Modeling
  • Application Goal
  • 3D modeling of human cochlea (inner ear)
  • Desired fidelity/problem size requires large
    memory
  • Good fit to the V2500!
  • Approach
  • Modify sequential code with shared memory pragmas
    in key areas
  • Use calls to shared memory FFT libs

14
(No Transcript)
15
HP V2500 3DFFT Performance for Various
Cochlea Grid Sizes
MFLOPS
CPUs
16
Thread Scheduling Experiment
  • Simple Parallel Java Program, Computing
    Mandelbrot Set
  • Tests thread self-scheduling
  • Shared memory worker threads
  • Dynamic load balancing
  • Difficult with MPI
  • Garbage Collection Overhead

17
Mandelbrot
Each scanline is a task -- heterogeneous Each
thread gets scanline, computes it, gets scanline,
....
next scanline (shared variable)
18
Mandelbrot on HP VClass
Good speedup to 64 threads (64 processors)
64 x PA 8500, 440 MHz
64
19
Mandelbrot on Sun HPC10000
64 x Ultrasparc II, 400 MHz
Good speedup to 64 threads (64 processors)
64
20
HP and Sun Large problem
HP is 1.4 times faster than Sun
64
21
HP and Sun Small Problem
Conclusion HP is faster, but Sun has better
thread management
factor 1.4
factor 1.6
64
22
User Experiences
  • Great Stability
  • Hardware and system software
  • Good Turnaround Time in Queuing System
  • Short, long, small and large jobs
  • Exploit Shared Memory FFT Libs.
  • Performs well
  • Straight forward integration
  • Helpful Performance Tuning Tools
  • Totalview, glance

23
Future Developments
  • Identify and Help Tune Well-suited Applications
  • Large memory requirements
  • Large I/O requirements
  • Expand Robustness of Queuing System
  • Prepare for Transitioning to Successor
    SuperDome HP System
Write a Comment
User Comments (0)
About PowerShow.com