Title: Programming the IBM Power3 SP
1Programming the IBM Power3 SP
- Eric Aubanel
- Advanced Computational Research Laboratory
- Faculty of Computer Science, UNB
2Advanced Computational Research Laboratory
- High Performance Computational Problem-Solving
and Visualization Environment - Computational Experiments in multiple
disciplines CS, Science and Eng. - 16-Processor IBM SP3
- Member of C3.ca Association, Inc.
(http//www.c3.ca)
3Advanced Computational Research Laboratory
- www.cs.unb.ca/acrl
- Virendra Bhavsar, Director
- Eric Aubanel, Research Associate Scientific
Computing Support - Sean Seeley, System Administrator
4(No Transcript)
5(No Transcript)
6Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
7POWER chip 1990 to 2003
- 1990
- Performance Optimized with Enhanced RISC
- Reduced Instruction Set Computer
- Superscalar combined floating point multiply-add
(FMA) unit which allowed peak MFLOPS rate 2 x
MHz - Initially 25 MHz (50 MFLOPS) and 64 KB data cache
8POWER chip 1990 to 2003
- 1991 SP1
- IBMs first SP (scalable power parallel)
- Rack of standalone POWER processors (62.5 MHz)
connected by internal switch network - Parallel Environment system software
9POWER chip 1990 to 2003
- 1993 POWER2
- 2 FMAs
- Increased data cache size
- 66.5 MHz (254 MFLOPS)
- Improved instruction set (incl. Hardware square
root) - SP2 POWER2 higher bandwidth switch for larger
systems
10POWER chip 1990 to 2003
- 1993 POWERPC
- Support SMP
- 1996 P2SC
- POWER2 super chip clock speeds up to 160 MHz
11POWER chip 1990 to 2003
- Feb. 99 POWER3
- Combined P2SC POWERPC
- 64 bit architecture
- Initially 2-way SMP, 200 MHz
- Cache improvement, including L2 cache of 1-16 MB
- Instruction data prefetch
12POWER3 chip Feb. 2000
- Winterhawk II - 375 MHz
- 4- way SMP
- 2 MULT/ ADD - 1500 MFLOPS
- 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
- 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
- 1.6 GB/ s Memory Bandwidth
- 6 GFLOPS/ Node
- Nighthawk II - 375 MHz
- 16- way SMP
- 2 MULT/ ADD - 1500 MFLOPS
- 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
- 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
- 14 GB/ s Memory Bandwidth
- 24 GFLOPS/ Node
13The Clustered SMP
ACRLs SP Four 4-way SMPs
Each node has its own copy of the
O/S Processors on the node are closer than
those on different nodes
14Power3 Architecture
15Power4 - 32 way
- Logical UMA
- SP High Node
- L3 cache shared between all processors on node -
32 MB - Up to 32 GB main memory
- Each processor 1.1 GHz
- 140 Gflops total peak
16Going to NUMA
NUMA up to 256 processors - 1.1 Teraflops
17Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
18Uni-processor Optimization
- Compiler options
- start with -O3 -qstrict, then -O3, -qarchpwr3
- Cache re-use
- Take advantage of superscalar architecture
- give enough operations per load/store
- Use ESSL - optimization already maximally
exploited
19Memory Access Times
20Cache
L2 cache 4-way set-associative, 8 MB total
L1 cache 128-way set-associative, 64 KB
21How to Monitor Performance?
- IBMs hardware monitor HPMCOUNT
- Uses hardware counters on chip
- Cache TLB misses, fp ops, load-stores,
- Beta version
- Available soon on ACRLs SP
22HMPCOUNT sample output
- real8 a(256,256),b(256,256),c(256,256)
- common a,b,c
- do j1,256
- do i1,256
- a(i,j)b(i,j)c(i,j)
- end do
- end do
- end
- PM_TLB_MISS (TLB misses)
66543 -
- Average number of loads per TLB miss
5.916 - Total loads and stores
0.525 M - Instructions per load/store
2.749 - Cycles per instruction
2.378 - Instructions per cycle
0.420 - Total floating point operations
0.066 M - Hardware float point rate
2.749 Mflop/sec -
23HMPCOUNT sample output
- real8 a(257,256),b(257,256),c(257,256)
- common a,b,c
- do j1,256
- do i1,257
- a(i,j)b(i,j)c(i,j)
- end do
- end do
- end
- PM_TLB_MISS (TLB misses)
1634 -
- Average number of loads per TLB miss
241.876 - Total loads and stores
0.527 M - Instructions per load/store
2.749 - Cycles per instruction
1.271 - Instructions per cycle
0.787 - Total floating point operations
0.066 M - Hardware float point rate
3.525 Mflop/sec -
24ESSL
- Linear algebra, Fourier related transforms,
sorting, interpolation, quadrature, random
numbers - Fast!
- 560x560 real8 matrix multiply
- Hand coding 19 Mflops
- dgemm 1.2 GFlops
- Parallel (threaded and distributed) versions
25Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
26ACRLs IBM SP
- 4 Winterhawk II nodes
- 16 processors
- Each node has
- 1 GB RAM
- 9 GB (mirrored) disk on each node
- Switch adapter
- High Perforrnance Switch
- Gigabit Ethernet (1 node)
- Control workstation
- Disk SSA tower with 6 18.2 GB disks
Gigabit Ethernet
27 28IBM Power3 SP Switch
- Bidirectional multistage interconnection networks
(MIN) - 300 MB/sec bi-directional
- 1.2 ?sec latency
29General Parallel File System
Node 2
Node 3
Node 4
SP Switch
Node 1
30ACRL Software
- Operating System AIX 4.3.3
- Compilers
- IBM XL Fortran 7.1 (HPF not yet installed)
- VisualAge C for AIX, Version 5.0.1.0
- VisualAge C Professional for AIX, Version
5.0.0.0 - IBM Visual Age Java - not yet installed
- Job Scheduler Loadleveler 2.2
- Parallel Programming Tools
- IBM Parallel Environment 3.1 MPI, MPI-2
parallel I/O - Numerical Libraries ESSL (v. 3.2) and Parallel
ESSL (v. 2.2 ) - Visualization OpenDX (not yet installed)
- E-Commerce software (not yet installed)
31Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
32Why Parallel Computing?
- Solve large problems in reasonable time
- Many algorithms are inherently parallel
- image processing, Monte Carlo
- Simulations (eg. CFD)
- High performance computers have parallel
architectures - Commercial off-the shelf (COTS) components
- Beowulf clusters
- SMP nodes
- Improvements in network technology
33NRL Layered Ocean Model at Naval Research
Laboratory IBM Winterhawk II SP
34Parallel Computational Models
- Data Parallelism
- Parallel program looks like serial program
- parallelism in the data
- Vector processors
- HPF
35Parallel Computational Models
Send
Receive
- Message Passing (MPI)
- Processes have only local memory but can
communicate with other processes by sending
receiving messages - Data transfer between processes requires
operations to be performed by both processes - Communication network not part of computational
model (hypercube, torus, )
36Parallel Computational Models
- Shared Memory (threads)
- P(osix)threads
- OpenMP higher level standard
37Parallel Computational Models
Get
Put
- Remote Memory Operations
- One-sided communication
- MPI-2, IBMs LAPI
- One process can access the memory of another
without the others participation, but does so
explicitly, not the same way it accesses local
memory
38Parallel Computational Models
- Combined Message Passing Threads
- Driven by clusters of SMPs
- Leads to software complexity!
39Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
40Message Passing Interface
- MPI 1.0 standard in 1994
- MPI 1.1 in 1995 - IBM support
- MPI 2.0 in 1997
- Includes 1.1 but adds new features
- MPI-IO
- One-sided communication
- Dynamic processes
41Advantages of MPI
- Universality
- Expressivity
- Well suited to formulating a parallel algorithm
- Ease of debugging
- Memory is local
- Performance
- Explicit association of data with process allows
good use of cache
42MPI Functionality
- Several modes of point-to-point message passing
- blocking (e.g. MPI_SEND)
- non-blocking (e.g. MPI_ISEND)
- synchronous (e.g. MPI_SSEND)
- buffered (e.g. MPI_BSEND)
- Collective communication and synchronization
- e.g. MPI_REDUCE, MPI_BARRIER
- User-defined datatypes
- Logically distinct communicator spaces
- Application-level or virtual topologies
43Simple MPI Example
My_Id
0
1
This is from MPI process number 0
This is from MPI processes other than 0
44Simple MPI Example
- Program Trivial
- implicit none
- include "mpif.h" ! MPI header file
- integer My_Id, Numb_of_Procs, Ierr
- call MPI_INIT ( ierr )
- call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr
) - call MPI_COMM_SIZE ( MPI_COMM_WORLD,
Numb_of_Procs, ierr ) - print , ' My_id, numb_of_procs ', My_Id,
Numb_of_Procs - if ( My_Id .eq. 0 ) then
- print , ' This is from MPI process number
',My_Id - else
- print , ' This is from MPI processes other than
0 ', My_Id - end if
- call MPI_FINALIZE ( ierr ) ! bad things happen if
you forget ierr - stop
- end
45MPI Example with send/recv
Send
Receive
Send
Receive
My_Id
0
1
46MPI Example with send/recv
- Program Simple
- implicit none
- Include "mpif.h"
- Integer My_Id, Other_Id, Nx, Ierr
- Parameter ( Nx 100 )
- Real A ( Nx ), B ( Nx )
- call MPI_INIT ( Ierr )
- call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr
) - Other_Id Mod ( My_Id 1, 2 )
- A My_Id
- call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id,
MPI_COMM_WORLD, Ierr ) - call MPI_RECV ( B, Nx, MPI_REAL, Other_Id,
Other_Id, MPI_COMM_WORLD, Ierr ) - call MPI_FINALIZE ( Ierr )
- stop
- end
47What Will Happen?
- / Processor 0 /
- ...
- MPI_Send(sendbuf,
- bufsize,
- MPI_CHAR,
- partner,
- tag,
- MPI_COMM_WORLD)
- printf("Posting receive now ...\n")
- MPI_Recv(recvbuf,
- bufsize,
- MPI_CHAR,
- partner,
- tag,
- MPI_COMM_WORLD,
- status)
- / Processor 1 /
- ...
- MPI_Send(sendbuf,
- bufsize,
- MPI_CHAR,
- partner,
- tag,
- MPI_COMM_WORLD)
- printf("Posting receive now ...\n")
- MPI_Recv(recvbuf,
- bufsize,
- MPI_CHAR,
- partner,
- tag,
- MPI_COMM_WORLD,
- status)
48MPI Message Passing Modes
Ready Standard Synchronous Buffered
Ready Eager Rendezvous Buffered
lt eager limit
gt eager limit
Default Eager Limit on SP is 4 KB (can be up to
64 KB)
49MPI Performance Visualization
- ParaGraph
- Developed by University of Illinois
- Graphical display system for visualizing
behaviour and performance of MPI programs
50(No Transcript)
51(No Transcript)
52Message Passing on SMP
Call MPI_SEND
Call MPI_RECEIVE
Memory Crossbar or Switch
Data to Send
Received Data
Buffer
Buffer
export MP_SHARED_MEMORYyesno
53Shared Memory MPI
- MPI_SHARED_MEMORYltyesnogt
- Latency Bandwidth
- (?sec) (Mbytes/sec)
- between 2 nodes 24 133
- same nodes 30 (no) 80 (no)
- same nodes 10 (yes) 270(yes)
54Message Passing off Node
MPI Across all the processors Many more messages
going through the fabric
55Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
56OpenMP
- 1997 group of hardware and software vendors
announced their support for OpenMP, a new API for
multi-platform shared-memory programming (SMP) on
UNIX and Microsoft Windows NT platforms. - www.openmp.org
- OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C
or Fortran source code. IBM does not yet support
OpenMP for C.
57OpenMP
- All processors can access all the memory in the
parallel system - Parallel execution is achieved by generating
threads which execute in parallel - Overhead for SMP parallelization is large
(100-200 ?sec)- size of parallel work construct
must be significant enough to overcome overhead
58OpenMP
- 1.All OpenMP programs begin as a single process
the master thread - 2.FORK the master thread then creates a
team of parallel threads - 3.Parallel region statements executed
in parallel among
the various team threads - 4.JOIN threads
synchronize and terminate, leaving only the
master thread
59OpenMP
- How is OpenMP typically used?
- OpenMP is usually used to parallelize loops
- Find your most time consuming loops.
- Split them up between threads.
- Better scaling can be obtained using OpenMP
parallel regions, but can be tricky!
60OpenMP Loop Parallelization
- !OMP PARALLEL DO
- do i0,ilong
- do k1,kshort
- ...
- end do
- end do
- pragma omp parallel for
- for(i0 i lt ilong i)
- for(k1 k lt kshort k)
- ...
61Variable Scoping
- Most difficult part of Shared Memory
Parallelization - What memory is Shared
- What memory is Private - each processor has its
own copy - Compare MPI all variables are private
- Variables are shared by default, except
- loop indices
- scalars that are set and then used in loop
62How Does Sharing Work?
Shared X initially 0
- THREAD 1
- increment(x)
-
- x x 1
-
- THREAD 1
- 10 LOAD A, (x address)
- 20 ADD A, 1
- 30 STORE A, (x address)
-
- THREAD 2
- increment(x)
-
x x 1 -
- THREAD 2
- 10 LOAD A, (x address)
- 20 ADD A, 1
- 30 STORE A, (x address)
Result could be 1 or 2 Need synchronization
63False Sharing
7 6 5 4 3 2 1 0
Block
Address tag
Cache line
Block in Cache
Say A(1-5)starts on cache line, then some of
A(6-10) will be on first cache line so wont be
accessible until first thread finished
!OMP PARALLEL DO do I1,20 A(I) ... enddo
64Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
65Why Hybrid MPI-OpenMP?
- To optimize performance on mixed-mode hardware
like the SP - MPI is used for inter-node communication, and
OpenMP is used for intra-node communication - threads have lower latency
- threads can alleviate network contention of a
pure MPI implementation
66Hybrid MPI-OpenMP?
- Unless you are forced against your will, for the
hybrid model to be worthwhile - There has to be obvious parallelism to exploit
- The code has to be easy to program and maintain
- easy to write bad OpenMP code
- It has to promise to perform at least as well as
the equivalent all-MPI program - Experience has shown that converting working MPI
code to a hybrid model rarely results in better
performance - especially true with applications having a single
level of parallelism
67Hybrid Scenario
- Thread the computational portions of the code
that exist between MPI calls - MPI calls are single-threaded and therefore
use only a single CPU. - Assumes
- application has two natural levels of parallelism
- or that in breaking an MPI code with one level of
parallelism that communication between resulting
threads is little/none
68Programming the IBM Power3 SP
- History and future of POWER chip
- Uni-processor optimization
- Description of ACRLs IBM SP
- Parallel Processing
- MPI
- OpenMP
- Hybrid MPI/OpenMP
- MPI-I/O (one slide)
69MPI-IO
memory processes file
- Part of MPI-2
- Resulted work at IBM Research exploring the
analogy between I/O and message passing - See Using MPI-2, by Gropp et al. (MIT Press)
70Conclusion
- Dont forget uni-processor optimization
- If you choose one parallel programming API,
choose MPI - Mixed MPI-OpenMP may be appropriate in certain
cases - More work needed here
- Remote memory access model may be the answer