Programming the IBM Power3 SP - PowerPoint PPT Presentation

About This Presentation

Title:

Programming the IBM Power3 SP

Description:

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 71

Provided by: fcs66

Category:

more less

Transcript and Presenter's Notes

Title: Programming the IBM Power3 SP

1
Programming the IBM Power3 SP

Eric Aubanel
Advanced Computational Research Laboratory
Faculty of Computer Science, UNB

2
Advanced Computational Research Laboratory

High Performance Computational Problem-Solving
and Visualization Environment
Computational Experiments in multiple
disciplines CS, Science and Eng.
16-Processor IBM SP3
Member of C3.ca Association, Inc.
(http//www.c3.ca)

3
Advanced Computational Research Laboratory

www.cs.unb.ca/acrl
Virendra Bhavsar, Director
Eric Aubanel, Research Associate Scientific
Computing Support
Sean Seeley, System Administrator

4
(No Transcript)
5
(No Transcript)
6
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

7
POWER chip 1990 to 2003

1990
Performance Optimized with Enhanced RISC
Reduced Instruction Set Computer
Superscalar combined floating point multiply-add
(FMA) unit which allowed peak MFLOPS rate 2 x
MHz
Initially 25 MHz (50 MFLOPS) and 64 KB data cache

8
POWER chip 1990 to 2003

1991 SP1
IBMs first SP (scalable power parallel)
Rack of standalone POWER processors (62.5 MHz)
connected by internal switch network
Parallel Environment system software

9
POWER chip 1990 to 2003

1993 POWER2
2 FMAs
Increased data cache size
66.5 MHz (254 MFLOPS)
Improved instruction set (incl. Hardware square
root)
SP2 POWER2 higher bandwidth switch for larger
systems

10
POWER chip 1990 to 2003

1993 POWERPC
Support SMP
1996 P2SC
POWER2 super chip clock speeds up to 160 MHz

11
POWER chip 1990 to 2003

Feb. 99 POWER3
Combined P2SC POWERPC
64 bit architecture
Initially 2-way SMP, 200 MHz
Cache improvement, including L2 cache of 1-16 MB
Instruction data prefetch

12
POWER3 chip Feb. 2000

Winterhawk II - 375 MHz
4- way SMP
2 MULT/ ADD - 1500 MFLOPS
64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
1.6 GB/ s Memory Bandwidth
6 GFLOPS/ Node

Nighthawk II - 375 MHz
16- way SMP
2 MULT/ ADD - 1500 MFLOPS
64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
14 GB/ s Memory Bandwidth
24 GFLOPS/ Node

13
The Clustered SMP
ACRLs SP Four 4-way SMPs
Each node has its own copy of the
O/S Processors on the node are closer than
those on different nodes
14
Power3 Architecture
15
Power4 - 32 way

Logical UMA
SP High Node
L3 cache shared between all processors on node -
32 MB
Up to 32 GB main memory
Each processor 1.1 GHz
140 Gflops total peak

16
Going to NUMA
NUMA up to 256 processors - 1.1 Teraflops
17
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

18
Uni-processor Optimization

Compiler options
start with -O3 -qstrict, then -O3, -qarchpwr3
Cache re-use
Take advantage of superscalar architecture
give enough operations per load/store
Use ESSL - optimization already maximally
exploited

19
Memory Access Times
20
Cache
L2 cache 4-way set-associative, 8 MB total
L1 cache 128-way set-associative, 64 KB
21
How to Monitor Performance?

IBMs hardware monitor HPMCOUNT
Uses hardware counters on chip
Cache TLB misses, fp ops, load-stores,
Beta version
Available soon on ACRLs SP

22
HMPCOUNT sample output

real8 a(256,256),b(256,256),c(256,256)
common a,b,c
do j1,256
do i1,256
a(i,j)b(i,j)c(i,j)
end do
end do
end

PM_TLB_MISS (TLB misses)
66543
Average number of loads per TLB miss
5.916
Total loads and stores
0.525 M
Instructions per load/store
2.749
Cycles per instruction
2.378
Instructions per cycle
0.420
Total floating point operations
0.066 M
Hardware float point rate
2.749 Mflop/sec

23
HMPCOUNT sample output

real8 a(257,256),b(257,256),c(257,256)
common a,b,c
do j1,256
do i1,257
a(i,j)b(i,j)c(i,j)
end do
end do
end

PM_TLB_MISS (TLB misses)
1634
Average number of loads per TLB miss
241.876
Total loads and stores
0.527 M
Instructions per load/store
2.749
Cycles per instruction
1.271
Instructions per cycle
0.787
Total floating point operations
0.066 M
Hardware float point rate
3.525 Mflop/sec

24
ESSL

Linear algebra, Fourier related transforms,
sorting, interpolation, quadrature, random
numbers
Fast!
560x560 real8 matrix multiply
Hand coding 19 Mflops
dgemm 1.2 GFlops
Parallel (threaded and distributed) versions

25
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

26
ACRLs IBM SP

4 Winterhawk II nodes
16 processors
Each node has
1 GB RAM
9 GB (mirrored) disk on each node
Switch adapter
High Perforrnance Switch
Gigabit Ethernet (1 node)
Control workstation
Disk SSA tower with 6 18.2 GB disks

Gigabit Ethernet
27

28
IBM Power3 SP Switch

Bidirectional multistage interconnection networks
(MIN)
300 MB/sec bi-directional
1.2 ?sec latency

29
General Parallel File System
Node 2
Node 3
Node 4
SP Switch
Node 1
30
ACRL Software

Operating System AIX 4.3.3
Compilers
IBM XL Fortran 7.1 (HPF not yet installed)
VisualAge C for AIX, Version 5.0.1.0
VisualAge C Professional for AIX, Version
5.0.0.0
IBM Visual Age Java - not yet installed
Job Scheduler Loadleveler 2.2
Parallel Programming Tools
IBM Parallel Environment 3.1 MPI, MPI-2
parallel I/O
Numerical Libraries ESSL (v. 3.2) and Parallel
ESSL (v. 2.2 )
Visualization OpenDX (not yet installed)
E-Commerce software (not yet installed)

31
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

32
Why Parallel Computing?

Solve large problems in reasonable time
Many algorithms are inherently parallel
image processing, Monte Carlo
Simulations (eg. CFD)
High performance computers have parallel
architectures
Commercial off-the shelf (COTS) components
Beowulf clusters
SMP nodes
Improvements in network technology

33
NRL Layered Ocean Model at Naval Research
Laboratory IBM Winterhawk II SP
34
Parallel Computational Models

Data Parallelism
Parallel program looks like serial program
parallelism in the data
Vector processors
HPF

35
Parallel Computational Models
Send
Receive

Message Passing (MPI)
Processes have only local memory but can
communicate with other processes by sending
receiving messages
Data transfer between processes requires
operations to be performed by both processes
Communication network not part of computational
model (hypercube, torus, )

36
Parallel Computational Models

Shared Memory (threads)
P(osix)threads
OpenMP higher level standard

37
Parallel Computational Models
Get
Put

Remote Memory Operations
One-sided communication
MPI-2, IBMs LAPI
One process can access the memory of another
without the others participation, but does so
explicitly, not the same way it accesses local
memory

38
Parallel Computational Models

Combined Message Passing Threads
Driven by clusters of SMPs
Leads to software complexity!

39
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

40
Message Passing Interface

MPI 1.0 standard in 1994
MPI 1.1 in 1995 - IBM support
MPI 2.0 in 1997
Includes 1.1 but adds new features
MPI-IO
One-sided communication
Dynamic processes

41
Advantages of MPI

Universality
Expressivity
Well suited to formulating a parallel algorithm
Ease of debugging
Memory is local
Performance
Explicit association of data with process allows
good use of cache

42
MPI Functionality

Several modes of point-to-point message passing
blocking (e.g. MPI_SEND)
non-blocking (e.g. MPI_ISEND)
synchronous (e.g. MPI_SSEND)
buffered (e.g. MPI_BSEND)
Collective communication and synchronization
e.g. MPI_REDUCE, MPI_BARRIER
User-defined datatypes
Logically distinct communicator spaces
Application-level or virtual topologies

43
Simple MPI Example
My_Id
0
1
This is from MPI process number 0
This is from MPI processes other than 0
44
Simple MPI Example

Program Trivial
implicit none
include "mpif.h" ! MPI header file
integer My_Id, Numb_of_Procs, Ierr
call MPI_INIT ( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr
)
call MPI_COMM_SIZE ( MPI_COMM_WORLD,
Numb_of_Procs, ierr )
print , ' My_id, numb_of_procs ', My_Id,
Numb_of_Procs
if ( My_Id .eq. 0 ) then
print , ' This is from MPI process number
',My_Id
else
print , ' This is from MPI processes other than
0 ', My_Id
end if
call MPI_FINALIZE ( ierr ) ! bad things happen if
you forget ierr
stop
end

45
MPI Example with send/recv
Send
Receive
Send
Receive
My_Id
0
1
46
MPI Example with send/recv

Program Simple
implicit none
Include "mpif.h"
Integer My_Id, Other_Id, Nx, Ierr
Parameter ( Nx 100 )
Real A ( Nx ), B ( Nx )
call MPI_INIT ( Ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr
)
Other_Id Mod ( My_Id 1, 2 )
A My_Id
call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id,
MPI_COMM_WORLD, Ierr )
call MPI_RECV ( B, Nx, MPI_REAL, Other_Id,
Other_Id, MPI_COMM_WORLD, Ierr )
call MPI_FINALIZE ( Ierr )
stop
end

47
What Will Happen?

/ Processor 0 /
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD)
printf("Posting receive now ...\n")
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status)

/ Processor 1 /
...
MPI_Send(sendbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD)
printf("Posting receive now ...\n")
MPI_Recv(recvbuf,
bufsize,
MPI_CHAR,
partner,
tag,
MPI_COMM_WORLD,
status)

48
MPI Message Passing Modes
Ready Standard Synchronous Buffered
Ready Eager Rendezvous Buffered
lt eager limit
gt eager limit
Default Eager Limit on SP is 4 KB (can be up to
64 KB)
49
MPI Performance Visualization

ParaGraph
Developed by University of Illinois
Graphical display system for visualizing
behaviour and performance of MPI programs

50
(No Transcript)
51
(No Transcript)
52
Message Passing on SMP
Call MPI_SEND
Call MPI_RECEIVE
Memory Crossbar or Switch
Data to Send
Received Data
Buffer
Buffer
export MP_SHARED_MEMORYyesno
53
Shared Memory MPI

MPI_SHARED_MEMORYltyesnogt
Latency Bandwidth
(?sec) (Mbytes/sec)
between 2 nodes 24 133
same nodes 30 (no) 80 (no)
same nodes 10 (yes) 270(yes)

54
Message Passing off Node
MPI Across all the processors Many more messages
going through the fabric
55
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

56
OpenMP

1997 group of hardware and software vendors
announced their support for OpenMP, a new API for
multi-platform shared-memory programming (SMP) on
UNIX and Microsoft Windows NT platforms.
www.openmp.org
OpenMP parallelism specified through the use of
compiler directives which are imbedded in C/C
or Fortran source code. IBM does not yet support
OpenMP for C.

57
OpenMP

All processors can access all the memory in the
parallel system
Parallel execution is achieved by generating
threads which execute in parallel
Overhead for SMP parallelization is large
(100-200 ?sec)- size of parallel work construct
must be significant enough to overcome overhead

58
OpenMP

1.All OpenMP programs begin as a single process
the master thread
2.FORK the master thread then creates a
team of parallel threads
3.Parallel region statements executed
in parallel among
the various team threads
4.JOIN threads
synchronize and terminate, leaving only the
master thread

59
OpenMP

How is OpenMP typically used?
OpenMP is usually used to parallelize loops
Find your most time consuming loops.
Split them up between threads.
Better scaling can be obtained using OpenMP
parallel regions, but can be tricky!

60
OpenMP Loop Parallelization

!OMP PARALLEL DO
do i0,ilong
do k1,kshort
...
end do
end do
pragma omp parallel for
for(i0 i lt ilong i)
for(k1 k lt kshort k)
...

61
Variable Scoping

Most difficult part of Shared Memory
Parallelization
What memory is Shared
What memory is Private - each processor has its
own copy
Compare MPI all variables are private
Variables are shared by default, except
loop indices
scalars that are set and then used in loop

62
How Does Sharing Work?
Shared X initially 0

THREAD 1
increment(x)
x x 1
THREAD 1
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)

THREAD 2
increment(x)
x x 1
THREAD 2
10 LOAD A, (x address)
20 ADD A, 1
30 STORE A, (x address)

Result could be 1 or 2 Need synchronization
63
False Sharing
7 6 5 4 3 2 1 0
Block
Address tag
Cache line
Block in Cache
Say A(1-5)starts on cache line, then some of
A(6-10) will be on first cache line so wont be
accessible until first thread finished
!OMP PARALLEL DO do I1,20 A(I) ... enddo
64
Programming the IBM Power3 SP

History and future of POWER chip
Uni-processor optimization
Description of ACRLs IBM SP
Parallel Processing
MPI
OpenMP
Hybrid MPI/OpenMP
MPI-I/O (one slide)

65
Why Hybrid MPI-OpenMP?

To optimize performance on mixed-mode hardware
like the SP
MPI is used for inter-node communication, and
OpenMP is used for intra-node communication
threads have lower latency
threads can alleviate network contention of a
pure MPI implementation

66
Hybrid MPI-OpenMP?

Unless you are forced against your will, for the
hybrid model to be worthwhile
There has to be obvious parallelism to exploit
The code has to be easy to program and maintain
easy to write bad OpenMP code
It has to promise to perform at least as well as
the equivalent all-MPI program
Experience has shown that converting working MPI
code to a hybrid model rarely results in better
performance
especially true with applications having a single
level of parallelism

67
Hybrid Scenario

Thread the computational portions of the code
that exist between MPI calls
MPI calls are single-threaded and therefore
use only a single CPU.
Assumes
application has two natural levels of parallelism
or that in breaking an MPI code with one level of
parallelism that communication between resulting
threads is little/none

68
Programming the IBM Power3 SP