Title: Parallel Architecture Models
1Parallel Architecture Models
- Shared Memory
- Dual/Quad Pentium, Cray T90, IBM Power3 Node
- Distributed Memory
- Cray T3E, IBM SP2, Network of Workstations
- Distributed-Shared Memory
- SGI Origin 2000, Convex Exemplar
2 Shared Memory Systems (SMP)
- Any processor can access any memory location at
equal cost (Symmetric Multi-Processor) - Tasks
communicate by writing/reading common
locations - Easier to program - Cannot scale
beyond around 30 PE's (bus bottleneck) - Most
workstation vendors make SMP's today (SGI, Sun,
HP Digital Pentium) -Cray Y-MP, C90, T90
(cross-bar between PE's and memory)
3 Cache Coherence in SMPs
- Each procs cache holds most recently accessed
values - If multiply cached word is modified,
need to make all copies consistent - Bus-based
SMPs use an efficient mechanism snoopy bus -
Snoopy bus monitors all writes marks other
copies invalid - When proc finds invalid cache
word, fetches copy from SM
4Distributed Memory Systems
M Memory c Cache P Processor NIC Network
Interface Card
Interconnection Network
- Each processor can only access its own memory -
Explicit communication by sending and receiving
messages - More tedious to program - Can scale to
hundreds/thousands of processors - Cache
coherence is not needed - Examples IBM SP-2,
Cray T3E, Workstation Clusters
5 Distributed Shared Memory
- Each processor can directly access any memory
location - Physically distributed memory many
simultaneous accesses - Non-uniform memory access
costs - Examples Convex Exemplar, SGI Origin
2000 - Complex hardware and high cost for cache
coherence - Software DSM systems (e.g.
Treadmarks) implement shared memory abstraction
on top of Distributed Memory Systems
6Parallel Programming Models
- Shared-Address Space Models
- BSP (Bulk Synchronous Parallel model)
- HPF (High Performance Fortran)
- OpenMP
- Message Passing
- Partitioned address space PVM, MPI Ch.8,
I.Fosters book Designing and Building Parallel
Programs (available online) - Higher Level Programming Environments
- PETSc Portable Extensible Toolkit for Scientific
computation - POOMA Parallel Object-Oriented Methods and
Applications
7OpenMP
- Standard sequential Fortran/C model
- Single global view of data
- Automatic parallelization by compiler
- User can provide loop-level directives
- Easy to program
- Only available on Shared-Memory Machines
8High Performance Fortran
- Global shared address space, similar to
sequential programming model - User provides data mapping directives
- User can provide information on loop-level
parallelism - Portable available on all three types of
architectures - Compiler automatically synthesizes
message-passing code if needed - Restricted to dense arrays and regular
distributions - Performance is not consistently good
9Message Passing
- Program is a collection of tasks
- Each task can only read/write its own data
- Tasks communicate data by explicitly
sending/receiving messages - Need to translate from global shared view to
local partitioned view in porting a sequential
program - Tedious to program/debug
- Very good performance
10Illustrative Example
Real a(n,n),b(n,n) Do k 1,NumIter Do i
2,n-1 Do j 2,n-1 a(i,j)(b(i-1,j)b(i,j-1)
b(i1,j)b(i,j1))/4 End
Do End Do Do i 2,n-1 Do j 2,n-1
b(i,j) a(i,j) End Do End Do End Do
a(20,20)
b(20,20)
11Example OpenMP
Real a(n,n),b(n,n) comp parallel shared(a,b,k)
private(i,j) Do k 1,NumIter comp do Do i
2,n-1 Do j 2,n-1 a(i,j)(b(i-1,j)b(i,j-1)
b(i1,j)b(i,j1))/4 End Do
End Do comp do Do i 2,n-1 Do j 2,n-1
b(i,j) a(i,j) End Do End Do End Do
Global shared view of data
a(20,20)
b(20,20)
12Example HPF (1D partition)
Real a(n,n),b(n,n) chpf Distribute a(block,),
b(block,) Do k 1,NumIter chpf independent,
new(i) Do i 2,n-1 Do j 2,n-1
a(i,j)(b(i-1,j)b(i,j-1)
b(i1,j)b(i,j1))/4 End Do End Do chpf
independent , new(i) Do i 2,n-1 Do j
2,n-1 b(i,j) a(i,j) End Do End Do End Do
Global shared view of data
P0
P1
P2
P3
a(20,20)
b(20,20)
13Example HPF (2D partition)
Real a(n,n),b(n,n) chpf Distribute
a(block,block) chpf Distribute b(block,block) Do
k 1,NumIter chpf independent, new(i) Do i
2,n-1 Do j 2,n-1 a(i,j)(b(i-1,j)b(i,j-1)
b(i1,j)b(i,j1))/4 End Do
End Do chpf independent , new(i) Do i 2,n-1
Do j 2,n-1 b(i,j) a(i,j) End Do End
Do End Do
Global shared view of data
a(20,20)
b(20,20)
14Message Passing Local View
communication required
bl(5,20)
Global shared view
Local partitioned view
15Example Message Passing
Real al(NdivP,n),bl(0NdivP1,n) me
get_my_procnum() Do k 1,NumIter if
(meP-1) send(me1,bl(NdivP,1n)) if (me0)
recv(me-1,bl(0,1n)) if (me0)
send(me-1,bl(1,1n)) if (meP-1)
recv(me1,bl(NdivP1,1n)) if (me0) then i12
else i11 if (meP-1) then i2NdivP-1 else
i2NdivP Do i i1,i2 Do j 2,n-1
a(i,j)(b(i-1,j)b(i,j-1)
b(i1,j)b(i,j1))/4 End Do End Do
...
al(5,20)
ghost cells are communicated by message-passing
Local partitioned view with ghost cells
16Comparison of Models
- Program Porting/Development Effort
- OpenMP HPF ltlt MPI
- Portability across systems
- HPF MPI gtgt OpenMP (only shared-memory)
- Applicability
- MPI OpenMP gtgt HPF (only dense arrays)
- Performance
- MPI gt OpenMP gtgt HPF
17PETSc
- Higher level parallel programming model
- Aims to provide both ease of use and high
performance for numerical PDE solution - Uses efficient message-passing implementation
underneath but - Provides global view of data arrays
- System takes care of needed message-passing
- Portable across shared distributed memory
systems