Title: Parallel Computing Overview
1Parallel Computing Overview
- CS 524 High-Performance Computing
2Parallel Computing
- Multiple processors that are able to work
cooperatively to solve a computational problem - Example of parallel computing include specially
designed parallel computers and algorithms to
geographically distributed network of
workstations cooperating on a task - There are problems that cannot be solved by
present-day serial computers or they take an
impractically long time to solve - Parallel computing exploits concurrency and
parallelism inherent in the problem domain - Task parallelism
- Data parallelism
3Development Trends
- Advances in IC technology and processor design
- CPU performance double every 18 months for past
20 years (Moores Law) - Clock rates increase from 4.77 MHz for 8088
(1979) to 3.6 GHz for Pentium 4 (2004) - FLOPS increase from a handful (1945) to 35.86
TFLOPS (Earth Simulator by NEC, 2002 to date) - Decrease in cost and size
- Advances in computer networking
- Bandwidth increase from a few bits per second to
gt 10 Gb/s - Decrease in size and cost, and increase in
reliability - Need
- Solution of larger and more complex problems
4Issues in Parallel Computing
- Parallel architectures
- Design of bottleneck-free hardware components
- Parallel programming models
- Parallel view of problem domain for effective
partitioning and distribution of work among
processors - Parallel algorithms
- Efficient algorithms that take advantage of
parallel architectures - Parallel programming environments
- Programming languages, compilers, portable
libraries, development tools, etc
5Two Key Algorithm Design Issues
- Load balancing
- Execution time of parallel programs is the time
elapsed from start of processing by the first
processor to end of processing by the last
processor - Partitioning of computational load among
processors - Communication overhead
- Processors are much faster than communication
links - Partitioning of data among processors
6Parallel MVM Row-Block Partition
- do i 1, N
- do j 1, N
- y(i) y(i)A(i,j)x(j)
- end do
- end do
P0
P1
P2
P3
x
j
P0
P1
i
P2
P3
y
A
7Parallel MVM Column-Block Partition
- do j 1, N
- do i 1, N
- y(i) y(i)A(i,j)x(j)
- end do
- end do
P0
P1
P2
P3
x
j
P0
P1
i
P2
P3
y
A
8Parallel MVM Block Partition
- Can we do any better?
- Assume same distribution of x and y
- Can A be partitioned to reduce communication?
P0
P1
P2
P3
x
j
P0
P0
P1
P1
i
P2
P2
P3
P3
y
A
9Parallel Architecture Models
- Bus-based shared memory or symmetric
multiprocessor SMP (e.g. suraj, dual/quad
processor Xeon machines) - Network-based distributed-memory (e.g. Cray T3E,
our linux cluster) - Network-based distributed-shared-memory (e.g. SGI
Origin 2000) - Network-based distributed shared-memory (e.g. SMP
clusters)
10Bus-Based Shared-Memory (SMP)
Shared memory
Bus
- Any processor can access any memory location at
equal cost (symmetric multiprocessor) - Tasks communicate by writing/reading commonly
accessible locations - Easier to program
- Cannot scale beyond 30 processors (bus
bottleneck) - Examples most workstation vendors make SMPs
(Sun, IBM, Intel-based SMPs), Cray T90, SV1 (uses
cross-bar)
11Network-Connected Distributed-Memory
M
M
M
M
Interconnection network
- Each processor can only access own memory
- Explicit communication by sending and receiving
messages - More tedious to program
- Can scale to thousand of processors
- Examples Cray T3E, clusters
12Network-Connected Distributed-Shared-Memory
M
M
M
M
- Each processor can directly access any memory
location - Physically distributed memory
- Non-uniform memory access costs
- Example SGI Origin 2000
13Network-Connected Distributed Shared-Memory
M
M
Bus
Bus
Interconnection network
- Network of SMPs
- Each SMP can only access own memory
- Explicit communication between SMPs
- Can take advantage of both shared-memory and
distributed-memory programming models - Can scale to hundreds of processors
- Examples SMP clusters
14Parallel Programming Models
- Global-address (or shared-address) space model
- POSIX threads (PThreads)
- OpenMP
- Message passing (or distributed-address) model
- MPI (message passing interface)
- PVM (parallel virtual machine)
- Higher level programming environments
- High-Performance Fortran (HPF)
- PETSc (portable extensible toolkit for scientific
computation) - POOMA (parallel object-oriented methods and
applications)
15Other Parallel Programming Models
- Task and channel
- Similar to message passing
- Instead of communicating between named tasks (as
in message passing model), it communicates
through named channels - SPMD (single program multiple data)
- Each processor executes the same program code
that operates on different data - Most message passing programs are SPMD
- Data parallel
- Operations on chunks of data (e.g. arrays) are
parallelized - Grid
- Problem domain viewed in parcels with processing
for parcel(s) allocated to different processors
16Example
- real a(n,n), b(n,n)
- do k 1, NumIter
- do i 2, n-1
- do j 2, n-1
- a(i,j) (b(i-1,j) b(i,j-1
- b(i1,j) b(i,j1))/4
- end do
- end do
- do i 2, n-1
- do j 2, n-1
- b(i,j) a(i,j)
- end do
- end do
- end do
17Shared-Address Space Model OpenMP
- real a(n,n), b(n,n)
- comp parallel shared(a,b,k) private(i,j)
- do k 1, NumIter
- comp do
- do i 2, n-1
- do j 2, n-1
- a(i,j) (b(i-1,j) b(i,j-1)
- b(i1,j) b(i,j1))/4
- end do
- end do
- comp do
- do i 2, n-1
- do j 2, n-1
- b(i,j) a(i,j)
- end do
- end do
- end do
18Message Passing Pseudo-code
- real aLoc(NdivP,n), bLoc(0NdivP1,n)
- me get_my_procnum()
- do k 1, NumIter
- if (me .ne. P-1) send(me1, bLoc(NdivP, 1n))
- if (me .ne. 0) recv(me-1, bLoc(0, 1n))
- if (me .ne. 0) send(me-1, bLoc(1, 1n))
- if (me .ne. P-1) recv(me1, bLoc(NdivP1, 1n))
- if (me .eq. 0) then ibeg 2 else ibeg 1
endif - if (me .eq. P-1) then iend NdivP-1 else iend
NdivP endif - do i ibeg, iend
- do j 2, n-1
- aLoc(i,j) (bLoc(i-1,j) bLoc(i,j-1)
- bLoc(i1,j) bLoc(i,j1))/4
- end do
- end do
- do i ibeg, iend
- do j 2, n-1
- bLoc(i,j) aLoc(i,j)
- end do