Title: Multiprocessing
1Multiprocessing
- COMP381
- Tutorial 13
- 2nd-5th Dec, 08
2Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem Solution
3The Bottleneck of Single Processor
Even various ways of increasing a single
processor performance have been introduced, the
performance of a single processor still has its
bottleneck. The example above demonstrated the
limitation of ILP .
4We Need High Performance Computing
- less time for a time-consuming operation
- high number of operations in a period of time
5Multiprocessing (Parallel Processing)
- Hard or cost too much to improve the performance
of single processor - Concurrent execution of tasks (programs) using
multiple computing, memory and interconnection
resources - Provides alternative to faster clock for
performance - Using multiple processors to solve a single
problem
6Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem Solution
- Clusters
7Performance Potential Using Multiprocessing
- Amdahls Law
- Using multiple processors to solve the same
problems as the single processor. - Pessimistic, limited speedup factor
- Gustafson View
- Using multiple processors to solve larger or more
complex problems - Parallel portion increases as the problem size
increases, speedup factor can be increased by
deploying more processors
8Amdahls Law
- A parallel program has a sequential part and a
parallel part, the proportion for both of them
are ? and (1-?). - Assume the total execution time for a single
processer is - T1 ?T1 (1-?)T1
- The total execution time for p processors would
be - Tp ?T1 (1-?)T1 / p
- Speedup(p) 1 / (? (1-?)/p) p / (? p 1 -
?)? 1 / ?
9Amdahls Law
- Amdahl's Law is pessimistic (in this case)
- Let s be the serial part
- Let p be the part that can be parallelized n ways
- Serial SSPPPPPP
- 6 processors SSP
- P
- P
- P
- P
- P
- Speedup 8/3 2.67
- T(n)
- As n ? ?, T(n) ?
10Amdahls Law
Speedup
25
20
15
1000 CPUs
16 CPUs
4 CPUs
10
5
Limited speedup factor!!!
0
10
20
30
40
50
60
70
80
90
99
Serial
11Gustafson View
- Parallel portion increases as the problem size
increases - Let s be the serial part
- Let p be the part that can be parallelized n ways
- Old Serial SSPPPPPP
- 6 processors SSPPPPPP
- PPPPPP
- PPPPPP
- PPPPPP
- PPPPPP
- PPPPPP
- Hypothetical Serial
- SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP
- Speedup(6) (856)/8 4.75
- Speedup(N) N(1- ?) ? Speedup'(? ) ? ?!!!
12(No Transcript)
13Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem solusion
- Clusters
14Flynns Taxonomy of Computing
- SISD (Single Instruction, Single Data)
- Typical uniprocessor systems that weve studied
throughout this course. - Uniprocessor systems can time share and still be
SISD. - SIMD (Single Instruction, Multiple Data)
- Multiple processors simultaneously executing the
same instruction on different data. - Specialized applications (e.g., image
processing). - MIMD (Multiple Instruction, Multiple Data)
- Multiple processors autonomously executing
different instructions on different data. - Keep in mind that the processors are working
together to solve a single problem.
15Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem solusion
- Clusters
16Typical SIMD Systems
One control unit Lockstep All Ps do the same or
nothing
Possible Applications database, image
processing, signal processing and etc.
17SIMD Operations on Single Processor
- Sequential pixel operations take a very long time
to perform. - A 512x512 image would require 262,144 iterations
through a sequential loop with each loop
executing 10 instructions. That translates to
2,621,440 clock cycles (if each instruction is a
single cycle) plus loop overhead.
Each pixel is operated on sequentially one
after another.
18SIMD Operations with Multiprocessing
- On a SIMD system with 512x512 processors (which
is not unreasonable on SIMD machines) the same
operation would take 10 cycles.
Each processor operates on a single pixel in
parallel.
Speedup due to parallelism 2,621,440/10
262,144 512x512 (number of processor)!
512x512 image
Notice no loop overhead!
19Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem Solutions
- Clusters
20MIMD Architecture
Multiple processors autonomously executing
different instructions on different data. This is
a more general form of multiprocessing, and can
be used in numerous applications
21Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem Solutions
22Programming with Shared Memory
- We use share memory model for multiprocessing.
Store
23Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem Solutions
24Cache Coherence Problem
W X 17
R X
R X
X17
X42
X42
X42
- Processor 3 does not see the value written by
processor 0
24
25Write Through does not help
W X 17
R X
R X
R X
X17
X17
X42
X42
X42
- Processor 3 sees 42 in cache (does not get the
correct value (17) from memory.
25
26Outline
- Why do we need multiprocessing?
- Performance Potential Using Multiprocessing
- Flynns Taxonomy of Computing
- SIMD
- MIMD
- Shared Memory Model
- Cache Coherence Problem Solutions
27Solutions to Cache Coherence Problem
- Shared Cache
- Cache placement identical to single cache, only
one copy of any cached block which results in
limited bandwidth - Snoopy Cache-Coherence Protocols for Distributed
Cache
28Snooping Cache Coherency
Bus snoop
Cache-memory transaction
- Cache Controller snoops all transactions on
the shared bus, take action to ensure coherence
(invalidate, update, or supply value) - A transaction is a relevant transaction if it
involves a cache block currently contained in
this cache
28
29Hardware Cache Coherence Demonstration
memory
invalidate --gt
X -gt X
X -gt Inv
X -gt Inv
. . . . .
- write-update (also called distributed write)
memory
update --gt
X -gt X
X -gt X
X -gt X
. . . . .
30All the best for your up coming final exam!!