Title: CSE 8383 - Advanced Computer Architecture
1CSE 8383 - Advanced Computer Architecture
- Week-5
- Week of Feb 9, 2004
- engr.smu.edu/rewini/8383
2Contents
- Project/Schedule
- Introduction to Multiprocessors
- Parallelism
- Performance
- PRAM Model
- .
3Warm Up
- Parallel Numerical Integration
- Parallel Matrix Multiplication
- In class Discuss with your neighbor!
- Videotape Think about it!
- What kind of architecture do we need?
4Explicit vs. Implicit Paralleism
Parallel program
Sequential program
Parallelizer
Programming Environment
Parallel Architecture
5Motivation
- One-processor systems are not capable of
delivering solutions to some problems in
reasonable time - Multiple processors cooperate to jointly execute
a single computational task in order to speed up
its execution - Speed-up versus Quality-up
6Multiprocessing
One-processor
Physical limitations
Multiprocessor
N processors cooperate to solve a single
computational task
Speed-up
Quality-up
Sharing
7Flynns Classification- revisited
- SISD (single instruction stream over a single
data stream) - SIMD (single instruction stream over multiple
data stream) - MIMD (multiple instruction streams over multiple
data streams) - MISD (multiple instruction streams and a single
data streams)
8SISD (single instruction stream over a single
data stream)
- SISD uniprocessor architecture
Captions CU control unit PU Processing
unit MU memory unit IS instruction stream DS
data stream PE processing element LM Local
Memory
9SIMD (single instruction stream over multiple
data stream)
SIMD Architecture
10MIMD (multiple instruction streams over multiple
data streams)
IS
Shared Memory
CU1
PU1
IS
DS
I/O
I/O
CU1
PUn
IS
DS
IS
MMD Architecture (with shared memory)
11MISD (multiple instruction streams and a single
data streams)
IS
CU1
CU2
CUn
IS
Memory (Program and data)
IS
IS
IS
PU2
PUn
PU1
DS
DS
DS
DS
I/O
MISD architecture (the systolic array)
12System Components
- Three major Components
- Processors
- Memory Modules
- Interconnection Network
13Memory Access
- Shared Memory
- Distributed Memory
M
P
P
P
P
M
M
14Interconnection Network Taxonomy
Interconnection Network
Dynamic
Static
Bus-based
Switch-based
1-D
2-D
HC
Crossbar
Single
Multiple
SS
MS
15MIMD Shared Memory Systems
Interconnection Networks
16Shared Memory
- Single address space
- Communication via read write
- Synchronization via locks
17Bus Based switch based SM Systems
Global Memory
18Cache Coherent NUMA
19MIMD Distributed Memory Systems
P
P
P
P
Interconnection Networks
20Distributed Memory
- Multiple address spaces
- Communication via send receive
- Synchronization via messages
21SIMD Computers
von Neumann Computer
Some Interconnection Network
22SIMD (Data Parallel)
- Parallel Operations within a computation are
partitioned spatially rather than temporally - Scalar instructions vs. Array instructions
- Processors are incapable of operating
autonomously ? they must be diven by the control
uni
23Past Trends in Parallel Architecture (inside the
box)
- Completely custom designed components
(processors, memory, interconnects, I/O) - Longer RD time (2-3 years)
- Expensive systems
- Quickly becoming outdated
- Bankrupt companies!!
24New Trends in Parallel Architecture (outside the
box)
- Advances in commodity processors and network
technology - Network of PCs and workstations connected via LAN
or WAN forms a Parallel System - Network Computing
- Compete favorably (cost/performance)
- Utilize unused cycles of systems sitting idle
25Clusters
Programming Environment
Middleware
Interconnection Network
26Grids
- Grids are geographically distributed platforms
for computation. - They provide dependable, consistent, pervasive,
and inexpensive access to high end computational
capabilities.
27Problem
- Assume that a switching component such as a
transistor can switch in zero time. We propose to
construct a disk-shaped computer chip with such a
component. The only limitation is the time it
takes to send electronic signals from one edge of
the chip to the other. Make the simplifying
assumption that electronic signals travel 300,000
kilometers per second. What must be the diameter
of a round chip so that it can switch 109 times
per second? What would the diameter be if the
switching requirements were 1012 time per second?
28Groschs Law (1960s)
- To sell a computer for twice as much, it must be
four times as fast - Vendors skip small speed improvements in favor of
waiting for large ones - Buyers of expensive machines would wait for a
twofold improvement in performance for the same
price.
29Moores Law
- Gordon Moore (cofounder of Intel)
- Processor performance would double every 18
months - This prediction has held for several decades
- Unlikely that single-processor performance
continues to increase indefinitely
30Von Neumanns bottleneck
- Great mathematician of the 1940s and 1950s
- Single control unit connecting a memory to a
processing unit - Instructions and data are fetched one at a time
from memory and fed to processing unit - Speed is limited by the rate at which
instructions and data are transferred from memory
to the processing unit.
31Parallelism
- Multiple CPUs
- Within the CPU
- One Pipeline
- Multiple pipelines
32Speedup
- S Speed(new) / Speed(old)
- S Work/time(new) / Work/time(old)
- S time(old) / time(new)
- S time(before improvement) /
- time(after improvement)
33Speedup
- Time (one CPU) T(1)
- Time (n CPUs) T(n)
- Speedup S
- S T(1)/T(n)
34Amdahls Law
- The performance improvement to be gained from
using some faster mode of execution is limited by
the fraction of the time the faster mode can be
used
35Example
20 hours
B
A
must walk
200 miles
Walk 4 miles /hour ? 50 20 70 hours
S 1 Bike 10 miles / hour ? 20 20
40 hours S 1.8 Car-1 50 miles / hour
? 4 20 24 hours S 2.9 Car-2 120
miles / hour ? 1.67 20 21.67 hours S
3.2 Car-3 600 miles /hour ? 0.33 20 20.33
hours S 3.4
36Amdahls Law (1967)
- ? The fraction of the program that is naturally
serial - (1- ?) The fraction of the program that is
naturally parallel
37S T(1)/T(N)
T(1)(1- ? )
T(N) T(1)?
N
1
N
S
(1- ? )
?
?N (1- ? )
N
38Amdahls Law
39Gustafson-Barsis Law
N ? are not independent from each other
a The fraction of the program that is naturally
serial
T(N) 1 T(1) a (1- a ) N S N (N-1) a
40Gustafson-Barsis Law
41Comparison of Amdahls Law vs Gustafson-Barsis
Law
42Example
For I 1 to 10 do begin SI
0.0 for J 1 to 10 do
SI SI MI, J SI SI/10
end
43(No Transcript)
44Distributed Computing Performance
- Single Program Performance
- Multiple Program Performance
45PRAM Model
46What is a Model?
- According to Websters Dictionary, a model is a
description or analogy used to help visualize
something that cannot be directly observed. - According to The Oxford English Dictionary, a
model is a simplified or idealized description
or conception of a particular system, situation
or process.
47Why Models?
- In general, the purpose of Modeling is to capture
the salient characteristics of phenomena with
clarity and the right degree of accuracy to
facilitate analysis and prediction. - Megg, Matheson and Tarjan (1995)
48Models in Problem Solving
- Computer Scientists use models to help design
problem solving tools such as - Fast Algorithms
- Effective Programming Environments
- Powerful Execution Engines
49An Interface
Applications
- A model is an interface separating high level
properties from low level ones
Provides operations
MODEL
Requires implementation
Architectures
50PRAM Model
Control
- Synchronized Read Compute Write Cycle
- EREW
- ERCW
- CREW
- CRCW
- Complexity
- T(n), P(n), C(n)
Private Memory
P1
Global
Private Memory
P2
Memory
Private Memory
Pp
51The PRAM model and its variations (cont.)
- There are different modes for read and write
operations in a PRAM. - Exclusive read(ER)
- Exclusive write(EW)
- Concurrent read(CR)
- Concurrent write(CW)
- Common
- Arbitrary
- Minimum
- Priority
- Based on the different modes described above, the
PRAM can be further divided into the following
four subclasses. - EREW-PRAM model
- CREW-PRAM model
- ERCW-PRAM model
- CRCW-PRAM model
52Analysis of Algorithms
- Sequential Algorithms
- Time Complexity
- Space Complexity
- An algorithm whose time complexity is bounded by
a polynomial is called a polynomial-time
algorithm. An algorithm is considered to be
efficient if it runs in polynomial time.
53Analysis of Sequential Algorithms
NP-hard
NP
P
NP-complete
The relationships among P, NP, NP-complete,
NP-hard
54Analysis of parallel algorithm
- Performance of a parallel algorithm is expressed
in terms of how fast it is and how much resources
it uses when it runs. - Run time, which is defined as the time during the
execution of the algorithm - Number of processors the algorithm uses to solve
a problem - The cost of the parallel algorithm, which is the
product of the run time and the number of
processors
55Analysis of parallel algorithmThe NC-class and
P-completeness
NP-hard
NP
NC
P
P-complete
NP-complete
The relationships among P, NP, NP-complete,
NP-hard, NC, and P-complete (if P?NP and NC ? P)
56Simulating multiple accesses on an EREW PRAM
- Broadcasting mechanism
- P1 reads x and makes it known to P2.
- P1 and P2 make x known to P3 and P4,
respectively, in parallel. - P1, P2, P3 and P4 make x known to P5, P6, P7 and
P8, respectively, in parallel. - These eight processors will make x know to
another eight processors, and so on.
57Simulating multiple accesses on an EREW PRAM
(cont.)
x
L
L
L
L
x
P2
x
P3
x
P4
(c)
(d)
(b)
Simulating Concurrent read on EREW PRAM with
eight processors using Algorithm Broadcast_EREW
58Simulating multiple accesses on an EREW PRAM
(cont.)
- Algorithm Broadcast_EREW
- Processor P1
- y (in P1s private memory) ? x
- L1 ? y
- for i0 to log p-1 do
- forall Pj, where 2i 1 lt j lt 2i1 do in
parallel - y (in Pjs private memory) ? Lj-2i
- Lj ? y
- endfor
- endfor
59Bus-based Shared Memory
- Collection of wires and connectors
- Only one transaction at a time
- Bottleneck!! How can we solve the problem?
60Single Processor caching
x
Memory
Hit data in the cache Miss data is not in the
cache
x
Cache
P
Hit rate h Miss rate m (1-h)
61Writing in the cache
x
Memory
x
Memory
x
Memory
x
Cache
x
Cache
x
Cache
P
P
P
Write through
Before
Write back
62Using Caches
C1
C2
C3
Cn
P1
P2
P3
Pn
- How many processors?
- Cache Coherence problem
63Group Activity
- Variables
- Number of processors (n)
- Hit rate (h)
- Bus Bandwidth (B)
- Processor speed (V)
- Condition n(I - h)v lt B
- Maximum number of processors n B/(1-h)v
64Cache Coherence
x
x
x
x
P1
P2
P3
Pn
- Multiple copies of x
- What if P1 updates x?
65Cache Coherence Policies
- Writing to Cache in 1 processor case
- Write Through
- Write Back
- Writing to Cache in n processor case
- Write Update - Write Through
- Write Invalidate - Write Back
- Write Update - Write Through
- Write Invalidate - Write Back
66Write-invalidate
x
x
x
x
x
x
I
x
I
P1
P2
P3
P1
P2
P3
P1
P2
P3
Write back
Before
Write Through
67Write-Update
x
x
x
x
x
x
x
x
x
P1
P2
P3
P1
P2
P3
P1
P2
P3
Write back
Before
Write Through
68Synchronization
69Superscalar Parallelism
Scheduling