CSE 8383 - Advanced Computer Architecture - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

CSE 8383 - Advanced Computer Architecture

Description:

Title: Introduction To Parallel Processors Author: rewini Last modified by: rewini Created Date: 3/5/2001 10:21:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:203

Avg rating:3.0/5.0

Slides: 70

Provided by: rew7

Category:

more less

Transcript and Presenter's Notes

Title: CSE 8383 - Advanced Computer Architecture

1
CSE 8383 - Advanced Computer Architecture

Week-5
Week of Feb 9, 2004
engr.smu.edu/rewini/8383

2
Contents

Project/Schedule
Introduction to Multiprocessors
Parallelism
Performance
PRAM Model
.

3
Warm Up

Parallel Numerical Integration
Parallel Matrix Multiplication
In class Discuss with your neighbor!
Videotape Think about it!
What kind of architecture do we need?

4
Explicit vs. Implicit Paralleism
Parallel program
Sequential program
Parallelizer
Programming Environment
Parallel Architecture
5
Motivation

One-processor systems are not capable of
delivering solutions to some problems in
reasonable time
Multiple processors cooperate to jointly execute
a single computational task in order to speed up
its execution
Speed-up versus Quality-up

6
Multiprocessing
One-processor
Physical limitations
Multiprocessor
N processors cooperate to solve a single
computational task
Speed-up
Quality-up
Sharing
7
Flynns Classification- revisited

SISD (single instruction stream over a single
data stream)
SIMD (single instruction stream over multiple
data stream)
MIMD (multiple instruction streams over multiple
data streams)
MISD (multiple instruction streams and a single
data streams)

8
SISD (single instruction stream over a single
data stream)

SISD uniprocessor architecture

Captions CU control unit PU Processing
unit MU memory unit IS instruction stream DS
data stream PE processing element LM Local
Memory
9
SIMD (single instruction stream over multiple
data stream)
SIMD Architecture
10
MIMD (multiple instruction streams over multiple
data streams)
IS
Shared Memory
CU1
PU1
IS
DS
I/O
I/O
CU1
PUn
IS
DS
IS
MMD Architecture (with shared memory)
11
MISD (multiple instruction streams and a single
data streams)
IS
CU1
CU2
CUn
IS
Memory (Program and data)
IS
IS
IS
PU2
PUn
PU1
DS
DS
DS
DS
I/O
MISD architecture (the systolic array)
12
System Components

Three major Components
Processors
Memory Modules
Interconnection Network

13
Memory Access

Shared Memory
Distributed Memory

M
P
P
P
P
M
M
14
Interconnection Network Taxonomy
Interconnection Network
Dynamic
Static
Bus-based
Switch-based
1-D
2-D
HC
Crossbar
Single
Multiple
SS
MS
15
MIMD Shared Memory Systems
Interconnection Networks
16
Shared Memory

Single address space
Communication via read write
Synchronization via locks

17
Bus Based switch based SM Systems
Global Memory
18
Cache Coherent NUMA
19
MIMD Distributed Memory Systems
P
P
P
P
Interconnection Networks
20
Distributed Memory

Multiple address spaces
Communication via send receive
Synchronization via messages

21
SIMD Computers
von Neumann Computer
Some Interconnection Network
22
SIMD (Data Parallel)

Parallel Operations within a computation are
partitioned spatially rather than temporally
Scalar instructions vs. Array instructions
Processors are incapable of operating
autonomously ? they must be diven by the control
uni

23
Past Trends in Parallel Architecture (inside the
box)

Completely custom designed components
(processors, memory, interconnects, I/O)
Longer RD time (2-3 years)
Expensive systems
Quickly becoming outdated
Bankrupt companies!!

24
New Trends in Parallel Architecture (outside the
box)

Advances in commodity processors and network
technology
Network of PCs and workstations connected via LAN
or WAN forms a Parallel System
Network Computing
Compete favorably (cost/performance)
Utilize unused cycles of systems sitting idle

25
Clusters
Programming Environment
Middleware
Interconnection Network
26
Grids

Grids are geographically distributed platforms
for computation.
They provide dependable, consistent, pervasive,
and inexpensive access to high end computational
capabilities.

27
Problem

Assume that a switching component such as a
transistor can switch in zero time. We propose to
construct a disk-shaped computer chip with such a
component. The only limitation is the time it
takes to send electronic signals from one edge of
the chip to the other. Make the simplifying
assumption that electronic signals travel 300,000
kilometers per second. What must be the diameter
of a round chip so that it can switch 109 times
per second? What would the diameter be if the
switching requirements were 1012 time per second?

28
Groschs Law (1960s)

To sell a computer for twice as much, it must be
four times as fast
Vendors skip small speed improvements in favor of
waiting for large ones
Buyers of expensive machines would wait for a
twofold improvement in performance for the same
price.

29
Moores Law

Gordon Moore (cofounder of Intel)
Processor performance would double every 18
months
This prediction has held for several decades
Unlikely that single-processor performance
continues to increase indefinitely

30
Von Neumanns bottleneck

Great mathematician of the 1940s and 1950s
Single control unit connecting a memory to a
processing unit
Instructions and data are fetched one at a time
from memory and fed to processing unit
Speed is limited by the rate at which
instructions and data are transferred from memory
to the processing unit.

31
Parallelism

Multiple CPUs
Within the CPU
One Pipeline
Multiple pipelines

32
Speedup

S Speed(new) / Speed(old)
S Work/time(new) / Work/time(old)
S time(old) / time(new)
S time(before improvement) /
time(after improvement)

33
Speedup

Time (one CPU) T(1)
Time (n CPUs) T(n)
Speedup S
S T(1)/T(n)

34
Amdahls Law

The performance improvement to be gained from
using some faster mode of execution is limited by
the fraction of the time the faster mode can be
used

35
Example
20 hours
B
A
must walk
200 miles
Walk 4 miles /hour ? 50 20 70 hours
S 1 Bike 10 miles / hour ? 20 20
40 hours S 1.8 Car-1 50 miles / hour
? 4 20 24 hours S 2.9 Car-2 120
miles / hour ? 1.67 20 21.67 hours S
3.2 Car-3 600 miles /hour ? 0.33 20 20.33
hours S 3.4
36
Amdahls Law (1967)

? The fraction of the program that is naturally
serial
(1- ?) The fraction of the program that is
naturally parallel

37
S T(1)/T(N)
T(1)(1- ? )
T(N) T(1)?
N
1
N
S

(1- ? )
?
?N (1- ? )
N
38
Amdahls Law
39
Gustafson-Barsis Law
N ? are not independent from each other
a The fraction of the program that is naturally
serial
T(N) 1 T(1) a (1- a ) N S N (N-1) a
40
Gustafson-Barsis Law
41
Comparison of Amdahls Law vs Gustafson-Barsis
Law
42
Example
For I 1 to 10 do begin SI
0.0 for J 1 to 10 do
SI SI MI, J SI SI/10
end
43
(No Transcript)
44
Distributed Computing Performance

Single Program Performance
Multiple Program Performance

45
PRAM Model
46
What is a Model?

According to Websters Dictionary, a model is a
description or analogy used to help visualize
something that cannot be directly observed.
According to The Oxford English Dictionary, a
model is a simplified or idealized description
or conception of a particular system, situation
or process.

47
Why Models?

In general, the purpose of Modeling is to capture
the salient characteristics of phenomena with
clarity and the right degree of accuracy to
facilitate analysis and prediction.
Megg, Matheson and Tarjan (1995)

48
Models in Problem Solving

Computer Scientists use models to help design
problem solving tools such as
Fast Algorithms
Effective Programming Environments
Powerful Execution Engines

49
An Interface
Applications

A model is an interface separating high level
properties from low level ones

Provides operations
MODEL
Requires implementation
Architectures
50
PRAM Model
Control

Synchronized Read Compute Write Cycle
EREW
ERCW
CREW
CRCW
Complexity
T(n), P(n), C(n)

Private Memory
P1
Global
Private Memory
P2
Memory
Private Memory
Pp
51
The PRAM model and its variations (cont.)

There are different modes for read and write
operations in a PRAM.
Exclusive read(ER)
Exclusive write(EW)
Concurrent read(CR)
Concurrent write(CW)
Common
Arbitrary
Minimum
Priority
Based on the different modes described above, the
PRAM can be further divided into the following
four subclasses.
EREW-PRAM model
CREW-PRAM model
ERCW-PRAM model
CRCW-PRAM model

52
Analysis of Algorithms

Sequential Algorithms
Time Complexity
Space Complexity
An algorithm whose time complexity is bounded by
a polynomial is called a polynomial-time
algorithm. An algorithm is considered to be
efficient if it runs in polynomial time.

53
Analysis of Sequential Algorithms
NP-hard
NP
P
NP-complete
The relationships among P, NP, NP-complete,
NP-hard
54
Analysis of parallel algorithm

Performance of a parallel algorithm is expressed
in terms of how fast it is and how much resources
it uses when it runs.
Run time, which is defined as the time during the
execution of the algorithm
Number of processors the algorithm uses to solve
a problem
The cost of the parallel algorithm, which is the
product of the run time and the number of
processors

55
Analysis of parallel algorithmThe NC-class and
P-completeness
NP-hard
NP
NC
P
P-complete
NP-complete
The relationships among P, NP, NP-complete,
NP-hard, NC, and P-complete (if P?NP and NC ? P)
56
Simulating multiple accesses on an EREW PRAM

Broadcasting mechanism
P1 reads x and makes it known to P2.
P1 and P2 make x known to P3 and P4,
respectively, in parallel.
P1, P2, P3 and P4 make x known to P5, P6, P7 and
P8, respectively, in parallel.
These eight processors will make x know to
another eight processors, and so on.

57
Simulating multiple accesses on an EREW PRAM
(cont.)
x
L
L
L
L
x
P2
x
P3
x
P4
(c)
(d)
(b)
Simulating Concurrent read on EREW PRAM with
eight processors using Algorithm Broadcast_EREW
58
Simulating multiple accesses on an EREW PRAM
(cont.)

Algorithm Broadcast_EREW
Processor P1
y (in P1s private memory) ? x
L1 ? y
for i0 to log p-1 do
forall Pj, where 2i 1 lt j lt 2i1 do in
parallel
y (in Pjs private memory) ? Lj-2i
Lj ? y
endfor
endfor

59
Bus-based Shared Memory

Collection of wires and connectors
Only one transaction at a time
Bottleneck!! How can we solve the problem?

60
Single Processor caching
x
Memory
Hit data in the cache Miss data is not in the
cache
x
Cache
P
Hit rate h Miss rate m (1-h)
61
Writing in the cache
x
Memory
x
Memory
x
Memory
x
Cache
x
Cache
x
Cache
P
P
P
Write through
Before
Write back
62
Using Caches
C1
C2
C3
Cn
P1
P2
P3
Pn
- How many processors?
- Cache Coherence problem
63
Group Activity

Variables
Number of processors (n)
Hit rate (h)
Bus Bandwidth (B)
Processor speed (V)
Condition n(I - h)v lt B
Maximum number of processors n B/(1-h)v

64
Cache Coherence
x
x
x
x
P1
P2
P3
Pn

Multiple copies of x
What if P1 updates x?

65
Cache Coherence Policies

Writing to Cache in 1 processor case
Write Through
Write Back
Writing to Cache in n processor case
Write Update - Write Through
Write Invalidate - Write Back
Write Update - Write Through
Write Invalidate - Write Back

66
Write-invalidate
x
x
x
x
x
x
I
x
I
P1
P2
P3
P1
P2
P3
P1
P2
P3
Write back
Before
Write Through
67
Write-Update
x
x
x
x
x
x
x
x
x
P1
P2
P3
P1
P2
P3
P1
P2
P3
Write back
Before
Write Through
68
Synchronization
69
Superscalar Parallelism
Scheduling

Write a Comment

User Comments (0)