CSE 8383 - Advanced Computer Architecture - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

CSE 8383 - Advanced Computer Architecture

Description:

Title: Introduction To Parallel Processors Author: rewini Last modified by: rewini Created Date: 3/5/2001 10:21:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 70
Provided by: rew7
Category:

less

Transcript and Presenter's Notes

Title: CSE 8383 - Advanced Computer Architecture


1
CSE 8383 - Advanced Computer Architecture
  • Week-5
  • Week of Feb 9, 2004
  • engr.smu.edu/rewini/8383

2
Contents
  • Project/Schedule
  • Introduction to Multiprocessors
  • Parallelism
  • Performance
  • PRAM Model
  • .

3
Warm Up
  • Parallel Numerical Integration
  • Parallel Matrix Multiplication
  • In class Discuss with your neighbor!
  • Videotape Think about it!
  • What kind of architecture do we need?

4
Explicit vs. Implicit Paralleism
Parallel program
Sequential program
Parallelizer
Programming Environment
Parallel Architecture
5
Motivation
  • One-processor systems are not capable of
    delivering solutions to some problems in
    reasonable time
  • Multiple processors cooperate to jointly execute
    a single computational task in order to speed up
    its execution
  • Speed-up versus Quality-up

6
Multiprocessing
One-processor
Physical limitations
Multiprocessor
N processors cooperate to solve a single
computational task
Speed-up
Quality-up
Sharing
7
Flynns Classification- revisited
  • SISD (single instruction stream over a single
    data stream)
  • SIMD (single instruction stream over multiple
    data stream)
  • MIMD (multiple instruction streams over multiple
    data streams)
  • MISD (multiple instruction streams and a single
    data streams)

8
SISD (single instruction stream over a single
data stream)
  • SISD uniprocessor architecture

Captions CU control unit PU Processing
unit MU memory unit IS instruction stream DS
data stream PE processing element LM Local
Memory
9
SIMD (single instruction stream over multiple
data stream)
SIMD Architecture
10
MIMD (multiple instruction streams over multiple
data streams)
IS
Shared Memory
CU1
PU1
IS
DS
I/O
I/O
CU1
PUn
IS
DS
IS
MMD Architecture (with shared memory)
11
MISD (multiple instruction streams and a single
data streams)
IS
CU1
CU2
CUn
IS
Memory (Program and data)
IS
IS
IS
PU2
PUn
PU1
DS
DS
DS
DS
I/O
MISD architecture (the systolic array)
12
System Components
  • Three major Components
  • Processors
  • Memory Modules
  • Interconnection Network

13
Memory Access
  • Shared Memory
  • Distributed Memory

M
P
P
P
P
M
M
14
Interconnection Network Taxonomy
Interconnection Network
Dynamic
Static
Bus-based
Switch-based
1-D
2-D
HC
Crossbar
Single
Multiple
SS
MS
15
MIMD Shared Memory Systems
Interconnection Networks
16
Shared Memory
  • Single address space
  • Communication via read write
  • Synchronization via locks

17
Bus Based switch based SM Systems
Global Memory
18
Cache Coherent NUMA
19
MIMD Distributed Memory Systems
P
P
P
P
Interconnection Networks
20
Distributed Memory
  • Multiple address spaces
  • Communication via send receive
  • Synchronization via messages

21
SIMD Computers
von Neumann Computer
Some Interconnection Network
22
SIMD (Data Parallel)
  • Parallel Operations within a computation are
    partitioned spatially rather than temporally
  • Scalar instructions vs. Array instructions
  • Processors are incapable of operating
    autonomously ? they must be diven by the control
    uni

23
Past Trends in Parallel Architecture (inside the
box)
  • Completely custom designed components
    (processors, memory, interconnects, I/O)
  • Longer RD time (2-3 years)
  • Expensive systems
  • Quickly becoming outdated
  • Bankrupt companies!!

24
New Trends in Parallel Architecture (outside the
box)
  • Advances in commodity processors and network
    technology
  • Network of PCs and workstations connected via LAN
    or WAN forms a Parallel System
  • Network Computing
  • Compete favorably (cost/performance)
  • Utilize unused cycles of systems sitting idle

25
Clusters
Programming Environment
Middleware
Interconnection Network
26
Grids
  • Grids are geographically distributed platforms
    for computation.
  • They provide dependable, consistent, pervasive,
    and inexpensive access to high end computational
    capabilities.

27
Problem
  • Assume that a switching component such as a
    transistor can switch in zero time. We propose to
    construct a disk-shaped computer chip with such a
    component. The only limitation is the time it
    takes to send electronic signals from one edge of
    the chip to the other. Make the simplifying
    assumption that electronic signals travel 300,000
    kilometers per second. What must be the diameter
    of a round chip so that it can switch 109 times
    per second? What would the diameter be if the
    switching requirements were 1012 time per second?

28
Groschs Law (1960s)
  • To sell a computer for twice as much, it must be
    four times as fast
  • Vendors skip small speed improvements in favor of
    waiting for large ones
  • Buyers of expensive machines would wait for a
    twofold improvement in performance for the same
    price.

29
Moores Law
  • Gordon Moore (cofounder of Intel)
  • Processor performance would double every 18
    months
  • This prediction has held for several decades
  • Unlikely that single-processor performance
    continues to increase indefinitely

30
Von Neumanns bottleneck
  • Great mathematician of the 1940s and 1950s
  • Single control unit connecting a memory to a
    processing unit
  • Instructions and data are fetched one at a time
    from memory and fed to processing unit
  • Speed is limited by the rate at which
    instructions and data are transferred from memory
    to the processing unit.

31
Parallelism
  • Multiple CPUs
  • Within the CPU
  • One Pipeline
  • Multiple pipelines

32
Speedup
  • S Speed(new) / Speed(old)
  • S Work/time(new) / Work/time(old)
  • S time(old) / time(new)
  • S time(before improvement) /
  • time(after improvement)

33
Speedup
  • Time (one CPU) T(1)
  • Time (n CPUs) T(n)
  • Speedup S
  • S T(1)/T(n)

34
Amdahls Law
  • The performance improvement to be gained from
    using some faster mode of execution is limited by
    the fraction of the time the faster mode can be
    used

35
Example
20 hours
B
A
must walk
200 miles
Walk 4 miles /hour ? 50 20 70 hours
S 1 Bike 10 miles / hour ? 20 20
40 hours S 1.8 Car-1 50 miles / hour
? 4 20 24 hours S 2.9 Car-2 120
miles / hour ? 1.67 20 21.67 hours S
3.2 Car-3 600 miles /hour ? 0.33 20 20.33
hours S 3.4
36
Amdahls Law (1967)
  • ? The fraction of the program that is naturally
    serial
  • (1- ?) The fraction of the program that is
    naturally parallel

37
S T(1)/T(N)
T(1)(1- ? )
T(N) T(1)?
N
1
N
S

(1- ? )
?
?N (1- ? )
N
38
Amdahls Law
39
Gustafson-Barsis Law
N ? are not independent from each other
a The fraction of the program that is naturally
serial
T(N) 1 T(1) a (1- a ) N S N (N-1) a
40
Gustafson-Barsis Law
41
Comparison of Amdahls Law vs Gustafson-Barsis
Law
42
Example
For I 1 to 10 do begin SI
0.0 for J 1 to 10 do
SI SI MI, J SI SI/10
end
43
(No Transcript)
44
Distributed Computing Performance
  • Single Program Performance
  • Multiple Program Performance

45
PRAM Model
46
What is a Model?
  • According to Websters Dictionary, a model is a
    description or analogy used to help visualize
    something that cannot be directly observed.
  • According to The Oxford English Dictionary, a
    model is a simplified or idealized description
    or conception of a particular system, situation
    or process.

47
Why Models?
  • In general, the purpose of Modeling is to capture
    the salient characteristics of phenomena with
    clarity and the right degree of accuracy to
    facilitate analysis and prediction.
  • Megg, Matheson and Tarjan (1995)

48
Models in Problem Solving
  • Computer Scientists use models to help design
    problem solving tools such as
  • Fast Algorithms
  • Effective Programming Environments
  • Powerful Execution Engines

49
An Interface
Applications
  • A model is an interface separating high level
    properties from low level ones

Provides operations
MODEL
Requires implementation
Architectures
50
PRAM Model
Control
  • Synchronized Read Compute Write Cycle
  • EREW
  • ERCW
  • CREW
  • CRCW
  • Complexity
  • T(n), P(n), C(n)

Private Memory
P1
Global
Private Memory
P2
Memory
Private Memory
Pp
51
The PRAM model and its variations (cont.)
  • There are different modes for read and write
    operations in a PRAM.
  • Exclusive read(ER)
  • Exclusive write(EW)
  • Concurrent read(CR)
  • Concurrent write(CW)
  • Common
  • Arbitrary
  • Minimum
  • Priority
  • Based on the different modes described above, the
    PRAM can be further divided into the following
    four subclasses.
  • EREW-PRAM model
  • CREW-PRAM model
  • ERCW-PRAM model
  • CRCW-PRAM model

52
Analysis of Algorithms
  • Sequential Algorithms
  • Time Complexity
  • Space Complexity
  • An algorithm whose time complexity is bounded by
    a polynomial is called a polynomial-time
    algorithm. An algorithm is considered to be
    efficient if it runs in polynomial time.

53
Analysis of Sequential Algorithms
NP-hard
NP
P
NP-complete
The relationships among P, NP, NP-complete,
NP-hard
54
Analysis of parallel algorithm
  • Performance of a parallel algorithm is expressed
    in terms of how fast it is and how much resources
    it uses when it runs.
  • Run time, which is defined as the time during the
    execution of the algorithm
  • Number of processors the algorithm uses to solve
    a problem
  • The cost of the parallel algorithm, which is the
    product of the run time and the number of
    processors

55
Analysis of parallel algorithmThe NC-class and
P-completeness
NP-hard
NP
NC
P
P-complete
NP-complete
The relationships among P, NP, NP-complete,
NP-hard, NC, and P-complete (if P?NP and NC ? P)
56
Simulating multiple accesses on an EREW PRAM
  • Broadcasting mechanism
  • P1 reads x and makes it known to P2.
  • P1 and P2 make x known to P3 and P4,
    respectively, in parallel.
  • P1, P2, P3 and P4 make x known to P5, P6, P7 and
    P8, respectively, in parallel.
  • These eight processors will make x know to
    another eight processors, and so on.

57
Simulating multiple accesses on an EREW PRAM
(cont.)
x
L
L
L
L
x
P2
x
P3
x
P4
(c)
(d)
(b)
Simulating Concurrent read on EREW PRAM with
eight processors using Algorithm Broadcast_EREW
58
Simulating multiple accesses on an EREW PRAM
(cont.)
  • Algorithm Broadcast_EREW
  • Processor P1
  • y (in P1s private memory) ? x
  • L1 ? y
  • for i0 to log p-1 do
  • forall Pj, where 2i 1 lt j lt 2i1 do in
    parallel
  • y (in Pjs private memory) ? Lj-2i
  • Lj ? y
  • endfor
  • endfor

59
Bus-based Shared Memory
  • Collection of wires and connectors
  • Only one transaction at a time
  • Bottleneck!! How can we solve the problem?

60
Single Processor caching
x
Memory
Hit data in the cache Miss data is not in the
cache
x
Cache
P
Hit rate h Miss rate m (1-h)
61
Writing in the cache
x
Memory
x
Memory
x
Memory
x
Cache
x
Cache
x
Cache
P
P
P
Write through
Before
Write back
62
Using Caches
C1
C2
C3
Cn
P1
P2
P3
Pn
- How many processors?
- Cache Coherence problem
63
Group Activity
  • Variables
  • Number of processors (n)
  • Hit rate (h)
  • Bus Bandwidth (B)
  • Processor speed (V)
  • Condition n(I - h)v lt B
  • Maximum number of processors n B/(1-h)v

64
Cache Coherence
x
x
x
x
P1
P2
P3
Pn
  • Multiple copies of x
  • What if P1 updates x?

65
Cache Coherence Policies
  • Writing to Cache in 1 processor case
  • Write Through
  • Write Back
  • Writing to Cache in n processor case
  • Write Update - Write Through
  • Write Invalidate - Write Back
  • Write Update - Write Through
  • Write Invalidate - Write Back

66
Write-invalidate
x
x
x
x
x
x
I
x
I
P1
P2
P3
P1
P2
P3
P1
P2
P3
Write back
Before
Write Through
67
Write-Update
x
x
x
x
x
x
x
x
x
P1
P2
P3
P1
P2
P3
P1
P2
P3
Write back
Before
Write Through
68
Synchronization
69
Superscalar Parallelism
Scheduling
Write a Comment
User Comments (0)
About PowerShow.com