Lecture 17: IO wrapup, parallel processing - PowerPoint PPT Presentation

1 / 75

About This Presentation

Title:

Lecture 17: IO wrapup, parallel processing

Description:

for (col = 1; col cols - 1; col ) narray[row][col] = ((oarray[row - 1][col] oarray[row 1][col] oarray[row][col - 1] oarray[row][col 1]) / 4.0) ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 76

Provided by: Rand235

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 17: IO wrapup, parallel processing

1
Lecture 17I/O wrapup, parallel processing

Prof. Kenneth M. Mackenzie
Computer Systems and Networks
CS2200, Spring 2003

Includes slides from Bill Leahy
2
Review 1/3

Filesystem representation
files
directories
API for I/O
how to wait ... dealing with blocks
unix approach
alternatives
device drivers
Disk head scheduling (finish this today)

3
Review 2/3 Unix Inode
4
Review 3/3 File Interface

open() -- open a file by name, return handle
close() -- done with file
read() -- read block, advance internal index
write() -- write block, advance internal index
fseek() -- set internal index
mmap() -- map a file into virtual addresses

5
Misc. Issues

read/write/seek
10 free space in BSD-FFS
and descendants, including Linux ext2fs
called root reserve and helps avoid
denial-of-service
however, main purpose is for the allocator
similar to over-provisioning factor in a hash
table
defrag utilities also exist for unix just not
ordinarily used

while (1) read() seek() / to next
block / .... / do work while moving
head
6
Today

Disk head scheduling
Parallel systems

7
Disk Head Scheduling
8
Review...

Radial position is called a track on one surface.
All tracks on all surfaces cylinder
Tracks divided into sectors
Must move head to correct position (seek)
Must wait for platter to rotate under head
Rotational Latency

9
Disk Scheduling

Algorithm Objective
Minimize seek time
Assumptions
Single disk module
Requests for equal sized blocks
Random distribution of locations on disk
One movable arm
Seek time proportional to tracks crossed
No delay due to controller
Read/write times are equal

10
Disk Scheduling

Measures of response
Mean wait time
Variance of wait time
Throughput
Measures of load
Number of requests per second
Average queue length

11
Policies

FCFS
Fair
OK for small loads, saturated quickly
High mean wait time
Low variance
Leads to wide swings in seeks leading to high
mean wait time

12
FCFS
13
Policies

SSTF (Shortest seek time first)
Analogous to SJF for CPU scheduling
Good throughput
Saturates slowest
High variance
Possibility of starvation

14
SSTF
15
Policies

SCAN (elevator algorithm)
Start at one end, move toward the other end
Service requests on the way
Reverse and continue scanning
Lower variance than SSTF, similar mean waiting
time
Requests just in front versus behind the
direction of head motion?

16
SCAN
17
Policies

C-SCAN
Treat disk as if it were circular
Scan in one direction, retract, and resume scan
Leads to more uniform waiting time than SCAN

18
C-Scan
19
Policies

Look
Same as SCAN but reverse direction if no more
requests in the scan direction
Leads to better performance than SCAN
C-Look
Circular look

20
C-Look
21
Policies

N-Step SCAN
Two queues
active (N requests or less)
latent
Service active queue
When no more in active, transfer N requests from
latent to active
Leads to lower variance compared to SCAN
Worse than SCAN for mean waiting time

22
Algorithm Selection

SSTF (Shortest Seek Time First) is common. Better
than FCFS.
If load is heavy SCAN and C-SCAN best because
less likely to have starvation problems
We could calculate an optimum for any series of
requests but costly
Depends on number type of requests
e.g. Imagine we only have one request pending
Also depends on file layout
Recommend a modular scheduling algorithm that can
be changed.

23
Typical Question

Suppose a disk drive has 5000 cylinders, numbered
from 0 to 4999. Currently at cylinder 143 and
previous request was at 125. Queue (in FIFO
order) is
86, 1470, 913, 1774, 948, 1509, 1022, 1750, 130
Starting from current position what is total
distance moved (in cyclinders) the disk head
moves to satisfy all requests
Using FCFS, SSTF, SCAN, LOOK, C-SCAN

24
FCFS

143 - 86 57
86 - 1470 1384
1470 - 913 557
913 - 1774 861
1774 - 948 826
948 - 1509 561
1509 - 1022 487
1022 - 1750 728
1750 - 130 1620
7081 Cylinderslt-- Answer

25
Parallel Processors
26
TodayParallelism vs. Parallelism

Uni
Pipelined
Superscalar
VLIW/EPIC

SMP (Symmetric)
Distributed

TLP
ILP
27
Parallel Computers

Definition A parallel computer is a collection
of processing elements that cooperate and
communicate to solve large problems fast.
Almasi and Gottlieb, Highly Parallel Computing
,1989
Questions about parallel computers
How large a collection?
How powerful are processing elements?
How do they cooperate and communicate?
How are data transmitted?
What type of interconnection?
What are HW and SW primitives for programmer?
Does it translate into performance?

28
The Plan

Applications (problem space)
Key hardware issues
shared memory how to keep caches coherence
message passing low-cost communication
Commercial examples
SGI (Cray) O2K 512 processors, CC shared memory
Cray T3E 2048 processors, MP machine

29
Current Practice

Some success w/MPPs (Massively Parallel
Processors)
dense matrix scientific computing (Petrolium,
Automotive, Aeronautics, Pharmaceuticals)
file servers, databases, web search engines
entertainment/graphics
Small-scale machines DELL WORKSTATION 530
1.7GHz Intel Pentium IV (in Minitower)
512 MB RDRAM memory, 40GB disk, 20X CD, 19
monitor, Quadro2Pro Graphics card, RedHat Linux,
3yrs service
2,760 for 2nd processor, add 515

30
Parallel Architecture

Parallel Architecture extends traditional
computer architecture with a communication
architecture
Programmming model (SW view)
Abstractions (HW/SW interface)
Implementation to realize abstraction efficiently
Historically, implementations have been tied to
programming models but that is changing.

31
Parallel Applications

Throughput-oriented (want many answers)
multiprogramming
databases, web servers
Acme...
Latency oriented (want one answer, fast)
Grand Challenge problems
global climate model
human genome
quantum chromodynamics
combustion model
cognition

32
Speedupmetric for performance on
latency-sensitive applications

Time(1) / Time(P) for P processors
note must use the best sequential algorithm for
Time(1) -- the parallel algorithm may be
different.

linear speedup (ideal)
speedup
typical rolls off w/some of processors
1 2 4 8 16 32 64
occasionally see superlinear... why?
1 2 4 8 16 32 64
processors
33
HardwareTwo Main Variations

Shared-Memory
may be physically shared or only logically shared
communication is implicit in loads and stores
Message-Passing
must add explicit communication

34
Shared-Memory Hardware (1)Hardware and
programming model dont have to match, but this
is the mental model for shared-memory programming

Memory centralized with uniform access time
(UMA) and bus interconnect, I/O
Examples Dell Workstation 530, Sun Enterprise,
SGI Challenge

typical
1 cycle to local cache
20 cycles to remote cache
100 cycles to memory

35
Shared-Memory Hardware (2)

Variation memory is not centralized. Called
non-uniform access time (NUMA)
Shared memory accesses are converted into a
messaging protocol (usually by hardware)
Examples DASH/Alewife/FLASH (academic), SGI
Origin, Compaq GS320, Sequent (IBM) NUMA-Q

36
Message Passing Model

Whole computers (CPU, memory, I/O devices)
communicate as explicit I/O operations
Essentially NUMA but integrated at I/O devices
instead of at the memory system
Send specifies local buffer receiving process
on remote computer
Receive specifies sending process on remote
computer local buffer to place data
Usually send includes process tag and receive
has rule on tag match 1, match any
Synch when send completes, when buffer free,
when request accepted, receive wait for send
Sendreceive gt memory-memory copy, where each
each supplies local address, AND does pairwise
sychronization!

37
Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
38
MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
39
Key ProblemCache Coherence
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
40
Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
41
Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation

Maintain a lock per cache line
Invalidate other caches on a read/write
Easy on a bus snoop bus for transactions

42
Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
43
Exactly One Copy

Works, but performance is crummy.
Suppose we all just want to read the same memory
location
one lousy global variable n the size of the
problem, written once at the start of the program
and read thereafter

Permit multiple readers (readers/writer lock per
cache line)
44
Programming Examples
45
Electric Field
2D plane w/known distribution of charges. want to
solve for the electric field everywhere.

Electrostatic potential Poissons eqn
Poisson ?-squared phi(x, y) rho(x, y)
just solve this differential equation for phi...
Field is E(x, y) del phi(x, y)

46
Approach Discretize
continuous
discrete

discretize plane into cells
convert differential equation into difference eqn
write out differences as a system Ax b
x is phi for every cell, b is rho for every cell
Solve equations (linear algebra)
oh by the way, x has 1M elements A is a 1Mx1M
matrix...

47
Jacobi Method(iterative solver for, e.g.,
Poissons eqn)

Ultimate parallel method! (all communication is
between neighbors (local)
Natural partitioning by square tiles

Next Aij average of neighbors
48
Jacobi Relaxationsequential
o New cell is average of four neighbors o Use
two arrays (oarray, narray) and alternate between
them
jacobi_one_step(oarray, narray, rows, cols)
for (row 1 row lt rows - 1 row) for (col
1 col lt cols - 1 col)
narrayrowcol ((oarrayrow - 1col
oarrayrow 1col
oarrayrowcol - 1
oarrayrowcol 1)
/ 4.0)
49
Small Multiprocessorsconnect via bus memory
literally shared
Proc Cache A
Proc Cache B
memory
Good for 2-4 way bus is eventually a bottleneck
50
Large Multiprocessorscalable point-to-point
interconnect
Proc Cache A
Proc Cache B
Communication Controller or Assist
network
Memory A
Memory B
SM controller detects remote accesses, maintains
cache coherence
51
Jacobi Relaxationshared memory
oarray

Divide up arrays among processors (how?)
Start a threads on all processors, each thread
computes for its partition.
Note neighboring cells are transported on demand
Synchronize threads

1
2
3
4
5
6
narray
1
2
3
4
5
6
52
Jacobi Relaxationshared memory
o Communication is implicit (no code!) o Need
synchronization (here a barrier)
jacobi_one_step_sm(oarray, narray, rows, cols)
botrow rows / NPROCS my_pid() toprow
botrow rows / NPROCS for (row botrow row
lt toprow row) for (col 1 col lt cols -
1 col) narrayrowcol ((oarrayrow -
1col oarrayrow
1col
oarrayrowcol - 1
oarrayrowcol 1) /
4.0) barrier()
53
Jacobi Relaxationmessage passing
(from P2)

Divide up arrays among processors (same problem)
Processor can only see its partition data from
other partitions must be transported explicitly
via messages
Note, though, that synchronization is inherent in
the messages

oarray
3
narray
3
(from P4)
54
Jacobi Relaxationmessage passing
jacobi_one_step_mp(oarray, narray, srows,
cols) send(my_pid() - 1, oarray00,
srows) send(my_pid() 1, oarraysrows-10,
srows) recv(my_pid() - 1, oarray-10,
srows) recv(my_pid() 1, oarraysrows0,
srows) for (row 0 row lt srows row)
for (col 1 col lt cols - 1 col)
narrayrowcol ((oarrayrow - 1col
oarrayrow 1col
oarrayrowcol - 1
oarrayrowcol 1)
/ 4.0)
55
Messages vs. Shared Memory?

Shared Memory
As a programming model, shared memory is
considered easier
automatic caching is good for dynamic/irregular
problems
Message Passing
As a programming model, messages are the most
portable
Right Thing for static/regular problems
BW , latency --, no concept of caching
Model implementation?
not necessarily...

56
Advanced ApplicationsPlasma Simulation

Use E-field as part of a larger problem, e.g.
plasma simulation
Computation in two phases (a) compute field, (b)
particle push.
Stability constraint push no more than one cell

F q E v x B

57
Plasma SimulationPartitioning??

F q E v x B

but for physically interesting systems, 80
particles are found in 10 of cells!
How to partition?
Load balance ltgt communication
this is a particle in cell problem

58
Example MPPs

1. SGI (Cray) Origin 2000
2. Cray (SGI) (Tera) T3E

59
SGI Origin 2000note slides/photos from SGI web
site.

Hardware DSM ccNUMA
1-1024 processors w/shared memory (directory
limit)

60
(No Transcript)
61
Protocol Details

Protocol hacks
read-exclusive cache state
three-party forwarding
Directory layout
full-map, 16- or 64-bits per memory line
blocking mode where each bit 2-8 nodes
per-VM-page reference counters for page migration
Other hacks
fetch op at memory
DMA engine for block transfers

62
(No Transcript)
63
Network

SGI spider routers
6 pairs of ports/router
41nS fall-through, wormhole routed
four virtual channels per physical channel
2 for request/reply
2 more for adaptive routing (not limited to
dimension-order like the MRC)

64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
O2K Summary

ccNUMA DSM all in hardware
1-1024 processors
Scalable packaging
New product, Origin 3000, appears similar

68
Cray T3Enote slides/photos from SGI/Cray web
site

Message-Passing non-coherent remote memory
Up to 2048 processors
Air- or liquid-cooled
New product SV1

69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
T3E Mechanisms

Remote load/store (non-coherent)
staged through E registers
address centrifuge for data distribution
some fetchop synchronization operations
Block transfer to remote message queues
Hardware barrier/eureka network

Steve Oberlin, ISCA-99 the only mechanism widely
used today is remote load/store (w/out the
centrifuge)
73
(No Transcript)
74
(No Transcript)
75
Summary