Title: Lecture 17: IO wrapup, parallel processing
1Lecture 17I/O wrapup, parallel processing
- Prof. Kenneth M. Mackenzie
- Computer Systems and Networks
- CS2200, Spring 2003
Includes slides from Bill Leahy
2Review 1/3
- Filesystem representation
- files
- directories
- API for I/O
- how to wait ... dealing with blocks
- unix approach
- alternatives
- device drivers
- Disk head scheduling (finish this today)
3Review 2/3 Unix Inode
4Review 3/3 File Interface
- open() -- open a file by name, return handle
- close() -- done with file
- read() -- read block, advance internal index
- write() -- write block, advance internal index
- fseek() -- set internal index
- mmap() -- map a file into virtual addresses
5Misc. Issues
- read/write/seek
- 10 free space in BSD-FFS
- and descendants, including Linux ext2fs
- called root reserve and helps avoid
denial-of-service - however, main purpose is for the allocator
similar to over-provisioning factor in a hash
table - defrag utilities also exist for unix just not
ordinarily used
while (1) read() seek() / to next
block / .... / do work while moving
head
6Today
- Disk head scheduling
- Parallel systems
7Disk Head Scheduling
8Review...
- Radial position is called a track on one surface.
All tracks on all surfaces cylinder - Tracks divided into sectors
- Must move head to correct position (seek)
- Must wait for platter to rotate under head
- Rotational Latency
9Disk Scheduling
- Algorithm Objective
- Minimize seek time
- Assumptions
- Single disk module
- Requests for equal sized blocks
- Random distribution of locations on disk
- One movable arm
- Seek time proportional to tracks crossed
- No delay due to controller
- Read/write times are equal
10Disk Scheduling
- Measures of response
- Mean wait time
- Variance of wait time
- Throughput
- Measures of load
- Number of requests per second
- Average queue length
11Policies
- FCFS
- Fair
- OK for small loads, saturated quickly
- High mean wait time
- Low variance
- Leads to wide swings in seeks leading to high
mean wait time
12FCFS
13Policies
- SSTF (Shortest seek time first)
- Analogous to SJF for CPU scheduling
- Good throughput
- Saturates slowest
- High variance
- Possibility of starvation
14SSTF
15Policies
- SCAN (elevator algorithm)
- Start at one end, move toward the other end
- Service requests on the way
- Reverse and continue scanning
- Lower variance than SSTF, similar mean waiting
time - Requests just in front versus behind the
direction of head motion?
16SCAN
17Policies
- C-SCAN
- Treat disk as if it were circular
- Scan in one direction, retract, and resume scan
- Leads to more uniform waiting time than SCAN
18C-Scan
19Policies
- Look
- Same as SCAN but reverse direction if no more
requests in the scan direction - Leads to better performance than SCAN
- C-Look
- Circular look
20C-Look
21Policies
- N-Step SCAN
- Two queues
- active (N requests or less)
- latent
- Service active queue
- When no more in active, transfer N requests from
latent to active - Leads to lower variance compared to SCAN
- Worse than SCAN for mean waiting time
22Algorithm Selection
- SSTF (Shortest Seek Time First) is common. Better
than FCFS. - If load is heavy SCAN and C-SCAN best because
less likely to have starvation problems - We could calculate an optimum for any series of
requests but costly - Depends on number type of requests
- e.g. Imagine we only have one request pending
- Also depends on file layout
- Recommend a modular scheduling algorithm that can
be changed.
23Typical Question
- Suppose a disk drive has 5000 cylinders, numbered
from 0 to 4999. Currently at cylinder 143 and
previous request was at 125. Queue (in FIFO
order) is - 86, 1470, 913, 1774, 948, 1509, 1022, 1750, 130
- Starting from current position what is total
distance moved (in cyclinders) the disk head
moves to satisfy all requests - Using FCFS, SSTF, SCAN, LOOK, C-SCAN
24FCFS
- 143 - 86 57
- 86 - 1470 1384
- 1470 - 913 557
- 913 - 1774 861
- 1774 - 948 826
- 948 - 1509 561
- 1509 - 1022 487
- 1022 - 1750 728
- 1750 - 130 1620
- 7081 Cylinderslt-- Answer
25Parallel Processors
26TodayParallelism vs. Parallelism
- Uni
- Pipelined
- Superscalar
- VLIW/EPIC
- SMP (Symmetric)
- Distributed
TLP
ILP
27Parallel Computers
- Definition A parallel computer is a collection
of processing elements that cooperate and
communicate to solve large problems fast. - Almasi and Gottlieb, Highly Parallel Computing
,1989 - Questions about parallel computers
- How large a collection?
- How powerful are processing elements?
- How do they cooperate and communicate?
- How are data transmitted?
- What type of interconnection?
- What are HW and SW primitives for programmer?
- Does it translate into performance?
28The Plan
- Applications (problem space)
- Key hardware issues
- shared memory how to keep caches coherence
- message passing low-cost communication
- Commercial examples
- SGI (Cray) O2K 512 processors, CC shared memory
- Cray T3E 2048 processors, MP machine
29Current Practice
- Some success w/MPPs (Massively Parallel
Processors) - dense matrix scientific computing (Petrolium,
Automotive, Aeronautics, Pharmaceuticals) - file servers, databases, web search engines
- entertainment/graphics
- Small-scale machines DELL WORKSTATION 530
- 1.7GHz Intel Pentium IV (in Minitower)
- 512 MB RDRAM memory, 40GB disk, 20X CD, 19
monitor, Quadro2Pro Graphics card, RedHat Linux,
3yrs service - 2,760 for 2nd processor, add 515
30Parallel Architecture
- Parallel Architecture extends traditional
computer architecture with a communication
architecture - Programmming model (SW view)
- Abstractions (HW/SW interface)
- Implementation to realize abstraction efficiently
- Historically, implementations have been tied to
programming models but that is changing.
31Parallel Applications
- Throughput-oriented (want many answers)
- multiprogramming
- databases, web servers
- Acme...
- Latency oriented (want one answer, fast)
- Grand Challenge problems
- global climate model
- human genome
- quantum chromodynamics
- combustion model
- cognition
32Speedupmetric for performance on
latency-sensitive applications
- Time(1) / Time(P) for P processors
- note must use the best sequential algorithm for
Time(1) -- the parallel algorithm may be
different.
linear speedup (ideal)
speedup
typical rolls off w/some of processors
1 2 4 8 16 32 64
occasionally see superlinear... why?
1 2 4 8 16 32 64
processors
33HardwareTwo Main Variations
- Shared-Memory
- may be physically shared or only logically shared
- communication is implicit in loads and stores
- Message-Passing
- must add explicit communication
34Shared-Memory Hardware (1)Hardware and
programming model dont have to match, but this
is the mental model for shared-memory programming
- Memory centralized with uniform access time
(UMA) and bus interconnect, I/O - Examples Dell Workstation 530, Sun Enterprise,
SGI Challenge
- typical
- 1 cycle to local cache
- 20 cycles to remote cache
- 100 cycles to memory
35Shared-Memory Hardware (2)
- Variation memory is not centralized. Called
non-uniform access time (NUMA) - Shared memory accesses are converted into a
messaging protocol (usually by hardware) - Examples DASH/Alewife/FLASH (academic), SGI
Origin, Compaq GS320, Sequent (IBM) NUMA-Q
36Message Passing Model
- Whole computers (CPU, memory, I/O devices)
communicate as explicit I/O operations - Essentially NUMA but integrated at I/O devices
instead of at the memory system - Send specifies local buffer receiving process
on remote computer - Receive specifies sending process on remote
computer local buffer to place data - Usually send includes process tag and receive
has rule on tag match 1, match any - Synch when send completes, when buffer free,
when request accepted, receive wait for send - Sendreceive gt memory-memory copy, where each
each supplies local address, AND does pairwise
sychronization!
37Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
38MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
39Key ProblemCache Coherence
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
40Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
41Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation
- Maintain a lock per cache line
- Invalidate other caches on a read/write
- Easy on a bus snoop bus for transactions
42Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
43Exactly One Copy
- Works, but performance is crummy.
- Suppose we all just want to read the same memory
location - one lousy global variable n the size of the
problem, written once at the start of the program
and read thereafter
Permit multiple readers (readers/writer lock per
cache line)
44Programming Examples
45Electric Field
2D plane w/known distribution of charges. want to
solve for the electric field everywhere.
- Electrostatic potential Poissons eqn
- Poisson ?-squared phi(x, y) rho(x, y)
- just solve this differential equation for phi...
- Field is E(x, y) del phi(x, y)
46Approach Discretize
continuous
discrete
- discretize plane into cells
- convert differential equation into difference eqn
- write out differences as a system Ax b
- x is phi for every cell, b is rho for every cell
- Solve equations (linear algebra)
- oh by the way, x has 1M elements A is a 1Mx1M
matrix...
47Jacobi Method(iterative solver for, e.g.,
Poissons eqn)
- Ultimate parallel method! (all communication is
between neighbors (local) - Natural partitioning by square tiles
Next Aij average of neighbors
48Jacobi Relaxationsequential
o New cell is average of four neighbors o Use
two arrays (oarray, narray) and alternate between
them
jacobi_one_step(oarray, narray, rows, cols)
for (row 1 row lt rows - 1 row) for (col
1 col lt cols - 1 col)
narrayrowcol ((oarrayrow - 1col
oarrayrow 1col
oarrayrowcol - 1
oarrayrowcol 1)
/ 4.0)
49Small Multiprocessorsconnect via bus memory
literally shared
Proc Cache A
Proc Cache B
memory
Good for 2-4 way bus is eventually a bottleneck
50Large Multiprocessorscalable point-to-point
interconnect
Proc Cache A
Proc Cache B
Communication Controller or Assist
network
Memory A
Memory B
SM controller detects remote accesses, maintains
cache coherence
51Jacobi Relaxationshared memory
oarray
- Divide up arrays among processors (how?)
- Start a threads on all processors, each thread
computes for its partition. - Note neighboring cells are transported on demand
- Synchronize threads
1
2
3
4
5
6
narray
1
2
3
4
5
6
52Jacobi Relaxationshared memory
o Communication is implicit (no code!) o Need
synchronization (here a barrier)
jacobi_one_step_sm(oarray, narray, rows, cols)
botrow rows / NPROCS my_pid() toprow
botrow rows / NPROCS for (row botrow row
lt toprow row) for (col 1 col lt cols -
1 col) narrayrowcol ((oarrayrow -
1col oarrayrow
1col
oarrayrowcol - 1
oarrayrowcol 1) /
4.0) barrier()
53Jacobi Relaxationmessage passing
(from P2)
- Divide up arrays among processors (same problem)
- Processor can only see its partition data from
other partitions must be transported explicitly
via messages - Note, though, that synchronization is inherent in
the messages
oarray
3
narray
3
(from P4)
54Jacobi Relaxationmessage passing
jacobi_one_step_mp(oarray, narray, srows,
cols) send(my_pid() - 1, oarray00,
srows) send(my_pid() 1, oarraysrows-10,
srows) recv(my_pid() - 1, oarray-10,
srows) recv(my_pid() 1, oarraysrows0,
srows) for (row 0 row lt srows row)
for (col 1 col lt cols - 1 col)
narrayrowcol ((oarrayrow - 1col
oarrayrow 1col
oarrayrowcol - 1
oarrayrowcol 1)
/ 4.0)
55Messages vs. Shared Memory?
- Shared Memory
- As a programming model, shared memory is
considered easier - automatic caching is good for dynamic/irregular
problems - Message Passing
- As a programming model, messages are the most
portable - Right Thing for static/regular problems
- BW , latency --, no concept of caching
- Model implementation?
- not necessarily...
56Advanced ApplicationsPlasma Simulation
- Use E-field as part of a larger problem, e.g.
plasma simulation - Computation in two phases (a) compute field, (b)
particle push. - Stability constraint push no more than one cell
F q E v x B
57Plasma SimulationPartitioning??
F q E v x B
- but for physically interesting systems, 80
particles are found in 10 of cells! - How to partition?
- Load balance ltgt communication
- this is a particle in cell problem
58Example MPPs
- 1. SGI (Cray) Origin 2000
- 2. Cray (SGI) (Tera) T3E
59SGI Origin 2000note slides/photos from SGI web
site.
- Hardware DSM ccNUMA
- 1-1024 processors w/shared memory (directory
limit)
60(No Transcript)
61Protocol Details
- Protocol hacks
- read-exclusive cache state
- three-party forwarding
- Directory layout
- full-map, 16- or 64-bits per memory line
- blocking mode where each bit 2-8 nodes
- per-VM-page reference counters for page migration
- Other hacks
- fetch op at memory
- DMA engine for block transfers
62(No Transcript)
63Network
- SGI spider routers
- 6 pairs of ports/router
- 41nS fall-through, wormhole routed
- four virtual channels per physical channel
- 2 for request/reply
- 2 more for adaptive routing (not limited to
dimension-order like the MRC)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67O2K Summary
- ccNUMA DSM all in hardware
- 1-1024 processors
- Scalable packaging
- New product, Origin 3000, appears similar
68Cray T3Enote slides/photos from SGI/Cray web
site
- Message-Passing non-coherent remote memory
- Up to 2048 processors
- Air- or liquid-cooled
- New product SV1
69(No Transcript)
70(No Transcript)
71(No Transcript)
72T3E Mechanisms
- Remote load/store (non-coherent)
- staged through E registers
- address centrifuge for data distribution
- some fetchop synchronization operations
- Block transfer to remote message queues
- Hardware barrier/eureka network
Steve Oberlin, ISCA-99 the only mechanism widely
used today is remote load/store (w/out the
centrifuge)
73(No Transcript)
74(No Transcript)
75Summary
- Mechanisms
- shared-memory coherence problem solved at
runtime (HW or SW) - message passing
- Origin 2000
- coherent shared memory to 128 maybe 1024 proc.
- one 16-processer machine in CoC nighthawk
- T3E
- Biggest tightly-coupled MPP 2048 processors
- several auxiliary communication mechanisms