Lecture 17: IO wrapup, parallel processing - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Lecture 17: IO wrapup, parallel processing

Description:

for (col = 1; col cols - 1; col ) narray[row][col] = ((oarray[row - 1][col] oarray[row 1][col] oarray[row][col - 1] oarray[row][col 1]) / 4.0) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 76
Provided by: Rand235
Category:

less

Transcript and Presenter's Notes

Title: Lecture 17: IO wrapup, parallel processing


1
Lecture 17I/O wrapup, parallel processing
  • Prof. Kenneth M. Mackenzie
  • Computer Systems and Networks
  • CS2200, Spring 2003

Includes slides from Bill Leahy
2
Review 1/3
  • Filesystem representation
  • files
  • directories
  • API for I/O
  • how to wait ... dealing with blocks
  • unix approach
  • alternatives
  • device drivers
  • Disk head scheduling (finish this today)

3
Review 2/3 Unix Inode
4
Review 3/3 File Interface
  • open() -- open a file by name, return handle
  • close() -- done with file
  • read() -- read block, advance internal index
  • write() -- write block, advance internal index
  • fseek() -- set internal index
  • mmap() -- map a file into virtual addresses

5
Misc. Issues
  • read/write/seek
  • 10 free space in BSD-FFS
  • and descendants, including Linux ext2fs
  • called root reserve and helps avoid
    denial-of-service
  • however, main purpose is for the allocator
    similar to over-provisioning factor in a hash
    table
  • defrag utilities also exist for unix just not
    ordinarily used

while (1) read() seek() / to next
block / .... / do work while moving
head
6
Today
  • Disk head scheduling
  • Parallel systems

7
Disk Head Scheduling
8
Review...
  • Radial position is called a track on one surface.
    All tracks on all surfaces cylinder
  • Tracks divided into sectors
  • Must move head to correct position (seek)
  • Must wait for platter to rotate under head
  • Rotational Latency

9
Disk Scheduling
  • Algorithm Objective
  • Minimize seek time
  • Assumptions
  • Single disk module
  • Requests for equal sized blocks
  • Random distribution of locations on disk
  • One movable arm
  • Seek time proportional to tracks crossed
  • No delay due to controller
  • Read/write times are equal

10
Disk Scheduling
  • Measures of response
  • Mean wait time
  • Variance of wait time
  • Throughput
  • Measures of load
  • Number of requests per second
  • Average queue length

11
Policies
  • FCFS
  • Fair
  • OK for small loads, saturated quickly
  • High mean wait time
  • Low variance
  • Leads to wide swings in seeks leading to high
    mean wait time

12
FCFS
13
Policies
  • SSTF (Shortest seek time first)
  • Analogous to SJF for CPU scheduling
  • Good throughput
  • Saturates slowest
  • High variance
  • Possibility of starvation

14
SSTF
15
Policies
  • SCAN (elevator algorithm)
  • Start at one end, move toward the other end
  • Service requests on the way
  • Reverse and continue scanning
  • Lower variance than SSTF, similar mean waiting
    time
  • Requests just in front versus behind the
    direction of head motion?

16
SCAN
17
Policies
  • C-SCAN
  • Treat disk as if it were circular
  • Scan in one direction, retract, and resume scan
  • Leads to more uniform waiting time than SCAN

18
C-Scan
19
Policies
  • Look
  • Same as SCAN but reverse direction if no more
    requests in the scan direction
  • Leads to better performance than SCAN
  • C-Look
  • Circular look

20
C-Look
21
Policies
  • N-Step SCAN
  • Two queues
  • active (N requests or less)
  • latent
  • Service active queue
  • When no more in active, transfer N requests from
    latent to active
  • Leads to lower variance compared to SCAN
  • Worse than SCAN for mean waiting time

22
Algorithm Selection
  • SSTF (Shortest Seek Time First) is common. Better
    than FCFS.
  • If load is heavy SCAN and C-SCAN best because
    less likely to have starvation problems
  • We could calculate an optimum for any series of
    requests but costly
  • Depends on number type of requests
  • e.g. Imagine we only have one request pending
  • Also depends on file layout
  • Recommend a modular scheduling algorithm that can
    be changed.

23
Typical Question
  • Suppose a disk drive has 5000 cylinders, numbered
    from 0 to 4999. Currently at cylinder 143 and
    previous request was at 125. Queue (in FIFO
    order) is
  • 86, 1470, 913, 1774, 948, 1509, 1022, 1750, 130
  • Starting from current position what is total
    distance moved (in cyclinders) the disk head
    moves to satisfy all requests
  • Using FCFS, SSTF, SCAN, LOOK, C-SCAN

24
FCFS
  • 143 - 86 57
  • 86 - 1470 1384
  • 1470 - 913 557
  • 913 - 1774 861
  • 1774 - 948 826
  • 948 - 1509 561
  • 1509 - 1022 487
  • 1022 - 1750 728
  • 1750 - 130 1620
  • 7081 Cylinderslt-- Answer

25
Parallel Processors
26
TodayParallelism vs. Parallelism
  • Uni
  • Pipelined
  • Superscalar
  • VLIW/EPIC
  • SMP (Symmetric)
  • Distributed

TLP
ILP
27
Parallel Computers
  • Definition A parallel computer is a collection
    of processing elements that cooperate and
    communicate to solve large problems fast.
  • Almasi and Gottlieb, Highly Parallel Computing
    ,1989
  • Questions about parallel computers
  • How large a collection?
  • How powerful are processing elements?
  • How do they cooperate and communicate?
  • How are data transmitted?
  • What type of interconnection?
  • What are HW and SW primitives for programmer?
  • Does it translate into performance?

28
The Plan
  • Applications (problem space)
  • Key hardware issues
  • shared memory how to keep caches coherence
  • message passing low-cost communication
  • Commercial examples
  • SGI (Cray) O2K 512 processors, CC shared memory
  • Cray T3E 2048 processors, MP machine

29
Current Practice
  • Some success w/MPPs (Massively Parallel
    Processors)
  • dense matrix scientific computing (Petrolium,
    Automotive, Aeronautics, Pharmaceuticals)
  • file servers, databases, web search engines
  • entertainment/graphics
  • Small-scale machines DELL WORKSTATION 530
  • 1.7GHz Intel Pentium IV (in Minitower)
  • 512 MB RDRAM memory, 40GB disk, 20X CD, 19
    monitor, Quadro2Pro Graphics card, RedHat Linux,
    3yrs service
  • 2,760 for 2nd processor, add 515

30
Parallel Architecture
  • Parallel Architecture extends traditional
    computer architecture with a communication
    architecture
  • Programmming model (SW view)
  • Abstractions (HW/SW interface)
  • Implementation to realize abstraction efficiently
  • Historically, implementations have been tied to
    programming models but that is changing.

31
Parallel Applications
  • Throughput-oriented (want many answers)
  • multiprogramming
  • databases, web servers
  • Acme...
  • Latency oriented (want one answer, fast)
  • Grand Challenge problems
  • global climate model
  • human genome
  • quantum chromodynamics
  • combustion model
  • cognition

32
Speedupmetric for performance on
latency-sensitive applications
  • Time(1) / Time(P) for P processors
  • note must use the best sequential algorithm for
    Time(1) -- the parallel algorithm may be
    different.

linear speedup (ideal)
speedup
typical rolls off w/some of processors
1 2 4 8 16 32 64
occasionally see superlinear... why?
1 2 4 8 16 32 64
processors
33
HardwareTwo Main Variations
  • Shared-Memory
  • may be physically shared or only logically shared
  • communication is implicit in loads and stores
  • Message-Passing
  • must add explicit communication

34
Shared-Memory Hardware (1)Hardware and
programming model dont have to match, but this
is the mental model for shared-memory programming
  • Memory centralized with uniform access time
    (UMA) and bus interconnect, I/O
  • Examples Dell Workstation 530, Sun Enterprise,
    SGI Challenge
  • typical
  • 1 cycle to local cache
  • 20 cycles to remote cache
  • 100 cycles to memory

35
Shared-Memory Hardware (2)
  • Variation memory is not centralized. Called
    non-uniform access time (NUMA)
  • Shared memory accesses are converted into a
    messaging protocol (usually by hardware)
  • Examples DASH/Alewife/FLASH (academic), SGI
    Origin, Compaq GS320, Sequent (IBM) NUMA-Q

36
Message Passing Model
  • Whole computers (CPU, memory, I/O devices)
    communicate as explicit I/O operations
  • Essentially NUMA but integrated at I/O devices
    instead of at the memory system
  • Send specifies local buffer receiving process
    on remote computer
  • Receive specifies sending process on remote
    computer local buffer to place data
  • Usually send includes process tag and receive
    has rule on tag match 1, match any
  • Synch when send completes, when buffer free,
    when request accepted, receive wait for send
  • Sendreceive gt memory-memory copy, where each
    each supplies local address, AND does pairwise
    sychronization!

37
Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
38
MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
39
Key ProblemCache Coherence
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
40
Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
41
Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation
  • Maintain a lock per cache line
  • Invalidate other caches on a read/write
  • Easy on a bus snoop bus for transactions

42
Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
43
Exactly One Copy
  • Works, but performance is crummy.
  • Suppose we all just want to read the same memory
    location
  • one lousy global variable n the size of the
    problem, written once at the start of the program
    and read thereafter

Permit multiple readers (readers/writer lock per
cache line)
44
Programming Examples
45
Electric Field
2D plane w/known distribution of charges. want to
solve for the electric field everywhere.



  • Electrostatic potential Poissons eqn
  • Poisson ?-squared phi(x, y) rho(x, y)
  • just solve this differential equation for phi...
  • Field is E(x, y) del phi(x, y)

46
Approach Discretize
continuous
discrete







  • discretize plane into cells
  • convert differential equation into difference eqn
  • write out differences as a system Ax b
  • x is phi for every cell, b is rho for every cell
  • Solve equations (linear algebra)
  • oh by the way, x has 1M elements A is a 1Mx1M
    matrix...

47
Jacobi Method(iterative solver for, e.g.,
Poissons eqn)
  • Ultimate parallel method! (all communication is
    between neighbors (local)
  • Natural partitioning by square tiles

Next Aij average of neighbors
48
Jacobi Relaxationsequential
o New cell is average of four neighbors o Use
two arrays (oarray, narray) and alternate between
them
jacobi_one_step(oarray, narray, rows, cols)
for (row 1 row lt rows - 1 row) for (col
1 col lt cols - 1 col)
narrayrowcol ((oarrayrow - 1col
oarrayrow 1col
oarrayrowcol - 1
oarrayrowcol 1)
/ 4.0)
49
Small Multiprocessorsconnect via bus memory
literally shared
Proc Cache A
Proc Cache B
memory
Good for 2-4 way bus is eventually a bottleneck
50
Large Multiprocessorscalable point-to-point
interconnect
Proc Cache A
Proc Cache B
Communication Controller or Assist
network
Memory A
Memory B
SM controller detects remote accesses, maintains
cache coherence
51
Jacobi Relaxationshared memory
oarray
  • Divide up arrays among processors (how?)
  • Start a threads on all processors, each thread
    computes for its partition.
  • Note neighboring cells are transported on demand
  • Synchronize threads

1
2
3
4
5
6
narray
1
2
3
4
5
6
52
Jacobi Relaxationshared memory
o Communication is implicit (no code!) o Need
synchronization (here a barrier)
jacobi_one_step_sm(oarray, narray, rows, cols)
botrow rows / NPROCS my_pid() toprow
botrow rows / NPROCS for (row botrow row
lt toprow row) for (col 1 col lt cols -
1 col) narrayrowcol ((oarrayrow -
1col oarrayrow
1col
oarrayrowcol - 1
oarrayrowcol 1) /
4.0) barrier()
53
Jacobi Relaxationmessage passing
(from P2)
  • Divide up arrays among processors (same problem)
  • Processor can only see its partition data from
    other partitions must be transported explicitly
    via messages
  • Note, though, that synchronization is inherent in
    the messages

oarray
3
narray
3
(from P4)
54
Jacobi Relaxationmessage passing
jacobi_one_step_mp(oarray, narray, srows,
cols) send(my_pid() - 1, oarray00,
srows) send(my_pid() 1, oarraysrows-10,
srows) recv(my_pid() - 1, oarray-10,
srows) recv(my_pid() 1, oarraysrows0,
srows) for (row 0 row lt srows row)
for (col 1 col lt cols - 1 col)
narrayrowcol ((oarrayrow - 1col
oarrayrow 1col
oarrayrowcol - 1
oarrayrowcol 1)
/ 4.0)
55
Messages vs. Shared Memory?
  • Shared Memory
  • As a programming model, shared memory is
    considered easier
  • automatic caching is good for dynamic/irregular
    problems
  • Message Passing
  • As a programming model, messages are the most
    portable
  • Right Thing for static/regular problems
  • BW , latency --, no concept of caching
  • Model implementation?
  • not necessarily...

56
Advanced ApplicationsPlasma Simulation
  • Use E-field as part of a larger problem, e.g.
    plasma simulation
  • Computation in two phases (a) compute field, (b)
    particle push.
  • Stability constraint push no more than one cell



F q E v x B




57
Plasma SimulationPartitioning??


F q E v x B



  • but for physically interesting systems, 80
    particles are found in 10 of cells!
  • How to partition?
  • Load balance ltgt communication
  • this is a particle in cell problem

58
Example MPPs
  • 1. SGI (Cray) Origin 2000
  • 2. Cray (SGI) (Tera) T3E

59
SGI Origin 2000note slides/photos from SGI web
site.
  • Hardware DSM ccNUMA
  • 1-1024 processors w/shared memory (directory
    limit)

60
(No Transcript)
61
Protocol Details
  • Protocol hacks
  • read-exclusive cache state
  • three-party forwarding
  • Directory layout
  • full-map, 16- or 64-bits per memory line
  • blocking mode where each bit 2-8 nodes
  • per-VM-page reference counters for page migration
  • Other hacks
  • fetch op at memory
  • DMA engine for block transfers

62
(No Transcript)
63
Network
  • SGI spider routers
  • 6 pairs of ports/router
  • 41nS fall-through, wormhole routed
  • four virtual channels per physical channel
  • 2 for request/reply
  • 2 more for adaptive routing (not limited to
    dimension-order like the MRC)

64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
O2K Summary
  • ccNUMA DSM all in hardware
  • 1-1024 processors
  • Scalable packaging
  • New product, Origin 3000, appears similar

68
Cray T3Enote slides/photos from SGI/Cray web
site
  • Message-Passing non-coherent remote memory
  • Up to 2048 processors
  • Air- or liquid-cooled
  • New product SV1

69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
T3E Mechanisms
  • Remote load/store (non-coherent)
  • staged through E registers
  • address centrifuge for data distribution
  • some fetchop synchronization operations
  • Block transfer to remote message queues
  • Hardware barrier/eureka network

Steve Oberlin, ISCA-99 the only mechanism widely
used today is remote load/store (w/out the
centrifuge)
73
(No Transcript)
74
(No Transcript)
75
Summary
  • Mechanisms
  • shared-memory coherence problem solved at
    runtime (HW or SW)
  • message passing
  • Origin 2000
  • coherent shared memory to 128 maybe 1024 proc.
  • one 16-processer machine in CoC nighthawk
  • T3E
  • Biggest tightly-coupled MPP 2048 processors
  • several auxiliary communication mechanisms
Write a Comment
User Comments (0)
About PowerShow.com