THIRD REVIEW SESSION - PowerPoint PPT Presentation

About This Presentation
Title:

THIRD REVIEW SESSION

Description:

Title: COSC3330/6308 Computer Architecture Author: Jehan-Fran ois P ris Last modified by: Jehan-Fran ois P ris Created Date: 8/29/2001 4:04:21 AM – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 191
Provided by: Jehan99
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: THIRD REVIEW SESSION


1
THIRD REVIEW SESSION
  • Jehan-François Pâris
  • May 5, 2010

2
MATERIALS (I)
  • Memory hierarchies
  • Caches
  • Virtual memory
  • Protection
  • Virtual machines
  • Cache consistency

3
MATERIALS (II)
  • I/O Operations
  • More about disks
  • I/O operation implementation
  • Busses
  • Memory-mapped I/O
  • Specific I/O instructions
  • RAID organizations

4
MATERIALS (III)
  • Parallel Architectures
  • Shared memory multiprocessors
  • Computer clusters
  • Hardware multithreading
  • SISD, SIMD, MIMD,
  • Roofline performance model

5
CACHING AND VIRTUAL MEMORY
6
Common objective
  • Make a combination of
  • Small, fast and expensive memory
  • Large, slow and cheap memory
  • look like
  • A single large and fast memory
  • Fetch policy is fetch on demand

7
Questions to ask
  • What are the transfer units?
  • How are they placed in the faster memory?
  • How are they accessed?
  • How do we handle misses?
  • How do we implement writes?
  • and more generally
  • Are these tasks performed by the hardware or the
    OS?

8
Transfer units
  • Block or pages containing 2n bytes
  • Always properly aligned
  • If a block or a page contains 2n bytes,the n
    LSBs of its start address will be all zeroes

9
Examples
  • If block size is 4 words,
  • Corresponds to 16 2 4 bytes
  • 4 LSB of block address will be all zeroes
  • If page size is 4KB
  • Corresponds to 22210 212 bytes
  • 12 LSBs of page address will be all zeroes
  • Remaining bits of address form page number

10
Examples
Page size 4KB
32-bit address of first byte in page
XXXXXXXXXXXXXXXXXlt12 zeroesgt
In page address 20-bit page number 12 bit
offset
XXXXXXXXXXXXXXXXX
YYYYYYYYY
11
Consequence
  • In a 32-bit architecture,
  • We identify a block or a page of size 2n bytes by
    the 32 n MSBs of its address
  • Will be called
  • Tag
  • Page number

12
Placement policy
  • Two extremes
  • Each block can only occupy a fixed address in the
    faster memory
  • Direct mapping (many caches)
  • Each page can occupy any address in the faster
    memory
  • Full association (virtual memory)

13
Direct mapping
  • Assume
  • Cache has 2m entries
  • Block size is 2n bytes
  • a is the block address(with its n LSBs removed)
  • The block will be placed at cache position
  • a 2m

14
Consequence
  • The tag identifying the cache block will be the
    start address of the block with its n m LSBs
    removed
  • the original n LSBs because they are known to be
    all zeroes
  • the next m LSBs because they are equal toa 2m

15
Consequence
Block start address
Remove m additional LSBs given by a2m
Tag
16
A cache whose block size is 8 bytes
Tag
Contents
17
Fully associative solution
Page Frame
0 4
1 7
2 27
3 44
4 5
  • Used in virtual memory systems
  • Each page can occupy any free page frame in main
    memory
  • Use a page table
  • Without redundant first column

18
Solutions with limited associativity
  • A cache of size 2m with associativity level k
    lets a given block occupy any of k possible
    locations in the cache
  • Implementation looks very much like k caches of
    size 2m/k put together
  • All possible cache locations for a block have the
    same position a 2m/k in each of the smaller
    caches

19
A set-associative cache with k2
20
Accessing an entry
  • In a cache, use hardware to compute the
    possible cache position for the block containing
    the data
  • a 2m for a cache using direct mapping
  • a 2m/k for a cache of associativity level k
  • Check then if the cache entry is valid using its
    valid bit

21
Accessing an entry
  • In a VM system, hardware checks the TLB to find
    the frame containing a given page number
  • TLB entries contain
  • A page number (tag)
  • A frame number
  • A valid bit
  • A dirty bit

22
Accessing an entry
  • The valid bit indicates if the mapping is valid
  • The dirty bit indicates whether we need to
    savethe page contents when we expel it

23
Accessing an entry
  • If page mapping is not in the TLB, must consult
    the page table and update the TLB
  • Can be done by hardware or software

24
Realization
25
Handling cache misses
  • Cache hardware fetches missing block
  • Often overwriting an existing entry
  • Which one?
  • The one that occupies the same location if cache
    use direct mapping
  • One of those that occupy the same location if
    cache use direct mapping

26
Handling cache misses
  • Before expelling a cache entry, we must
  • Check its dirty bit
  • Save its contents if dirty bit is on.

27
Handling page faults
  • OS fetches missing page
  • Often overwriting an existing page
  • Which one?
  • One that was not recently used
  • Selected by page replacement policy

28
Handling page faults
  • Before expelling a page, we must
  • Check its dirty bit
  • Save its contents if dirty bit is on.

29
Handling writes (I)
  • Two ways to handle writes
  • Write through
  • Each write updates both the cache and the main
    memory
  • Write back
  • Writes are not propagated to the main memory
    until the updated word is expelled from the cache

30
Handling writes (II)
  • Write through
  • Write back

CPU
Cache
later
RAM
31
Pros and cons
  • Write through
  • Ensures that memory is always up to date
  • Expelled cache entries can be overwritten
  • Write back
  • Faster writes
  • Complicates cache expulsion procedure
  • Must write back cache entries that have been
    modified in the cache

32
A better write through (I)
  • Add a small buffer to speed up write performance
    of write-through caches
  • At least four words
  • Holds modified data until they are written into
    main memory
  • Cache can proceed as soon as data are written
    into the write buffer

33
A better write through (II)
  • Write through
  • Better write through

CPU
Cache
Write buffer
RAM
34
Designing RAM to support caches
  • RAM connected to CPU through a "bus"
  • Clock rate much slower than CPU clock rate
  • Assume that a RAM access takes
  • 1 bus clock cycle to send the address
  • 15 bus clock cycle to initiate a read
  • 1 bus clock cycle to send a word of data

35
Designing RAM to support caches
  • Assume
  • Cache block size is 4 words
  • One-word bank of DRAM
  • Fetching a cache block would take
  • 1 415 41 65 bus clock cycles
  • Transfer rate is 0.25 byte/bus cycle
  • Awful!

36
Designing RAM to support caches
  • Could
  • Have an interleaved memory organization
  • Four one-word banks of DRAM
  • A 32-bit bus

32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
37
Designing RAM to support caches
  • Can do the 4 accesses in parallel
  • Must still transmit the block 32 bits by 32 bits
  • Fetching a cache block would take
  • 1 15 41 20 bus clock cycles
  • Transfer rate is 0.80 word/bus cycle
  • Even better
  • Much cheaper than having a 64-bit bus

38
PERFORMANCE ISSUES
39
Memory stalls
  • Can divide CPU time into
  • NEXEC clock cycles spent executing instructions
  • NMEM_STALLS cycles spent waiting for memory
    accesses
  • We have
  • CPU time (NEXEC NMEM_STALLS)TCYCLE

40
Memory stalls
  • We assume that
  • cache access times can be neglected
  • most CPU cycles spent waiting for memory accesses
    are caused by cache misses

41
Global impact
  • We have
  • NMEM_STALLS NMEM_ACCESSESCache miss rate
  • Cache miss penalty
  • and also
  • NMEM_STALLS NINSTRUCTIONS(NMISSES/Instruction)
  • Cache miss penalty

42
Example
  • Miss rate of instruction cache is 2 percentMiss
    rate of data cache is 5 percentIn the absence of
    memory stalls, each instruction would take 2
    cyclesMiss penalty is 100 cycles40 percent of
    instructions access the main memory
  • How many cycles are lost due to cache misses?

43
Solution (I)
  • Impact of instruction cache misses
  • 0.02100 2 cycles/instruction
  • Impact of data cache misses
  • 0.400.05100 2 cycles/instruction
  • Total impact of cache misses
  • 2 2 4 cycles/instruction

44
Solution (II)
  • Average number of cycles per instruction
  • 2 4 6 cycles/instruction
  • Fraction of time wasted
  • 4 /6 67 percent

45
Average memory access time
  • Some authors call it AMAT
  • TAVERAGE TCACHE fTMISS
  • where f is the cache miss rate
  • Times can be expressed
  • In nanoseconds
  • In number of cycles

46
Example
  • A cache has a hit rate of 96 percent
  • Accessing data
  • In the cache requires one cycle
  • In the memory requires 100 cycles
  • What is the average memory access time?

47
Solution
  • Miss rate 1 Hit rate 0.04
  • Applying the formula
  • TAVERAGE 1 0.04100 401 cycles

48
In other words
It's the miss rate, stupid!
49
Improving cache hit rate
  • Two complementary techniques
  • Using set-associative caches
  • Must check tags of all blocks with the same index
    values
  • Slower
  • Have fewer collisions
  • Fewer misses
  • Use a cache hierarchy

50
A cache hierarchy
  • Topmost cache
  • Optimized for speed, not miss rate
  • Rather small
  • Uses a small block size
  • As we go down the hierarchy
  • Cache sizes increase
  • Block sizes increase
  • Cache associativity level increases

51
Example
  • Cache miss rate per instruction is 3 percentIn
    the absence of memory stalls, each instruction
    would take one cycleCache miss penalty is 100
    nsClock rate is 4GHz
  • How many cycles are lost due to cache misses?

52
Solution (I)
  • Duration of clock cycle
  • 1/(4 Ghz) 0.2510-9 s 0.25 ns
  • Cache miss penalty
  • 100ns 400 cycles
  • Total impact of cache misses
  • 0.03400 12 cycles/instruction

53
Solution (II)
  • Average number of cycles per instruction
  • 1 12 13 cycles/instruction
  • Fraction of time wasted
  • 12/13 92 percent

A very good case for hardware multithreading
54
Example (cont'd)
  • How much faster would the processor if we added a
    L2 cache that
  • Has a 5 ns access time
  • Would reduce miss rate to main memory to one
    percent?

55
Solution (I)
  • L2 cache access time
  • 5ns 20 cycles
  • Impact of cache misses per instruction
  • L1 cache misses L2 cache misses
    0.03200.01400 0.6 4.0 4.6
    cycles/instruction
  • Average number of cycles per instruction
  • 1 4.6 5.6 cycles/instruction

56
Solution (II)
  • Fraction of time wasted
  • 4.6/5.6 82 percent
  • CPU speedup
  • 13/4.6 2.83

57
Problem
  • Redo the second part of the example assuming that
    the secondary cache
  • Has a 3 ns access time
  • Can reduce miss rate to main memory to one
    percent?

58
Solution
  • Fraction of time wasted
  • 86 percent
  • CPU speedup
  • 1.22

New L2 cache with a lower access timebut a
higher miss rate performs much worsethan first
L2 cache
59
Example
  • A virtual memory has a page fault rate of 10-4
    faults per memory access
  • Accessing data
  • In the memory requires 100 ns
  • On disk requires 5 ms
  • What is the average memory access time?Tavg
    100 ns 10-4 5 ms 600ns

60
The cost of a page fault
  • Let
  • Tm be the main memory access time
  • Td the disk access time
  • f the page fault rate
  • Ta the average access time of the VM
  • Ta (1 f ) Tm f (Tm Td ) Tm f Td

61
Example
  • Assume Tm 50 ns and Td 5 ms

f Mean memory access time
10-3 50 ns 5 ms/103 5,050 ns
10-4 50 ns 5 ms/104 550 ns
10-5 50 ns 5 ms/105 100 ns
10-6 50 ns 5 ms/ 106 55 ns
62
In other words
It's the page fault rate, stupid!
63
Locality principle (I)
  • A process that would access its pages in a
    totally unpredictable fashion would perform very
    poorly in a VM system unless all its pages are in
    main memory

64
Locality principle (II)
  • Process P accesses randomly a very large array
    consisting of n pages
  • If m of these n pages are in main memory, the
    page fault frequency of the process will be( n
    m )/ n
  • Must switch to another algorithm

65
First problem
  • A virtual memory system has
  • 32 bit addresses
  • 4 KB pages
  • What are the sizes of the
  • Page number field?
  • Offset field?

66
Solution (I)
  • Step 1Convert page size to power of 24 KB
    212 B
  • Step 2Exponent is length of offset field

67
Solution (II)
  • Step 3Size of page number field Address size
    Offset sizeHere 32 12 20 bits

12 bits for the offset and 20 bits for the page
number
68
MEMORY PROTECTION
69
Objective
  • Unless we have an isolated single-user system, we
    must prevent users from
  • Accessing
  • Deleting
  • Modifying
  • the address spaces of other processes, including
    the kernel

70
Memory protection (I)
  • VM ensures that processes cannot access page
    frames that are not referenced in their page
    table.
  • Can refine control by distinguishing among
  • Read access
  • Write access
  • Execute access
  • Must also prevent processes from modifying their
    own page tables

71
Dual-mode CPU
  • Require a dual-mode CPU
  • Two CPU modes
  • Privileged mode or executive mode that allows
    CPU to execute all instructions
  • User mode that allows CPU to execute only safe
    unprivileged instructions
  • State of CPU is determined by a special bits

72
Switching between states
  • User mode will be the default mode for all
    programs
  • Only the kernel can run in supervisor mode
  • Switching from user mode to supervisor mode is
    done through an interrupt
  • Safe because the jump address is at a
    well-defined location in main memory

73
Memory protection (II)
  • Has additional advantages
  • Prevents programs from corrupting address spaces
    of other programs
  • Prevents programs from crashing the kernel
  • Not true for device drivers which are inside the
    kernel
  • Required part of any multiprogramming system

74
INTEGRATING CACHES AND VM
75
The problem
  • In a VM system, each byte of memory has teo
    addresses
  • A virtual address
  • A physical address
  • Should cache tags contain virtual addresses or
    physical addresses?

76
Discussion
  • Using virtual addresses
  • Directly available
  • Bypass TLB
  • Cache entries specific to a given address space
  • Must flush caches when the OS selects another
    process
  • Using physical addresses
  • Must access first TLB
  • Cache entries not specific to a given address
    space
  • Do not have to flush caches when the OS selects
    another process

77
The best solution
  • Let the cache use physical addresses
  • No need to flush the cache at each context switch
  • TLB access delay is tolerable

78
VIRTUAL MACHINES
79
Key idea
  • Let different operating systems run at the same
    time on a single computer
  • Windows, Linux and Mac OS
  • A Real-time Os and a conventional OS
  • A production OS and a new OS being tested

80
How it is done
  • A hypervisor /VM monitor defines two or more
    virtual machines
  • Each virtual machine has
  • Its own virtual CPU
  • Its own virtual physical memory
  • Its own virtual disk(s)

81
Two virtual machines
User mode
Privileged mode
Hypervisor
82
Translating a block address
That's block v, w of the actual disk
Access block x, y of my virtual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w of actual disk
Actual disk
83
Handling I/Os
  • Difficult task because
  • Wide variety of devices
  • Some devices may be shared among several VMs
  • Printers
  • Shared disk partition
  • Want to let Linux and Windows access the same
    files

84
Virtual Memory Issues
  • Each VM kernel manages its own memory
  • Its page tables map program virtual addresses
    into pseudo-physical addresses
  • It treats these addresses as physical addresses

85
The dilemma
User processA
Page 735 of process A is stored in page frame 435
That's page frame 993 of the actual RAM
VM kernel
Hypervisor
86
The solution (I)
  • Address translation must remain fast!
  • Hypervisor lets each VM kernel manage their own
    page tables but do not use them
  • They contain bogus mappings!
  • It maintains instead its own shadow page tables
    with the correct mappings
  • Used to handle TLB misses

87
The solution (II)
  • To keep its shadow page tables up to date,
    hypervisor must track any changes made by the VM
    kernels
  • Mark page tables read-only

88
Nastiest Issue
  • The whole VM approach assumes that a kernel
    executing in user mode will behave exactly like a
    kernel executing in privileged mode
  • Not true for all architectures!
  • Intel x86 Pop flags (POPF) instruction

89
Solutions
  • Modify the instruction set and eliminate
    instructions like POPF
  • IBM redesigned the instruction set of their 360
    series for the 370 series
  • Mask it through clever software
  • Dynamic "binary translation" when direct
    execution of code could not work(VMWare)

90
CACHE CONSISTENCY
91
The problem
  • Specific to architectures with
  • Several processors sharing the same main memory
  • Multicore architectures
  • Each core/processor has its own private cache
  • A must for performance
  • Happens when same data are present in two or more
    private caches

92
An example (I)
RAM
93
An example (II)
Increments x
Still assumes x 0
RAM
94
An example
Both CPUs must applythe two updatesin the same
order
Sets x to 1
Resets x to 1
RAM
95
Rules
  • Whenever a processes accesses a variable it
    always gets the value stored by the processor
    that updated that variable last if the updates
    are sufficiently separated in times
  • A processor accessing a variable sees all updates
    applied to that variable in thesame order
  • No compromise is possible here

96
A realization Snoopy caches
  • All caches are linked to the main memory through
    a shared bus
  • All caches observe the writes performed by other
    caches
  • When a cache notices that another cache performs
    a write on a memory location that it has in its
    cache, it invalidates the corresponding cache
    block

97
An example (I)
Fetches x 2
RAM
98
An example (II)
Also fetches x
RAM
99
An example (III)
Resets x to 0
RAM
100
An example (IV)
Performs write-through
Detects write-through and invalidates its copy
of x
RAM
101
An example (IV)
when CPU wants to access x. cache gets correct
value from RAM
RAM
102
A last correctness condition
  • Cache cannot reorder their memory updates
  • Cache RAM buffer must be FIFO
  • First in first out

103
Miscellaneous fallacies
  • Segmented address spaces
  • Address is segment number offset in segment
  • Programmers hate them
  • Ignoring virtual memory behavior when accessing
    large two-dimensional arrays

104
Miscellaneous fallacies
  • Segmented address spaces
  • Address is segment number offset in segment
  • Programmers hate them
  • Ignoring virtual memory behavior when accessing
    large two-dimensional arrays
  • Believing that you can virtualize any CPU
    architecture

105
DEPENDABILITY
106
Reliability and Availability
  • Reliability
  • Probability R(t) that system will be up at time
    t if it was up at time t 0
  • Availability
  • Fraction of time the system is up
  • Reliability and availability do not measure the
    same thing!

107
MTTF, MMTR and MTBF
  • MTTF is mean time to failure
  • MTTR is mean time to repair
  • 1/MTTF is failure rate l
  • MTTBF, the mean time between failures, is
  • MTBF MTTF MTTR

108
Reliability
  • As a first approximation R(t) exp(t/MTTF)
  • Not true if failure rate varies over time

109
Availability
  • Measured by (MTTF)/(MTTF MTTR) MTTF/MTBF
  • MTTR is very important

110
Example
  • A server crashes on the average once a month
  • When this happens, it takes six hours to reboot
    it
  • What is the server availability ?

111
Solution
  • MTBF 30 days
  • MTTR ½ day
  • MTTF 29 ½ days
  • Availability is 29.5/30 98.3

112
Example
  • A disk drive has a MTTF of 20 years.
  • What is the probability that the data it contains
    will not be lost over a period of five years?

113
Example
  • A disk farm contains 100 disks whose MTTF is 20
    years.
  • What is the probability that no data will be
    lost over a period of five years?

114
Solution
  • The aggregate failure rate of the disk farm is
  • 100x1/20 5
  • The mean time to failure of the farm is
  • 1/5 year
  • We apply the formula
  • R(t) exp(t/MTTF) -exp(55) 1.4 10-11

115
RAID Arrays
116
Todays Motivation
  • We use RAID today for
  • Increasing disk throughput by allowing parallel
    access
  • Eliminating the need to make disk backups
  • Disks are too big to be backed up in an efficient
    fashion

117
RAID LEVEL 0
  • No replication
  • Advantages
  • Simple to implement
  • No overhead
  • Disadvantage
  • If array has n disks failure rate is n times the
    failure rate of a single disk

118
RAID levels 0 and 1
Mirrors
RAID level 1
119
RAID LEVEL 1
  • Mirroring
  • Two copies of each disk block
  • Advantages
  • Simple to implement
  • Fault-tolerant
  • Disadvantage
  • Requires twice the disk capacity of normal file
    systems

120
RAID LEVEL 2
  • Instead of duplicating the data blocks we use an
    error correction code
  • Very bad idea because disk drives either work
    correctly or do not work at all
  • Only possible errors are omission errors
  • We need an omission correction code
  • A parity bit is enough to correct a single
    omission

121
RAID levels 2 and 3
Check disks
RAID level 2
Parity disk
RAID level 3
122
RAID LEVEL 3
  • Requires N1 disk drives
  • N drives contain data (1/N of each data block)
  • Block bk now partitioned into N fragments
    bk,1, bk,2, ... bk,N
  • Parity drive contains exclusive or of these N
    fragments
  • pk bk,1 ? bk,2 ? ... ? bk,N

123
How parity works?
  • Truth table for XOR (same as parity)

A B A?B
0 0 0
0 1 1
1 0 1
1 1 0
124
Recovering from a disk failure
  • Small RAID level 3 array with data disks D0 and
    D1 and parity disk P can tolerate failure of
    either D0 or D1

D0 D1 P
0 0 0
0 1 1
1 0 1
1 1 0
D1?PD0 D0?PD1
0 0
0 1
1 0
1 1
125
How RAID level 3 works (I)
  • Assume we have N 1 disks
  • Each block is partitioned into N equal chunks

N 4 in example
126
How RAID level 3 works (II)
  • XOR data chunks to compute the parity chunk
  • Each chunk is written into a separate disk

Parity
127
How RAID level 3 works (III)
  • Each read/write involves all disks in RAID array
  • Cannot do two or more reads/writes in parallel
  • Performance of array not netter than that of a
    single disk

128
RAID LEVEL 4 (I)
  • Requires N1 disk drives
  • N drives contain data
  • Individual blocks, not chunks
  • Blocks with same disk address form a stripe

x
x
x
x
?
129
RAID LEVEL 4 (II)
  • Parity drive contains exclusive or of the N
    blocks in stripe
  • pk bk ? bk1 ? ... ? bkN-1
  • Parity block now reflects contents of several
    blocks!
  • Can now do parallel reads/writes

130
RAID levels 4 and 5
Bottleneck
RAID level 4
RAID level 5
131
RAID LEVEL 5
  • Single parity drive of RAID level 4 is involved
    in every write
  • Will limit parallelism
  • RAID-5 distribute the parity blocks among the
    N1 drives
  • Much better

132
The small write problem
  • Specific to RAID 5
  • Happens when we want to update a single block
  • Block belongs to a stripe
  • How can we compute the new value of the parity
    block

pk
...
bk1
bk2
bk
133
First solution
  • Read values of N-1 other blocks in stripe
  • Recompute
  • pk bk ? bk1 ? ... ? bkN-1
  • Solution requires
  • N-1 reads
  • 2 writes (new block and new parity block)

134
Second solution
  • Assume we want to update block bm
  • Read old values of bm and parity block pk
  • Compute
  • pk new bm ? old bm ? old pk
  • Solution requires
  • 2 reads (old values of block and parity block)
  • 2 writes (new block and new parity block)

135
RAID level 6 (I)
  • Not part of the original proposal
  • Two check disks
  • Tolerates two disk failures
  • More complex updates

136
RAID level 6 (II)
  • Has become more popular as disks are becoming
  • Bigger
  • More vulnerable to irrecoverable read errors
  • Most frequent cause for RAID level 5 array
    failures is
  • Irrecoverable read error occurring while
    contents of a failed disk are reconstituted

137
CONNECTING I/O DEVICES
138
Busses
  • Connecting computer subsystems with each other
    was traditionally done through busses
  • A bus is a shared communication link connecting
    multiple devices
  • Transmit several bits at a time
  • Parallel buses

139
Busses
140
Examples
  • Processor-memory busses
  • Connect CPU with memory modules
  • Short and high-speed
  • I/O busses
  • Longer
  • Wide range of data bandwidths
  • Connect to memory through processor-memory bus of
    backplane bus

141
Synchronous busses
  • Include a clock in the control lines
  • Bus protocols expressed in actions to be taken at
    each clock pulse
  • Have very simple protocols
  • Disadvantages
  • All bus devices must run at same clock rate
  • Due to clock skew issues, cannot be both fast
    and long

142
Asynchronous busses
  • Have no clock
  • Can accommodate a wide variety of devices
  • Have no clock skew issues
  • Require a handshaking protocol before any
    transmission
  • Implemented with extra control lines

143
Advantages of busses
  • Cheap
  • One bus can link many devices
  • Flexible
  • Can add devices

144
Disadvantages of busses
  • Shared devices
  • can become bottlenecks
  • Hard to run many parallel lines at high clock
    speeds

145
New trend
  • Away from parallel shared buses
  • Towards serial point-to-point switched
    interconnections
  • Serial
  • One bit at a time
  • Point-to-point
  • Each line links a specific device to another
    specific device

146
x86 bus organization
  • Processor connects to peripherals through two
    chips (bridges)
  • North Bridge
  • South Bridge

147
x86 bus organization
North Bridge
South Bridge
148
North bridge
  • Essentially a DMA controller
  • Lets disk controller access main memory w/o any
    intervention of the CPU
  • Connects CPU to
  • Main memory
  • Optional graphics card
  • South Bridge

149
South Bridge
  • Connects North bridge to a wide variety of I/O
    busses

150
Communicating with I/O devices
  • Two solutions
  • Memory-mapped I/O
  • Special I/O instructions

151
Memory mapped I/O
  • A portion of the address space reserved for I/O
    operations
  • Writes to any to these addresses are interpreted
    as I/O commands
  • Reading from these addresses gives access to
  • Error bit
  • I/O completion bit
  • Data being read

152
Memory mapped I/O
  • User processes cannot access these addresses
  • Only the kernel
  • Prevents user processes from accessing the disk
    in an uncontrolled fashion

153
Dedicated I/O instructions
  • Privileged instructions that cannot be executed
    by User processes cannot access these addresses
  • Only the kernel
  • Prevents user processes from accessing the disk
    in an uncontrolled fashion

154
Polling
  • Simplest way for an I/O device to communicate
    with the CPU
  • CPU periodically checks the status of pending I/O
    operations
  • High CPU overhead

155
I/O completion interrupts
  • Notify the CPU that an I/O operation has
    completed
  • Allows th CPU to do something else while waiting
    for the completion of an I/O operation
  • Multiprogramming
  • I/O completion interrupts are processed by CPU
    between instructions
  • No internal instruction state to save

156
Interrupts levels
  • See previous chapter

157
Direct memory access
  • DMA
  • Lets disk controller access main memory w/o any
    intervention of the CPU

158
DMA and virtual memory
  • A single DMA transfer may cross page boundaries
    with
  • One page being in main memory
  • One missing page

159
Solutions
  • Make DMA work with virtual addresses
  • Issue is then dealt by the virtual memory
    subsystem
  • Break DMA transfers crossing page boundaries into
    chains of transfers that do not corss page
    boundaries

160
An Example
Break
DMA transfer
DMA
DMA
into
161
DMA and cache hierarchy
  • Three approaches for handling temporary
    inconsistencies between caches and main memory

162
Solutions
  • Running all DMA accesses to the cache
  • Bad solution
  • Have OS selectively
  • Invalidate affected cache entries when
    performing a read
  • Forcing immediate flush of dirty cache entries
    when performing a write
  • Have specific hardware to do same

163
Benchmarking I/O
164
Benchmarks
  • Specific benchmarks for
  • Transaction processing
  • Emphasis on speed and graceful recovery from
    failures
  • Atomic transactions
  • All or nothing behavior

165
An important observation
  • Very difficult to operate a disk subsystem at a
    reasonable fraction of its maximum throughput
  • Unless we access sequentially very large ranges
    of data
  • 512 KB and more

166
Major fallacies
  • Since rated MTTFs of disk drives exceed one
    million hours, disk can last more than 100 years
  • MTTF expresses failure rate during the disk
    actual lifetime
  • Disk failure rates in the field match the MMTTFS
    mentioned in the manufacturers literature
  • They are up to ten times higher

167
Major fallacies
  • Neglecting to do end-to-end checks
  • Using magnetic tapes to back up disks
  • Tape formats can become quickly obsolescent
  • Disk bit densities have grown much faster than
    tape data densities.

168
WRITING PARALLEL PROGRAMS
169
Overview
  • Some problems are embarrassingly parallel
  • Many computer graphics tasks
  • Brute force searches in cryptography or password
    guessing
  • Much more difficult for other applications
  • Communication overhead among sub-tasks
  • Ahmdahl's law
  • Balancing the load

170
Amdahl's Law
  • Assume a sequential process takes
  • tp seconds to perform operations that could be
    performed in parallel
  • ts seconds to perform purely sequential
    operations
  • The maximum speedup will be
  • (tp ts )/ts

171
Balancing the load
  • Must ensure that workload is equally divided
    among all the processors
  • Worst case is when one of the processors does
    much more work than all others

172
A last issue
  • Humans likes to address issues one after the
    order
  • We have meeting agendas
  • We do not like to be interrupted
  • We write sequential programs

173
MULTI PROCESSOR ORGANIZATIONS
174
Shared memory multiprocessors

Interconnection network
RAM
I/O
175
Shared memory multiprocessor
  • Can offer
  • Uniform memory access to all processors(UMA)
  • Easiest to program
  • Non-uniform memory access to all
    processors(NUMA)
  • Can scale up to larger sizes
  • Offer faster access to nearby memory

176
Computer clusters

Interconnection network
177
Computer clusters
  • Very easy to assemble
  • Can take advantage of high-speed LANs
  • Gigabit Ethernet, Myrinet,
  • Data exchanges must be done throughmessage
    passing

178
HARDWARE MULTITHREADING
179
General idea
  • Let the processor switch to another thread of
    computation while them current one is stalled
  • Motivation
  • Increased cost of cache misses

180
Implementation
  • Entirely controlled by the hardware
  • Unlike multiprogramming
  • Requires a processor capable of
  • Keeping track of the state of each thread
  • One set of registersincluding PC for each
    concurrent thread
  • Quickly switching among concurrent threads

181
Approaches
  • Fine-grained multithreading
  • Switches between threads for each instruction
  • Provides highest throughputs
  • Slows down execution of individual threads

182
Approaches
  • Coarse-grained multithreading
  • Switches between threads whenever a long stall
    is detected
  • Easier to implement
  • Cannot eliminate all stalls

183
Approaches
  • Simultaneous multi-threading
  • Takes advantage of the possibility of modern
    hardware to perform different tasks in parallel
    for instructions of different threads
  • Best solution

184
ALPHABET SOUP
185
Classification
  • SISD
  • Single instruction, single data
  • Conventional uniprocessor architecture
  • MIMD
  • Multiple instructions, multiple data
  • Conventional multiprocessor architecture

186
Classification
  • SIMD
  • Single instruction, multiple data
  • Perform same operations on a set of similar data
  • Think of adding two vectors
  • for (i 0 i i lt VECSIZE) sumi ai
    bi

187
PERFORMANCE ISSUES
188
Roofline model
  • Takes into account
  • Memory bandwidth
  • Floating-point performance
  • Introduces arithmetic intensity
  • Total number of floating point operations in a
    program divided by total number of bytes
    transferred to main memory
  • Measured in FLOPS/byte

189
Roofline model
  • Attainable GFLOPS/s Min(Peak Memory
    BW?Arithmetic Intensity, Peak
    Floating-Point Performance

190
Roofline model
Peak floating-point performance
Floating-point performance is limited by memory
bandwidth
Write a Comment
User Comments (0)
About PowerShow.com