Title: Cache Performance, Interfacing, Multiprocessors CPSC 321
1Cache Performance, Interfacing, Multiprocessors
CPSC 321
2Todays Menu
- Cache Performance
- Review of Virtual Memory
- Processor and Peripherals
- Multiprocessors
3Cache Performance
4Caching Basics
- What are the different cache placement schemes?
- direct mapped
- set associative
- fully associative
- Explain how a 2-way cache with 4 sets works
- If we want to read a memory block whose address
is addr, then we search the set addr mod 4 - The memory block could be in either element of
the set - Compare tags with upper n-2 bits of addr
5Implementation of a Cache
- Sketch an implementation of a 4-way associative
cache
6Measuring Cache Performance
- CPU cycle time
- CPU execution clock cycles (including cache hits)
- Memory-stall clock cycles (cache misses)
- CPU time (CPU execution clock cycles memory
stall clock cycles) x clock cycle time - Memory stall clock cycles
- read stall cycles (rsc)
- write stall clock cycles (wsc)
- Memory stall clock cycles rsc wsc
7Measuring Cache Performance
- Write-stall cycle write-through scheme
- two sources of stalls
- write misses (usually require to fetch the block)
- write buffer stalls (write buffer is full when
write occurs) - WSCs are sum of the two
- WSCs (writes/prg x write miss rate x write miss
- penalty) write buffer stalls
- Memory stall clock cycles similar
8Cache Performance Example
- Instruction cache rate 2
- Data miss rate 4
- Assume that 2 CPI without any memory stalls
- Miss penalty 40 cycles for all misses
- Instruction count I
- Instruction miss cycles I x 2 x 40 0.80 x I
- gcc has 36 loads and stores
- Data miss cycles I x 36 x 4 x 40 0.58 x I
9Review of Virtual Memory
10Virtual Memory
- Processor generates virtual addresses
- Memory is accessed using physical addresses
- Virtual and physical memory is broken into blocks
of memory, called pages - A virtual page may be absent from main memory,
residing on the disk or may be mapped to a
physical page
11Page Tables
The page table maps each page to either a page in
main memory or to a page stored on disk
12Pages virtual memory blocks
- Page faults if data is not in memory, retrieve
it from disk - huge miss penalty, thus pages should be fairly
large (e.g., 4KB) - reducing page faults is important (LRU is worth
the price) - using write-through takes too long so we use
writeback - Example page size 2124KB 218 physical pages
- main memory lt 1GB virtual memory lt 4GB
13Page Faults
- Incredible high penalty for a page fault
- Reduce number of page faults by optimizing page
placement - Use fully associative placement
- full search of pages is impractical
- pages are located by a full table that indexes
the memory, called the page table - the page table resides within the memory
14Making Memory Access Fast
- Page tables slow us down
- Memory access will take at least twice as long
- access page table in memory
- access page
- What can we do?
Memory access is local gt use a cache that keeps
track of recently used address translations,
called translation lookaside buffer
15Making Address Translation Fast
- A cache for address translations translation
lookaside buffer
16Processors and Peripherals
17Collection of I/O Devices
- Communication between I/O devices, processor and
- memory use protocols on the bus and interrupts
18Impact of I/O on Performance
- A benchmark executes in 100 seconds
- 90 seconds CPU time
- 10 seconds I/O time
- If CPU improves 50 per year for next 5 years,
how - much faster does the benchmark run in 5 years?
90/(1.5)5 90/7.59 11.851
19I/O Devices
- Very diverse devices behavior (i.e., input vs.
output) partner (who is at the other end?)
data rate
20Communicating with Processor
- Polling
- simple
- I/O device puts information in a status register
- processor retrieves information
- check the status periodically
- Interrupt driven I/O
- device notifies processor that it has completed
some operation by causing an interrupt - similar to exception, except that it is
asynchronous - processor must be notified of the device csng
interrupt - interrupts must be prioritized
21I/O Example Disk Drives
- To access data seek position head over the
proper track (8 to 20 ms. avg.) rotational
latency wait for desired sector (.5 / RPM)
transfer grab the data (one or more sectors) 2
to 15 MB/sec
22I/O Example Buses
- Shared communication link (one or more wires)
- Difficult design may be bottleneck
tradeoffs (buffers for higher bandwidth increases
latency) support for many different devices
cost - Types of buses processor-memory (short high
speed, custom design) backplane (high speed,
often standardized, e.g., PCI) I/O (lengthy,
different devices, standardized, e.g., SCSI) - Synchronous vs. Asynchronous use a clock and a
synchronous protocol, - fast and small, but every
device must operate at same
rate and clock skew requires the bus to be
short dont use a clock - use handshaking
instead
23Asynchronous Handshake Protocol
- ReadyReq Indicates a read request from memory
- DataRdy Indicates that data word is now ready on
data lines - Ack Used to acknowledge the ReadyReq or DataRdy
signal of the other party
24Asynchronous Handshake Protocol
- Memory sees ReadReq, reads address from data bus,
raises Ack - I/O device sees Ack high, releases ReadReq and
data lines - Memory sees ReadReq low, drops Ack to acknowledge
ReadReq - When memory has data ready, it places data from
the read request on the data lines and raises
DataRdy - I/O devices sees DataRdy, reads data from the
bus, signals that it has the data by raising Ack - Memory sees the Ack signal, drops DataRdy,
releases datalines - If DataRdy goes low, the I/O device drops Ack to
indicate that transmission is over
25Synchronous vs. Asynchronous Buses
- Compare max. bandwidth for a synchronous bus and
an asynchronous bus - Synchronous bus
- has clock cycle time of 50 ns
- each transmission takes 1 clock cycle
- Asynchronous bus
- requires 40 ns per handshake
- Find bandwidth for each bus when performing
one-word reads from a 200ns memory
26Synchronous Bus
- Send address to memory 50 ns
- Read memory 200 ns
- Send data to device 50ns
- Total 300 ns
- Max. bandwidth
- 4 bytes/300ns 13.3 MB/second
27Asynchronous Bus
- Apparently much slower because each step of the
protocol takes 40 ns and memory access 200 ns - Notice that several steps are overlapped with
memory access time - Memory receives address at step 1
- does not need to put address until step 5
- steps 2,3,4 can overlap with memory access
- Step 1 40 ns
- Step 2,3,4 max(3 x 40ns 120ns, 200 ns)
- Steps 5,6,7 3x40ns 120ns
- Total time 360ns
- max. bandwidth 4bytes/360ns11.1MB/second
28Other important issues
- Bus Arbitration daisy chain arbitration (not
very fair) centralized arbitration (requires
an arbiter), e.g., PCI self selection, e.g.,
NuBus used in Macintosh collision detection,
e.g., Ethernet - Operating system polling, interrupts, DMA
- Performance Analysis techniques queuing
theory simulation analysis, i.e., find the
weakest link (see I/O System Design)
29Overhead of Polling
30Overhead of Polling
31Ways to Transfer Data between Memory and Device
32Multiprocessors
33Idea
- Build powerful computers by connecting
- many smaller ones.
34Multiprocessors
Good for timesharing easy to realize -
difficult to write good concurrent programs -
hard to parallelize tasks - mapping to
architecture can be difficult
35Questions
- How do parallel processors share data? single
address space message passing - How do parallel processors coordinate?
synchronization (locks, semaphores) built into
send / receive primitives operating system
protocols - How are they implemented? connected by a
single bus connected by a network
36Shared Memory Multiprocessors
Problems???
Symmetric multiprocessor (SMP)
37Distributed Memory Multiprocessors
- Distributed shared-memory multiprocessor
- Message passing multiprocessor
38Multiprocessors
Global Memory Distributed memory
Common Address Space Symmetric Multiprocessor Distributed shared-memory multiprocessor
Distributed Address Space does not exist Message passing multiprocessor
39Connection Network
- Static Network
- fixed connections between nodes
- Dynamic Network
- packet switching (packets routed from sender to
recipient) - circuit switching (connection between nodes can
be established by crossbar or switching network)
40Static Connection Networks
41Circuit Switching Delta Networks
0101
- Route from any input x to output y by selecting
links determined by successive d-ary digits of
ys label. - This process is reversible we can route from
output y back to x by following the links
determined by successive digits of xs label. - This self-routing property allows for simple
hardware-based routing of cells.
1
1
0
1
0
1
1101
0
1
xxk-1 . . . x0
yyk-1 . . . y0
42Network versus Bus
43Performance / Unit Cost
44Programming
- lock variables
- semaphores
- monitor
-
45Cache Coherency
46Outlook
- Distributed Algorithms
- Distributed Systems
- Parallel Programming
- Parallelizing Compilers