Cache Performance, Interfacing, Multiprocessors CPSC 321

About This Presentation

Title:

Cache Performance, Interfacing, Multiprocessors CPSC 321

Description:

Miss penalty 40 cycles for all misses ... huge miss penalty, thus pages should be fairly large (e.g., 4KB) ... Incredible high penalty for a page fault ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 47

Provided by: faculty

Learn more at: https://people.engr.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cache Performance, Interfacing, Multiprocessors CPSC 321

1
Cache Performance, Interfacing, Multiprocessors
CPSC 321

Andreas Klappenecker

2
Todays Menu

Cache Performance
Review of Virtual Memory
Processor and Peripherals
Multiprocessors

3
Cache Performance
4
Caching Basics

What are the different cache placement schemes?
direct mapped
set associative
fully associative
Explain how a 2-way cache with 4 sets works
If we want to read a memory block whose address
is addr, then we search the set addr mod 4
The memory block could be in either element of
the set
Compare tags with upper n-2 bits of addr

5
Implementation of a Cache

Sketch an implementation of a 4-way associative
cache

6
Measuring Cache Performance

CPU cycle time
CPU execution clock cycles (including cache hits)
Memory-stall clock cycles (cache misses)
CPU time (CPU execution clock cycles memory
stall clock cycles) x clock cycle time
Memory stall clock cycles
read stall cycles (rsc)
write stall clock cycles (wsc)
Memory stall clock cycles rsc wsc

7
Measuring Cache Performance

Write-stall cycle write-through scheme
two sources of stalls
write misses (usually require to fetch the block)
write buffer stalls (write buffer is full when
write occurs)
WSCs are sum of the two
WSCs (writes/prg x write miss rate x write miss
penalty) write buffer stalls
Memory stall clock cycles similar

8
Cache Performance Example

Instruction cache rate 2
Data miss rate 4
Assume that 2 CPI without any memory stalls
Miss penalty 40 cycles for all misses
Instruction count I
Instruction miss cycles I x 2 x 40 0.80 x I
gcc has 36 loads and stores
Data miss cycles I x 36 x 4 x 40 0.58 x I

9
Review of Virtual Memory
10
Virtual Memory

Processor generates virtual addresses
Memory is accessed using physical addresses
Virtual and physical memory is broken into blocks
of memory, called pages
A virtual page may be absent from main memory,
residing on the disk or may be mapped to a
physical page

11
Page Tables
The page table maps each page to either a page in
main memory or to a page stored on disk
12
Pages virtual memory blocks

Page faults if data is not in memory, retrieve
it from disk
huge miss penalty, thus pages should be fairly
large (e.g., 4KB)
reducing page faults is important (LRU is worth
the price)
using write-through takes too long so we use
writeback
Example page size 2124KB 218 physical pages
main memory lt 1GB virtual memory lt 4GB

13
Page Faults

Incredible high penalty for a page fault
Reduce number of page faults by optimizing page
placement
Use fully associative placement
full search of pages is impractical
pages are located by a full table that indexes
the memory, called the page table
the page table resides within the memory

14
Making Memory Access Fast

Page tables slow us down
Memory access will take at least twice as long
access page table in memory
access page
What can we do?

Memory access is local gt use a cache that keeps
track of recently used address translations,
called translation lookaside buffer
15
Making Address Translation Fast

A cache for address translations translation
lookaside buffer

16
Processors and Peripherals
17
Collection of I/O Devices

Communication between I/O devices, processor and
memory use protocols on the bus and interrupts

18
Impact of I/O on Performance

A benchmark executes in 100 seconds
90 seconds CPU time
10 seconds I/O time
If CPU improves 50 per year for next 5 years,
how
much faster does the benchmark run in 5 years?

90/(1.5)5 90/7.59 11.851
19
I/O Devices

Very diverse devices behavior (i.e., input vs.
output) partner (who is at the other end?)
data rate

20
Communicating with Processor

Polling
simple
I/O device puts information in a status register
processor retrieves information
check the status periodically
Interrupt driven I/O
device notifies processor that it has completed
some operation by causing an interrupt
similar to exception, except that it is
asynchronous
processor must be notified of the device csng
interrupt
interrupts must be prioritized

21
I/O Example Disk Drives

To access data seek position head over the
proper track (8 to 20 ms. avg.) rotational
latency wait for desired sector (.5 / RPM)
transfer grab the data (one or more sectors) 2
to 15 MB/sec

22
I/O Example Buses

Shared communication link (one or more wires)
Difficult design may be bottleneck
tradeoffs (buffers for higher bandwidth increases
latency) support for many different devices
cost
Types of buses processor-memory (short high
speed, custom design) backplane (high speed,
often standardized, e.g., PCI) I/O (lengthy,
different devices, standardized, e.g., SCSI)
Synchronous vs. Asynchronous use a clock and a
synchronous protocol,
fast and small, but every
device must operate at same
rate and clock skew requires the bus to be
short dont use a clock - use handshaking
instead

23
Asynchronous Handshake Protocol

ReadyReq Indicates a read request from memory
DataRdy Indicates that data word is now ready on
data lines
Ack Used to acknowledge the ReadyReq or DataRdy
signal of the other party

24
Asynchronous Handshake Protocol

Memory sees ReadReq, reads address from data bus,
raises Ack
I/O device sees Ack high, releases ReadReq and
data lines
Memory sees ReadReq low, drops Ack to acknowledge
ReadReq
When memory has data ready, it places data from
the read request on the data lines and raises
DataRdy
I/O devices sees DataRdy, reads data from the
bus, signals that it has the data by raising Ack
Memory sees the Ack signal, drops DataRdy,
releases datalines
If DataRdy goes low, the I/O device drops Ack to
indicate that transmission is over

25
Synchronous vs. Asynchronous Buses

Compare max. bandwidth for a synchronous bus and
an asynchronous bus
Synchronous bus
has clock cycle time of 50 ns
each transmission takes 1 clock cycle
Asynchronous bus
requires 40 ns per handshake
Find bandwidth for each bus when performing
one-word reads from a 200ns memory

26
Synchronous Bus

Send address to memory 50 ns
Read memory 200 ns
Send data to device 50ns
Total 300 ns
Max. bandwidth
4 bytes/300ns 13.3 MB/second

27
Asynchronous Bus

Apparently much slower because each step of the
protocol takes 40 ns and memory access 200 ns
Notice that several steps are overlapped with
memory access time
Memory receives address at step 1
does not need to put address until step 5
steps 2,3,4 can overlap with memory access
Step 1 40 ns
Step 2,3,4 max(3 x 40ns 120ns, 200 ns)
Steps 5,6,7 3x40ns 120ns
Total time 360ns
max. bandwidth 4bytes/360ns11.1MB/second

28
Other important issues

Bus Arbitration daisy chain arbitration (not
very fair) centralized arbitration (requires
an arbiter), e.g., PCI self selection, e.g.,
NuBus used in Macintosh collision detection,
e.g., Ethernet
Operating system polling, interrupts, DMA
Performance Analysis techniques queuing
theory simulation analysis, i.e., find the
weakest link (see I/O System Design)

29
Overhead of Polling
30
Overhead of Polling
31
Ways to Transfer Data between Memory and Device
32
Multiprocessors
33
Idea

Build powerful computers by connecting
many smaller ones.

34
Multiprocessors
Good for timesharing easy to realize -
difficult to write good concurrent programs -
hard to parallelize tasks - mapping to
architecture can be difficult
35
Questions

How do parallel processors share data? single
address space message passing
How do parallel processors coordinate?
synchronization (locks, semaphores) built into
send / receive primitives operating system
protocols
How are they implemented? connected by a
single bus connected by a network

36
Shared Memory Multiprocessors
Problems???
Symmetric multiprocessor (SMP)
37
Distributed Memory Multiprocessors

Distributed shared-memory multiprocessor
Message passing multiprocessor

38
Multiprocessors
Global Memory Distributed memory
Common Address Space Symmetric Multiprocessor Distributed shared-memory multiprocessor
Distributed Address Space does not exist Message passing multiprocessor
39
Connection Network

Static Network
fixed connections between nodes
Dynamic Network
packet switching (packets routed from sender to
recipient)
circuit switching (connection between nodes can
be established by crossbar or switching network)

40
Static Connection Networks
41
Circuit Switching Delta Networks
0101

Route from any input x to output y by selecting
links determined by successive d-ary digits of
ys label.
This process is reversible we can route from
output y back to x by following the links
determined by successive digits of xs label.
This self-routing property allows for simple
hardware-based routing of cells.

1
1
0
1
0
1
1101
0
1
xxk-1 . . . x0
yyk-1 . . . y0
42
Network versus Bus
43
Performance / Unit Cost
44
Programming

lock variables
semaphores
monitor

45
Cache Coherency
46
Outlook

Distributed Algorithms
Distributed Systems
Parallel Programming
Parallelizing Compilers

Write a Comment

User Comments (0)

About PowerShow.com

Cache Performance, Interfacing, Multiprocessors CPSC 321 - PowerPoint PPT Presentation

Cache Performance, Interfacing, Multiprocessors CPSC 321

Miss penalty 40 cycles for all misses ... huge miss penalty, thus pages should be fairly large (e.g., 4KB) ... Incredible high penalty for a page fault ... – PowerPoint PPT presentation