CST305 Performance Evaluation

About This Presentation

Title:

CST305 Performance Evaluation

Description:

Ann, Brian, Cathy, Dave. each have one load of clothes to wash, dry, and fold ... danger of thrashing. Fully Associative. general table, store the main memory address ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 60

Provided by: Davi825

Category:

more less

Transcript and Presenter's Notes

Title: CST305 Performance Evaluation

1
CST305Performance Evaluation

L2 Basic Serial Architecture
Pipelining
Memory Hierarchy

2
References

Course web page
www.wmin.ac.uk/lancasd/CST305/
K.Dowd and C.Severance High Performance Computing
chapters 2,3 (1st ed)
J.L.Hennessy and D.A.Patterson Computer
Architecture a Quantitative Approach

Designing to Last through Trends
Capacity Speed
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
Processor ( n.a.) 2x in 1.5 years
Time to run the task
Execution time, response time, latency
Tasks per day, hour, week, sec, ns,
Throughput, bandwidth
X is n times faster than Y means
ExTime(Y) Performance(X)
--------- -------------- n
ExTime(X) Performance(Y)

4
Price vs. Cost
5
Computer Architectures Changing Definition

1950s to 1960s Computer Architecture Course
Computer Arithmetic
1970s to mid 1980s Computer Architecture
Course Instruction Set Design, especially ISA
appropriate for compilers
1990s Computer Architecture Course Design of
CPU, memory system, I/O system, Multiprocessors

6
Instruction Set Architecture (ISA)
software
instruction set
hardware
7

Performance prediction is hard because
architecture and software are complicated
Consider a (benchmark) program, compile to get
the list of instructions

- Takes 14 cycles sequentially - Assume that data
is in cache
8

Pipeline
Like an assembly line
Overlap the execution of each instruction, so
they start 1 cycle before the previous one
completes

- Takes 11 cycles with pipeline - Does scale with
clock speed
9

Memory Hierarchy
If one of the load instructions cant find the
data in cache

- Takes 62 cycles with cache miss - Does not
scale with clock speed -- how do you predict
that there will be a cache miss?
10
RISC

Pipelining and a complicated memory hierarchy are
characteristic of modern RISC processors
They make it much more difficult (than with older
CISC processors) to predict performance
RISC Reduced Instruction Set Computer
CISC Complex Instruction Set Computer

11
RISC vs CISC

A disagreement in instruction set philosophy
CISC
Powerful primitives, close to high level
languages
VAX, Intel
RISC
Low-level primitives - can compute anything, but
need more instructions
Alpha, PowerPC, Sparc

Before circa 1985, design variables favoured
CISC, they now favour RISC
Memory was precious, and CISC executables are
smaller. Now memory is cheap.
Programmers wrote in assembler and used the
complex instructions. Now people rely on
compilers which have difficulty in using complex
instructions when optimising.

13
A "Typical" RISC

Uniform instruction length
easier pipeline (with higher clock speed)
Simple addressing modes
to avoid stall
Load/store architecture
memory references only in these explicit
instructions
Many registers
avoid memory references
Delayed branch
a branch delay slot after any branch instruction

14
A "Typical" RISC

32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store base
displacement
no indirection
Simple branch conditions
Delayed branch

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
15
Example MIPS ( DLX)
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
16
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

17
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads

18
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

19
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
20
Computer Pipelines

Execute billions of instructions, so throughout
is what matters
DLX desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores

21
5 Steps of DLX Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
22
Pipelined DLX Datapath
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access

Data stationary control
local decode for each instruction phase /
pipeline stage

23
Visualizing Pipelining
Time (clock cycles)
I n s t r. O r d e r
24
Its Not That Easy for Computers

If a complicated memory access occurs in stage 3,
stage 4 will be delayed and the rest of the pipe
is stalled.
If there is a branch, if.. and jump, then some of
the instructions that have already entered the
pipeline should not be processed.
Need optimal conditions in order to keep the
pipeline moving

25
Hazards prevent efficient pipelining

Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle
Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away)
Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
Control hazards Pipelining of branches other
instructions stall the pipeline

26
One Memory Port/Structural Hazards
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
27
One Memory Port/Structural Hazards
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
28
Can now build RISC cheaply

Instruction cache to speed instruction fetch
Dramatic memory size increases
Better pipeline design
Optimising compilers
CISC made it difficult to build pipelines
resistant to stalls

29
Pipelining Summary

Just overlap tasks - easy if the tasks are
independent
Speed Up Pipeline Depth if ideal CPI is 1,
then
Hazards limit pipeline performance
Structural need more HW resources
Data (RAW,WAR,WAW) need forwarding, compiler
scheduling
Control delayed branch, prediction

Pipeline Depth
Clock Cycle Unpipelined
X
Speedup
1 Pipeline stall CPI
Clock Cycle Pipelined
30
Floating Point pipelines

RISC aims for 1 instruction/cycle
Actually with FP can get more!
Separate and more than one FP pipeline
Superscalar
schedule operations for side by side execution at
run time
IBM RS/6000, Supersparc, DEC AXP
Speculative execution

31
5 minute Class Break

120 minutes straight is too long for me to
lecture (1100 1300)
5 minutes review last time motivate this
lecture
40 minutes lecture - pipeline
5 minutes break
40 minutes lecture - memory hierarchy
5 minutes summary of todays important topics

32
Memory Hierarchy

To cope with modern fast CPUs need large and
fast memory (and economic)
interleave
wide memory bus
separate instruction cache
hierarchy
fast expensive volatile cache
slower cheap non-volatile disk

33
Random Access Memory

DRAM
dynamic, charge based -- must keep refreshing
generally best price/performance
memory cycle time
SRAM
static, preserved as long as power is on
sometimes (eg graphics) better for critical fast
access

34
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
35
The Principle of Locality

The Principle of Locality
Programs access a relatively small portion of the
address space at any instant of time.
Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., array access)
Last 15 years, HW relied on locality for speed

36
A Modern Memory Hierarchy

By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at speeds offered by the fastest
technology.

Processor
Control
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
1s
10,000,000s (10s ms)
Speed (ns)
10s
100s
10,000,000,000s (10s sec)
100s
Size (bytes)
Ks
Ms
Gs
Ts
37
Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 512-4K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-6
-5
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
38
Memory Hierarchy Terminology

Hit data appears in some block (eg X) in the
upper level
Hit Rate the fraction of memory access found in
the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data must be retrieved from a lower level
block (Y) Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver the block the processor

Lower Level Memory
Upper Level Memory
To Processor
Blk X
Blk Y
From Processor
39
Memory Hierarchy

Cache - Main memory
Cache miss
Cache types
Main Memory - Disk
Virtual memory
Paging
TLB

40
Cache Measures

Hit rate fraction found in that level
So high that usually talk about Miss rate
Average memory-access time Hit time Miss
rate x Miss penalty (ns or clocks)
Miss penalty time to replace a block from lower
level, including time to replace in CPU
access time time to lower level
f(latency to lower level)
transfer time time to transfer block
f(bandwidth between upper lower levels)
Hit Time ltlt Miss Penalty (500 instructions on
21264!)

41
Cache lines

Arranged in lines, each holding a handful of
adjacent main memory entries
Neighbouring lines might contain data far apart
in main memory
Reading hit/miss rates
Writing (write policy)
line eventually gets replaced into main memory
Multi CPU must implement cache coherency
Effective due to locality can often use the rest
of the information on a new line

42
Cache - Main Memory mapping

Direct mapped
each main memory ref ? particular cache line
simplest
danger of thrashing
Fully Associative
general table, store the main memory address
not much used (but sometimes in TLB)
Set Associative
usually two-way or four-way
each main memory ref ? 2 (4) cache lines

43
Simplest Cache Direct Mapped
Memory Address
Memory
0
4 Byte Direct Mapped Cache
1
Cache Index
2
0
3
1
4
2
5
3
6

Location 0 can be occupied by data from
Memory location 0, 4, 8, ... etc.
In general any memory locationwhose 2 LSBs of
the address are 0s
Addresslt10gt gt cache index
Which one should we place in the cache?
How can we tell which one is in the cache?

7
8
9
A
B
C
D
E
F
44
1 KB Direct Mapped Cache, 32B blocks

For a 2 N byte cache
The uppermost (32 - N) bits are always the Cache
Tag
The lowest M bits are the Byte Select (Block Size
2 M)

0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3

31
Byte 992
Byte 1023
45
Thrashing

If alternate memory references point to the same
cache line

REAL4 A(1024), B(1024) DO 10 I 1,1024
A(I) A(I) B(I) 10 CONTINUE
Arrays A and B take up exactly 4kB, and are
adjacent in memory. In a 4kB direct mapped cache,
A(1) and B(1) will be mapped to the same cache
line. References to A and B will alternately
cause cache misses.
46
Set Associative Cache

Two (or four) direct mapped caches next to each
other making a single cache
Now there are 2 (4) choices for where a
particular main memory reference will appear in
the cache
Which choice to use?
Least recently used
Random
Less likely to thrash

47
Two-way Set Associative Cache

N-way set associative N entries for each Cache
Index
N direct mapped caches operates in parallel (N
typically 2 to 4)
Example Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared in parallel
Data is selected based on the tag result

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0

Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Hit
48
Write Policy

Write Through Write the information to both the
block in cache and to the block in lower-level
memory.
Write Back Only write to the block in cache. The
modified cache block is written to main memory
only when replaced.
What about cache coherency for multiprocessors?
To reduce write stalls, use a write buffer

49
The Cache Design Space

Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
write allocation
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
50
Virtual Memory

All processes think they have all the memory to
themselves
Virtual memory ? physical memory
Pages (typical size 512B - 16kB, say 4kB)
Mapping in page table
Virtual memory is 32bit/64bit - bigger than
physical, some pages are on disk
Size of page table
32bit CPU, 4kB pages has gt million entries

51
Virtual Memory History

To run programs bigger than memory - overlays
program out-of-core by hand
overlay1 completed, pulled overlay2 from disk etc
Found a way to automate this - make the OS
control it
1961 - virtual memory, paging
See a book on OS

52
Address Map
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present in physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Secondary Memory
Addr Trans Mechanism
Main Memory
a
a'
physical address
OS performs this transfer
53
Paging Organization
V.A.
P.A.
unit of mapping
frame 0
0
1K
0
Addr Trans MAP
1K
page 0
1
1024
1K
1024
1
1K
also unit of transfer from virtual to physical
memory
7
1K
7168
Physical Memory
31
1K
31744
Virtual Memory
Address Mapping
10
VA
page no.
disp
Page Table
Page Table Base Reg
Access Rights
actually, concatenation is more likely
V
PA

index into page table
table located in physical memory
physical memory address
54
VM problems

Size of page table
VM with a cache

It takes an extra memory access to translate VA
to PA, and this makes cache access too expensive

55
Translation Lookaside Buffer TLB

A way to speed up translation is to use a special
cache of recently used page table entries
Translation Lookaside Buffer or TLB
Based on locality of page reference

Like a cache on the page table mappings
TLB access time is comparable to cache access
time

56
Translation Lookaside Buffers
TLBs are usually small, typically not more than
128 - 256 entries even on high end machines.
This permits fully associative lookup on these
machines. Most mid-range machines use small n-way
set associative organizations.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
Translation with a TLB
hit
miss
Trans- lation
data
t
20 t
1/2 t
57
Summary TLB, Virtual Memory

Page tables map virtual address to physical
address
TLBs are important for fast translation
TLB misses are significant in processor
performance

58
Summary of Memory Hierarchy

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Temporal Locality, Spatial Locality
Three Major Categories of Cache Misses
Compulsory Misses sad facts of life. Example
cold start misses.
Capacity Misses increase cache size
Conflict Misses increase cache size and/or
associativity. Nightmare Scenario thrashing!

59
Summary

Caches, TLBs, Virtual Memory all understood by
examining how they deal with 4 questions
1) Where can block be placed?
2) How is block found?
3) What block is replaced on miss?
4) How are writes handled?

Write a Comment

User Comments (0)