CST305 Performance Evaluation - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

CST305 Performance Evaluation

Description:

Ann, Brian, Cathy, Dave. each have one load of clothes to wash, dry, and fold ... danger of thrashing. Fully Associative. general table, store the main memory address ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 60
Provided by: Davi825
Category:

less

Transcript and Presenter's Notes

Title: CST305 Performance Evaluation


1
CST305Performance Evaluation
  • L2 Basic Serial Architecture
  • Pipelining
  • Memory Hierarchy

2
References
  • Course web page
  • www.wmin.ac.uk/lancasd/CST305/
  • K.Dowd and C.Severance High Performance Computing
    chapters 2,3 (1st ed)
  • J.L.Hennessy and D.A.Patterson Computer
    Architecture a Quantitative Approach

3
  • Designing to Last through Trends
  • Capacity Speed
  • Logic 2x in 3 years 2x in 3 years
  • DRAM 4x in 3 years 2x in 10 years
  • Disk 4x in 3 years 2x in 10 years
  • Processor ( n.a.) 2x in 1.5 years
  • Time to run the task
  • Execution time, response time, latency
  • Tasks per day, hour, week, sec, ns,
  • Throughput, bandwidth
  • X is n times faster than Y means
  • ExTime(Y) Performance(X)
  • --------- -------------- n
  • ExTime(X) Performance(Y)

4
Price vs. Cost
5
Computer Architectures Changing Definition
  • 1950s to 1960s Computer Architecture Course
    Computer Arithmetic
  • 1970s to mid 1980s Computer Architecture
    Course Instruction Set Design, especially ISA
    appropriate for compilers
  • 1990s Computer Architecture Course Design of
    CPU, memory system, I/O system, Multiprocessors

6
Instruction Set Architecture (ISA)
software
instruction set
hardware
7
  • Performance prediction is hard because
    architecture and software are complicated
  • Consider a (benchmark) program, compile to get
    the list of instructions

- Takes 14 cycles sequentially - Assume that data
is in cache
8
  • Pipeline
  • Like an assembly line
  • Overlap the execution of each instruction, so
    they start 1 cycle before the previous one
    completes

- Takes 11 cycles with pipeline - Does scale with
clock speed
9
  • Memory Hierarchy
  • If one of the load instructions cant find the
    data in cache

- Takes 62 cycles with cache miss - Does not
scale with clock speed -- how do you predict
that there will be a cache miss?
10
RISC
  • Pipelining and a complicated memory hierarchy are
    characteristic of modern RISC processors
  • They make it much more difficult (than with older
    CISC processors) to predict performance
  • RISC Reduced Instruction Set Computer
  • CISC Complex Instruction Set Computer

11
RISC vs CISC
  • A disagreement in instruction set philosophy
  • CISC
  • Powerful primitives, close to high level
    languages
  • VAX, Intel
  • RISC
  • Low-level primitives - can compute anything, but
    need more instructions
  • Alpha, PowerPC, Sparc

12
  • Before circa 1985, design variables favoured
    CISC, they now favour RISC
  • Memory was precious, and CISC executables are
    smaller. Now memory is cheap.
  • Programmers wrote in assembler and used the
    complex instructions. Now people rely on
    compilers which have difficulty in using complex
    instructions when optimising.

13
A "Typical" RISC
  • Uniform instruction length
  • easier pipeline (with higher clock speed)
  • Simple addressing modes
  • to avoid stall
  • Load/store architecture
  • memory references only in these explicit
    instructions
  • Many registers
  • avoid memory references
  • Delayed branch
  • a branch delay slot after any branch instruction

14
A "Typical" RISC
  • 32-bit fixed format instruction (3 formats)
  • 32 32-bit GPR (R0 contains zero, DP take pair)
  • 3-address, reg-reg arithmetic instruction
  • Single address mode for load/store base
    displacement
  • no indirection
  • Simple branch conditions
  • Delayed branch

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
15
Example MIPS ( DLX)
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
16
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

17
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads

18
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

19
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
20
Computer Pipelines
  • Execute billions of instructions, so throughout
    is what matters
  • DLX desirable features all instructions same
    length, registers located in same place in
    instruction format, memory operands only in loads
    or stores

21
5 Steps of DLX Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
22
Pipelined DLX Datapath
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
  • Data stationary control
  • local decode for each instruction phase /
    pipeline stage

23
Visualizing Pipelining
Time (clock cycles)
I n s t r. O r d e r
24
Its Not That Easy for Computers
  • If a complicated memory access occurs in stage 3,
    stage 4 will be delayed and the rest of the pipe
    is stalled.
  • If there is a branch, if.. and jump, then some of
    the instructions that have already entered the
    pipeline should not be processed.
  • Need optimal conditions in order to keep the
    pipeline moving

25
Hazards prevent efficient pipelining
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to
    fold and put clothes away)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Control hazards Pipelining of branches other
    instructions stall the pipeline

26
One Memory Port/Structural Hazards
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
27
One Memory Port/Structural Hazards
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
28
Can now build RISC cheaply
  • Instruction cache to speed instruction fetch
  • Dramatic memory size increases
  • Better pipeline design
  • Optimising compilers
  • CISC made it difficult to build pipelines
    resistant to stalls

29
Pipelining Summary
  • Just overlap tasks - easy if the tasks are
    independent
  • Speed Up Pipeline Depth if ideal CPI is 1,
    then
  • Hazards limit pipeline performance
  • Structural need more HW resources
  • Data (RAW,WAR,WAW) need forwarding, compiler
    scheduling
  • Control delayed branch, prediction

Pipeline Depth
Clock Cycle Unpipelined
X
Speedup
1 Pipeline stall CPI
Clock Cycle Pipelined
30
Floating Point pipelines
  • RISC aims for 1 instruction/cycle
  • Actually with FP can get more!
  • Separate and more than one FP pipeline
  • Superscalar
  • schedule operations for side by side execution at
    run time
  • IBM RS/6000, Supersparc, DEC AXP
  • Speculative execution

31
5 minute Class Break
  • 120 minutes straight is too long for me to
    lecture (1100 1300)
  • 5 minutes review last time motivate this
    lecture
  • 40 minutes lecture - pipeline
  • 5 minutes break
  • 40 minutes lecture - memory hierarchy
  • 5 minutes summary of todays important topics

32
Memory Hierarchy
  • To cope with modern fast CPUs need large and
    fast memory (and economic)
  • interleave
  • wide memory bus
  • separate instruction cache
  • hierarchy
  • fast expensive volatile cache
  • slower cheap non-volatile disk

33
Random Access Memory
  • DRAM
  • dynamic, charge based -- must keep refreshing
  • generally best price/performance
  • memory cycle time
  • SRAM
  • static, preserved as long as power is on
  • sometimes (eg graphics) better for critical fast
    access

34
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
35
The Principle of Locality
  • The Principle of Locality
  • Programs access a relatively small portion of the
    address space at any instant of time.
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon (e.g., loops, reuse)
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon (e.g., array access)
  • Last 15 years, HW relied on locality for speed

36
A Modern Memory Hierarchy
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at speeds offered by the fastest
    technology.

Processor
Control
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
1s
10,000,000s (10s ms)
Speed (ns)
10s
100s
10,000,000,000s (10s sec)
100s
Size (bytes)
Ks
Ms
Gs
Ts
37
Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 512-4K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-6
-5
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
38
Memory Hierarchy Terminology
  • Hit data appears in some block (eg X) in the
    upper level
  • Hit Rate the fraction of memory access found in
    the upper level
  • Hit Time Time to access the upper level which
    consists of
  • RAM access time Time to determine hit/miss
  • Miss data must be retrieved from a lower level
    block (Y) Miss Rate 1 - (Hit Rate)
  • Miss Penalty Time to replace a block in the
    upper level
  • Time to deliver the block the processor

Lower Level Memory
Upper Level Memory
To Processor
Blk X
Blk Y
From Processor
39
Memory Hierarchy
  • Cache - Main memory
  • Cache miss
  • Cache types
  • Main Memory - Disk
  • Virtual memory
  • Paging
  • TLB

40
Cache Measures
  • Hit rate fraction found in that level
  • So high that usually talk about Miss rate
  • Average memory-access time Hit time Miss
    rate x Miss penalty (ns or clocks)
  • Miss penalty time to replace a block from lower
    level, including time to replace in CPU
  • access time time to lower level
  • f(latency to lower level)
  • transfer time time to transfer block
  • f(bandwidth between upper lower levels)
  • Hit Time ltlt Miss Penalty (500 instructions on
    21264!)

41
Cache lines
  • Arranged in lines, each holding a handful of
    adjacent main memory entries
  • Neighbouring lines might contain data far apart
    in main memory
  • Reading hit/miss rates
  • Writing (write policy)
  • line eventually gets replaced into main memory
  • Multi CPU must implement cache coherency
  • Effective due to locality can often use the rest
    of the information on a new line

42
Cache - Main Memory mapping
  • Direct mapped
  • each main memory ref ? particular cache line
  • simplest
  • danger of thrashing
  • Fully Associative
  • general table, store the main memory address
  • not much used (but sometimes in TLB)
  • Set Associative
  • usually two-way or four-way
  • each main memory ref ? 2 (4) cache lines

43
Simplest Cache Direct Mapped
Memory Address
Memory
0
4 Byte Direct Mapped Cache
1
Cache Index
2
0
3
1
4
2
5
3
6
  • Location 0 can be occupied by data from
  • Memory location 0, 4, 8, ... etc.
  • In general any memory locationwhose 2 LSBs of
    the address are 0s
  • Addresslt10gt gt cache index
  • Which one should we place in the cache?
  • How can we tell which one is in the cache?

7
8
9
A
B
C
D
E
F
44
1 KB Direct Mapped Cache, 32B blocks
  • For a 2 N byte cache
  • The uppermost (32 - N) bits are always the Cache
    Tag
  • The lowest M bits are the Byte Select (Block Size
    2 M)

0
4
31
9
Cache Index
Cache Tag
Example 0x50
Byte Select
Ex 0x01
Ex 0x00
Stored as part of the cache state
Cache Data
Valid Bit
Cache Tag

0
Byte 0
Byte 1
Byte 31

1
0x50
Byte 32
Byte 33
Byte 63
2
3




31
Byte 992
Byte 1023
45
Thrashing
  • If alternate memory references point to the same
    cache line

REAL4 A(1024), B(1024) DO 10 I 1,1024
A(I) A(I) B(I) 10 CONTINUE
Arrays A and B take up exactly 4kB, and are
adjacent in memory. In a 4kB direct mapped cache,
A(1) and B(1) will be mapped to the same cache
line. References to A and B will alternately
cause cache misses.
46
Set Associative Cache
  • Two (or four) direct mapped caches next to each
    other making a single cache
  • Now there are 2 (4) choices for where a
    particular main memory reference will appear in
    the cache
  • Which choice to use?
  • Least recently used
  • Random
  • Less likely to thrash

47
Two-way Set Associative Cache
  • N-way set associative N entries for each Cache
    Index
  • N direct mapped caches operates in parallel (N
    typically 2 to 4)
  • Example Two-way set associative cache
  • Cache Index selects a set from the cache
  • The two tags in the set are compared in parallel
  • Data is selected based on the tag result

Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0



Adr Tag
Compare
0
1
Mux
Sel1
Sel0
OR
Cache Block
Hit
48
Write Policy
  • Write Through Write the information to both the
    block in cache and to the block in lower-level
    memory.
  • Write Back Only write to the block in cache. The
    modified cache block is written to main memory
    only when replaced.
  • What about cache coherency for multiprocessors?
  • To reduce write stalls, use a write buffer

49
The Cache Design Space
  • Several interacting dimensions
  • cache size
  • block size
  • associativity
  • replacement policy
  • write-through vs write-back
  • write allocation
  • The optimal choice is a compromise
  • depends on access characteristics
  • workload
  • use (I-cache, D-cache, TLB)
  • depends on technology / cost
  • Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
50
Virtual Memory
  • All processes think they have all the memory to
    themselves
  • Virtual memory ? physical memory
  • Pages (typical size 512B - 16kB, say 4kB)
  • Mapping in page table
  • Virtual memory is 32bit/64bit - bigger than
    physical, some pages are on disk
  • Size of page table
  • 32bit CPU, 4kB pages has gt million entries

51
Virtual Memory History
  • To run programs bigger than memory - overlays
  • program out-of-core by hand
  • overlay1 completed, pulled overlay2 from disk etc
  • Found a way to automate this - make the OS
    control it
  • 1961 - virtual memory, paging
  • See a book on OS

52
Address Map
V 0, 1, . . . , n - 1 virtual address
space M 0, 1, . . . , m - 1 physical address
space MAP V --gt M U 0 address mapping
function
n gt m
MAP(a) a' if data at virtual address a is
present in physical
address a' and a' in M 0 if
data at virtual address a is not present in M
a
missing item fault
Name Space V
fault handler
Processor
0
Secondary Memory
Addr Trans Mechanism
Main Memory
a
a'
physical address
OS performs this transfer
53
Paging Organization
V.A.
P.A.
unit of mapping
frame 0
0
1K
0
Addr Trans MAP
1K
page 0
1
1024
1K
1024
1
1K
also unit of transfer from virtual to physical
memory
7
1K
7168
Physical Memory
31
1K
31744
Virtual Memory
Address Mapping
10
VA
page no.
disp
Page Table
Page Table Base Reg
Access Rights
actually, concatenation is more likely
V
PA

index into page table
table located in physical memory
physical memory address
54
VM problems
  • Size of page table
  • VM with a cache
  • It takes an extra memory access to translate VA
    to PA, and this makes cache access too expensive

55
Translation Lookaside Buffer TLB
  • A way to speed up translation is to use a special
    cache of recently used page table entries
    Translation Lookaside Buffer or TLB
  • Based on locality of page reference
  • Like a cache on the page table mappings
  • TLB access time is comparable to cache access
    time

56
Translation Lookaside Buffers
TLBs are usually small, typically not more than
128 - 256 entries even on high end machines.
This permits fully associative lookup on these
machines. Most mid-range machines use small n-way
set associative organizations.
hit
miss
VA
PA
TLB Lookup
Cache
Main Memory
CPU
Translation with a TLB
hit
miss
Trans- lation
data
t
20 t
1/2 t
57
Summary TLB, Virtual Memory
  • Page tables map virtual address to physical
    address
  • TLBs are important for fast translation
  • TLB misses are significant in processor
    performance

58
Summary of Memory Hierarchy
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Temporal Locality, Spatial Locality
  • Three Major Categories of Cache Misses
  • Compulsory Misses sad facts of life. Example
    cold start misses.
  • Capacity Misses increase cache size
  • Conflict Misses increase cache size and/or
    associativity. Nightmare Scenario thrashing!

59
Summary
  • Caches, TLBs, Virtual Memory all understood by
    examining how they deal with 4 questions
  • 1) Where can block be placed?
  • 2) How is block found?
  • 3) What block is replaced on miss?
  • 4) How are writes handled?
Write a Comment
User Comments (0)
About PowerShow.com