EECS 583 Class 20 Compiler Directed Prefetching - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

EECS 583 Class 20 Compiler Directed Prefetching

Description:

Characteristics of compiler based prefetching algorithms ... Used by hardware to make decisions on data placement - 19 - Cache Hints on Itanium ... – PowerPoint PPT presentation

Number of Views:412
Avg rating:3.0/5.0
Slides: 23
Provided by: scottm80
Category:

less

Transcript and Presenter's Notes

Title: EECS 583 Class 20 Compiler Directed Prefetching


1
EECS 583 Class 20Compiler Directed Prefetching
  • University of Michigan
  • March 28, 2005
  • Guest Speaker Manjunath Kudlur

2
Announcements
  • Scott back for Wednesdays class
  • Register allocation
  • Review for exam
  • Exam next Monday in class (April 4)
  • SIG meetings
  • Well definitely have them this week on Thurs or
    Fri
  • Figure out schedule on Wed in class
  • Sorry about last week ? too much stuff got pushed
    to Fri

3
Introduction
  • Applications spend most of their time waiting for
    memory
  • Prefetching reduces stalls by overlapping data
    fetch with computation
  • Why compiler based prefetching?
  • Compiler has global view of data access patterns
  • Can perform sophisticated and aggressive
    optimisations

4
Some Preliminaries
  • Hardware support
  • Non-faulting prefetch instruction in the ISA
  • Lock up free cache multiple outstanding misses
  • Characteristics of compiler based prefetching
    algorithms
  • Accuracy whether missing loads and missing
    addresses are correctly predicted
  • Timeliness
  • Prefetching too early would pollute cache
  • Prefetching too late is useless
  • Instruction bandwidth Instructions to compute
    prefetch address and the prefetch instruction
    itself

5
In This Talk..
  • Design and Evaluation of a Compiler Algorithm
    for Prefetching, Todd C. Mowry, Monica S. Lam
    and Anoop Gupta. ASPLOS V, 1992
  • Automatic Compiler-Inserted Prefetching for
    Pointer Based Applications, C.K. Luk and Todd C.
    Mowry, IEEE Transactions on Computers, 1999
  • Efficient Discovery of Regular Stride Patterns
    in Irregular Programs and its Use in Compiler
    Prefetching, Youfeng Wu, PLDI, 2002
  • Compiler Orchestrated Prefetching via
    Speculation and Predication, Rodric Rabbah
    et.al., ASPLOS 2002
  • Generating Cache Hints for improved program
    efficiency, Kristof Beyls and Erik H.
    DHollander, Journal of System Architecture, 2004

6
A Prefetching Algorithm for Array Based Programs
  • Handles programs where array accesses are
    affine functions of iteration vector
  • for(i0 ilt100 i)
  • for(j0 jlt100 j)
  • Xij-1 ..

is the iteration vector
I

i
0
1
0
i


0
1
j
-1
j-1
7
Locality Analysis
  • Find out which references reuse the same cache
    line
  • Tells us that the other references have to be
    prefetched
  • Reuse computed in terms of iteration distances
  • for(i0 igt32 i)
  • for(j0 jlt32 j)
  • ..Xij..

i
  • Reuse points for X9 shown here
  • Result from Linear Algebra
  • The reuse points for XAI C lie in the Null
    Space of the matrix A

j
8
(Overly Simplistic) Example
After loop splitting and prefetch insertion
Original Code
for(i0 ilt100 i) .. Bi
for(i0 ilt1 i) prefetch(Bi) for(i0
ilt100 i2) prefetch(Bi2) .. Bi
.. Bi1
Cache line size 8 bytes (2 elements
of B)
Prefetching this cache line
Software pipeline
Working on this cache line
9
A Prefetching Algorithm for Pointer Based Programs
  • Challenges in prefetching for pointer based
    programs
  • Recursive data structures
  • Trees, linked lists
  • Many SPEC and real world programs use RDSs
    extensively

10
Greedy Prefetching
  • Prefetch all child nodes of the current node
  • Advantages
  • Straightword to implement in a compiler
  • Very less overhead
  • Disadvantages
  • Prefetching just one level of children may not
    be enough

11
Pointer Chasing Problem
  • Have to prefetch a node far away to hide latency
  • Have to dereference multiple levels of pointers
    to get to far away node
  • Pointer chasing leads to loads, which themselves
    cause misses defeats the purpose
  • History Pointer Prefetching

12
History Pointer Prefetching
  • Maintain pointers to far away nodes in current
    node
  • The history pointers are filled during first
    traversal
  • Data prefetched using history pointers during
    later traversals
  • Disadvantages
  • Extra space/computation overhead for history
    pointers
  • First traversal not benefited

13
Regular Stride Patterns
  • Compiler prefetching at assembly level
  • Consider individual load instructions, look for
    patterns in the address they load
  • Artifacts of memory allocation order and access
    patterns

197.parser, loads in S1 and S2 have a constant
stride 94 of the time
254.gap, load in S2 has 2 dominant strides 48
and 47 of times
14
Use in Prefetching
  • Discover regular strides based on profiling
  • When 1or 2 dominant strides found for a load
    instruction, insert prefetch accordingly

15
Precomputation Based Prefetching
R1 list R5 0 loop R2 R1 4 R3
load R2 R4 R1 8 R1 load R4 R5 R5
R3 br loop (R1 ! NULL)
// record is a pointer to a structure in
memory record list while (record ! NULL)
data record-gtfield record
record-gtnext sum data
16
Precomputation Based Prefetching (Contd..)
R1 list R5 0 R6 R1 R8 loop R2
R1 4 R3 load R2 R4 R1 8 R1 load
R4 R5 R5 R3 br loop (R1 ! NULL)
R7 load R6 R6 R7 8 R8 prefR7 R9 R8
8
17
Informing Memory Operations
  • What if the precomputation operations themselves
    cause cache miss
  • Should abandon prefetch when not profitable
  • Informing load operation
  • Sets some architectural state to inform the
    programmer about its state
  • Eg., in Itanium 2, iLD sets a predicate register
    when its a miss
  • Can use iLD to inform later prefetch instructions

18
Informing Memory Operations
R1 list R5 0 R6 R1 R8 loop R2
R1 4 R3 load R2 R4 R1 8 R1 load
R4 R5 R5 R3 br loop (R1 ! NULL)
R7 iLD R6 R6 R7 8 if p1 R8 prefR7 if
p1 R9 R8 8 if p1
19
Cache Hints
  • Annotations to regular memory instructions
  • Source cache specifier
  • Indicates at which cache level the data is likely
    to be found
  • Used by the compiler to estimate the access
    latency
  • Target cache specifier
  • Indicates at which cache level the data is kept
    after execution
  • Used by hardware to make decisions on data
    placement

20
Cache Hints on Itanium
  • Target cache hints specify whether there is
    temporal locality at a given cache level
  • t1, nt1, nt2, nta

21
Static Cache Hint Selection
  • Reuse distance number of unique memory elements
    accessed between two accesses to the same element
  • Property 1 An access a will hit in a fully
    associative LRU cache with n lines iff its
    backward reuse distance is less than n
  • Eg a6 will be a hit if RD(a1,a6) 3 lt n
  • Property 2 Data accessed by access a will
    remain in memory iff its forward reuse distance
    is less than n

22
Static Cache Hint Selection (Contd..)
  • Plot a graph of reuse distance vs. number of
    times that distance was seen
  • Pick cache hint accordingly
  • Profit!
Write a Comment
User Comments (0)
About PowerShow.com