Cache Simulations and Application Performance - PowerPoint PPT Presentation

About This Presentation
Title:

Cache Simulations and Application Performance

Description:

Run-time library to be linked with the application (C) Works with: ... 1 B(i,k):7:kji.f. 2 C(k,j):7:kji.f. Address dump. Symbolic name. Virtual address. Cache line ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 19
Provided by: philip119
Learn more at: https://icl.utk.edu
Category:

less

Transcript and Presenter's Notes

Title: Cache Simulations and Application Performance


1
Cache Simulations and Application Performance
  • Christopher Kerr (kerr_at_gfdl.gov)
  • Philip Mucci (mucci_at_cs.utk.edu)
  • Jeff Brown (jeffb_at_lanl.gov
  • Los Alamos, Sandia National Laboratories

2
Goal
  • To optimize large, numerically intensive
    applications with poor cache utilization.
  • By taking advantage of the memory hierarchy and
    we can often achieve the greatest performance
    improvements for our time.

3
Philosophy
  • By simulating the cache hierarchy, we wish to
    understand how the applications data maps to a
    specific cache architecture.
  • In addition, we wish to understand the
    applications reference pattern and the
    relationship to the mapping.
  • Performance improvements can be obtained from
    this information algorithmically.

4
Cache Simulator
  • Consists of
  • Instrumentation assistant (Perl)
  • Header files
  • Run-time library to be linked with the
    application (C)
  • Works with
  • C, C, Fortran 77, Fortran 90

5
How it works
  • Cache simulator is called on memory (array)
    references
  • Cache simulator reads a configuration file
    containing an architectural description of the
    memory hierarchy for multiple machines.
  • Environment variables enable different options

6
How it works (cont)
  • Each call to the simulator provides as input
  • Address of the reference
  • Size of the datum being accessed
  • Symbolic name consisting of the name, file and
    line number

7
Instrumentation (before)
  • subroutine kji(A, ii, jj, lda, B, kk, ldb, C,
    ldc)
  • dimension A(lda,lda), B(ldb,ldb),
    C(ldc,ldc)
  • do k 1, kk
  • do j 1, jj
  • do i 1, ii
  • A(i,j) A(i,j) B(i,k) C(k,j)
  • enddo
  • enddo
  • enddo
  • return
  • end

8
Instrumentation (after)
  • .
  • .
  • .
  • do i 1, ii
  • call cache_sim(A(i,j),KIND(A(i,j)),
  • 'A(i,j)7stdin\0')
  • call cache_sim(A(i,j),KIND(A(i,j)),
  • 'A(i,j)7stdin\0')
  • call cache_sim(B(i,k),KIND(B(i,k)),
  • 'B(i,k)7stdin\0')
  • call cache_sim(C(k,j),KIND(C(k,j)),
  • 'C(k,j)7stdin\0')
  • A(i,j) A(i,j) B(i,k) C(k,j)
  • enddo
  • .
  • .
  • .

9
Output
  • Summary
  • Misses by name
  • Misses by address
  • Conflict matrix
  • Address trace

10
Summary
  • Machine 1 test-machine
  • Cache level 1 size 32kB, line size 32B,
    associativity 2, 1024 lines total
  • --------------------------------
  • Total mem accesses 166492.00
  • Total cache misses 10276.00
  • Total cache hits 156216.00
  • Total hit rate 93.83
  • --------------------------------
  • Num split accesses 0.00
  • Cold cache misses 1024.00
  • Real misses 9252.00
  • Real hit rate 94.41

11
Misses by name
  • Name Trace miss rate for references gt 0
    percent.
  • Percentage Real misses
    LineFileReference
  • 0.01 1.000
    X(i,j)26stencil.F
  • 0.01 1.000
    X(i,j-1)26stencil.F
  • 9.08 840.000
    R(i,j)26stencil.F
  • 0.12 11.000
    X(i1,j-1)26stencil.F
  • 9.08 840.000
    X(i1,j1)26stencil.F
  • 9.08 840.000
    AN(i,j)26stencil.F
  • 7.61 704.000
    ANE(i,j)17stencil.F
  • 7.61 704.000
    AN(i,j)15stencil.F
  • 0.13 12.000
    ANE(i,j-1)26stencil.F
  • 9.08 840.000
    ANE(i,j)26stencil.F
  • .
  • .
  • .

12
Misses by address
  • Address Trace miss rate for references gt 0
    percent.
  • Percentage Real misses Address
    LineFileReference
  • 0.01 1.000 0x32d60
    R(i,j)13stencil.F
  • 0.01 1.000 0x32d80
    R(i,j)13stencil.F
  • 0.01 1.000 0x32da0
    R(i,j)13stencil.F
  • 0.01 1.000 0x32dc0
    R(i,j)13stencil.F
  • 0.01 1.000 0x32de0
    R(i,j)13stencil.F
  • 0.01 1.000 0x32e00
    R(i,j)13stencil.F
  • .
  • .
  • .

13
Conflict matrix
  • Each axis represents the different arrays
  • X axis is replacer, Y is replacee
  • Elements are the number of replacements of one
    array element with another
  • Goal is to algorithmically determine optimal
    layout, placement, padding and blocking.
  • This is a minimization problem.

14
Conflict matrix (cont)
  • 0 1 2
  • 0 100 50 20
  • 1 20 10 10
  • 2 50 0 0
  • Num Name
  • 0 A(i,j)7kji.f
  • 1 B(i,k)7kji.f
  • 2 C(k,j)7kji.f

15
Address dump
  • Symbolic name
  • Virtual address
  • Cache line
  • Goal is to use this for replay.

16
Future
  • Handle nonblocking caches, replacement policies,
    write strategies and buffering.
  • Add output file with starting address and extents
    for each array.
  • Facility for replay of the simulator using the
    address dump, and data regarding padding,
    blocking and alignment. This will eliminates the
    need for additional runs.

17
Future (cont)
  • Categorize cold misses for repeatedly accessed
    data items.
  • Provide cost metrics to analyze approximate
    performance loss due to poor locality.
  • Full Lex/Yacc based parser.
  • Perl/Tk GUI for finer control of instrumentation.

18
Future (cont)
  • MPI, Thread aware
  • Reduction in run-time requirements
  • MUT integration
  • Tools to compare data sets
Write a Comment
User Comments (0)
About PowerShow.com