Dagoberto A.R.Justo - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Dagoberto A.R.Justo

Description:

... given certain characteristics of the CPU architecture Memory ... Bandwidth Hiding Memory Latency Multi-threading To Hide Memory Latency Multithread, ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 20
Provided by: smi245
Category:

less

Transcript and Presenter's Notes

Title: Dagoberto A.R.Justo


1
Memory System PerformanceChapter 3
  • Dagoberto A.R.Justo
  • PPGMAp
  • UFRGS

2
Introduction
  • The clock rate of your CPU does not alone
    determine its performance
  • Nowadays, memory speeds are becoming the
    limitation
  • Hardware designers are creating architectures
    that try to overcome memory speed limitations
  • However, the hardware is designed to be efficient
    under only some models of how the program running
    on them is designed
  • Thus, careful program design is essential to
    obtain high performance
  • We introduce these issues by looking at simple
    matrix operations and modeling their performance,
    given certain characteristics of the CPU
    architecture

3
Memory Issues -- Definitions
  • Consider an architecture with several caches
  • How long does it take to receive a word in a
    particular cache, once requested?
  • How many words can be retrieved in a unit of
    time, once the first one is received?
  • Assume we request a word of memory and receive a
    block of data of size b words containing the
    desired word
  • Latency
  • The time l to receive the first word after the
    request (usually in nanoseconds)
  • Bandwidth
  • The number of words per time unit received with
    one request (usually in million of words per
    second)

4
Hypothetical Machine 1(no Cache)
  • Clock rate 1 GHz (1 nanosecond, 1 ns)
  • Two multiply-add units
  • can perform 2 multiply and 2 add operations per
    cycle
  • Fast CPU -- 4 FLOP per cycle ? peak performance
    4 GFlops
  • Latency 100 ns
  • takes 100 cycles x 1 ns to obtain a word once
    requested
  • Block size 1 word (8 bytes64 bits)
  • Thus, the bandwidth is 10 megawords per second
    (Mwords/s)
  • However, very slow in practice
  • Consider a dot product operation
  • Each step is a multiply and add accessing 2
    elements of 2 vectors and accumulating the result
    in a register, say s
  • 2 elements are sent to cache every 200 ns and 2
    ops are performed in 1 cycle
  • Thus, the machine runs at 100.5ns for each flop
  • ? 10 Mflops -- a factor 400 times slower than the
    peak

Fetch 1 word
Fetch 1 word
2FL
X
100
5
Hypothetical Machine 2(Cache BUT The Problem is
Different)
  • Consider matrix multiplication problem
  • Cache size 32 Kbytes
  • Block size 1 word (8 bytes)
  • Two situations are different here
  • The problem is different
  • Dot-product performs 1 operation for each data in
    average
  • 2n operands, 2n operations
  • Matrix multiplication has data reuse
  • O(n3) operations for O(n2) data
  • This machine has a cache with line (block) size 4
  • A cache is an intermediate storage area for which
    the CPU accesses its memory in 1 cycle (low
    latency) but stores sufficient data to take
    advantage of data reuse -- the important aspect
    of matrix multiplication

6
?
  • Same processor, clock rate 1 GHz (1 ns)
  • Latency 100 ns (from memory to cache)
  • Block size 1 word
  • Let n32, A,B,C be 32?32 matrices. Consider
    multiplying CAB
  • Each matrix needs 1024 words, 3 matrices, times 8
    bytes per matrix 24KB, which fits entirely in a
    32 KB cache
  • Time to fetching 2 matrices into the cache
  • 2048 words x 100 ns takes 204.8 ?s
  • Perform 2n3 operations for the matrix multiply
    (2323 64K op)
  • Time 2323/4 ns or 16.3 ?s (4 flop per cycle)
  • Thus, the flop rate is 64K/(204.816.3) ? 296
    Mflops
  • A lot better than 10 Mflops but no where near the
    peak of 4 Gflops
  • The notion of reusing data in this way is called
    temporal locality (many operations on the same
    data occur close in time)

Fetch 1 word
4FL
Fetch 1 word
Fetch 1 word
4FL
4FL
4FL
100
7
Hypothetical Machine 3(Increase the Memory/Cache
Bandwidth)
  • increase the block size b, from 1 word to 4 words
  • As stated, this implies that the data path from
    memory to cache is 128 bits wide (4?32 bits/word)
  • For the dot product algorithm,
  • A request for a(1) brings in a(14)
  • The a(1) takes 100 ns but a(2), a(3), and a(4)
    arrive at the same time
  • Similarly, a request for b(1) brings in b(14)
  • The request for b(1) is issued one cycle after
    that for a(1) but the bus is busy bringing a(14)
    into the cache
  • Thus, after 201 ns, the dot product computation
    starts and proceeds 1 cycle at a time, completing
    at a(4) and b(4)
  • Next the request for a(5) brings in a(58) and so
    on
  • Thus, the CPU is performing at approximately 8
    flops for roughly 204 ns or 1 operation per 25 ns
    or 40 Mflops

8
Hypothetical Machine 3-- Analyzed In Terms of
Cache-Hit Ratio
  • Cache hit ratio the number of memory accesses
    which are in cache/total number of memory
    accesses
  • In this case, the first access in every 4
    accesses is a miss and the remaining 3 are hits
    or successes
  • Thus, the cache hit ratio is 75
  • Assume the dominant overhead is the misses
  • Then 25 of the memory cycle time is an average
    overhead per access or 25 ns (25 of 100 ns
    memory latency)
  • Because the dot-product has one operation per
    word accessed, this also works out to 40 Mflops
  • A more accurate estimate is (75 ? 1 25 ?
    100) ns/word
  • Or 25.74 ns or 38.8 Mflops

9
Actual Implementations
  • 128 bit wide buses are expensive
  • The usual implementation is to pipeline a 32-bit
    bus so that the words in the line (block) arrive
    at each clock cycle after the first item is
    received
  • That is, instead of 4 words received after a 100
    ns latency, the 4 items arrive after 100 3 ns
  • However, multiply-add operations can start after
    each item arrives so that the result is the same
    -- that is, 40 or so Mflops

Fetch 128 bits
4FL
4FL
4FL
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
Fetch 32 bits
4FL
100
10
Spatial Locality Issue
  • It is clear that the dot-product is taking
    advantage of the consecutiveness of elements of
    the vector
  • This is called spatial locality -- the elements
    are close together in memory
  • Consider a different problem
  • The multiplication matrix-vector yAx
  • The elements of the column are not consecutive in
    C
  • That is, they are separated by a number of
    columns equal to the column length
  • In this case, the accesses are not spatially
    local and essentially all accesses to every
    element of every column are cache misses

11
In Fortran
  • The matrix A is stored by columns in the memory
  • 8000, A(1,1)
  • 8001, A(2,1)
  • 8002, A(3,1)
  • 8003, A(4,1)
  • 8004, A(1,2)
  • 8005, A(2,2)
  • 8006, A(3,2)
  • 8007, A(4,2)
  • 8008, A(1,3)
  • 8009, A(2,3)
  • 800A, A(3,3)
  • 800B, A(4,3)

12
Sum All Elements of a Matrix
  • Consider the problem of computing the sum of all
    elements a 1000?1000 matrix B
  • S0.D0
  • do i1, 1000
  • do j1, 1000
  • SSB(i,j)
  • end do
  • end do
  • This code performs very poorly
  • s fits in cache
  • Since the inner loop is in the columns,
    consecutive elements are far apart in the memory
  • unlikely to be in the same cache line
  • every access experiences the maximum latency
    delay (100 ns)

13
Sum All Elements of a Matrix
  • Changing the order of the loops
  • S0.D0
  • do j1, 1000
  • do i1, 1000
  • SSB(i,j)
  • end do
  • end do
  • the inner loop accesses B by columns
  • The elements in the columns are consecutive and
    thus a memory access brings multiple elements at
    a time (4 for our model machine) and thus the
    performance is reasonable for this machine

14
Peak Floating Point Performance x Peak Memory
Bandwidth
  • The performance issue
  • the peak floating point performance is bounded by
    the peak memory bandwidth
  • For fast microprocessors, it is 100 MFLOPS/MByte
    of bandwidth
  • Solve the problem by modifying the algorithms to
    hide the large memory latencies
  • Some compilers can transform simple codes to
    obtain better performance
  • For large scale vector processors, 1 MFLOP/MByte
    of bandwidth
  • These modifications are typically unnecessary but
    they don't hurt the computation and sometimes help

15
Hiding Memory Latency
  • Consider the example of getting information from
    Internet using a browser
  • What can you do to reduce the wait time?
  • While reading one page, we anticipate the next
    pages we will read and therefore begin fetches
    for them in advance
  • This corresponds to pre-fetching pages in
    anticipation of them being read
  • We open multiple browser windows and begin
    accesses in each window in parallel
  • This corresponds to multiple threads running in
    parallel
  • We request many pages in order
  • This corresponds to pipelining with spatial
    locality

16
Multi-threading To Hide Memory Latency
  • Consider the matrix-vector multiplication cAb
  • Each row by vector inner product is an
    independent computation -- thus, create a
    different thread for each computation as follows
  • do k1, n
  • c(k)create_thread( dot_product, A(k, ),b)
  • end do
  • As separate threads
  • On the first cycle, the first thread accesses the
    first pair of data elements for the first row and
    waits for the data to arrive
  • On the second cycle, the second thread accesses
    the first pair of elements for the second row and
    waits for the data to arrive
  • And so on until l units of time (the latency)
  • Then the first thread performs a computation and
    requests more data next
  • Then the second thread performs a computation and
    requests more data
  • And so on so that after the first latency of l
    cycles, every cycle is performing a computation

17
Multithread, block size1
Fetch A(1,1)
FL
Fetch A(1,2)
FL
FL
Fetch A(2,1)
Fetch A(2,2)
FL
Fetch A(3,1)
Fetch A(3,2)
FL
FL
FL
Fetch A(4,1)
Fetch A(4,2)
FL
Fetch A(5,1)
FL
Fetch A(5,2)
FL
Fetch A(6,1)
FL
Fetch A(6,2)
FL
FL
Fetch A(7,1)
Fetch A(7,2)
FL
Fetch A(8,1)
FL
FL
Fetch A(8,2)
Fetch A(9,1)
FL
FL
Fetch A(10,1)
8
18
Pre-fetching To Hide Memory Latency
  • Advance the loads ahead of when the data is
    needed
  • The problem is that the cache space may be needed
    for computation between the pre-fetch and use of
    the pre-fetched data
  • This is no worse that not performing the
    pre-fetch because the pre-fetch memory unit is
    typically an independent functional unit
  • Dot product again (or vector sum) provides an
    example
  • a(1) and b(1) are requested in a loop
  • The processor sees that a(2) and b(2) are needed
    for the next iteration and in the next cycle
    requests them and so on
  • Assume the first item takes 100ns to obtain the
    data and the requests for the others are every
    consecutive cycle
  • The processor waits 101 cycles for the first
    pair, performs the computation, and the next pair
    are there on the next cycle ready for computation
    and so on

19
Impact On Memory Bandwidth
  • Pre-fetching or multithreading increase the
    bandwidth requirements to memory. Compare
  • A 1 single thread computation experiencing a 90
    cache hit ratio
  • The memory bandwidth requirement is estimated to
    be 400 MB/sec
  • a 32 thread computation experiencing a cache hit
    ratio of 25 (because all threads share the same
    cache and memory access)
  • The memory bandwidth requirement is estimated to
    be 3 GB/sec
Write a Comment
User Comments (0)
About PowerShow.com