Memory Consistency - PowerPoint PPT Presentation

About This Presentation
Title:

Memory Consistency

Description:

Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 65
Provided by: www2EngrA
Category:

less

Transcript and Presenter's Notes

Title: Memory Consistency


1
Memory Consistency
2
Memory Consistency
3
Memory Consistency
  • Reads and writes of the shared memory face
    consistency problem
  • Need to achieve controlled consistency in memory
    events
  • Shared memory behavior determined by
  • Program order
  • Memory access order
  • Challenges
  • Modern processors reorder operations
  • Compiler optimizations (scalar replacement,
    instruction rescheduling

4
Basic Concept
  • On a multiprocessor
  • Concurrent instruction streams (threads) on
    different processors
  • Memory events performed by one process may create
    data to be used by another
  • Events read and write
  • Memory consistency model specifies how the memory
    events initiated by one process should be
    observed by other processes
  • Event ordering
  • Declare which memory access is allowed , which
    process should wait for a later access when
    processes compete

5
Uniprocessor vs. Multiprocessor Model
6
Understanding Program Order
Initially X 2 P1 P2
.. .. r0Read(X) r1Read(x) r0r01
r1r11 Write(r0,X) Write(r1,X) ..
Possible execution sequences P1r0Read(X) P
2r1Read(X) P2r1Read(X) P2r1r11 P1r0r01
P2Write(r1,X) P1Write(r0,X) P1r0Read(X)
P2r1r11 P1r0r01 P2Write(r1,X) P1Write
(r0,X) x3 x4
7
Interleaving
  • Program orders of individual instruction streams
    may need to be modified because of interaction
    among them
  • Finding optimum global memory order is an NP hard
    problem

8
Example
  • Concatenate program orders in P1, P2 and P3
  • 6-tuple binary strings (64 output combinations)
  • (a,b,c,d,e,f) gt (001011) (in order execution)
  • (a,c,e,b,d,f) gt (111111) (in order execution)
  • (b,d,f,e,a,c) gt (000000) (out of order
    execution)
  • 6! (720 possible permutations)

9
Mutual exclusion problem
  • mutual exclusion problem in concurrent
    programming
  • allow two threads to share a single-use resource
    without conflict, using only shared memory for
    communication.
  • avoid the strict alternation of a naive
    turn-taking algorithm

10
Definition
  • If two processes attempt to enter a critical
    section at the same time, allow only one process
    in, based on whose turn it is.
  • If one process is already in the critical
    section, the other process will wait for the
    first process to exit.
  • How would you implement this without
  • mutual exclusion,
  • freedom from deadlock, and
  • freedom from starvation.

11
Solution Dekkers Algorithm
  • This is done by the use of two flags f0 and f1
    which indicate an intention to enter the critical
    section and a turn variable which indicates who
    has priority between the two processes.

12
  • flag0 false
  • flag1 false
  • turn 0 // or 1

P0
P1
flag0 true while flag1 true if
turn ? 0 flag0 false while turn ? 0
flag0 true
// critical section ... turn 1 flag0
false // remainder
// section
flag1 true while flag0 true if
turn ? 1 flag1 false while turn ? 1
flag1 true
// critical section ... turn 0
flag1 false // remainder
// section
13
Disadvantages
  • limited to two processes
  • makes use of busy waiting instead of process
    suspension.
  • Modern CPUs execute their instructions in an
    out-of-order fashion,
  • even memory accesses can be reordered

14
Petersons Algorithm
  • flag0 0
  • flag1 0
  • turn

P0
P1
flag0 1 turn 1 while (flag1 1
turn 1) // busy wait // critical
section ... // end of critical section flag0
0
flag1 1 turn 0 while (flag0 1
turn 0) // busy wait // critical
section ... // end of critical section flag1
0
15
Lamport's bakery algorithm
// declaration and initial values of global
variables Entering array 1..NUM_THREADS of
bool false Number array 1.. NUM_THREADS
of integer 0 1 lock(integer i) 2
Enteringi true 3 Numberi 1
max(Number1, ..., NumberNUM_THREADS) 4
Enteringi false 5 for (j 1 j lt
NUM_THREADS j) 6 // Wait until thread j
receives its number 7 while (Enteringj)
/ nothing / 8 // Wait until all
threads with smaller numbers or with the same 9
// number, but with higher priority, finish
their work 10 while ((Numberj ! 0)
((Numberj, j) lt (Numberi, i))) 11 /
nothing / 12 13 14
15 unlock(integer i) 16 Numberi 0
17 18 Thread(integer i) 19 while (true)
20 lock(i) 21 // The critical section
goes here... 22 unlock(i) 23 //
non-critical section... 24 25
  • a bakery with a numbering machine
  • the 'customers' will be threads, identified by
    the letter i, obtained from a global variable.
  • more than one thread might get the same number

16
Models
Strict Consistency Read always returns with most
recent Write to same address
Sequential Consistency The result of any
execution appears as the interleaving of
individual programs strictly in sequential
program order
Processor Consistency Writes issued by each
processor are in program order, but writes from
different processors can be out of order
(Goodman)
Weak Consistency Programmer uses synch
operations to enforce sequential consistency
(Dubois)
Reads from each processor is not restricted More
opportunities for pipelining
17
Relationship to Cache Coherence Protocol
  • Cache coherence protocol must observe the
    constraints imposed by the memory consistency
    model
  • Ex Read hit in a cache
  • Reading without waiting for the completion of a
    previous write my violate sequential consistency
  • Cache coherence protocol provides a mechanism to
    propagate the newly written value
  • Memory consistency model places an additional
    constraint on when the value can be propagated to
    a given processor

18
Latency Tolerance
  • Scalable systems
  • Distributed shared memory architecture
  • Access to remote memory long latency
  • Processor speed vs. the memory and interconnect
  • Need for
  • Latency reduction, avoidance, hiding

19
Latency Avoidance
  • Organize user applications at architectural,
    compiler or application levels to achieve
    program/data locality
  • Possible when applications exhibit
  • Temporal or spatial locality
  • How do you enhance locality?

20
Locality Enhancement
  • Architectural support
  • Cache coherency protocols, memory consistency
    models, fast message passing, etc.
  • User support
  • High Performance Fortran program instructs
    compiler how to allocate the data (example ?)
  • Software support
  • Compiler performs certain transformations
  • Example?

21
Latency Reduction
  • What if locality is limited?
  • Data access is dynamically changing?
  • For ex sorting algorithms
  • We need latency reduction mechanisms
  • Target communication subsystem
  • Interconnect
  • Network interface
  • Fast communication software
  • Cluster TCP, UDP, etc

22
Latency Hiding
  • Hide communication latency within computation
  • Overlapping techniques
  • Prefetching techniques
  • Hide read latency
  • Distributed coherent caches
  • Reduce cache misses
  • Shorten time to retrieve clean copy
  • Multiple context processors
  • Switch from one context to another when
    long-latency operations is encountered (hardware
    supported multithreading)

23
Memory Delays
  • SMP
  • high in multiprocessors due to added contention
    for shared resources such as a shared bus and
    memory modules
  • Distributed
  • are even more pronounced in distributed-memory
    multiprocessors where memory requests may need to
    be satisfied across an interconnection network.
  • By masking some or all of these significant
    memory latencies, prefetching can be an effective
    means of speeding up multiprocessor applications

24
Data Prefetching
  • Overlapping computation with memory accesses
  • Rather than waiting for a cache miss to perform a
    memory fetch, data prefetching anticipates such
    misses and issues a fetch to the memory system in
    advance of the actual memory reference.

25
Cache Hierarchy
  • Popular latency reducing technique
  • But still common for scientific programs to spend
    more than half their run times stalled on memory
    requests
  • partially a result of the on demand fetch
    policy
  • fetch data into the cache from main memory only
    after the processor has requested a word and
    found it absent from the cache.

26
Why do scientific applications exhibit poor cache
utilization?
  • Is something wrong with the principle of
    locality?
  • The traversal of large data arrays is often at
    the heart of this problem.
  • Temporal locality in array computations
  • once an element has been used to compute a
    result, it is often not referenced again before
    it is displaced from the cache to make room for
    additional array elements.
  • Sequential array accesses patterns exhibit a high
    degree of spatial locality, many other types of
    array access patterns do not.
  • For example, in a language which stores matrices
    in row-major order, a row-wise traversal of a
    matrix will result in consecutively referenced
    elements being widely separated in memory. Such
    strided reference patterns result in low spatial
    locality if the stride is greater than the cache
    block size. In this case, only one word per cache
    block is actually used while the remainder of the
    block remains untouched even though cache space
    has been allocated for it.

27
Memory references r1,r2 and r3 not in the cache
Time Computation and memory references satisfied
within the cache hierarchy
main memory access time
28
Challenges
  • Cache pollution
  • Data arrives early enough to hide all of the
    memory latency
  • Data must be held in the processor cache for some
    period of time before it is used by the
    processor.
  • During this time, the prefetched data are exposed
    to the cache replacement policy and may be
    evicted from the cache before use.
  • Moreover, the prefetched data may displace data
    in the cache that is currently in use by the
    processor.
  • Memory bandwidth
  • Back to figure
  • No prefetch the three memory requests occur
    within the first 31 time units of program
    startup,
  • With prefetch these requests are compressed into
    a period of 19 time units.
  • By removing processor stall cycles, prefetching
    effectively increases the frequency of memory
    requests issued by the processor.
  • Memory systems must be designed to match this
    higher bandwidth to avoid becoming saturated and
    nullifying the benefits of prefetching.

29
Spatial Locality
  • Block transfer is a way of prefetching (1960s)
  • Software prefetching later (1980s)

30
Binding Prefetch
  • Non-blocking load instructions
  • these instructions are issued in advance of the
    actual use to take advantage of the parallelism
    between the processor and memory subsystem.
  • Rather than loading data into the cache, however,
    the specified word is placed directly into a
    processor register.
  • the value of the prefetched variable is bound to
    a named location at the time the prefetch is
    issued.

31
Software-Initiated Data Prefetching
  • Some form of fetch instruction
  • can be as simple as a load into a processor
    register
  • Fetches are non-blocking memory operations
  • Allow prefetches to bypass other outstanding
    memory operations in the cache.
  • Fetch instructions cannot cause exceptions
  • The hardware required to implement
    software-initiated prefetching is modest

32
Prefetch Challenges
  • prefetch scheduling.
  • judicious placement of fetch instructions within
    the target application.
  • not possible to precisely predict when to
    schedule a prefetch so that data arrives in the
    cache at the moment it will be requested by the
    processor
  • uncertainties not predictable at compile time
  • careful consideration when statically scheduling
    prefetch instructions.
  • may be added by the programmer or by the compiler
    during an optimization pass.
  • programming effort ?

33
Suitable spots for Fetch
  • most often used within loops responsible for
    large array calculations.
  • common in scientific codes,
  • exhibit poor cache utilization
  • predictable array referencing patterns.

34
Example
How to solve these two issues? software piplining
assume a four-word cache block
Issues Cache misses during the first
iteration Unnecessary prefetches in the last
iteration of the unrolled loop
35
Assumptions
  • implicit assumption
  • Prefetching one iteration ahead of the datas
    actual use is sufficient to hide the latency
  • What if the loops contain small computational
    bodies.
  • Define prefetch distance
  • initiate prefetches d iterations before the data
    is referenced
  • How do you determine d?
  • Let
  • l be the average cache miss latency, measured
    in processor cycles,
  • s be the estimated cycle time of the shortest
    possible execution path through one loop
    iteration, including the prefetch overhead.
  • d

36
Revisiting the example
  • let us assume an average miss latency of 100
    processor cycles and a loop iteration time of 45
    cycles
  • d3 (handle a prefetch distance of three)

37
Case Study
  • Given a distributed-shared multiprocessor
  • lets define a remote access cache (RAC)
  • Assume that RAC is located at the network
    interface of each node
  • Motivation prefetched remote data could be
    accessed at a speed comparable to that of local
    memory while the processor cache hierarchy was
    reserved for demand-fetched data.
  • Which one is better Having RAC or pretefetching
    data directly into the processor cache hierarchy?
  • Despite significantly increasing cache contention
    and
  • reducing overall cache space,
  • The latter approach results in higher cache hit
    rates,
  • dominant performance factor.

38
Case Study
  • Transfer of individual cache blocks across the
    interconnection network of a multiprocessor
    yields low network efficiency
  • what if we propose transferring prefetched data
    in larger units?
  • Method a compiler schedules a single prefetch
    command before the loop is entered rather than
    software pipelining prefetches within a loop.
  • transfer of large blocks of remote memory used
    within the loop body
  • prefetched into local memory to prevent excessive
    cache pollution.
  • Issues
  • binding prefetch since data stored in a
    processors local memory are not exposed to any
    coherency policy
  • imposes constraints on the use of prefetched data
    which, in turn, limits the amount of remote data
    that can be prefetched.

39
What about besides the loops?
  • Prefetching is normally restricted to loops
  • array accesses whose indices are linear functions
    of the loop indices
  • compiler must be able to predict memory access
    patterns when scheduling prefetches.
  • such loops are relatively common in scientific
    codes but far less so in general applications.
  • Irregular data structures
  • difficult to reliably predict when a particular
    data will be accessed
  • once a cache block has been accessed, there is
    less of a chance that several successive cache
    blocks will also be requested when data
    structures such as graphs and linked lists are
    used.
  • comparatively high temporal locality
  • result in high cache utilization thereby
    diminishing the benefit of prefetching.

40
What is the overhead of fetch instructions?
  • require extra execution cycles
  • fetch source addresses must be calculated and
    stored in the processor
  • to avoid recalculation for the matching load or
    store instruction.
  • How
  • Register space
  • Problem
  • compiler will have less register space to
    allocate to other active variables.
  • fetch instructions increase register pressure
  • It gets worse when
  • the prefetch distance is greater than one
  • multiple prefetch addresses
  • code expansion
  • may degrade instruction cache performance.
  • software-initiated prefetching is done statically
  • unable to detect when a prefetched block has been
    prematurely evicted and needs to be re-fetched.

41
Hardware-Initiated Data Prefetching
  • Prefetching capabilities without the need for
    programmer or compiler intervention.
  • No changes to existing executables
  • instruction overhead completely eliminated.
  • can take advantage of run-time information to
    potentially make prefetching more effective.

42
Cache Blocks
  • Typically fetch data from main memory into the
    processor cache in units of cache blocks.
  • multiple word cache blocks are themselves a form
    of data prefetching.
  • large cache blocks
  • Effective prefetching vs cache pollution.
  • What is the complication for SMPs with private
    caches
  • false sharing when two or more processors wish
    to access different words within the same cache
    block and at least one of the accesses is a
    store.
  • cache coherence traffic is generated to ensure
    that the changes made to a block by a store
    operation are seen by all processors caching the
    block.
  • Unnecessary traffic
  • Increasing the cache block size increases the
    likelihood of such occurances
  • How do we take advantage of spatial locality
    without introducing some of the problems
    associated with large cache blocks?

43
Sequential prefetching
  • one block lookahead (OBL) approach
  • initiates a prefetch for block b1 when block b
    is accessed.
  • How is it different from doubling the block size?
  • prefetched blocks are treated separately with
    regard to the cache replacement and coherency
    policies.

44
OBL Case Study
  • Assume that a large block contains one word which
    is frequently referenced and several other words
    which are not in use.
  • Assume that an LRU replacement policy is used,
  • What is the implication?
  • the entire block will be retained even though
    only a portion of the blocks data is actually in
    use.
  • How do we solve?
  • Replace large block with two smaller blocks,
  • one of them could be evicted to make room for
    more active data.
  • use of smaller cache blocks reduces the
    probability of false sharing

45
OBL implementations
  • Based on what type of access to block b
    initiates the prefetch of b1
  • prefetch on miss
  • Initiates a prefetch for block b1 whenever an
    access for block b results in a cache miss.
  • If b1 is already cached, no memory access is
    initiated
  • tagged prefetch algorithms
  • Associates a tag bit with every memory block.
  • Use this bit to detect
  • when a block is demand-fetched or
  • when a prefetched block is referenced for the
    first time.
  • Then, next sequential block is fetched.
  • Which one is better in terms of reducing miss
    rate? Prefetch on miss vs tagged prefetch?

46
Prefetch on miss vs tagged prefetch
Accessing three contiguous blocks strictly
sequential access pattern
47
Shortcoming of the OBL
  • prefetch may not be initiated far enough in
    advance of the actual use to avoid a processor
    memory stall.
  • A sequential access stream resulting from a tight
    loop, for example, may not allow sufficient time
    between the use of blocks b and b1 to completely
    hide the memory latency.

48
How do you solve this shortcoming?
  • Increase the number of blocks prefetched after a
    demand fetch from one to d
  • As each prefetched block, b, is accessed for the
    first time, the cache is interrogated to check if
    blocks b1, ... bd are present in the cache
  • What if d1? What kind of prefetching is this?
  • Tagged

49
Another technique with d-prefetch
  • d prefetched blocks are brought into a FIFO
    stream buffer before being brought into the
    cache.
  • As each buffer entry is referenced, it is brought
    into the cache while the remaining blocks are
    moved up in the queue and a new block is
    prefetched into the tail position.
  • If a miss occurs in the cache and the desired
    block is also not found at the head of the stream
    buffer, the buffer is flushed.
  • Advantage
  • prefetched data are not placed directly into the
    cache,
  • avoids cache pollution.
  • Disadvantage
  • requires that prefetched blocks be accessed in a
    strictly sequential order to take advantage of
    the stream buffer.

50
Tradeoffs of d-prefetching?
  • Good increasing the degree of prefetching
  • reduces miss rates in sections of code that show
    a high degree of spatial locality
  • Bad
  • additional traffic and cache pollution are
    generated by sequential prefetching during
    program phases that show little spatial locality.
  • What if are able to vary the d

51
Adaptive sequential prefetching
  • d is matched to the degree of spatial locality
    exhibited by the program at a particular point in
    time.
  • a prefetch efficiency metric is periodically
    calculated
  • Prefetch efficiency
  • ratio of useful prefetches to total prefetches
  • a useful prefetch occurs whenever a prefetched
    block results in a cache hit.
  • d is initialized to one,
  • incremented whenever efficiency exceeds a
    predetermined upper threshold
  • decremented whenever the efficiency drops below a
    lower threshold
  • If d0, no prefetching
  • Which one is better? adaptive or tagged
    prefetching?
  • Miss ratio vs Memory traffic and contention

52
Sequential prefetching summary
  • Does sequential prefetching require changes to
    existing executables?
  • What about the hardware complexity?
  • Which one offers both simplicity and performance?
  • TAGGED
  • Compared to software-initiated prefetching, what
    might be the problem?
  • tend to generate more unnecessary prefetches.
  • Non-sequential access patterns are not good
  • Ex such as scalar references or array accesses
    with large strides, will result in unnecessary
    prefetch requests
  • do not exhibit the spatial locality upon which
    sequential prefetching is based.
  • To enable prefetching of strided and other
    irregular data access patterns, several more
    elaborate hardware prefetching techniques have
    been proposed.

53
Prefetching with arbitrary strides
  • Reference Prediction Table

State initial, transient, steady
54
RPT Entries State Transition
55
Matrix Multiplication
Assume that starting addresses a10000 b20000
c30000, and 1 word cache block
After the first iteration of inner loop
56
Matrix Multiplication
After the second iteration of inner loop
Hits/misses?
57
Matrix Multiplication
After the third iteration
b and c hits provided that a prefetch of distance
one is enough
58
RPT Limitations
  • Prefetch distance to one loop iteration
  • Loop entrance miss
  • Loop exit unnecessary prefetch
  • How can we solve this?
  • Use longer distance
  • Prefetch address effective address
  • (stride x distance
    )
  • with lookahead program counter (LA-PC)

59
Summary
  • Prefetches
  • timely, useful, and introduce little overhead.
  • Reduce secondary effects in the memory system
  • strategies are diverse and no single strategy
    provides optimal performance

60
Summary
  • Prefetching schemes are diverse.
  • To help categorize a particular approach it is
    useful to answer three basic questions concerning
    the prefetching mechanism
  • 1) When are prefetches initiated,
  • 2) Where are prefetched data placed,
  • 3) What is the unit of prefetch?

61
Software vs Hardware Prefetching
  • Prefetch instructions actually increase the
    amount of work done by the processor.
  • Hardware-based prefetching techniques do not
    require the use of explicit fetch instructions.
  • hardware monitors the processor in an attempt to
    infer prefetching opportunities.
  • no instruction overhead
  • generates more unnecessary prefetches than
    software-initiated schemes.
  • need to speculate on future memory accesses
    without the benefit of compile-time information
  • Cache pollution
  • Consume memory bandwidth

62
Conclusions
  • Prefetches can be initiated either by
  • explicit fetch operation within a program
    (software initiated)
  • logic that monitors the processors referencing
    pattern (hardware-initiated).
  • Prefetches must be timely.
  • issued too early
  • chance that the prefetched data will displace
    other useful data or be displaced itself before
    use.
  • issued too late
  • may not arrive before the actual memory reference
    and introduce stalls
  • Prefetches must be precise.
  • The software approach issues prefetches only for
    data that is likely to be used
  • Hardware schemes tend to fetch more data
    unnecessarily.

63
Conclusions
  • The decision of where to place prefetched data in
    the memory hierarchy
  • higher level of the memory hierarchy to provide a
    performance benefit.
  • The majority of schemes
  • prefetched data in some type of cache memory.
  • Prefetched data in processor registers
  • binding and additional constraints must be
    imposed on the use of the data.
  • Finally, multiprocessor systems can introduce
    additional levels into the memory hierarchy which
    must be taken into consideration.

64
Conclusions
  • Data can be prefetched in units of single words,
    cache blocks or larger blocks of memory.
  • determined by the organization of the underlying
    cache and memory system.
  • Uniprocessors and SMPs
  • Cache blocks appropriate
  • Distributed memory multiprocessor
  • larger memory blocks
  • to amortize the cost of initiating a data
    transfer across an interconnection network
Write a Comment
User Comments (0)
About PowerShow.com