Thoughts on Shared Caches - PowerPoint PPT Presentation

About This Presentation
Title:

Thoughts on Shared Caches

Description:

Thoughts on Shared Caches. Jeff Odom. University of Maryland ... False Sharing. Occurs when two CPUs access different data structures on the same cache line ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 35
Provided by: jeffre104
Category:

less

Transcript and Presenter's Notes

Title: Thoughts on Shared Caches


1
Thoughts on Shared Caches
  • Jeff OdomUniversity of Maryland

2
A Brief History of Time
  • First there was the single CPU
  • Memory tuning new field
  • Large improvements possible
  • Life is good
  • Then came multiple CPUs
  • Rethink memory interactions
  • Life is good (again)
  • Now theres multi-core on multi-CPU
  • Rethink memory interactions (again)
  • Life will be good (we hope)

3
SMP vs. CMP
  • Symmetric Multiprocessing (SMP)
  • Single CPU core per chip
  • All caches private to each CPU
  • Communication via main memory
  • Chip Multiprocessing (CMP)
  • Multiple CPU cores on one integrated circuit
  • Private L1 cache
  • Shared second-level and higher caches

4
CMP Features
  • Thread-level parallelism
  • One thread per core
  • Same as SMP
  • Shared higher-level caches
  • Reduced latency
  • Improved memory bandwidth
  • Non-homogeneous data decomposition
  • Not all cores are created equal

5
CMP Challenges
  • New optimizations
  • False sharing/private data copies
  • Delaying reads until shared
  • Fewer locations to cache data
  • More chance of data eviction in high-throughput
    computations
  • Hybrid SMP/CMP systems
  • Connect multiple multi-core nodes
  • Composite cache sharing scheme
  • Cray XT4
  • 2 cores/chip
  • 2 chips/node

6
False Sharing
  • Occurs when two CPUs access different data
    structures on the same cache line

7
False Sharing (SMP)
8
False Sharing (SMP)
9
False Sharing (SMP)
10
False Sharing (SMP)
11
False Sharing (SMP)
12
False Sharing (SMP)
13
False Sharing (SMP)
14
False Sharing (SMP)
15
False Sharing (CMP)
16
False Sharing (CMP)
17
False Sharing (CMP)
18
False Sharing (CMP)
19
False Sharing (CMP)
20
False Sharing (CMP)
21
False Sharing (CMP)
22
False Sharing (CMP)
23
False Sharing (SMP vs. CMP)
  • With private L2 (SMP), modification of
    co-resident data structures results in trips to
    main memory
  • In CMP, false sharing impact is limited by the
    shared L2
  • Latency from L1 to L2 much less than L2 to main
    memory

24
Maintaining Private Copies
  • Two threads modifying the same cache line will
    want to move data to their L1
  • Simultaneous reading/modification causes
    thrashing between L1s and L2
  • Keeping a copy of data in separate cache line
    keeps data local to the processor
  • Updates to shared data occur less often

25
Delaying Reads Until Shared
  • Often the results from one thread are pipelined
    to another
  • Typical signal-based sharing
  • Thread 1 accesses data, is pulled into L1T1
  • T1 modifies data
  • T1 signals T2 that data is ready
  • T2 requests data, forcing eviction from L1T1 into
    L2Shared
  • Data is now shared
  • L1 line not filled in, wasting space

26
Delaying Reads Until Shared
  • Optimized sharing
  • T1 pulls data into L1T1 as before
  • T1 modifies data
  • T1 waits until it has other data to fill the line
    with, then uses that to push data into L2Shared
  • T1 signals T2 that data is ready
  • T1 and T2 now share data in L2Shared
  • Eviction is side-effect of loading line

27
Hybrid Models
  • Most CMP systems will have SMP as well
  • Large core density not feasible
  • Want to balance processing with cache sizes
  • Different access patterns
  • Co-resident cores act different than cores of
    different nodes
  • Results may differ depending on which processor
    pairs you get

28
Experimental Framework
  • Simics simulator
  • Full system simulation
  • Hot-swappable components
  • Configurable memory system
  • Reconfigurable cache hierarchy
  • Roll-your-own coherency protocol
  • Simulated environment
  • SunFire 6800, Solaris 10
  • Single CPU board, 4 UltraSPARC IIi
  • Uniform main memory access
  • Similar to actual hardware on hand

29
Experimental Workload
  • NAS Parallel Benchmarks
  • Well known, standard applications
  • Various data access patterns (conjugate gradient,
    multi-grid, etc.)
  • OpenMP-optimized
  • Already converted from original serial versions
  • MPI-based versions also available
  • Small (W) workloads
  • Simulation framework slows down execution
  • Will examine larger (A-C) versions to verify
    tool correctness

30
Workload Results
  • Some show marked improvement (CG)
  • others show marginal improvement (FT)
  • still others show asymmetrical loads (BT)
  • and asymmetrical improvement (EP)

31
The Next Step
  • How to get data and tools for programmers to deal
    with this?
  • Hardware
  • Languages
  • Analysis tools
  • Specialized hardware counters
  • Which CPU forced eviction
  • Are cores or nodes contending for data
  • Coherency protocol diagnostics

32
The Next Step
  • CMP-aware parallel languages
  • Language-based framework easier to perform
    automatic optimizations
  • OpenMP, UPC likely candidates
  • Specialized partitioning may be needed to
    leverage shared caches
  • Implicit data partitioning
  • Current languages distribute data uniformly
  • May require extensions (hints) in the form of
    language directives

33
The Next Step
  • Post-execution analysis tools
  • Identify memory hotspots
  • Provide hints on restructuring
  • Blocking
  • Execution interleaving
  • Convert SMP-optimized code for use in CMP
  • Dynamic instrumentation opportunities

34
Questions?
Write a Comment
User Comments (0)
About PowerShow.com