Adaptive History-Based Memory Schedulers - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive History-Based Memory Schedulers

Description:

Memory system performance is not increasing as fast as CPU performance ... previous approaches with data intensive applications: Stream, NAS, and microbenchmarks ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 30
Provided by: IBMU537
Learn more at: https://microarch.org
Category:

less

Transcript and Presenter's Notes

Title: Adaptive History-Based Memory Schedulers


1
Adaptive History-Based Memory Schedulers
  • Ibrahim Hur and Calvin Lin
  • IBM Austin
  • The University of Texas at Austin

2
Memory Bottleneck
  • Memory system performance is not increasing as
    fast as CPU performance
  • Latency Use caches, prefetching,
  • Bandwidth Use parallelism inside memory system

3
How to Increase Memory Command Parallelism?
  • Similar to instruction scheduling,
  • can reorder commands for higher bandwidth

time
4
Inside the Memory System
not FIFO
Read Queue
FIFO
Memory Queue
DRAM
arbiter
caches
Write Queue
Memory Controller
not FIFO
the arbiter schedules memory operations
5
Our Work
  • Study memory command scheduling
  • in the context of the IBM Power5
  • Present new memory arbiters
  • 20 increased bandwidth
  • Very little cost 0.04 increase in chip area

6
Outline
  • The Problem
  • Characteristics of DRAM
  • Previous Scheduling Methods
  • Our approach
  • History-based schedulers
  • Adaptive history-based schedulers
  • Results
  • Conclusions

7
Understanding the ProblemCharacteristics of DRAM
  • Multi-dimensional structure
  • Banks, rows, and columns
  • IBM Power5 ranks and ports as well
  • Access time is not uniform
  • Bank-to-Bank conflicts
  • Read after Write to the same rank conflict
  • Write after Read to different port conflict

8
Previous Scheduling Approaches FIFO Scheduling
caches
DRAM
Read Queue
caches
arbiter
Memory Queue (FIFO)
Write Queue
9
Memoryless Scheduling
Adapted from Rixner et al, ISCA2000
caches
DRAM
Read Queue
caches
arbiter
Memory Queue (FIFO)
Write Queue
long delay
10
What we really want
  • Keep the pipeline full dont hold commands in
    the reorder queues until conflicts are totally
    resolved
  • Forward them to memory queue in an order to
    minimize future conflicts
  • To do this we need to know history of the
    commands

memory queue
Read/Write Queues
arbiter
11
Another Goal Match Applications Memory Command
Behavior
  • Arbiter should select commands from queues
    roughly in the ratio in which the application
    generates them
  • Otherwise, read or write queue may be congested
  • Command history is useful here too

12
Our Approach History-Based Memory Schedulers
  • Benefits
  • Minimize contention costs
  • Consider multiple constraints
  • Match applications memory access behavior
  • 2 Reads per Write?
  • 1 Read per Write?
  • The Result less congested memory system, i.e.
    more bandwidth

13
How does it work?
  • Use a Finite State Machine (FSM)
  • Each state in the FSM represents one possible
    history
  • Transitions out of a state are prioritized
  • At any state, scheduler selects the available
    command with the highest priority
  • FSM is generated at design time

14
An Example
available commands in reorder queues
next state
First Preference
current state
Second Preference
Third Preference
Fourth Preference
most appropriate command to memory
15
How to determine priorities?
  • Two criteria
  • A Minimize contention costs
  • B Satisfy programs Read/Write command mix
  • First Method Use A, break ties with B
  • Second Method Use B, break ties with A
  • Which method to use?
  • Combine two methods probabilistically
  • (details in the paper)

16
Limitation of the History-Based Approach
  • Designed for one particular mix of Read/Writes
  • Solution Adaptive History-Based Schedulers
  • Create multiple state machines one for each
    Read/Write mix
  • Periodically select most appropriate state
    machine

17
Adaptive History-Based Schedulers
2R1W 1R1W 1R2W
18
Evaluation
  • Used a cycle accurate simulator for the IBM
    Power5
  • 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port
  • Evaluated and compared our approach with previous
    approaches with data intensive applications
    Stream, NAS, and microbenchmarks

19
The IBM Power5
  • 2 cores on a chip
  • SMT capability
  • Large on-chip L2 cache
  • Hardware prefetching
  • 276 million transistors

Memory Controller
(1.6 of chip area)
20
Results 1 Stream Benchmarks
21
Results 2 NAS Benchmarks
(1 core active)
22
Results 3 Microbenchmarks
23

12 concurrent commands
caches
DRAM
Read Queue
caches
arbiter
Memory Queue (FIFO)
Write Queue
24
DRAM Utilization
Memoryless Approach
Our Approach
Number of Active Commands in DRAM
25
Why does it work?
detailed analysis in the paper
Read Queue
Memory Queue
DRAM
arbiter
caches
Write Queue
Memory Controller
Low Occupancy in Reorder Queues
Full Memory Queue
Busy Memory System
Full Reorder Queues
26
Other Results
  • We obtain gt95 performance of the perfect DRAM
    configuration (no conflicts)
  • Results with higher frequency, and no data
    prefetching are in the paper
  • History size of 2 works well

27
Conclusions
  • Introduced adaptive history-based schedulers
  • Evaluated on a highly tuned system, IBM Power5
  • Performance improvement
  • Over FIFO Stream 63 NAS 11
  • Over Memoryless Stream 19 NAS 5
  • Little cost 0.04 chip area increase

28
Conclusions (cont.)
  • Similar arbiters can be used in other places as
    well, e.g. cache controllers
  • Can optimize for other criteria, e.g. power or
    powerperformance.

29
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com