Title: Adaptive History-Based Memory Schedulers
1Adaptive History-Based Memory Schedulers
- Ibrahim Hur and Calvin Lin
- IBM Austin
- The University of Texas at Austin
2Memory Bottleneck
- Memory system performance is not increasing as
fast as CPU performance - Latency Use caches, prefetching,
- Bandwidth Use parallelism inside memory system
3How to Increase Memory Command Parallelism?
- Similar to instruction scheduling,
- can reorder commands for higher bandwidth
4Inside the Memory System
not FIFO
Read Queue
Memory Queue
Write Queue
Memory Controller
not FIFO
the arbiter schedules memory operations
5Our Work
- Study memory command scheduling
- in the context of the IBM Power5
- Present new memory arbiters
- 20 increased bandwidth
- Very little cost 0.04 increase in chip area
- The Problem
- Characteristics of DRAM
- Previous Scheduling Methods
- Our approach
- History-based schedulers
- Adaptive history-based schedulers
- Results
- Conclusions
7Understanding the ProblemCharacteristics of DRAM
- Multi-dimensional structure
- Banks, rows, and columns
- IBM Power5 ranks and ports as well
- Access time is not uniform
- Bank-to-Bank conflicts
- Read after Write to the same rank conflict
- Write after Read to different port conflict
8Previous Scheduling Approaches FIFO Scheduling
Read Queue
Memory Queue (FIFO)
Write Queue
9Memoryless Scheduling
Adapted from Rixner et al, ISCA2000
Read Queue
Memory Queue (FIFO)
Write Queue
long delay
10What we really want
- Keep the pipeline full dont hold commands in
the reorder queues until conflicts are totally
resolved - Forward them to memory queue in an order to
minimize future conflicts
- To do this we need to know history of the
memory queue
Read/Write Queues
11Another Goal Match Applications Memory Command
- Arbiter should select commands from queues
roughly in the ratio in which the application
generates them - Otherwise, read or write queue may be congested
- Command history is useful here too
12Our Approach History-Based Memory Schedulers
- Benefits
- Minimize contention costs
- Consider multiple constraints
- Match applications memory access behavior
- 2 Reads per Write?
- 1 Read per Write?
- The Result less congested memory system, i.e.
more bandwidth
13How does it work?
- Use a Finite State Machine (FSM)
- Each state in the FSM represents one possible
history - Transitions out of a state are prioritized
- At any state, scheduler selects the available
command with the highest priority - FSM is generated at design time
14An Example
available commands in reorder queues
next state
First Preference
current state
Second Preference
Third Preference
Fourth Preference
most appropriate command to memory
15How to determine priorities?
- Two criteria
- A Minimize contention costs
- B Satisfy programs Read/Write command mix
- First Method Use A, break ties with B
- Second Method Use B, break ties with A
- Which method to use?
- Combine two methods probabilistically
- (details in the paper)
16Limitation of the History-Based Approach
- Designed for one particular mix of Read/Writes
- Solution Adaptive History-Based Schedulers
- Create multiple state machines one for each
Read/Write mix - Periodically select most appropriate state
machine -
17Adaptive History-Based Schedulers
2R1W 1R1W 1R2W
- Used a cycle accurate simulator for the IBM
Power5 - 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port
- Evaluated and compared our approach with previous
approaches with data intensive applications
Stream, NAS, and microbenchmarks
19The IBM Power5
- 2 cores on a chip
- SMT capability
- Large on-chip L2 cache
- Hardware prefetching
- 276 million transistors
Memory Controller
(1.6 of chip area)
20Results 1 Stream Benchmarks
21Results 2 NAS Benchmarks
(1 core active)
22Results 3 Microbenchmarks
2312 concurrent commands
Read Queue
Memory Queue (FIFO)
Write Queue
24DRAM Utilization
Memoryless Approach
Our Approach
Number of Active Commands in DRAM
25Why does it work?
detailed analysis in the paper
Read Queue
Memory Queue
Write Queue
Memory Controller
Low Occupancy in Reorder Queues
Full Memory Queue
Busy Memory System
Full Reorder Queues
26Other Results
- We obtain gt95 performance of the perfect DRAM
configuration (no conflicts) - Results with higher frequency, and no data
prefetching are in the paper - History size of 2 works well
- Introduced adaptive history-based schedulers
- Evaluated on a highly tuned system, IBM Power5
- Performance improvement
- Over FIFO Stream 63 NAS 11
- Over Memoryless Stream 19 NAS 5
- Little cost 0.04 chip area increase
28Conclusions (cont.)
- Similar arbiters can be used in other places as
well, e.g. cache controllers - Can optimize for other criteria, e.g. power or