Title: Dynamic Management of
1Dynamic Management of Microarchitecture
Resources in Future Processors Rajeev
Balasubramonian Dept. of Computer Science,
University of Rochester
2Talk Outline
- Trade-offs in future microprocessors
- Dynamic resource management
- On-chip cache hierarchy
- Clustered processors
- Pre-execution threads
- Future work
University of Rochester
3Talk Outline
- Trade-offs in future microprocessors
- Dynamic resource management
- On-chip cache hierarchy
- Clustered processors
- Pre-execution threads
- Future work
University of Rochester
4Design Goals in Modern Processors
Microprocessor designs strive for
- High performance
- High clock speed
- High parallelism
- Low power
- Low design complexity
- Short, simple pipelines
Unfortunately, not all can be achieved simultaneou
sly.
University of Rochester
5Trade-Off in the Cache Size
CPU
CPU
L1 data cache
L1 data cache
Size/access time 32KB cache/2 cycles
128KB/4 cycles sort 4000 miss rate
very low very low sort
4000 execution time t
t x sort 16000 miss rate
high very
low sort 16000 execution time T
T - X
University of Rochester
6Trade-Off in the Register File Size
Register file
The register file stores results for all active
instructions in the processor.
Large register file ? more active instructions
? high parallelism
? long access times
? slow clock speed / more
pipeline stages ?
high power, design complexity
University of Rochester
7Trade-Offs Involving Resource Sizes
Trade-offs influence the design of the
cache, register file, issue queue, etc. Large
resource size ? high parallelism, ability to
support more threads ? long latency ? long
pipelines/ low clock speed ? high power, high
design complexity
University of Rochester
8Parallelism-Latency Trade-Off
- For each resource, performance depends on
- parallelism it can help extract
- negative impact of its latency
- Every program has different parallelism and
- latency needs.
University of Rochester
9Limitations of Conventional Designs
- Resource sizes are fixed at design time the
size - that works best, on average, for all programs
- This average size is often too small or too
large - for many programs
- For optimal performance, the hardware should
- match the programs parallelism needs.
University of Rochester
10Dynamic Resource Management
- Reconfigurable memory hierarchy (MICRO00,
- IEEE TOC, PACT02)
- Trade-offs in clusters (ISCA03)
- Selective pre-execution (ISCA01)
- Efficient register file design (MICRO01)
- Dynamic voltage/frequency scaling (HPCA02)
University of Rochester
11Talk Outline
- Trade-offs in future microprocessors
- Dynamic resource management
- On-chip cache hierarchy
- Clustered processors
- Pre-execution threads
- Future work
University of Rochester
12Conventional Cache Hierarchies
Capacity
L2
Speed
CPU
L1
Main Memory
32KB 2-way set-associative 2 cycles Miss rate 2.3
2MB 8-way 20 cycles Miss rate 0.2
University of Rochester
13Conventional Cache Layout
bitline
way 0
way 1
D e c o d e r
Address
wordline
Output Driver
Data
University of Rochester
14Wire Delays
- Delay is a quadratic function of the wire length
- By inserting repeaters/buffers, delay grows
- roughly linearly with length
Length x Delay t
Length 2x Delay 4t
Length 2x Delay 2t logic_delay
- Repeaters electrically isolate the wire segments
- Commonly used today in long wires
University of Rochester
15Exploiting Technology
D e c o d e r
University of Rochester
16The Reconfigurable Cache Layout
D e c o d e r
way 0
way 1
way 2
way 3
University of Rochester
17The Reconfigurable Cache Layout
D e c o d e r
way 0
way 1
way 2
way 3
32KB 1-way cache, 2 cycles
University of Rochester
18The Reconfigurable Cache Layout
D e c o d e r
way 0
way 1
way 2
way 3
64KB 2-way cache, 3 cycles
The disabled portions of the cache are used as
the non-inclusive L2.
University of Rochester
19Changing the Boundary between L1-L2
L1
L2
CPU
University of Rochester
20Changing the Boundary between L1-L2
L1
L2
CPU
University of Rochester
21Trade-Off in the Cache Size
CPU
CPU
L1 data cache
L1 data cache
Size/access time 32KB cache/2 cycles
128KB/4 cycles sort 4000 miss rate
very low very low sort
4000 execution time t
t x sort 16000 miss rate
high very
low sort 16000 execution time T
T - X
University of Rochester
22Salient Features
- Low-cost Exploits the benefits of repeaters
- Optimizes the access time/capacity trade-off
- Can reduce energy -- most efficient when cache
- size equals working set size
University of Rochester
23Control Mechanism
Gather statistics at periodic intervals (every
10K instructions)
Inspect stats. Is there a phase change?
exploration
yes
no
Run each configuration for an interval
Remain at the selected configuration
Pick the best configuration
University of Rochester
24Metrics
- Optimizing performance
- metric for best configuration is simply
- instructions per cycle (IPC)
- Detecting a phase change
- Change in branch frequency or miss rate
- frequency or sudden change in IPC ?
- change in program phase
- To avoid unnecessary explorations, the
- thresholds can be adapted at run-time
University of Rochester
25Simulation Methodology
- Modified version of Simplescalar-3.0 -- includes
- many details on bus contention
- Executing programs from various benchmark
- sets (a mix of many program types)
University of Rochester
26Performance Results
Overall harmonic mean (HM) improvement 17
University of Rochester
27Energy Results
Overall energy savings 42
University of Rochester
28Talk Outline
- Trade-offs in future microprocessors
- Dynamic resource management
- On-chip cache hierarchy
- Clustered processors
- Pre-execution threads
- Future work
University of Rochester
29Conventional Processor Design
Register File
Branch Predictor
I s s u e Q
I Cache
Rename Dispatch
FU
FU
FU
Large structures ? Slower clock speed
FU
University of Rochester
30The Clustered Processor
Regfile
Branch Predictor
r1 ? r3 r4
r2 ? r1 r41
IQ
FU
r2 ? r1 r41
I Cache
Rename Dispatch
Regfile
IQ
FU
Regfile
r41 ? r43 r44
IQ
FU
Small structures ? Faster clock speed But, high
latency for some instructions
Regfile
IQ
FU
University of Rochester
31Emerging Trends
- Wire delays and faster clocks will make each
- cluster smaller
- Larger transistor budgets and low design cost
- will enable the implementation of many clusters
- on the chip
- The support of many threads will require many
- resources and clusters
- ? Numerous, small clusters will be a reality!
University of Rochester
32Communication Costs
Regs
Regs
IQ
FU
IQ
FU
Regs
Regs
IQ
FU
IQ
FU
Regs
Regs
4 clusters
IQ
FU
IQ
FU
Regs
Regs
8 clusters
IQ
FU
IQ
FU
Regs
IQ
FU
Regs
Regs
IQ
FU
IQ
FU
More clusters ? more communication
Regs
IQ
FU
University of Rochester
33Communication vs Parallelism
4 clusters ? 100 active instrs r1 ? r2 r3 r5
? r1 r3 r7 ? r2 r3 r8
? r7 r3
8 clusters ? 200 active instrs r1 ? r2 r3 r5
? r1 r3 r7 ? r2 r3 r8 ? r7
r3 r5 ? r1 r7 r9 ?
r2 r3
Ready instructions
Distant parallelism distant instructions that
are ready to execute
University of Rochester
34Communication-Parallelism Trade-Off
- More clusters ? More communication
- ? More parallelism
- Selectively use more clusters
-
- if communication is tolerable
- if there is additional distant parallelism
University of Rochester
35IPC with Many Clusters (ISCA03)
University of Rochester
36Trade-Off Management
- The clustered processor abstraction exposes
- the trade-off between communication and
- parallelism
- It also simplifies the management of resources
- -- we can disable a cluster by simply not
- dispatching instructions to it
University of Rochester
37Control Mechanism
Gather statistics at periodic intervals (every
10K instructions)
Inspect stats. Is there a phase change?
exploration
yes
no
Run each configuration for an interval
Remain at the selected configuration
Pick the best configuration
University of Rochester
38The Interval Length
- Success depends on ability to repeat behavior
- across successive intervals
- Every program is likely to have phase changes
- at different granularities
- Must also pick the interval length at run-time
University of Rochester
39Picking the Interval Length
- Start with minimum allowed interval length
- If phase changes are too frequent, double
- the interval length find a coarse enough
- granularity such that behavior is consistent
- Repeat every 10 billion instructions
- Small interval lengths can result in noisy
- measurements
University of Rochester
40Varied Interval Lengths
Instability factor Percentage of intervals that
flag a phase change.
University of Rochester
41Results with Interval-Based Scheme
Overall improvement 11
University of Rochester
42Talk Outline
- Trade-offs in future microprocessors
- Dynamic resource management
- On-chip cache hierarchy
- Clustered processors
- Pre-execution threads
- Future work
University of Rochester
43Pre-Execution
- Executing a subset of the program in advance
- Helps warm up various processor structures
- such as the cache and branch predictor
University of Rochester
44The Future Thread (ISCA01)
- The main program thread executes every single
- instruction
- Some registers are reserved for the future
thread - so it can jump ahead
.
.
.
.
.
.
.
.
Main thread
.
.
.
.
.
Pre-execution thread
University of Rochester
45Key Innovations
- Ability to advance much further
- eager recycling of registers
- skipping idle instructions
- Integrating pre-executed results
- re-using register results
- correcting branch mispredicts
- prefetch into the caches
- Allocation of resources
University of Rochester
46Trade-Offs in Resource Allocation
- Allocating more registers for the main thread
- favors nearby parallelism
.
.
.
.
.
.
.
.
Main thread
.
.
.
.
.
Future thread
- Allocating more registers for the future thread
- favors distant parallelism
- The interval-based mechanism can pick the
- optimal allocation
University of Rochester
47Pre-Execution Results
Overall improvement with 12 registers
11 Overall improvement with dynamic allocation
18
University of Rochester
48Conclusion
- Emerging technologies will make trade-off
- management very vital
- Approaches to hardware adaptation
- cache hierarchy
- clustered processors
- pre-execution threads
- The interval-based mechanism with exploration
- is robust and applies to most problem domains
University of Rochester
49Talk Outline
- Trade-offs in future microprocessors
- Dynamic resource management
- On-chip cache hierarchy
- Clustered processors
- Pre-execution threads
- Future work
University of Rochester
50Future Scenarios
- Clustered designs can be used to produce
- all classes of processors
- A library of simple cluster cores with
different - energy, clock speed, latency, and parallelism
- characteristics
- The role of the architect putting these cores
- together on the chip and exploiting them to
- maximize performance
University of Rochester
51Heterogeneous Clusters
- Having different clusters on the chip provides
- many options for instruction steering
- For example, a program limited by communication
- will benefit from large, slow cluster cores
- Non-critical instructions of a program could be
- steered to slow, energy-efficient clusters --
such - clusters can also help reduce processor
hot-spots
University of Rochester
52Other Critical Problems
- How does one build a highly clustered processor?
- Where does the cache go?
- What interconnect topology do we use?
- How does multithreading affect these choices?
University of Rochester
53More Details
Research synopses and papers available
at http//www.cs.rochester.edu/rajeev/research.
html
University of Rochester
54University of Rochester
55Slide Title
University of Rochester