Dynamic Management of - PowerPoint PPT Presentation

About This Presentation

Title:

Dynamic Management of

Description:

Delay is a quadratic function of the wire length. By inserting repeaters/buffers, delay grows ... Minimum acceptable interval length and its instability factor ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 56

Provided by: talk8

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Management of

1
Dynamic Management of Microarchitecture
Resources in Future Processors Rajeev
Balasubramonian Dept. of Computer Science,
University of Rochester
2
Talk Outline

Trade-offs in future microprocessors
Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
Future work

University of Rochester
3
Talk Outline

Trade-offs in future microprocessors
Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
Future work

University of Rochester
4
Design Goals in Modern Processors
Microprocessor designs strive for

High performance
High clock speed
High parallelism
Low power
Low design complexity
Short, simple pipelines

Unfortunately, not all can be achieved simultaneou
sly.
University of Rochester
5
Trade-Off in the Cache Size
CPU
CPU
L1 data cache
L1 data cache
Size/access time 32KB cache/2 cycles
128KB/4 cycles sort 4000 miss rate
very low very low sort
4000 execution time t
t x sort 16000 miss rate
high very
low sort 16000 execution time T
T - X
University of Rochester
6
Trade-Off in the Register File Size
Register file
The register file stores results for all active
instructions in the processor.
Large register file ? more active instructions
? high parallelism
? long access times
? slow clock speed / more
pipeline stages ?
high power, design complexity
University of Rochester
7
Trade-Offs Involving Resource Sizes
Trade-offs influence the design of the
cache, register file, issue queue, etc. Large
resource size ? high parallelism, ability to
support more threads ? long latency ? long
pipelines/ low clock speed ? high power, high
design complexity
University of Rochester
8
Parallelism-Latency Trade-Off

For each resource, performance depends on
parallelism it can help extract
negative impact of its latency
Every program has different parallelism and
latency needs.

University of Rochester
9
Limitations of Conventional Designs

Resource sizes are fixed at design time the
size
that works best, on average, for all programs
This average size is often too small or too
large
for many programs
For optimal performance, the hardware should
match the programs parallelism needs.

University of Rochester
10
Dynamic Resource Management

Reconfigurable memory hierarchy (MICRO00,
IEEE TOC, PACT02)
Trade-offs in clusters (ISCA03)
Selective pre-execution (ISCA01)
Efficient register file design (MICRO01)
Dynamic voltage/frequency scaling (HPCA02)

University of Rochester
11
Talk Outline

Trade-offs in future microprocessors
Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
Future work

University of Rochester
12
Conventional Cache Hierarchies
Capacity
L2
Speed
CPU
L1
Main Memory
32KB 2-way set-associative 2 cycles Miss rate 2.3
2MB 8-way 20 cycles Miss rate 0.2
University of Rochester
13
Conventional Cache Layout
bitline
way 0
way 1
D e c o d e r
Address
wordline
Output Driver
Data
University of Rochester
14
Wire Delays

Delay is a quadratic function of the wire length
By inserting repeaters/buffers, delay grows
roughly linearly with length

Length x Delay t
Length 2x Delay 4t
Length 2x Delay 2t logic_delay

Repeaters electrically isolate the wire segments
Commonly used today in long wires

University of Rochester
15
Exploiting Technology
D e c o d e r
University of Rochester
16
The Reconfigurable Cache Layout
D e c o d e r
way 0
way 1
way 2
way 3
University of Rochester
17
The Reconfigurable Cache Layout
D e c o d e r
way 0
way 1
way 2
way 3
32KB 1-way cache, 2 cycles
University of Rochester
18
The Reconfigurable Cache Layout
D e c o d e r
way 0
way 1
way 2
way 3
64KB 2-way cache, 3 cycles
The disabled portions of the cache are used as
the non-inclusive L2.
University of Rochester
19
Changing the Boundary between L1-L2
L1
L2
CPU
University of Rochester
20
Changing the Boundary between L1-L2
L1
L2
CPU
University of Rochester
21
Trade-Off in the Cache Size
CPU
CPU
L1 data cache
L1 data cache
Size/access time 32KB cache/2 cycles
128KB/4 cycles sort 4000 miss rate
very low very low sort
4000 execution time t
t x sort 16000 miss rate
high very
low sort 16000 execution time T
T - X
University of Rochester
22
Salient Features

Low-cost Exploits the benefits of repeaters
Optimizes the access time/capacity trade-off
Can reduce energy -- most efficient when cache
size equals working set size

University of Rochester
23
Control Mechanism
Gather statistics at periodic intervals (every
10K instructions)
Inspect stats. Is there a phase change?
exploration
yes
no
Run each configuration for an interval
Remain at the selected configuration
Pick the best configuration
University of Rochester
24
Metrics

Optimizing performance
metric for best configuration is simply
instructions per cycle (IPC)
Detecting a phase change
Change in branch frequency or miss rate
frequency or sudden change in IPC ?
change in program phase
To avoid unnecessary explorations, the
thresholds can be adapted at run-time

University of Rochester
25
Simulation Methodology

Modified version of Simplescalar-3.0 -- includes
many details on bus contention
Executing programs from various benchmark
sets (a mix of many program types)

University of Rochester
26
Performance Results
Overall harmonic mean (HM) improvement 17
University of Rochester
27
Energy Results
Overall energy savings 42
University of Rochester
28
Talk Outline

Trade-offs in future microprocessors
Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
Future work

University of Rochester
29
Conventional Processor Design
Register File
Branch Predictor
I s s u e Q
I Cache
Rename Dispatch
FU
FU
FU
Large structures ? Slower clock speed
FU
University of Rochester
30
The Clustered Processor
Regfile
Branch Predictor
r1 ? r3 r4
r2 ? r1 r41
IQ
FU
r2 ? r1 r41
I Cache
Rename Dispatch
Regfile
IQ
FU
Regfile
r41 ? r43 r44
IQ
FU
Small structures ? Faster clock speed But, high
latency for some instructions
Regfile
IQ
FU
University of Rochester
31
Emerging Trends

Wire delays and faster clocks will make each
cluster smaller
Larger transistor budgets and low design cost
will enable the implementation of many clusters
on the chip
The support of many threads will require many
resources and clusters
? Numerous, small clusters will be a reality!

University of Rochester
32
Communication Costs
Regs
Regs
IQ
FU
IQ
FU
Regs
Regs
IQ
FU
IQ
FU
Regs
Regs
4 clusters
IQ
FU
IQ
FU
Regs
Regs
8 clusters
IQ
FU
IQ
FU
Regs
IQ
FU
Regs
Regs
IQ
FU
IQ
FU
More clusters ? more communication
Regs
IQ
FU
University of Rochester
33
Communication vs Parallelism
4 clusters ? 100 active instrs r1 ? r2 r3 r5
? r1 r3 r7 ? r2 r3 r8
? r7 r3
8 clusters ? 200 active instrs r1 ? r2 r3 r5
? r1 r3 r7 ? r2 r3 r8 ? r7
r3 r5 ? r1 r7 r9 ?
r2 r3
Ready instructions
Distant parallelism distant instructions that
are ready to execute
University of Rochester
34
Communication-Parallelism Trade-Off

More clusters ? More communication
? More parallelism
Selectively use more clusters
if communication is tolerable
if there is additional distant parallelism

University of Rochester
35
IPC with Many Clusters (ISCA03)
University of Rochester
36
Trade-Off Management

The clustered processor abstraction exposes
the trade-off between communication and
parallelism
It also simplifies the management of resources
-- we can disable a cluster by simply not
dispatching instructions to it

University of Rochester
37
Control Mechanism
Gather statistics at periodic intervals (every
10K instructions)
Inspect stats. Is there a phase change?
exploration
yes
no
Run each configuration for an interval
Remain at the selected configuration
Pick the best configuration
University of Rochester
38
The Interval Length

Success depends on ability to repeat behavior
across successive intervals
Every program is likely to have phase changes
at different granularities
Must also pick the interval length at run-time

University of Rochester
39
Picking the Interval Length

Start with minimum allowed interval length
If phase changes are too frequent, double
the interval length find a coarse enough
granularity such that behavior is consistent
Repeat every 10 billion instructions
Small interval lengths can result in noisy
measurements

University of Rochester
40
Varied Interval Lengths
Instability factor Percentage of intervals that
flag a phase change.
University of Rochester
41
Results with Interval-Based Scheme
Overall improvement 11
University of Rochester
42
Talk Outline

Trade-offs in future microprocessors
Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
Future work

University of Rochester
43
Pre-Execution

Executing a subset of the program in advance
Helps warm up various processor structures
such as the cache and branch predictor

University of Rochester
44
The Future Thread (ISCA01)

The main program thread executes every single
instruction
Some registers are reserved for the future
thread
so it can jump ahead

.
.
.
.
.
.
.
.
Main thread
.
.
.
.
.
Pre-execution thread
University of Rochester
45
Key Innovations

Ability to advance much further
eager recycling of registers
skipping idle instructions
Integrating pre-executed results
re-using register results
correcting branch mispredicts
prefetch into the caches
Allocation of resources

University of Rochester
46
Trade-Offs in Resource Allocation

Allocating more registers for the main thread
favors nearby parallelism

.
.
.
.
.
.
.
.
Main thread
.
.
.
.
.
Future thread

Allocating more registers for the future thread
favors distant parallelism
The interval-based mechanism can pick the
optimal allocation

University of Rochester
47
Pre-Execution Results
Overall improvement with 12 registers
11 Overall improvement with dynamic allocation
18
University of Rochester
48
Conclusion

Emerging technologies will make trade-off
management very vital
Approaches to hardware adaptation
cache hierarchy
clustered processors
pre-execution threads
The interval-based mechanism with exploration
is robust and applies to most problem domains

University of Rochester
49
Talk Outline

Trade-offs in future microprocessors
Dynamic resource management
On-chip cache hierarchy
Clustered processors
Pre-execution threads
Future work

University of Rochester
50
Future Scenarios

Clustered designs can be used to produce
all classes of processors
A library of simple cluster cores with
different
energy, clock speed, latency, and parallelism
characteristics
The role of the architect putting these cores
together on the chip and exploiting them to
maximize performance

University of Rochester
51
Heterogeneous Clusters

Having different clusters on the chip provides
many options for instruction steering
For example, a program limited by communication
will benefit from large, slow cluster cores
Non-critical instructions of a program could be
steered to slow, energy-efficient clusters --
such
clusters can also help reduce processor
hot-spots

University of Rochester
52
Other Critical Problems

How does one build a highly clustered processor?
Where does the cache go?
What interconnect topology do we use?
How does multithreading affect these choices?

University of Rochester
53
More Details
Research synopses and papers available
at http//www.cs.rochester.edu/rajeev/research.
html
University of Rochester
54
University of Rochester
55
Slide Title