ELEC 669 Low Power Design Techniques Lecture 2

About This Presentation

Title:

ELEC 669 Low Power Design Techniques Lecture 2

Description:

... about which way a branch will go - will the branch be taken or ... Fetch. Resolve. f(PC, x) Predicted Stream. PC, T or NT. Actual Stream. f(PC, x) = T or NT ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 69

Provided by: originally

Category:

more less

Transcript and Presenter's Notes

Title: ELEC 669 Low Power Design Techniques Lecture 2

1
ELEC 669Low Power Design TechniquesLecture 2

Amirali Baniasadi
amirali_at_ece.uvic.ca

2
How to write a review?

Think Critically.
What if?
Next Step?
Any other applications?

3
Branches

Instructions which can alter the flow of
instruction execution in a program

4
Motivation

Pipelined execution
A new intruction enters the pipeline every
cycle...
but still takes several cycles to execute
Control flow changes
Two possible paths after a branch is fetched
Introduces pipeline "bubbles"
Branch delay slots
Prediction offers a chance to avoid this bubbles

A branch is fetched
But takes N cycles to execute
Pipeline bubble
5
Techniques for handling branches

Stalling
Branch delay slots
Relies on programmer/compiler to fill
Depends on being able to find suitable
instructions
Ties resolution delay to a particular pipeline

6
Why arent these techniques acceptable?

Branches are frequent - 15-25
Todays pipelines are deeper and wider
Higher performance penalty for stalling
Misprediction Penalty issue width resolution
delay cycles
A lot of cycles can be wasted!!!

7
Branch Prediction

Predicting the outcome of a branch
Direction
Taken / Not Taken
Direction predictors
Target Address
PCoffset (Taken)/ PC4 (Not Taken)
Target address predictors
Branch Target Buffer (BTB)

8
Why do we need branch prediction?

Branch prediction
Increases the number of instructions available
for the scheduler to issue. Increases
instruction level parallelism (ILP)
Allows useful work to be completed while waiting
for the branch to resolve

9
Branch Prediction Strategies

Static
Decided before runtime
Examples
Always-Not Taken
Always-Taken
Backwards Taken, Forward Not Taken (BTFNT)
Profile-driven prediction
Dynamic
Prediction decisions may change during the
execution of the program

10
What happens when a branch is predicted?

On misprediction
No speculative state may commit
Squash instructions in the pipeline
Must not allow stores in the pipeline to occur
Cannot allow stores which would not have happened
to commit
Even for good branch predictors more than half of
the fetched instructions are squashed

11
Instruction traffic due to misprediction
better
Half of fetched instructions wasted. More Waste
in Front-End.
12
Energy Loss due to Miss-Predictions
better
21 average energy loss. More energy waste in
integer benchmarks.
13
Simple Static Predictors

Simple heuristics
Always taken
Always not taken
Backwards taken / Forward not taken
Relies on the compiler to arrange the code
following this assertion
Certain opcodes taken
Programmer provided hints
Profiling

14
Simple Static Predictors
15
Dynamic Hardware Predictors

Dynamic Branch Prediction is the ability of the
hardware to make an educated guess about which
way a branch will go - will the branch be taken
or not.
The hardware can look for clues based on the
instructions, or it can use past history - we
will discuss both of these directions.

16
A Generic Branch Predictor
Predicted Stream PC, T or NT
Fetch
f(PC, x)
Resolve
Actual Stream f(PC, x) T or NT
Actual Stream
Execution Order
Predicted Stream
- Whats f (PC, x)? - x can be any relevant
info thus far x was empty
17
Bimodal Branch Predictors

Dynamically store information about the branch
behaviour
Branches tend to behave in a fixed way
Branches tend to behave in the same way across
program execution
Index a Pattern History Table using the branch
address
1 bit branch behaves as it did last time
Saturating 2 bit counter branch behaves as it
usually does

18
Saturating-Counter Predictors

Consider strongly biased branch with infrequent
outcome
TTTTTTTTNTTTTTTTTNTTTT
Last-outcome will misspredict twice per
infrequent outcome encounter
TTTTTTTTNTTTTTTTTNTTTT
Idea Remember most frequent case
Saturating-Counter Hysteresis
often called bi-modal predictor
Captures Temporal Bias

19
Bimodal Prediction

Table of 2-bit saturating counters
Predict the most common direction
Advantages simple, cheap, good accuracy
Bimodal will misspredict once per infrequent
outcome encounter
TTTTTTTTNTTTTTTTTNTTTT

20
Bimodal Branch Predictors
21
Correlating Predictors

From program perspective
Different Branches may be correlated
if (aa 2) aa 0
if (bb 2) bb 0
if (aa ! bb) then
Can be viewed as a pattern detector
Instead of keeping aggregate history information
I.e., most frequent outcome
Keep exact history information
Pattern of n most recent outcomes
Example
BHR n most recent branch outcomes
Use PC and BHR (xor?) to access prediction table

22
Pattern-based Prediction

Nested loops
for i 0 to N
for j 0 to 3
Branch Outcome Stream for j-for branch
11101110111011101110
Patterns
111 -gt 0
110 -gt 1
101 -gt 1
011 -gt 1
100 accuracy
Learning time 4 instances
Table Index (PC, 3-bit history)

23
Two-level Branch Predictors

A branch outcome depends on the outcomes of
previous branches
First level Branch History Registers (BHR)
Global history / Branch correlation past
executions of all branches
Self history / Private history past executions
of the same branch
Second level Pattern History Table (PHT)
Use first level information to index a table
Possibly XOR with the branch address
PHT Usually saturating 2 bit counters
Also private, shared or global

24
Gshare Predictor (McFarling)
Branch History Table
Global BHR
Prediction
f
PC

PC and BHR can be
concatenated
completely overlapped
partially overlapped
xored, etc.
How deep BHR should be?
Really depends on program
But, deeper increases learning time
May increase quality of information

25
Two-level Branch Predictors (II)
26
Hybrid Prediction

Combining branch predictors
Use two different branch predictors
Access both in parallel
A third table determines which prediction to use
Two or more predictor components combined
Different
branches benefit
from different types
of history

27
Hybrid Branch Predictors (II)
28
Issues Affecting Accurate Branch Prediction

Aliasing
More than one branch may use the same BHT/PHT
entry
Constructive
Prediction that would have been incorrect,
predicted correctly
Destructive
Prediction that would have been correct,
predicted incorrectly
Neutral
No change in the accuracy

29
More Issues

Training time
Need to see enough branches to uncover pattern
Need enough time to reach steady state
Wrong history
Incorrect type of history for the branch
Stale state
Predictor is updated after information is needed
Operating system context switches
More aliasing caused by branches in different
programs

30
Performance Metrics

Misprediction rate
Mispredicted branches per executed branch
Unfortunately the most usually found
Instructions per mispredicted branch
Gives a better idea of the program behaviour
Branches are not evenly spaced

31
Impact of Realistic Branch Prediction

Limiting the type of branch prediction.

FP 15 - 45
Integer 6 - 12
IPC
32
BPPPower-Aware Branch Predictor

Combined Predictors
Branch Instruction Behavior
BPP (Branch Predictor Prediction)
Results

33
Combined Predictors

Different Behaviors, Different Sub-Predictors
Selector Picks Sub-Predictor.
Improved Performance over processors using only
one sub-predictor
Consequence Extra Power (50)

34
Branch Predictors Power

Direct Effect Up to 10.
In-direct Effect Wrong Path Instructions
Smaller/Less Complex Predictors, More Wasted
Energy.
Power-Aware Predictors MUST be Highly Accurate.

35
Branch Instruction Behavior

Branches use the same sub-predictor

36
Branch Predictor Prediction

BPP BUFFER
HINTS

Hints on next two branches. HOW? 11
Miss-Predicted Branch 00Branch used Bimod last
time 01Branch used Gshare last time
37
BPP example
Code Sequence First Appearance

A
BPP BUFFER
HINTS

B
C
A B C D
D
E
F
38
BPP example

39
Results

Power (Total Branch Predictors) and
Performance.
Compared to three base cases
A) Non-Gated Combined (CMB)
B) Bimodal (BMD)
C) Gshare (GSH)
Reported for 32k entry Banked Predictors.

40
Performance

Within 0.4 of CMB, better than BMD(7) and
GSH(3)
41
Branch Predictors Energy

13 less than CMB, more than BMD(35) and
GSH(22)
42
Total Energy

0.3, 4.5 and 1.8 less than CMB, BMD and GSH
43
ILP, benefits and costs?

How can we extract more ILP?
What are the costs?

44
Upper Limit to ILP Ideal Machine
Amount of parallelism when there are no branch
mis-predictions and were limited only by data
dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
45
Complexity-Effective Designs

History Brainiacs and Speed demons
Brainiacs maximizing the of instructions
issued per clock cycle
Speed demons simpler implementation with a very
fast clock
Complexity-Effective
Complexity-Effective architecture means that the
architecture takes both of the benefits of
complex issue schemes and the benefits of
simpler implementation with a fast clock cycle
Complexity measurement delay of the critical
path
Proposed Architecture
High performance(high IPC) with a very high clock
frequency

46
Extracting More Parallelism
8
8
4
Future?
Today
128
256
Higher IPC Clock, Power?
Want High IPC Fast Clock Low Power
47
Generic pipeline description

Baseline superscalar model

Criteria for sources of complexity(delay)
structures whose delay is a function of issue
window size and issue width
structures which tends to rely on broadcast
operations over long wires

48
Sources of complexity

Register renaming logic
translates logical register designators to
physical register designator
Wakeup logic
Responsible for waking up instructions waiting
for their source operands to become available
Selection logic
Responsible for selection instructions for
execution from the pool of ready instructions
Bypass logic
Bypassing the operand values from instructions
that have completed execution
Other structures not to be considered here
Access time of the register file varies with the
of registers and the of ports.
Access time of a cache is a function of the size
of the cache and the associativity of the cache

49
Register rename logic complexity
50
Delay analysis for rename logic

Delay analysis for RAM scheme
RAM scheme operates like a standard RAM
Issue width affect delay through its impact wire
lengths
- Increasing issue width increases the of
bit/word lines
- Delay of rename logic depends on the linear
function of the issue width.
Spice result
Total delay each component delay
increase linearly with IW
Bit line word line delay worsens
as the feature size is reduced.
(Logic delay is reduced linearly as
the feature size is reduced. But wire
delay fall at a slow rate.)

51
Wakeup logic

Wakeup logic
Responsible for updating source dependences for
instructions in the issue window waiting for
their source operands to become available.
Basic Structure
2 OR gates and 2IW comparators per one entry of
issue window
Delay analysis
Almost linear func. (gt0.35um)
Quadratic func. Under 0.35um
Almost linear function.

52
Delay analysis for wakeup logic

SPICE result
(figure 5 under 0.18um)
Issue width has a greater impact on
the delay than window size.
WINSIZE ? Tdrive
IW ? Tdrive, Ttagmatch,
TmatchOR
(figure 6 under 8-way,64-entry window)
The tag drive and tag match delays are
less scalable than the match OR delay.
Tdrive Ttagmatch ?52 under 0.8um
?62 under 0.18um

53
Selection Logic

Selection Logic
Responsible for choosing instructions for
execution from the pool of ready instructions in
the issue window
Basic structure
REQ(input) GRANT(output) signals
Operation 2 phases
REQ propagates up to the root.
GRANT with high priority on the
arbiter cell propagates down to
the leaf arbiter.
Selection policy(oldest first)
lt implementation gt
left-most entries have the highest
priority.
IW compacts the IW to the left every
time instr.s are issued and inserts
new instr.s at the right end.

54
Delay analysis for selection logic

Delay analysis
The optimal number of arbiter inputs to be four
here.
SPICE result
Assuming a single functional unit
Various components of the total
delay scale well as the feature size
is reduced.
? All the delays are logic delay.
(dont consider the wire)
? It is possible to minimize the effect of
the effect of wire delays if the ready
signals are stored in a smaller,more
compact array.

55
Data bypass logic

Bypass logic
Responsible for forwarding result values from
completing instructions to dependent instructions
to dependent instructions,bypassing RF
Basic structure
In fully bypass design,
Bypass paths2(IW)2S
where S of pipeline stages after first
output-producing stage
Current trend deeper-pipelining
wider issue
? produce critical importance

56
Delay analysis for data bypass logic

Delay analysis
The length of the wires is a function of the
result wires
Increasing IW increases the length of the result
wires
SPICE result
Based on the basic structure(layout)
The delays are the same for the three
technologies(feature sizes)

57
Summary of Delays and Pipeline Issues

Pipeline delay results
For the 4-way machine,
the window logic(WL) ? critical path delay
For the 8-way machine,
the bypass logic(BL) ? critical path delay
Future machine(ILP)
WL BL will pose the largest problems.
Both make us difficult to divide these
into more pipeline segments.(atomic
operation)
In WL(wake-up/select)
In BL(bypass logic)
In order for dependent operations to execute in
consecutive cycles, the bypass value must be made
available to the dependent instruction within a
cycle.
Solution stall(trade-off between the cycle time
and bottle-neck from bypass in wider issues)

58
A complexity-Effective Micro-Arch.

Dependence-based microarchitecture
Replaces the issue window with a simpler
structure that facilitates a faster clock while
exploiting similar levels of parallelism.
Naturally lends itself to clustering and helps
the bypass problem to a large extent.
Simple description
Dependent instructions cant execute
in parallel but consecutively.
The issue window is replaced by
a small of FIFO buffers
The FIFO buffers are constrained to
issue in-order, and dependent instr.s
are steered to the same FIFO.
The register availability only needs to be fanned
out to the heads of the FIFO buffers.
(In typical issue window, result tags have
to be broadcast to all the entries.)
The instruction at the FIFO heads monitor
reservation bits to check for operand
availability. (one per physical register)
SRC_FIFO table for steering instructions to
appropriate buffers
Indexed using logical register designators.
SRC_FIFO(Ra) the identity of the FIFO buffer

59
Instruction Steering Heuristics

Applied heuristics
Case 1 All operands of I are available ?
I into new(free) FIFO
Case 2 A single outstanding operands of I
Isource in FIFO fa
if no instructions behind Isource in FIFO
fa ? I into FIFO fa
else
? I into new FIFO
Case 3 2 outstanding operands of I ? apply one
of 2 operands to case 2

60
Performance results

Performance results
Proposed arch. 8 FIFOs, 8 entries in 1 FIFO,
baseline arch. 64-entry issue window
The dependence-based microarchitecture is nearly
as effective(extracts similar parallelism) as the
typical window-based microarchitecture.

Max. 8
61
Complexity analysis

Reservation table
If the instruction Ia at the head of FIFO Fa is
dependent on an instruction Ib waiting in FIFO,
Ia cannot issue until Ib completes.
The delay of the wakeup logic is determined by
the delay of accessing the reservation table.
The selection logic is simple
because only the instructions
at the FIFO heads need to be
considered for selecton.
Effect
The suggested arch. can improve clock
period(faster clock)
? as much as 39 in 0.18 um technology

62
Clustering

Clustering the dependence-based microarchitecture
Advantage
Wakeup and selection logic are simplified.
Because of assigning dependent instructions to
FIFOs,local bypasses are
more frequently than inter-cluster
bypasses.(overall delay is reduced.)
Multiple copies of register file make the of
ports reduced(faster RF access)

63
Performance of Clustering

Performance comparison
Comparison between 24-way dependence-based and
conventional 8-way
64-entry window-based architecture
Assuming 1-cycle Local bypass delay and 2-cycle
inter-cluster bypass delay
Overall performance
considering clock speed
? average 16 improvement

Max 12
64
Conclusion

Some important results
The logic associated with the issue window and
the data bypass logic are going to become
increasingly critical as future designs employ
wider issue widths,bigger windows, and smaller
feature size.
Wire delays will increasingly dominate total
delay in future technology.
(window logic and bypass logic are atomic
operations.)
Complexity-effective architecture
Architecture that facilitate a fast clock while
exploiting similar levels of ILP
Dependence-based architecture as a
complexity-effective architecture
? simplifies window logic
? naturally lends itself to clustering by
grouping dependent instructions

65
The Motivation for Caches

Motivation
Large memories (DRAM) are slow
Small memories (SRAM) are fast
Make the average access time small by
Servicing most accesses from a small, fast
memory.
Reduce the bandwidth required of the large memory

66
Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns .01-.001/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 100ns-1us .01-.001
Memory
OS 512-4K bytes
Pages
Disk G Bytes ms 10 - 10 cents
Disk
-4
-3
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10 cents
Tape
Lower Level
-6
67
The Principle of Locality

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Example 90 of time in 10 of the code
Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon.
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon.

68
Memory Hierarchy Principles of Operation

At any given time, data is copied between only 2
adjacent levels
Upper Level (Cache) the one closer to the
processor
Smaller, faster, and uses more expensive
technology
Lower Level (Memory) the one further away from
the processor
Bigger, slower, and uses less expensive
technology
Block
The minimum unit of information that can either
be present or not present in the two level
hierarchy