Title: ELEC 669 Low Power Design Techniques Lecture 2
1ELEC 669Low Power Design TechniquesLecture 2
- Amirali Baniasadi
- amirali_at_ece.uvic.ca
2How to write a review?
- Think Critically.
- What if?
- Next Step?
- Any other applications?
3Branches
- Instructions which can alter the flow of
instruction execution in a program
4Motivation
- Pipelined execution
- A new intruction enters the pipeline every
cycle... - but still takes several cycles to execute
- Control flow changes
- Two possible paths after a branch is fetched
- Introduces pipeline "bubbles"
- Branch delay slots
- Prediction offers a chance to avoid this bubbles
A branch is fetched
But takes N cycles to execute
Pipeline bubble
5Techniques for handling branches
- Stalling
- Branch delay slots
- Relies on programmer/compiler to fill
- Depends on being able to find suitable
instructions - Ties resolution delay to a particular pipeline
6Why arent these techniques acceptable?
- Branches are frequent - 15-25
- Todays pipelines are deeper and wider
- Higher performance penalty for stalling
- Misprediction Penalty issue width resolution
delay cycles - A lot of cycles can be wasted!!!
7Branch Prediction
- Predicting the outcome of a branch
- Direction
- Taken / Not Taken
- Direction predictors
- Target Address
- PCoffset (Taken)/ PC4 (Not Taken)
- Target address predictors
- Branch Target Buffer (BTB)
8Why do we need branch prediction?
- Branch prediction
- Increases the number of instructions available
for the scheduler to issue. Increases
instruction level parallelism (ILP) - Allows useful work to be completed while waiting
for the branch to resolve
9Branch Prediction Strategies
- Static
- Decided before runtime
- Examples
- Always-Not Taken
- Always-Taken
- Backwards Taken, Forward Not Taken (BTFNT)
- Profile-driven prediction
- Dynamic
- Prediction decisions may change during the
execution of the program
10What happens when a branch is predicted?
- On misprediction
- No speculative state may commit
- Squash instructions in the pipeline
- Must not allow stores in the pipeline to occur
- Cannot allow stores which would not have happened
to commit - Even for good branch predictors more than half of
the fetched instructions are squashed
11Instruction traffic due to misprediction
better
Half of fetched instructions wasted. More Waste
in Front-End.
12Energy Loss due to Miss-Predictions
better
21 average energy loss. More energy waste in
integer benchmarks.
13Simple Static Predictors
- Simple heuristics
- Always taken
- Always not taken
- Backwards taken / Forward not taken
- Relies on the compiler to arrange the code
following this assertion - Certain opcodes taken
- Programmer provided hints
- Profiling
14Simple Static Predictors
15Dynamic Hardware Predictors
- Dynamic Branch Prediction is the ability of the
hardware to make an educated guess about which
way a branch will go - will the branch be taken
or not. - The hardware can look for clues based on the
instructions, or it can use past history - we
will discuss both of these directions.
16A Generic Branch Predictor
Predicted Stream PC, T or NT
Fetch
f(PC, x)
Resolve
Actual Stream f(PC, x) T or NT
Actual Stream
Execution Order
Predicted Stream
- Whats f (PC, x)? - x can be any relevant
info thus far x was empty
17Bimodal Branch Predictors
- Dynamically store information about the branch
behaviour - Branches tend to behave in a fixed way
- Branches tend to behave in the same way across
program execution - Index a Pattern History Table using the branch
address - 1 bit branch behaves as it did last time
- Saturating 2 bit counter branch behaves as it
usually does
18Saturating-Counter Predictors
- Consider strongly biased branch with infrequent
outcome - TTTTTTTTNTTTTTTTTNTTTT
- Last-outcome will misspredict twice per
infrequent outcome encounter - TTTTTTTTNTTTTTTTTNTTTT
- Idea Remember most frequent case
- Saturating-Counter Hysteresis
- often called bi-modal predictor
- Captures Temporal Bias
19Bimodal Prediction
- Table of 2-bit saturating counters
- Predict the most common direction
- Advantages simple, cheap, good accuracy
- Bimodal will misspredict once per infrequent
outcome encounter - TTTTTTTTNTTTTTTTTNTTTT
20Bimodal Branch Predictors
21Correlating Predictors
- From program perspective
- Different Branches may be correlated
- if (aa 2) aa 0
- if (bb 2) bb 0
- if (aa ! bb) then
- Can be viewed as a pattern detector
- Instead of keeping aggregate history information
- I.e., most frequent outcome
- Keep exact history information
- Pattern of n most recent outcomes
- Example
- BHR n most recent branch outcomes
- Use PC and BHR (xor?) to access prediction table
22Pattern-based Prediction
- Nested loops
- for i 0 to N
- for j 0 to 3
-
- Branch Outcome Stream for j-for branch
- 11101110111011101110
- Patterns
- 111 -gt 0
- 110 -gt 1
- 101 -gt 1
- 011 -gt 1
- 100 accuracy
- Learning time 4 instances
- Table Index (PC, 3-bit history)
23Two-level Branch Predictors
- A branch outcome depends on the outcomes of
previous branches - First level Branch History Registers (BHR)
- Global history / Branch correlation past
executions of all branches - Self history / Private history past executions
of the same branch - Second level Pattern History Table (PHT)
- Use first level information to index a table
- Possibly XOR with the branch address
- PHT Usually saturating 2 bit counters
- Also private, shared or global
24Gshare Predictor (McFarling)
Branch History Table
Global BHR
Prediction
f
PC
- PC and BHR can be
- concatenated
- completely overlapped
- partially overlapped
- xored, etc.
- How deep BHR should be?
- Really depends on program
- But, deeper increases learning time
- May increase quality of information
25Two-level Branch Predictors (II)
26Hybrid Prediction
- Combining branch predictors
- Use two different branch predictors
- Access both in parallel
- A third table determines which prediction to use
Two or more predictor components combined -
- Different
- branches benefit
- from different types
- of history
27Hybrid Branch Predictors (II)
28Issues Affecting Accurate Branch Prediction
- Aliasing
- More than one branch may use the same BHT/PHT
entry - Constructive
- Prediction that would have been incorrect,
predicted correctly - Destructive
- Prediction that would have been correct,
predicted incorrectly - Neutral
- No change in the accuracy
29More Issues
- Training time
- Need to see enough branches to uncover pattern
- Need enough time to reach steady state
- Wrong history
- Incorrect type of history for the branch
- Stale state
- Predictor is updated after information is needed
- Operating system context switches
- More aliasing caused by branches in different
programs
30Performance Metrics
- Misprediction rate
- Mispredicted branches per executed branch
- Unfortunately the most usually found
- Instructions per mispredicted branch
- Gives a better idea of the program behaviour
- Branches are not evenly spaced
31Impact of Realistic Branch Prediction
- Limiting the type of branch prediction.
FP 15 - 45
Integer 6 - 12
IPC
32BPPPower-Aware Branch Predictor
- Combined Predictors
- Branch Instruction Behavior
- BPP (Branch Predictor Prediction)
- Results
33 Combined Predictors
- Different Behaviors, Different Sub-Predictors
- Selector Picks Sub-Predictor.
- Improved Performance over processors using only
one sub-predictor - Consequence Extra Power (50)
34 Branch Predictors Power
- Direct Effect Up to 10.
-
- In-direct Effect Wrong Path Instructions
- Smaller/Less Complex Predictors, More Wasted
Energy. - Power-Aware Predictors MUST be Highly Accurate.
35 Branch Instruction Behavior
- Branches use the same sub-predictor
-
36 Branch Predictor Prediction
BPP BUFFER
HINTS
Hints on next two branches. HOW? 11
Miss-Predicted Branch 00Branch used Bimod last
time 01Branch used Gshare last time
37 BPP example
Code Sequence First Appearance
A
BPP BUFFER
HINTS
B
C
A B C D
D
E
F
38 BPP example
39Results
- Power (Total Branch Predictors) and
Performance. - Compared to three base cases
- A) Non-Gated Combined (CMB)
- B) Bimodal (BMD)
- C) Gshare (GSH)
- Reported for 32k entry Banked Predictors.
-
40 Performance
Within 0.4 of CMB, better than BMD(7) and
GSH(3)
41 Branch Predictors Energy
13 less than CMB, more than BMD(35) and
GSH(22)
42Total Energy
0.3, 4.5 and 1.8 less than CMB, BMD and GSH
43ILP, benefits and costs?
- How can we extract more ILP?
- What are the costs?
44Upper Limit to ILP Ideal Machine
Amount of parallelism when there are no branch
mis-predictions and were limited only by data
dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
45Complexity-Effective Designs
- History Brainiacs and Speed demons
- Brainiacs maximizing the of instructions
issued per clock cycle - Speed demons simpler implementation with a very
fast clock - Complexity-Effective
- Complexity-Effective architecture means that the
architecture takes both of the benefits of
complex issue schemes and the benefits of
simpler implementation with a fast clock cycle - Complexity measurement delay of the critical
path - Proposed Architecture
- High performance(high IPC) with a very high clock
frequency
46Extracting More Parallelism
8
8
4
Future?
Today
128
256
Higher IPC Clock, Power?
Want High IPC Fast Clock Low Power
47Generic pipeline description
- Baseline superscalar model
- Criteria for sources of complexity(delay)
- structures whose delay is a function of issue
window size and issue width - structures which tends to rely on broadcast
operations over long wires
48 Sources of complexity
- Register renaming logic
- translates logical register designators to
physical register designator - Wakeup logic
- Responsible for waking up instructions waiting
for their source operands to become available - Selection logic
- Responsible for selection instructions for
execution from the pool of ready instructions - Bypass logic
- Bypassing the operand values from instructions
that have completed execution - Other structures not to be considered here
- Access time of the register file varies with the
of registers and the of ports. - Access time of a cache is a function of the size
of the cache and the associativity of the cache
49Register rename logic complexity
50Delay analysis for rename logic
- Delay analysis for RAM scheme
- RAM scheme operates like a standard RAM
- Issue width affect delay through its impact wire
lengths -
- - Increasing issue width increases the of
bit/word lines - - Delay of rename logic depends on the linear
function of the issue width. - Spice result
- Total delay each component delay
- increase linearly with IW
- Bit line word line delay worsens
- as the feature size is reduced.
- (Logic delay is reduced linearly as
- the feature size is reduced. But wire
- delay fall at a slow rate.)
-
51Wakeup logic
- Wakeup logic
- Responsible for updating source dependences for
instructions in the issue window waiting for
their source operands to become available. - Basic Structure
- 2 OR gates and 2IW comparators per one entry of
issue window - Delay analysis
- Almost linear func. (gt0.35um)
- Quadratic func. Under 0.35um
- Almost linear function.
52Delay analysis for wakeup logic
- SPICE result
- (figure 5 under 0.18um)
- Issue width has a greater impact on
- the delay than window size.
-
- WINSIZE ? Tdrive
- IW ? Tdrive, Ttagmatch,
TmatchOR -
- (figure 6 under 8-way,64-entry window)
- The tag drive and tag match delays are
- less scalable than the match OR delay.
-
- Tdrive Ttagmatch ?52 under 0.8um
- ?62 under 0.18um
53Selection Logic
- Selection Logic
- Responsible for choosing instructions for
execution from the pool of ready instructions in
the issue window - Basic structure
- REQ(input) GRANT(output) signals
- Operation 2 phases
- REQ propagates up to the root.
- GRANT with high priority on the
- arbiter cell propagates down to
- the leaf arbiter.
- Selection policy(oldest first)
- lt implementation gt
- left-most entries have the highest
- priority.
- IW compacts the IW to the left every
- time instr.s are issued and inserts
- new instr.s at the right end.
54Delay analysis for selection logic
- Delay analysis
- The optimal number of arbiter inputs to be four
here. - SPICE result
- Assuming a single functional unit
- Various components of the total
- delay scale well as the feature size
- is reduced.
- ? All the delays are logic delay.
- (dont consider the wire)
- ? It is possible to minimize the effect of
- the effect of wire delays if the ready
- signals are stored in a smaller,more
- compact array.
-
55Data bypass logic
- Bypass logic
- Responsible for forwarding result values from
completing instructions to dependent instructions
to dependent instructions,bypassing RF - Basic structure
- In fully bypass design,
- Bypass paths2(IW)2S
- where S of pipeline stages after first
- output-producing stage
- Current trend deeper-pipelining
- wider issue
- ? produce critical importance
56Delay analysis for data bypass logic
- Delay analysis
- The length of the wires is a function of the
result wires - Increasing IW increases the length of the result
wires - SPICE result
- Based on the basic structure(layout)
- The delays are the same for the three
- technologies(feature sizes)
-
57Summary of Delays and Pipeline Issues
- Pipeline delay results
- For the 4-way machine,
- the window logic(WL) ? critical path delay
- For the 8-way machine,
- the bypass logic(BL) ? critical path delay
- Future machine(ILP)
- WL BL will pose the largest problems.
- Both make us difficult to divide these
- into more pipeline segments.(atomic
operation) - In WL(wake-up/select)
- In BL(bypass logic)
- In order for dependent operations to execute in
consecutive cycles, the bypass value must be made
available to the dependent instruction within a
cycle. - Solution stall(trade-off between the cycle time
and bottle-neck from bypass in wider issues)
58A complexity-Effective Micro-Arch.
- Dependence-based microarchitecture
- Replaces the issue window with a simpler
structure that facilitates a faster clock while
exploiting similar levels of parallelism. - Naturally lends itself to clustering and helps
the bypass problem to a large extent. - Simple description
- Dependent instructions cant execute
- in parallel but consecutively.
- The issue window is replaced by
- a small of FIFO buffers
- The FIFO buffers are constrained to
- issue in-order, and dependent instr.s
- are steered to the same FIFO.
- The register availability only needs to be fanned
out to the heads of the FIFO buffers. - (In typical issue window, result tags have
to be broadcast to all the entries.) - The instruction at the FIFO heads monitor
reservation bits to check for operand
availability. (one per physical register) - SRC_FIFO table for steering instructions to
appropriate buffers - Indexed using logical register designators.
- SRC_FIFO(Ra) the identity of the FIFO buffer
59Instruction Steering Heuristics
- Applied heuristics
- Case 1 All operands of I are available ?
I into new(free) FIFO - Case 2 A single outstanding operands of I
Isource in FIFO fa - if no instructions behind Isource in FIFO
fa ? I into FIFO fa - else
? I into new FIFO - Case 3 2 outstanding operands of I ? apply one
of 2 operands to case 2 -
60Performance results
- Performance results
- Proposed arch. 8 FIFOs, 8 entries in 1 FIFO,
- baseline arch. 64-entry issue window
- The dependence-based microarchitecture is nearly
as effective(extracts similar parallelism) as the
typical window-based microarchitecture.
Max. 8
61Complexity analysis
- Reservation table
- If the instruction Ia at the head of FIFO Fa is
dependent on an instruction Ib waiting in FIFO,
Ia cannot issue until Ib completes. - The delay of the wakeup logic is determined by
the delay of accessing the reservation table. - The selection logic is simple
- because only the instructions
- at the FIFO heads need to be
- considered for selecton.
- Effect
- The suggested arch. can improve clock
period(faster clock) - ? as much as 39 in 0.18 um technology
62Clustering
- Clustering the dependence-based microarchitecture
- Advantage
- Wakeup and selection logic are simplified.
- Because of assigning dependent instructions to
FIFOs,local bypasses are - more frequently than inter-cluster
bypasses.(overall delay is reduced.) - Multiple copies of register file make the of
ports reduced(faster RF access)
63Performance of Clustering
- Performance comparison
- Comparison between 24-way dependence-based and
conventional 8-way - 64-entry window-based architecture
- Assuming 1-cycle Local bypass delay and 2-cycle
inter-cluster bypass delay - Overall performance
- considering clock speed
-
-
- ? average 16 improvement
Max 12
64Conclusion
- Some important results
- The logic associated with the issue window and
the data bypass logic are going to become
increasingly critical as future designs employ
wider issue widths,bigger windows, and smaller
feature size. - Wire delays will increasingly dominate total
delay in future technology. - (window logic and bypass logic are atomic
operations.) - Complexity-effective architecture
- Architecture that facilitate a fast clock while
exploiting similar levels of ILP - Dependence-based architecture as a
complexity-effective architecture - ? simplifies window logic
- ? naturally lends itself to clustering by
grouping dependent instructions
65The Motivation for Caches
- Motivation
- Large memories (DRAM) are slow
- Small memories (SRAM) are fast
- Make the average access time small by
- Servicing most accesses from a small, fast
memory. - Reduce the bandwidth required of the large memory
66Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns .01-.001/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 100ns-1us .01-.001
Memory
OS 512-4K bytes
Pages
Disk G Bytes ms 10 - 10 cents
Disk
-4
-3
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10 cents
Tape
Lower Level
-6
67The Principle of Locality
- The Principle of Locality
- Program access a relatively small portion of the
address space at any instant of time. - Example 90 of time in 10 of the code
- Two Different Types of Locality
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon. - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon.
68Memory Hierarchy Principles of Operation
- At any given time, data is copied between only 2
adjacent levels - Upper Level (Cache) the one closer to the
processor - Smaller, faster, and uses more expensive
technology - Lower Level (Memory) the one further away from
the processor - Bigger, slower, and uses less expensive
technology - Block
- The minimum unit of information that can either
be present or not present in the two level
hierarchy
Lower Level (Memory)
Upper Level (Cache)
To Processor
Blk X
From Processor
Blk Y