ELEC 669 Low Power Design Techniques Lecture 2 - PowerPoint PPT Presentation

About This Presentation
Title:

ELEC 669 Low Power Design Techniques Lecture 2

Description:

... about which way a branch will go - will the branch be taken or ... Fetch. Resolve. f(PC, x) Predicted Stream. PC, T or NT. Actual Stream. f(PC, x) = T or NT ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 69
Provided by: originally
Category:
Tags: elec | design | fetch | go | lecture | low | power | techniques

less

Transcript and Presenter's Notes

Title: ELEC 669 Low Power Design Techniques Lecture 2


1
ELEC 669Low Power Design TechniquesLecture 2
  • Amirali Baniasadi
  • amirali_at_ece.uvic.ca

2
How to write a review?
  • Think Critically.
  • What if?
  • Next Step?
  • Any other applications?

3
Branches
  • Instructions which can alter the flow of
    instruction execution in a program

4
Motivation
  • Pipelined execution
  • A new intruction enters the pipeline every
    cycle...
  • but still takes several cycles to execute
  • Control flow changes
  • Two possible paths after a branch is fetched
  • Introduces pipeline "bubbles"
  • Branch delay slots
  • Prediction offers a chance to avoid this bubbles

A branch is fetched
But takes N cycles to execute
Pipeline bubble
5
Techniques for handling branches
  • Stalling
  • Branch delay slots
  • Relies on programmer/compiler to fill
  • Depends on being able to find suitable
    instructions
  • Ties resolution delay to a particular pipeline

6
Why arent these techniques acceptable?
  • Branches are frequent - 15-25
  • Todays pipelines are deeper and wider
  • Higher performance penalty for stalling
  • Misprediction Penalty issue width resolution
    delay cycles
  • A lot of cycles can be wasted!!!

7
Branch Prediction
  • Predicting the outcome of a branch
  • Direction
  • Taken / Not Taken
  • Direction predictors
  • Target Address
  • PCoffset (Taken)/ PC4 (Not Taken)
  • Target address predictors
  • Branch Target Buffer (BTB)

8
Why do we need branch prediction?
  • Branch prediction
  • Increases the number of instructions available
    for the scheduler to issue. Increases
    instruction level parallelism (ILP)
  • Allows useful work to be completed while waiting
    for the branch to resolve

9
Branch Prediction Strategies
  • Static
  • Decided before runtime
  • Examples
  • Always-Not Taken
  • Always-Taken
  • Backwards Taken, Forward Not Taken (BTFNT)
  • Profile-driven prediction
  • Dynamic
  • Prediction decisions may change during the
    execution of the program

10
What happens when a branch is predicted?
  • On misprediction
  • No speculative state may commit
  • Squash instructions in the pipeline
  • Must not allow stores in the pipeline to occur
  • Cannot allow stores which would not have happened
    to commit
  • Even for good branch predictors more than half of
    the fetched instructions are squashed

11
Instruction traffic due to misprediction
better
Half of fetched instructions wasted. More Waste
in Front-End.
12
Energy Loss due to Miss-Predictions
better
21 average energy loss. More energy waste in
integer benchmarks.
13
Simple Static Predictors
  • Simple heuristics
  • Always taken
  • Always not taken
  • Backwards taken / Forward not taken
  • Relies on the compiler to arrange the code
    following this assertion
  • Certain opcodes taken
  • Programmer provided hints
  • Profiling

14
Simple Static Predictors
15
Dynamic Hardware Predictors
  • Dynamic Branch Prediction is the ability of the
    hardware to make an educated guess about which
    way a branch will go - will the branch be taken
    or not.
  • The hardware can look for clues based on the
    instructions, or it can use past history - we
    will discuss both of these directions.

16
A Generic Branch Predictor
Predicted Stream PC, T or NT
Fetch
f(PC, x)
Resolve
Actual Stream f(PC, x) T or NT
Actual Stream
Execution Order
Predicted Stream
- Whats f (PC, x)? - x can be any relevant
info thus far x was empty
17
Bimodal Branch Predictors
  • Dynamically store information about the branch
    behaviour
  • Branches tend to behave in a fixed way
  • Branches tend to behave in the same way across
    program execution
  • Index a Pattern History Table using the branch
    address
  • 1 bit branch behaves as it did last time
  • Saturating 2 bit counter branch behaves as it
    usually does

18
Saturating-Counter Predictors
  • Consider strongly biased branch with infrequent
    outcome
  • TTTTTTTTNTTTTTTTTNTTTT
  • Last-outcome will misspredict twice per
    infrequent outcome encounter
  • TTTTTTTTNTTTTTTTTNTTTT
  • Idea Remember most frequent case
  • Saturating-Counter Hysteresis
  • often called bi-modal predictor
  • Captures Temporal Bias

19
Bimodal Prediction
  • Table of 2-bit saturating counters
  • Predict the most common direction
  • Advantages simple, cheap, good accuracy
  • Bimodal will misspredict once per infrequent
    outcome encounter
  • TTTTTTTTNTTTTTTTTNTTTT

20
Bimodal Branch Predictors
21
Correlating Predictors
  • From program perspective
  • Different Branches may be correlated
  • if (aa 2) aa 0
  • if (bb 2) bb 0
  • if (aa ! bb) then
  • Can be viewed as a pattern detector
  • Instead of keeping aggregate history information
  • I.e., most frequent outcome
  • Keep exact history information
  • Pattern of n most recent outcomes
  • Example
  • BHR n most recent branch outcomes
  • Use PC and BHR (xor?) to access prediction table

22
Pattern-based Prediction
  • Nested loops
  • for i 0 to N
  • for j 0 to 3
  • Branch Outcome Stream for j-for branch
  • 11101110111011101110
  • Patterns
  • 111 -gt 0
  • 110 -gt 1
  • 101 -gt 1
  • 011 -gt 1
  • 100 accuracy
  • Learning time 4 instances
  • Table Index (PC, 3-bit history)

23
Two-level Branch Predictors
  • A branch outcome depends on the outcomes of
    previous branches
  • First level Branch History Registers (BHR)
  • Global history / Branch correlation past
    executions of all branches
  • Self history / Private history past executions
    of the same branch
  • Second level Pattern History Table (PHT)
  • Use first level information to index a table
  • Possibly XOR with the branch address
  • PHT Usually saturating 2 bit counters
  • Also private, shared or global

24
Gshare Predictor (McFarling)
Branch History Table
Global BHR
Prediction
f
PC
  • PC and BHR can be
  • concatenated
  • completely overlapped
  • partially overlapped
  • xored, etc.
  • How deep BHR should be?
  • Really depends on program
  • But, deeper increases learning time
  • May increase quality of information

25
Two-level Branch Predictors (II)
26
Hybrid Prediction
  • Combining branch predictors
  • Use two different branch predictors
  • Access both in parallel
  • A third table determines which prediction to use
    Two or more predictor components combined
  • Different
  • branches benefit
  • from different types
  • of history

27
Hybrid Branch Predictors (II)
28
Issues Affecting Accurate Branch Prediction
  • Aliasing
  • More than one branch may use the same BHT/PHT
    entry
  • Constructive
  • Prediction that would have been incorrect,
    predicted correctly
  • Destructive
  • Prediction that would have been correct,
    predicted incorrectly
  • Neutral
  • No change in the accuracy

29
More Issues
  • Training time
  • Need to see enough branches to uncover pattern
  • Need enough time to reach steady state
  • Wrong history
  • Incorrect type of history for the branch
  • Stale state
  • Predictor is updated after information is needed
  • Operating system context switches
  • More aliasing caused by branches in different
    programs

30
Performance Metrics
  • Misprediction rate
  • Mispredicted branches per executed branch
  • Unfortunately the most usually found
  • Instructions per mispredicted branch
  • Gives a better idea of the program behaviour
  • Branches are not evenly spaced

31
Impact of Realistic Branch Prediction
  • Limiting the type of branch prediction.

FP 15 - 45
Integer 6 - 12
IPC
32
BPPPower-Aware Branch Predictor
  • Combined Predictors
  • Branch Instruction Behavior
  • BPP (Branch Predictor Prediction)
  • Results

33
Combined Predictors
  • Different Behaviors, Different Sub-Predictors
  • Selector Picks Sub-Predictor.
  • Improved Performance over processors using only
    one sub-predictor
  • Consequence Extra Power (50)

34
Branch Predictors Power
  • Direct Effect Up to 10.
  • In-direct Effect Wrong Path Instructions
  • Smaller/Less Complex Predictors, More Wasted
    Energy.
  • Power-Aware Predictors MUST be Highly Accurate.

35
Branch Instruction Behavior
  • Branches use the same sub-predictor

36
Branch Predictor Prediction

BPP BUFFER
HINTS

Hints on next two branches. HOW? 11
Miss-Predicted Branch 00Branch used Bimod last
time 01Branch used Gshare last time
37
BPP example
Code Sequence First Appearance

A
BPP BUFFER
HINTS

B
C
A B C D
D
E
F
38
BPP example


39
Results
  • Power (Total Branch Predictors) and
    Performance.
  • Compared to three base cases
  • A) Non-Gated Combined (CMB)
  • B) Bimodal (BMD)
  • C) Gshare (GSH)
  • Reported for 32k entry Banked Predictors.

40
Performance

Within 0.4 of CMB, better than BMD(7) and
GSH(3)
41
Branch Predictors Energy

13 less than CMB, more than BMD(35) and
GSH(22)
42
Total Energy

0.3, 4.5 and 1.8 less than CMB, BMD and GSH
43
ILP, benefits and costs?
  • How can we extract more ILP?
  • What are the costs?

44
Upper Limit to ILP Ideal Machine
Amount of parallelism when there are no branch
mis-predictions and were limited only by data
dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
45
Complexity-Effective Designs
  • History Brainiacs and Speed demons
  • Brainiacs maximizing the of instructions
    issued per clock cycle
  • Speed demons simpler implementation with a very
    fast clock
  • Complexity-Effective
  • Complexity-Effective architecture means that the
    architecture takes both of the benefits of
    complex issue schemes and the benefits of
    simpler implementation with a fast clock cycle
  • Complexity measurement delay of the critical
    path
  • Proposed Architecture
  • High performance(high IPC) with a very high clock
    frequency

46
Extracting More Parallelism
8
8
4
Future?
Today
128
256
Higher IPC Clock, Power?
Want High IPC Fast Clock Low Power
47
Generic pipeline description
  • Baseline superscalar model
  • Criteria for sources of complexity(delay)
  • structures whose delay is a function of issue
    window size and issue width
  • structures which tends to rely on broadcast
    operations over long wires

48
Sources of complexity
  • Register renaming logic
  • translates logical register designators to
    physical register designator
  • Wakeup logic
  • Responsible for waking up instructions waiting
    for their source operands to become available
  • Selection logic
  • Responsible for selection instructions for
    execution from the pool of ready instructions
  • Bypass logic
  • Bypassing the operand values from instructions
    that have completed execution
  • Other structures not to be considered here
  • Access time of the register file varies with the
    of registers and the of ports.
  • Access time of a cache is a function of the size
    of the cache and the associativity of the cache

49
Register rename logic complexity
50
Delay analysis for rename logic
  • Delay analysis for RAM scheme
  • RAM scheme operates like a standard RAM
  • Issue width affect delay through its impact wire
    lengths
  • - Increasing issue width increases the of
    bit/word lines
  • - Delay of rename logic depends on the linear
    function of the issue width.
  • Spice result
  • Total delay each component delay
  • increase linearly with IW
  • Bit line word line delay worsens

  • as the feature size is reduced.
  • (Logic delay is reduced linearly as
  • the feature size is reduced. But wire
  • delay fall at a slow rate.)

51
Wakeup logic
  • Wakeup logic
  • Responsible for updating source dependences for
    instructions in the issue window waiting for
    their source operands to become available.
  • Basic Structure
  • 2 OR gates and 2IW comparators per one entry of
    issue window
  • Delay analysis
  • Almost linear func. (gt0.35um)
  • Quadratic func. Under 0.35um
  • Almost linear function.

52
Delay analysis for wakeup logic
  • SPICE result
  • (figure 5 under 0.18um)
  • Issue width has a greater impact on
  • the delay than window size.
  • WINSIZE ? Tdrive
  • IW ? Tdrive, Ttagmatch,
    TmatchOR
  • (figure 6 under 8-way,64-entry window)
  • The tag drive and tag match delays are
  • less scalable than the match OR delay.
  • Tdrive Ttagmatch ?52 under 0.8um
  • ?62 under 0.18um

53
Selection Logic
  • Selection Logic
  • Responsible for choosing instructions for
    execution from the pool of ready instructions in
    the issue window
  • Basic structure
  • REQ(input) GRANT(output) signals
  • Operation 2 phases
  • REQ propagates up to the root.
  • GRANT with high priority on the
  • arbiter cell propagates down to
  • the leaf arbiter.
  • Selection policy(oldest first)
  • lt implementation gt
  • left-most entries have the highest
  • priority.
  • IW compacts the IW to the left every
  • time instr.s are issued and inserts
  • new instr.s at the right end.

54
Delay analysis for selection logic
  • Delay analysis
  • The optimal number of arbiter inputs to be four
    here.
  • SPICE result
  • Assuming a single functional unit
  • Various components of the total
  • delay scale well as the feature size
  • is reduced.
  • ? All the delays are logic delay.
  • (dont consider the wire)
  • ? It is possible to minimize the effect of
  • the effect of wire delays if the ready
  • signals are stored in a smaller,more
  • compact array.

55
Data bypass logic
  • Bypass logic
  • Responsible for forwarding result values from
    completing instructions to dependent instructions
    to dependent instructions,bypassing RF
  • Basic structure
  • In fully bypass design,
  • Bypass paths2(IW)2S
  • where S of pipeline stages after first
  • output-producing stage
  • Current trend deeper-pipelining
  • wider issue
  • ? produce critical importance

56
Delay analysis for data bypass logic
  • Delay analysis
  • The length of the wires is a function of the
    result wires
  • Increasing IW increases the length of the result
    wires
  • SPICE result
  • Based on the basic structure(layout)
  • The delays are the same for the three
  • technologies(feature sizes)

57
Summary of Delays and Pipeline Issues
  • Pipeline delay results
  • For the 4-way machine,
  • the window logic(WL) ? critical path delay
  • For the 8-way machine,
  • the bypass logic(BL) ? critical path delay
  • Future machine(ILP)
  • WL BL will pose the largest problems.
  • Both make us difficult to divide these
  • into more pipeline segments.(atomic
    operation)
  • In WL(wake-up/select)
  • In BL(bypass logic)
  • In order for dependent operations to execute in
    consecutive cycles, the bypass value must be made
    available to the dependent instruction within a
    cycle.
  • Solution stall(trade-off between the cycle time
    and bottle-neck from bypass in wider issues)

58
A complexity-Effective Micro-Arch.
  • Dependence-based microarchitecture
  • Replaces the issue window with a simpler
    structure that facilitates a faster clock while
    exploiting similar levels of parallelism.
  • Naturally lends itself to clustering and helps
    the bypass problem to a large extent.
  • Simple description
  • Dependent instructions cant execute
  • in parallel but consecutively.
  • The issue window is replaced by
  • a small of FIFO buffers
  • The FIFO buffers are constrained to
  • issue in-order, and dependent instr.s
  • are steered to the same FIFO.
  • The register availability only needs to be fanned
    out to the heads of the FIFO buffers.
  • (In typical issue window, result tags have
    to be broadcast to all the entries.)
  • The instruction at the FIFO heads monitor
    reservation bits to check for operand
    availability. (one per physical register)
  • SRC_FIFO table for steering instructions to
    appropriate buffers
  • Indexed using logical register designators.
  • SRC_FIFO(Ra) the identity of the FIFO buffer

59
Instruction Steering Heuristics
  • Applied heuristics
  • Case 1 All operands of I are available ?
    I into new(free) FIFO
  • Case 2 A single outstanding operands of I
    Isource in FIFO fa
  • if no instructions behind Isource in FIFO
    fa ? I into FIFO fa
  • else
    ? I into new FIFO
  • Case 3 2 outstanding operands of I ? apply one
    of 2 operands to case 2

60
Performance results
  • Performance results
  • Proposed arch. 8 FIFOs, 8 entries in 1 FIFO,
  • baseline arch. 64-entry issue window
  • The dependence-based microarchitecture is nearly
    as effective(extracts similar parallelism) as the
    typical window-based microarchitecture.

Max. 8
61
Complexity analysis
  • Reservation table
  • If the instruction Ia at the head of FIFO Fa is
    dependent on an instruction Ib waiting in FIFO,
    Ia cannot issue until Ib completes.
  • The delay of the wakeup logic is determined by
    the delay of accessing the reservation table.
  • The selection logic is simple
  • because only the instructions
  • at the FIFO heads need to be
  • considered for selecton.
  • Effect
  • The suggested arch. can improve clock
    period(faster clock)
  • ? as much as 39 in 0.18 um technology

62
Clustering
  • Clustering the dependence-based microarchitecture
  • Advantage
  • Wakeup and selection logic are simplified.
  • Because of assigning dependent instructions to
    FIFOs,local bypasses are
  • more frequently than inter-cluster
    bypasses.(overall delay is reduced.)
  • Multiple copies of register file make the of
    ports reduced(faster RF access)

63
Performance of Clustering
  • Performance comparison
  • Comparison between 24-way dependence-based and
    conventional 8-way
  • 64-entry window-based architecture
  • Assuming 1-cycle Local bypass delay and 2-cycle
    inter-cluster bypass delay
  • Overall performance
  • considering clock speed
  • ? average 16 improvement

Max 12
64
Conclusion
  • Some important results
  • The logic associated with the issue window and
    the data bypass logic are going to become
    increasingly critical as future designs employ
    wider issue widths,bigger windows, and smaller
    feature size.
  • Wire delays will increasingly dominate total
    delay in future technology.
  • (window logic and bypass logic are atomic
    operations.)
  • Complexity-effective architecture
  • Architecture that facilitate a fast clock while
    exploiting similar levels of ILP
  • Dependence-based architecture as a
    complexity-effective architecture
  • ? simplifies window logic
  • ? naturally lends itself to clustering by
    grouping dependent instructions

65
The Motivation for Caches
  • Motivation
  • Large memories (DRAM) are slow
  • Small memories (SRAM) are fast
  • Make the average access time small by
  • Servicing most accesses from a small, fast
    memory.
  • Reduce the bandwidth required of the large memory

66
Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns .01-.001/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 100ns-1us .01-.001
Memory
OS 512-4K bytes
Pages
Disk G Bytes ms 10 - 10 cents
Disk
-4
-3
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10 cents
Tape
Lower Level
-6
67
The Principle of Locality
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Example 90 of time in 10 of the code
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon.
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon.

68
Memory Hierarchy Principles of Operation
  • At any given time, data is copied between only 2
    adjacent levels
  • Upper Level (Cache) the one closer to the
    processor
  • Smaller, faster, and uses more expensive
    technology
  • Lower Level (Memory) the one further away from
    the processor
  • Bigger, slower, and uses less expensive
    technology
  • Block
  • The minimum unit of information that can either
    be present or not present in the two level
    hierarchy

Lower Level (Memory)
Upper Level (Cache)
To Processor
Blk X
From Processor
Blk Y
Write a Comment
User Comments (0)
About PowerShow.com