Title: EE 382N Guest Lecture Wish Branches
1EE 382N Guest LectureWish Branches
Hyesoon Kim HPS Research Group The University
of Texas at Austin
2Lecture Outline
- Predicated execution
- Wish branches
- 2D-profiling
3Motivation
- Branch predictors are still not perfect.
- Deeper pipeline and larger instruction window
increase the branch misprediction penalty. - Predicated execution can eliminate branch
misprediction by converting control-dependency to
data dependency. However, predicated code has
overhead.
4Predicated Execution
(predicated code)
A
p1 (cond) (!p1) mov b, 1 (p1) mov
b, 0
B
C
D
add x, b, 1
- Convert control flow dependency to data
dependency - Pro Eliminate hard-to-predict branches
Cons (1) Fetch blocks B and C all the time
(2) Wait until p1 is resolved
5The Overhead of Predicated Execution
-2
13
16
non-predicated
p1 (cond) (!p1) mov b, 1 (p1) mov
b, 0
p1 (cond) (0) mov b,1 (1) mov
b,0
A
B
C
D
add x, b, 1
(Predicated code)
If all overhead is ideally eliminated, predicated
execution would provide 16 improvement in
average execution time
6The Problem
- Due to the predication overhead, predicated
execution sometimes reduces performance - Branch misprediction characteristics are
dependent on run-time behavior input set,
control-flow path and phase behavior. The
compiler cannot accurately estimate the run-time
behavior of branches
7Predicated Code Performance vs. Branch
Misprediction Rate
Predicated code performs better
run-time (input B)
profile-time (input A)
X
Normal branch code performs better
- Converting a branch to predicated code could hurt
performance if run-time misprediction rate is
lower than profile-time misprediction rate
- Execution time(normal branch code) exec_T
P(T) exec_N P(N)
misp_penalty P(misprediction) - Execution time of predicated code exec_pred
8Lecture Outline
- Predicated execution
- Wish branches
- 2D-profiling
9Wish Branches Kim et al. Micro-38
- A new type of control flow instruction
3 types wish jump/join and wish loop - The compiler generates code (with wish branches)
that can be executed either as predicated code or
non-predicated code (normal branch code) - The hardware decides to execute predicated code
or normal branch code at run-time based on the
confidence of branch prediction - Easy to predict normal branch code
- Hard to predict predicated code
10Wish Jump/Join
High Confidence
Low Confidence
A
wish jump
nop
B
wish join
Taken
Not-Taken
C
D
A
p1(cond) wish.jump p1 TARGET
p1 (cond) branch p1, TARGET
B
nop
(!p1) mov b,1 wish.join !p1 Join
(1) mov b,1 wish.join (1) Join
C
TARGET (p1) mov b,0
TARGET (1) mov b,0
D
JOIN
wish jump/join code
11Wish Loop
H
X
T
X
T
N
N
Low Confidence
High Confidence
Y
Y
H
mov p1, 1 LOOP (p1) add a,
a, 1 (p1) add i, i, 1 (p1) p1
(cond) wish. loop p1, LOOP EXIT
X
X
LOOP add a, a, 1 add i, i,
1 p1 (iltN) branch p1,
LOOP EXIT
(1) (1) (1)
Y
Y
wish loop code
normal backward branch code
12Mispredicted Case 1 Early-Exit
H
X1
X2
X3
Y
H
Correct execution
T
T
N
X
T
Early-exit (Low confidence)
Flush pipeline
N
X1
X2
Y
H
T
N
Y
X3
Y
N
- Compared to normal branch code
- predicate data dependency and one extra
instruction (-)
13Mispredicted Case 2 Late-Exit
H
Correct execution
X1
X2
X3
Y
H
T
T
N
X
T
nop
nop
Late-exit (Low confidence)
N
X1
X2
X3
X4
X5
Y
H
T
T
T
T
N
Y
- Compared to normal branch code
- pro reduce flush penalty ()
- cons predicate data dependency and one
extra instruction (-)
14Mispredicted Cases3 No-Exit
H
Correct execution
X1
X2
X3
Y
H
T
T
N
nop
nop
Late-exit
X
T
X1
X2
X3
X4
X5
Y
H
N
T
T
T
T
N
Flush pipeline
Y
No-exit
X1
X2
X3
X4
X5
X6
H
T
T
T
T
T
Y
- No-Exit
- predicate data dependency and one extra
instruction (-)
15Questions?
- What kind of branches should be converted to wish
branches (jump/join)? - Why not all branches?
- What kind of branches should be converted to wish
loops?
16Advantages/Disadvantages of Wish Branches
- Advantages compared to predicated execution
- Reduce the overhead of predication
- Increase the benefits of predicated code by
allowing the compiler to generate more
aggressively-predicated code - Provide a mechanism to exploit predication to
reduce the branch misprediction penalty for
backward branches (Wish loops) - Make predicated code less dependent on machine
configuration (e.g. branch predictor)
17Advantages/Disadvantages of Wish Branches
- Disadvantages compared to predicated execution
- Extra branch instructions use machine resources
- Extra branch instructions increase the contention
for branch predictor table entries - May constrain the compilers scope for code
optimizations
18Wish Branch Support
- ISA Support
- predicated execution, wish branch instruction
- Compiler Support
- Wish branch generation algorithms
- The compiler needs to decide which branches are
predicated, which are converted to wish branches,
and which stay as normal branches - Hardware Support
- Instruction decode logic
- Predicate dependency elimination module
- Confidence estimator
- Front-end and branch misprediction
detection/recovery module
19ISA Support
- Using existing hint bits (IA-64, x86, PowerPC)
- Hint bits can be ignored. A wish branch can be
treated as a normal branch.
OPCODE btype wtype target offset p
btye branch type (0normal branch
1wish branch) wtype wish branch type (0jump
1loop 2join) p predicate register identifier
20Wish Branch Support
- ISA Support
- predicated execution, wish branch instruction
- Compiler Support
- Wish branch generation algorithms
- The compiler needs to decide which branches are
predicated, which are converted to wish branches,
and which stay as normal branches - Hardware Support
- Instruction decode logic
- Predicate dependency elimination module
- Confidence estimator
- Front-end and branch misprediction
detection/recovery module
21Compiler Support
region formation
if-conversion
loop opt (swp, unrolling)
global inst. sched
register allocation
modified
local inst. sched
new
existing
- Major phase ordering with wish branch generation
in code generation ORC
22Wish Branch Generation Algorithm
- wish jump/join candidates all branch which are
suitable for if-conversion - The number of instructions in the fall-through
block gt N (N5) wish jump and join are inserted
- All other branches converted to predicated code
- A loop branch is converted into a wish loop when
the loop body has fewer than L instructions (L30)
23Wish Branch Support
- ISA Support
- predicated execution, wish branch instruction
- Compiler Support
- Wish branch generation algorithms
- The compiler needs to decide which branches are
predicated, which are converted to wish branches,
and which stay as normal branches - Hardware Support
- Instruction decode logic
- Predicate dependency elimination module
- Front-end and branch misprediction
detection/recovery module - Confidence estimator
24Hardware Support
- Instruction Fetch/decode logic
- Decoder decode wish branches
- BTB mark wish branches
- Wish branch state machine hardware
- Wish loop stays as low-confidence mode until the
loop exits - Predicate dependency elimination module
- High-confidence mode predicate values are
predicted - Branch misprediction detection/recovery module
- No flush if wish branch is mispredicted during
low-confidence mode - Confidence estimator
25JRS Confidence Estimator
Estimate how much confidence the processor has in
a branch prediction Trained with branch
misprediction information
n bit Counters
m bits
PC
2m entries
High Confidence Low Confidence
Global BHR
- Assigning Confidence to Conditional Branch
Predictions - Jacobsen et al. Micro-29
26Experimental Infrastructure
Source Code
IA-64 Binary
IA-64 Trace
µops
IA-64 Compiler (ORC)
Micro-op Translator
Micro-op Simulator
Trace generation module
- IA-64 provides full support for predication
- Convert IA-64 traces to micro-ops to simulate an
out-of-order superscalar processor model
27Simulation Methodology
- Nine SPEC 2000 integer benchmarks
- Baseline Processor Configuration
- Front End
- Large and accurate branch predictor (64KB
hybrid branch predictor gshare local) - Minimum 30-cycle branch misprediction penalty
- 64KB, 2-cycle latency I-cache
- Execution Core
- 8-wide out-of-order processor
- 512-entry instruction window
- Confidence Estimator
- 1KB tagged 16-bit history JRS confidence
estimator (Jacobsen et al. MICRO-29)
28Performance Improvement
-4
14
2.02
8
24
non-predicated
16 over conditional branch prediction (w/o
mcf) 11 over selective-predication (w/o mcf) 7
over aggressive predication (w/o mcf)
14 over conditional branch prediction and 13
over selective-predication and 16 over
aggressive-predication
12 over conditional branch prediction 11 over
selective-predication 13 over aggressive
predication
SELECTIVE-PREDICATION branches are selectively
predicated using compile-time cost-benefit
analysis
AGGRESSIVE-PREDICATION all branches that are
suitable for if-conversion are predicated
29Wish Branch Conclusion
- New control flow instructions wish branches
(jump/join/loop) - Wish branches improve performance by dividing the
work of predication between the compiler and the
microarchitecture - Compiler analyzes the control-flow graph and
generates code - Microarchitecture makes run-time decision to use
predication - Wish branches provide significant performance
benefits - 16 compared to conditional branch prediction
- 13 compared to selectively predicated code
- Wish branches can make predicated execution more
viable and effective in high performance
processors - By enabling adaptive and aggressive predicated
execution
30Lecture Outline
- Predicated execution
- Wish branches
- 2D-profiling
312D-profiling
- Goal Identify input-dependent branches by using
a single input set for profiling - If We Know a Branch is Input-Dependent
- May not convert it to predicated code.
- May convert it to a wish branch.
- May not perform other compiler optimizations or
may perform them less aggressively. - Hot-path/trace/superblock-based optimizations
- Fisher81, Pettis90, Hwu93, Merten99
32Key Insight of 2D-profiling
Phase behavior in prediction accuracy is a good
indicator of input dependence
phase 2
phase 3
phase 1
33Traditional Profiling
pr. Acc
MEAN pr.Acc(brA)
pr. Acc
MEAN pr.Acc(brB)
MEAN pr.Acc(brA) ? MEAN pr.Acc(brB) behavior
of brA ? behavior of brB
342D-profiling
pr. Acc
MEAN pr.Acc(brA) STD pr.Acc(brA)
pr. Acc
MEAN pr.Acc(brB) STD pr.Acc(brB)
MEAN pr.Acc(brA) ? MEAN pr.Acc(brB) STD
pr.Acc(brA) ? STD pr.Acc(brB) behavior of brA ?
behavior of brB A input-dependent br, B
input-independent br
352D-profiling Mechanism
- The profiler collects branch prediction accuracy
information for every static branch over time
slice size M instructions
Slice 1
Slice 2
Slice N
time
mean Pr.Acc(brA,s1)
mean Pr.Acc(brA,s2)
mean Pr.Acc(brA,sN)
...
mean Pr.Acc(brB,s1)
mean Pr.Acc(brB,s2)
mean Pr.Acc(brB,sN)
...
. . .
. . .
. . .
PAM50
brA
mean brA
Calculate MEAN (brA, brB, ),
Standard deviation (brA, brB, ), PAMPoints
Above Mean (brA, brB, )
brB
PAM0
mean brB
362D-profiling Conclusion Future Work
- 2D-profiling is a new profiling technique to find
input-dependent characteristics by using a single
input data set for profiling - 2D-profiling uses time-varying information
instead of just average data - Phase behavior in prediction accuracy in a
profile run ? input-dependent - Future Work
- Better predicated code/wish branch generation
algorithms - Detecting other input-dependent program
characteristics