Title: Predicting%20Conditional%20Branches%20With%20Fusion-Based%20Hybrid%20Predictors
1Predicting Conditional Branches With Fusion-Based
Hybrid Predictors
Gabriel H. Loh Yale University Dept. of Computer Science
Dana S. Henry Yale University Depts. of Elec. Eng. Comp. Sci.
This research was funded by NSF Grant MIP-9702281
2The Branch Prediction Problem
PC Compute
Branch resolution
- 1 out of 5 instructions is a branch
- May require many cycles to resolve
- P4 has 20 cycle branch resolution pipeline
- Future pipeline depths likely to increase
Sprangle02 - Predict branches to keep pipeline full
3Bigger Predictors More Accurate
(but bigger predictors slower)
- Larger predictors tend to yield more accurate
predictions - Faster cycle times force smaller branch
predictors - Overriding predictor couples small, fast
predictor with a large, multi-cycle predictor
Jiménez2000 - performs close to ideal large-fast predictor
4Hybrid Predictors
- Wide variety of branch prediction algorithms
available - Hybrid combines more than one stand-alone or
component predictor McFarling93
P1
P2
Meta- Predictor
Final Prediction
5Multi-Hybrids
P1
P2
M1
P3
P4
M2
P1
P2
Pn
M3
Pr. Encoder
Final Prediction
Final Prediction
Multi-Hybrid Evers96
Quad-Hybrid Evers00
6Our Idea Prediction Fusion
P1
P2
P3
Pn
Prediction Selection
7Early Attempt from ML
P2
P8
P7
P3
P6
P5
P1
P4
0.487
0.513
P2, P6 and P7 say not-taken
P1, P3, P4, P5 and P8 say taken
- Weighted Majority algorithm LW94
- Better predictors get assigned larger weights
- Make final prediction with larger sum
- Predictor with largest weight not always correct
8Outline
- COLT Predictor
- Choosing parameters and components
- Performance
- Prediction distributions, component choice
9COLT Organization
P1
P2
P3
Pn
Branch Address
Mapping Table
Branch History
1
0
1
0
Final Prediction
VMT
10Pathological Example
P1
P2
P3
0
0
0
Actual outcome 1 (taken)
11Example (contd)
Selection
COLT
P1
P2
P3
P1
P2
P3
VMT
0 0 0
1 1 0 1
0
0
0
Can recognize and remember this pattern
Outcome is always wrong
1
12COLT Lookup Delay
time
P1
P2
Pn
1
0
0
1
1
...
...
.
.
.
.
.
.
Prediction
13Design Choices
- of branch address bits
- of branch history bits
- of components
- Choice of components
- gshare, PAs, gskewed,
- History length, PHT size,
Determines number of mapping tables
Determines size of individual MTs
14Predictor Components
- Global History
- gshare McFarling93
- Bi-Mode Lee97
- Enhanced gskewed Michaud97
- YAGS Eden98
- Local History
- PAs Yeh94
- pskewed Evers96
- Other
- 2bC (bimodal) Smith81
- Loop Chang95
- alloyed Perceptron Jiménez02
history lengths optimized on test data sets
Total of 59 configurations Sizes vary up to 64KB
15Huge Search Space
- 259 ways to choose components
- ? ways to choose COLT parameters
- We use a genetic search
gene format
bit-k 0 means dont include Pk bit-k 1 means
do include Pk
VMT Size
history length
16Methodology
- SPEC2000 integer benchmarks
- For tuning/optimization 10M branches from test
- For evaluation 500M branches from train
- Skipped first 100M branches
- Compiled with cc arch ev6 O4 fast non_shared
- SimpleScalar simulator
- sim-safe for trace collection
- MASE for ILP simulations
17Genetic Search COLT Results
Name Size (KB) Components VMT Counter width History length
a 16 alpct(34/10) gskewed(12) gshare(8) 2048 4 8
b 32 alpct(34/10) gshare(15) gshare(9) PAs(7) 8192 4 7
g 64 alpct(40/14) gshare(16) YAGS(11) pskewed(6) 16384 4 10
d 128 alpct(40/14) alpct(38/14) gshare(16) gskewed(13) YAGS(12) PAs(8) 16384 4 7
h 256 alpct(50/18) alpct(34/10) gshare(18) Bi-Mode(16) gskewed(15) PAs(8) 32768 4 4
18Overall Predictor Performance
19Per-Benchmark Performance
20ILP Performance
- Simulated CPU
- 6-issue
- 20 cycle pipeline
- Same functional units, latencies, caches as Intel
P4/NetBurst microarchitecture
1-cycle 2bC
4-cycle OR alpct
4-cycle OR COLT
Ideal 1-cycle COLT
21ILP Impact
22COLT Parameter Sensitivity
- Mapping table counter widths
- Number of mapping tables
- Number of history bits for VMT index
23Counter Width
24VMT Size
25History Length
26Explaining Choice of Components
- Parameter sensitivity results shows GA performed
well for the COLT parameters - Why did it choose the component predictors that
it did?
27Classifying COLT Predictions
- We examined the b (32KB) COLT config.
- For each mapping table lookup, we examine the
neighboring entries
entry 0001 NT
0010
P1
P2
P3
P4
1
0
0
1
entry 1001 T
1111
entry 1101 T
1001
28Classifying Predictions (contd)
gshare (9)
gshare (14)
PAs (7)
alpct (34/10)
32KB COLT
Classes
- easy all neighboring entries agree
- short only gshare(9) distinguishes
- long only gshare(14) distinguishes
- local only PAs(7) distinguishes
- perceptron only alpct(34/10) distinguishes
- multi-length mix of gshare(9), (14) or alpct
- mixed both global and local components
29Prediction Classifications
30Related Work/Issues
- Alloyed history Skadron00
- Variable path history length Stark98
- Dynamic history length fitting Juan98
- Interference reduction lots
- COLT handles all of these cases
- Doesnt support partial update policies
31Open Research
- Better individual components
- Augment with SBI Manne99, agree Sprangle97
- Better fusion algorithms
- Hybrid fusion/selection algorithms
- Other domains (branch confidence prediction,
value prediction, memory dependence prediction,
instruction criticality prediction, )
32Summary
- Fusion is more powerful than selection
- Combines multiple sources of information
- Branch behavior is very varied
- Need long, short, global and local histories,
multiple simultaneous lengths and types of
history - COLT is one possible fusion-based predictor
- Combines multiple types of information
- Current best purely dynamic predictor
33Questions?