Title: Reducing%20Misspeculation%20Penalty%20in%20Trace-Level%20Speculative%20Multithreaded%20Architectures
1Reducing Misspeculation Penalty in Trace-Level
Speculative Multithreaded Architectures
ISHPC-VI, Nara City (Japan) - September 7-9, 2005
- Carlos Molina ?, ?
- Jordi Tubella ?
- Antonio González ?,?
? Dept. Enginyeria Informàtica Universitat Rovira
i Virgili Tarragona, Spaincarlos.molina_at_urv.net
? Intel Barcelona Research Center Intel Labs -
UPC Barcelona, Spainantoniox.gonzalez_at_intel.com
? Dept. Arquitectura de Computadors Universitat
Politècnica de Catalunya Barcelona, Spain
antonio,cmolina,jordit_at_ac.upc.edu
2Techniques to Boost I Execution
Computation Repetition
- Avoid serialization caused by data dependences
- Determine results of instructions without
executing them - Target is to boost the execution of programs
3Techniques to Boost I Execution
Computation Repetition
4Techniques to Boost I Execution
Computation Repetition
5Trace Level Speculation
- Avoids serialization caused by data dependences
- Skips in a row multiple instructions
- Predicts values based on the past
- Introduces penalties due to misspeculations
6Trace Level Speculation with Live Output Test
ST
NST
Trace Miss Speculation Detection Recovery
Actions
7Motivation
- Two orthogonal issues
- microarchitecture support for trace speculation
- control and data speculation techniques
- prediction of initial and final points
- prediction of live output values
- This work focuses on
- microarchitecture support (TSMA)
- concretely, on reducing penalties due to
misspeculations
Molina, González, Tubella, Trace-Level
Speculative Multithreaded Architecture (TSMA),
ICCD02 Molina, González, Tubella Compiler
Analysis for TSMA, INTERACT05
8Outline
- TSMA (Trace-level Speculative Multithreaded
Architecture) - Verification Engine
- Enhanced Verification Engine
- Experimental Framework
- Simulation Results
- Conclusions
9TSMA Block Diagram
Look Ahead Buffer
10Verification Engine
Program Counters Operation Type Sources
Destination Register Numbers Sources
Destination Register Values Effective Address
11Verification Engine
BRANCHES source value tested program counter
updated
12Verification Engine
BRANCHES source value tested program counter
updated
ARITH IS source values tested destination
register updated
13Verification Engine
BRANCHES source value tested program counter
updated
ARITH IS source values tested destination
register updated
STORES effective address verified destination
memory updated
14Verification Engine
BRANCHES source value tested program counter
updated
ARITH IS source values tested destination
register updated
STORES effective address verified destination
memory updated
LOADS effective address verified memory value
checked register updated
15Squashed Is from LAB
- On average, up to 85 instructions are squashed
from LAB in each thread synchronization
16Correctly Executed Is
- On average, over 20 of the squashed instructions
were correctly executed by ST
17Our Proposal
- Enhanced Verification Engine
- does not throw away execution results of
instructions that are independent of the
mispredicted point - reduce the number of Is fetched and executed
- thread synchronizations can be delayed or even
aborted - verification of branches, loads, stores and
single-cycle instructions is reconsidered.
18Related Work
- Instruction reissue Lipasti 1997, González
González 1997, Sato 1998 - Squash reuse Sodani Sohi 1997
- Control independence in trace processors
Rotenberg et al, 1997 - Dynamic control independence Chou et al 1999
- Register integration Roth Sohi 2000
19Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
20Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
ARITH IS if source values do not match,
instruction is re-executed.
21Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
ARITH IS if source values do not match,
instruction is re-executed.
STORES effective address is re-computed if
fails and memory is updated with value obtained
from the non-speculative architectural state.
22Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
ARITH IS if source values do not match,
instruction is re-executed.
STORES effective address is re-computed if
fails and memory is updated with value obtained
from the non-speculative architectural state.
LOADS effective address is re-computed if fails
and destination value obtained from memory is
commited to register file.
23Incorrect Speculated Is
- Only 1 Is inserted in LAB are incorrectly
predicted
- On average, close to 90 of the instructions are
branches, loads, stores and single-cycle
instructions
24Experimental Framework
- Simulator
- Alpha version of the SimpleScalar Toolset
- Benchmarks
- Spec2000, ref input
- Maximum Optimization Level
- DEC C F77 compilers with -non_shared -O5
- Statistics Collected for 250 million instructions
- Skipping an initial part of 500 million
instructions
25Simulation Parameters
- Base microarchitecture
- out of order machine, 4 instructions per cycle
- I cache 16KB, D cache 16KB, L2 shared 256KB
- bimodal predictor
- TSMA additional structures
- each thread I window, reorder buffer, register
file - speculative data cache 1KB
- trace table 128 entries, 4-way set associative
- look ahead buffer 128 entries
- verification engine up to 8 instructions per
cycle - only one I reexecuted per cycle
26Thread Synchronizations
Conventional VE
Enhanced VE
- On average, the number of thread synchronizations
is about 10 lower (from 30 to 20)
27Speedup
Conventional VE
Enhanced VE
1.45
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
- On average, the average performance improvement
is around 9
28Executed Is Reduced
- On average, almost 8 of the instructions are
reduced in execution with the enhanced VE
29Conclusions
- TSMA
- significant number of Is are correctly executed,
but discarded when synchronizing - novel hardware technique to enhance TSMA
- Enhanced Verification Engine
- thread synchros are delayed or even aborted
- branches, loads, stores and single-cycle Is are
reconsidered - Results show
- speedup of 38 (9 improvement)
- misprediction rate of 20 (10 reduction)
30Future Work
- Aggressive trace level predictors
- Generalization to multiple threads
31Questions Answers
ISHPC-VI, Nara City (Japan) - September 7-9, 2005