Reducing%20Misspeculation%20Penalty%20in%20Trace-Level%20Speculative%20Multithreaded%20Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

Reducing%20Misspeculation%20Penalty%20in%20Trace-Level%20Speculative%20Multithreaded%20Architectures

Description:

ST stores it's commited instructions in the LAB. Look-Ahead Buffer. I1. I2 ... if fails and destination value obtained from memory is commited to register file. ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 32

Provided by: carlos62

Learn more at: https://arco.e.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Reducing%20Misspeculation%20Penalty%20in%20Trace-Level%20Speculative%20Multithreaded%20Architectures

1
Reducing Misspeculation Penalty in Trace-Level
Speculative Multithreaded Architectures
ISHPC-VI, Nara City (Japan) - September 7-9, 2005

Carlos Molina ?, ?
Jordi Tubella ?
Antonio González ?,?

? Dept. Enginyeria Informàtica Universitat Rovira
i Virgili Tarragona, Spaincarlos.molina_at_urv.net
? Intel Barcelona Research Center Intel Labs -
UPC Barcelona, Spainantoniox.gonzalez_at_intel.com
? Dept. Arquitectura de Computadors Universitat
Politècnica de Catalunya Barcelona, Spain
antonio,cmolina,jordit_at_ac.upc.edu
2
Techniques to Boost I Execution
Computation Repetition

Avoid serialization caused by data dependences
Determine results of instructions without
executing them
Target is to boost the execution of programs

3
Techniques to Boost I Execution
Computation Repetition
4
Techniques to Boost I Execution
Computation Repetition
5
Trace Level Speculation

Avoids serialization caused by data dependences
Skips in a row multiple instructions
Predicts values based on the past
Introduces penalties due to misspeculations

6
Trace Level Speculation with Live Output Test
ST
NST
Trace Miss Speculation Detection Recovery
Actions
7
Motivation

Two orthogonal issues
microarchitecture support for trace speculation
control and data speculation techniques
prediction of initial and final points
prediction of live output values
This work focuses on
microarchitecture support (TSMA)
concretely, on reducing penalties due to
misspeculations

Molina, González, Tubella, Trace-Level
Speculative Multithreaded Architecture (TSMA),
ICCD02 Molina, González, Tubella Compiler
Analysis for TSMA, INTERACT05
8
Outline

TSMA (Trace-level Speculative Multithreaded
Architecture)
Verification Engine
Enhanced Verification Engine
Experimental Framework
Simulation Results
Conclusions

9
TSMA Block Diagram
Look Ahead Buffer
10
Verification Engine
Program Counters Operation Type Sources
Destination Register Numbers Sources
Destination Register Values Effective Address
11
Verification Engine
BRANCHES source value tested program counter
updated
12
Verification Engine
BRANCHES source value tested program counter
updated
ARITH IS source values tested destination
register updated
13
Verification Engine
BRANCHES source value tested program counter
updated
ARITH IS source values tested destination
register updated
STORES effective address verified destination
memory updated
14
Verification Engine
BRANCHES source value tested program counter
updated
ARITH IS source values tested destination
register updated
STORES effective address verified destination
memory updated
LOADS effective address verified memory value
checked register updated
15
Squashed Is from LAB

On average, up to 85 instructions are squashed
from LAB in each thread synchronization

16
Correctly Executed Is

On average, over 20 of the squashed instructions
were correctly executed by ST

17
Our Proposal

Enhanced Verification Engine
does not throw away execution results of
instructions that are independent of the
mispredicted point
reduce the number of Is fetched and executed
thread synchronizations can be delayed or even
aborted
verification of branches, loads, stores and
single-cycle instructions is reconsidered.

18
Related Work

Instruction reissue Lipasti 1997, González
González 1997, Sato 1998
Squash reuse Sodani Sohi 1997
Control independence in trace processors
Rotenberg et al, 1997
Dynamic control independence Chou et al 1999
Register integration Roth Sohi 2000

19
Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
20
Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
ARITH IS if source values do not match,
instruction is re-executed.
21
Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
ARITH IS if source values do not match,
instruction is re-executed.
STORES effective address is re-computed if
fails and memory is updated with value obtained
from the non-speculative architectural state.
22
Enhanced Verification Engine
ENHANCED VERIFICATION ENGINE
BRANCHES branch target is validated instead of
source values.
ARITH IS if source values do not match,
instruction is re-executed.
STORES effective address is re-computed if
fails and memory is updated with value obtained
from the non-speculative architectural state.
LOADS effective address is re-computed if fails
and destination value obtained from memory is
commited to register file.
23
Incorrect Speculated Is

Only 1 Is inserted in LAB are incorrectly
predicted

On average, close to 90 of the instructions are
branches, loads, stores and single-cycle
instructions

24
Experimental Framework

Simulator
Alpha version of the SimpleScalar Toolset
Benchmarks
Spec2000, ref input
Maximum Optimization Level
DEC C F77 compilers with -non_shared -O5
Statistics Collected for 250 million instructions
Skipping an initial part of 500 million
instructions

25
Simulation Parameters

Base microarchitecture
out of order machine, 4 instructions per cycle
I cache 16KB, D cache 16KB, L2 shared 256KB
bimodal predictor
TSMA additional structures
each thread I window, reorder buffer, register
file
speculative data cache 1KB
trace table 128 entries, 4-way set associative
look ahead buffer 128 entries
verification engine up to 8 instructions per
cycle
only one I reexecuted per cycle

26
Thread Synchronizations
Conventional VE
Enhanced VE

On average, the number of thread synchronizations
is about 10 lower (from 30 to 20)

27
Speedup
Conventional VE
Enhanced VE
1.45
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00

On average, the average performance improvement
is around 9

28
Executed Is Reduced

On average, almost 8 of the instructions are
reduced in execution with the enhanced VE

29
Conclusions

TSMA
significant number of Is are correctly executed,
but discarded when synchronizing
novel hardware technique to enhance TSMA
Enhanced Verification Engine
thread synchros are delayed or even aborted
branches, loads, stores and single-cycle Is are
reconsidered
Results show
speedup of 38 (9 improvement)
misprediction rate of 20 (10 reduction)

30
Future Work