Microarchitectural Techniques to Exploit Repetitive Computations and Values - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Microarchitectural Techniques to Exploit Repetitive Computations and Values

Description:

LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) ... Cacti 3.0. Simplescalar Tool Set. Benchmarks. Spec CPU95. Spec CPU2000. 11. Outline ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 53

Provided by: carlos101

Category:

more less

Transcript and Presenter's Notes

Title: Microarchitectural Techniques to Exploit Repetitive Computations and Values

1
Microarchitectural Techniques to Exploit
Repetitive Computations and Values
LECTURA DE TESIS, (Barcelona,14 de Diciembre de
2005)

Carlos Molina Clemente

Advisors Antonio González and Jordi Tubella
2
Outline

Motivation Objectives
Overview of Proposals
To improve the memory system
To speed-up the execution of instructions
Non Redundant Data Cache
Trace-Level Speculative Multithreaded Arch.
Conclusions Future Work

3
Outline

Motivation Objectives
Overview of Proposals
To improve the memory system
To speed-up the execution of instructions
Non Redundant Data Cache
Trace-Level Speculative Multithreaded Arch.
Conclusions Future Work

4
Motivation

General by design
real-world programs
operating systems
Often designed in mind to
future expansion
code reuse
Input sets have little variation

5
Types of Repetition
Repetition
z F (x, y)
6
Repetitive Computations
100 90 80 70 60 50 40 30
20 10 0
Spec CPU2000, 500 million instructions
7
Types of Repetition
Repetition
z F (x, y)
8
Repetitive Values
100 90 80 70 60 50 40 30
20 10 0
Spec CPU2000, 500 million instructions, analysis
of destination value
9
Objectives
10
Experimental Framework

Methodology
Analysis of benchmarks
Definition of proposal
Evaluation of proposal
Tools
Atom
Cacti 3.0
Simplescalar Tool Set
Benchmarks
Spec CPU95
Spec CPU2000

11
Outline

Motivation Objectives
Overview of Proposals
To improve the memory system
To speed-up the execution of instructions
Non Redundant Data Cache
Trace-Level Speculative Multithreaded Arch.
Conclusions Future Work

12
Techniques to Improve Memory
Value Repetition
13
Redundant Stores Instructions

Do NOT modify memory

STORE (_at_i , Value Y)

If (Value XValue Y) then

Redundant Store

Contributions
Redundant stores
Analysis of repetition into same storage location
Redundant stores applied to reduce memory traffic

Main results
15-25 of redundant store instructions
5-20 of memory traffic reduction

Molina, González, Tubella, Reducing Memory
Traffic via Redundant Store Instructions, HPCN99
14
Non Redundant Data Cache

If (Value AValue D) then

Value Repetition

Non redundant data cache (NRC)

Contributions
Analysis of repetition in several storage
locations

Main results
On average, a value is stored 4 times at any
given time
NRC -32 area, -13 energy, -25 latency, 5
miss

Molina, Aliagas, García,Tubella, González, Non
Redundant Data Cache, ISLPED03 Aliagas, Molina,
García, González, Tubella, Value Compression to
Reduce Power in Data Caches, EUROPAR03
15
Outline

Motivation Objectives
Overview of Proposals
To improve the memory system
To speed-up the execution of instructions
Non Redundant Data Cache
Trace-Level Speculative Multithreaded Arch.
Conclusions Future Work

16
Techniques to Speed-up I Execution
Computation Repetition

Avoid serialization caused by data dependences
Determine results of instructions without
executing them
Target is to speed-up the execution of programs

17
Techniques to Speed-up I Execution
Computation Repetition
18
Techniques to Speed-up I Execution
Computation Repetition
19
Techniques to Speed-up I Execution
Computation Repetition
20
Techniques to Speed-up I Execution
Computation Repetition
21
Techniques to Speed-up I Execution
Computation Repetition
22
Instruction Level Reuse (ILR)
Reuse Table
RCB

Redundant Computation Buffer (RCB)

Contributions
Performance potential of ILR

Main results
Ideal ILR speed-up of 1.5
RCB speed-up of 1.1 (outperforms previous
proposals)

Molina, González, Tubella, Dynamic Removal of
Redundant Computations, ICS99
23
Trace Level Reuse (TLR)

Contributions
Trace Level Reuse

Initial design issues for integrating TLR

Performance potential of TLR

Main results
Ideal TLR speed-up of 3.6
4K-entry table 25 of reuse, average trace size
of 6

González, Tubella, Molina, Trace-Level Reuse,
ICPP99
24
Trace Level Speculation (TLS)

Two orthogonal issues

Compiler analysis to support TSMA

Contributions
Trace Level Speculative Multithreaded Architecture

Main results
speedup of 1.38 with a 20 of misspeculations

Molina, González, Tubella, Trace-Level
Speculative Multithreaded Architecture (TSMA),
ICCD02 Molina, González, Tubella Compiler
Analysis for TSMA, INTERACT05 Molina, Tubella,
González, Reducing Misspeculation Penalty in
TSMA, ISHPC05
25
Objectives Proposals

To improve the memory system

Redundant store instructions
Non redundant data cache

To speed-up the execution of instructions

Redundant computation buffer (ILR)
Trace-level reuse buffer (TLR)
Trace-level speculative multithreaded
architecture (TLS)

26
Outline

Motivation Objectives
Overview of Proposals
To improve the memory system
To speed-up the execution of instructions
Non Redundant Data Cache
Trace-Level Speculative Multithreaded Arch.
Conclusions Future Work

27
Motivation

Caches spend close to 50 of total die area

Caches are responsible of a significant part of
total power dissipated by a processor

28
Data Value Repetition
percentage of repetitive values
percentage of time
Spec CPU2000, 1 billion instructions, 256KB data
cache
29
Conventional Cache

If (Value AValue D) then

Value Repetition
30
Non Redundant Data Cache
Pointer Table
Value Table
Die Area Reduction
31
Non Redundant Data Cache
Pointer Table
Value Table
32
Non Redundant Data Cache
Pointer Table
Value Table
1
2
1
33
Data Value Inlining

Some values can be represented with a small
number of bits (Narrow Values)
Narrow values can be inlined into pointer area
Simple sign extension is applied
Benefits
enlarges effective capacity of VT
reduces latency
reduces power dissipation

34
Non Redundant Data Cache
Pointer Table
Value Table
F
2
1234
0
Data Value Inlining
35
Miss Rate vs Die Area
L2 Cache 256KB 512KB
1MB 2MB 4MB

Miss Ratio

0,1 0,5
1,0
cm2
VT50
VT30
VT20
CONV
Spec CPU2000, 1 billion instructions
36
Results

Caches ranging from 256 KB to 4 MB

37
Outline

Motivation Objectives
Overview of Proposals
To improve the memory system
To speed-up the execution of instructions
Non Redundant Data Cache
Trace-Level Speculative Multithreaded Arch.
Conclusions Future Work

38
Trace Level Speculation

Avoids serialization caused by data dependences
Skips in a row multiple instructions
Predicts values based on the past
Solves live-input test
Introduces penalties due to misspeculations

39
Trace Level Speculation

Two orthogonal issues
microarchitecture support for trace speculation
control and data speculation techniques
prediction of initial and final points
prediction of live output values
Trace Level Speculative Multithreaded
Architecture (TSMA)
does not introduce significant misspeculation
penalties
Compiler Analysis
based on static analysis that uses profiling data

40
Trace Level Speculation with Live Output Test
ST
NST
41
TSMA Block Diagram
Look Ahead Buffer
42
Compiler Analysis

Focuses on
developing effective trace selection schemes for
TSMA
based on static analysis that uses profiling data
Trace Selection
Graph Construction (CFG DDG)
Graph Analysis

43
Graph Analysis

Two important issues
initial and final point of a trace
maximize trace length minimize misspeculations
predictability of live output values
prediction accuracy and utilization degree
Three basic heuristics
Procedure Trace Heuristic
Loop Trace Heuristic
Instruction Chaining Trace Heuristic

44
Trace Speculation Engine

Traces are communicated to the hardware
at program loading time
filling a special hardware structure (trace
table)
Each entry of the trace table contains
initial PC
final PC
live-output values information
branch history
frequency counter

45
Simulation Parameters

Base microarchitecture
out of order machine, 4 instructions per cycle
I cache 16KB, D cache 16KB, L2 shared 256KB
bimodal predictor
64-entry ROB, FUs 4 int, 2 div, 2 mul, 4 fps
TSMA additional structures
each thread I window, reorder buffer, register
file
speculative data cache 1KB
trace table 128 entries, 4-way set associative
look ahead buffer 128 entries
verification engine up to 8 instructions per
cycle

46
Speedup
1.45
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
Spec CPU2000, 250 million instructions
47
Misspeculations
Spec CPU2000, 250 million instructions
48
Outline