Title: Microarchitectural Techniques to Exploit Repetitive Computations and Values
1Microarchitectural Techniques to Exploit
Repetitive Computations and Values
LECTURA DE TESIS, (Barcelona,14 de Diciembre de
2005)
Advisors Antonio González and Jordi Tubella
2Outline
- Motivation Objectives
- Overview of Proposals
- To improve the memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
3Outline
- Motivation Objectives
- Overview of Proposals
- To improve the memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
4Motivation
- General by design
- real-world programs
- operating systems
- Often designed in mind to
- future expansion
- code reuse
- Input sets have little variation
5Types of Repetition
Repetition
z F (x, y)
6Repetitive Computations
100 90 80 70 60 50 40 30
20 10 0
Spec CPU2000, 500 million instructions
7Types of Repetition
Repetition
z F (x, y)
8Repetitive Values
100 90 80 70 60 50 40 30
20 10 0
Spec CPU2000, 500 million instructions, analysis
of destination value
9Objectives
10Experimental Framework
- Methodology
- Analysis of benchmarks
- Definition of proposal
- Evaluation of proposal
- Tools
- Atom
- Cacti 3.0
- Simplescalar Tool Set
- Benchmarks
- Spec CPU95
- Spec CPU2000
11Outline
- Motivation Objectives
- Overview of Proposals
- To improve the memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
12Techniques to Improve Memory
Value Repetition
13Redundant Stores Instructions
STORE (_at_i , Value Y)
Redundant Store
- Contributions
- Redundant stores
- Analysis of repetition into same storage location
- Redundant stores applied to reduce memory traffic
- Main results
- 15-25 of redundant store instructions
- 5-20 of memory traffic reduction
Molina, González, Tubella, Reducing Memory
Traffic via Redundant Store Instructions, HPCN99
14Non Redundant Data Cache
Value Repetition
-
-
- Non redundant data cache (NRC)
- Contributions
- Analysis of repetition in several storage
locations
- Main results
- On average, a value is stored 4 times at any
given time - NRC -32 area, -13 energy, -25 latency, 5
miss
Molina, Aliagas, GarcÃa,Tubella, González, Non
Redundant Data Cache, ISLPED03 Aliagas, Molina,
GarcÃa, González, Tubella, Value Compression to
Reduce Power in Data Caches, EUROPAR03
15Outline
- Motivation Objectives
- Overview of Proposals
- To improve the memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
16Techniques to Speed-up I Execution
Computation Repetition
- Avoid serialization caused by data dependences
- Determine results of instructions without
executing them - Target is to speed-up the execution of programs
17Techniques to Speed-up I Execution
Computation Repetition
18Techniques to Speed-up I Execution
Computation Repetition
19Techniques to Speed-up I Execution
Computation Repetition
20Techniques to Speed-up I Execution
Computation Repetition
21Techniques to Speed-up I Execution
Computation Repetition
22Instruction Level Reuse (ILR)
Reuse Table
RCB
-
-
- Redundant Computation Buffer (RCB)
- Contributions
- Performance potential of ILR
- Main results
- Ideal ILR speed-up of 1.5
- RCB speed-up of 1.1 (outperforms previous
proposals)
Molina, González, Tubella, Dynamic Removal of
Redundant Computations, ICS99
23Trace Level Reuse (TLR)
- Contributions
- Trace Level Reuse
-
-
-
- Initial design issues for integrating TLR
-
-
- Performance potential of TLR
- Main results
- Ideal TLR speed-up of 3.6
- 4K-entry table 25 of reuse, average trace size
of 6
González, Tubella, Molina, Trace-Level Reuse,
ICPP99
24Trace Level Speculation (TLS)
-
-
- Compiler analysis to support TSMA
- Contributions
- Trace Level Speculative Multithreaded Architecture
- Main results
- speedup of 1.38 with a 20 of misspeculations
Molina, González, Tubella, Trace-Level
Speculative Multithreaded Architecture (TSMA),
ICCD02 Molina, González, Tubella Compiler
Analysis for TSMA, INTERACT05 Molina, Tubella,
González, Reducing Misspeculation Penalty in
TSMA, ISHPC05
25Objectives Proposals
- To improve the memory system
- Redundant store instructions
- Non redundant data cache
- To speed-up the execution of instructions
- Redundant computation buffer (ILR)
- Trace-level reuse buffer (TLR)
- Trace-level speculative multithreaded
architecture (TLS)
26Outline
- Motivation Objectives
- Overview of Proposals
- To improve the memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
27Motivation
- Caches spend close to 50 of total die area
- Caches are responsible of a significant part of
total power dissipated by a processor
28Data Value Repetition
percentage of repetitive values
percentage of time
Spec CPU2000, 1 billion instructions, 256KB data
cache
29Conventional Cache
Value Repetition
30Non Redundant Data Cache
Pointer Table
Value Table
Die Area Reduction
31Non Redundant Data Cache
Pointer Table
Value Table
32Non Redundant Data Cache
Pointer Table
Value Table
1
2
1
33Data Value Inlining
- Some values can be represented with a small
number of bits (Narrow Values) - Narrow values can be inlined into pointer area
- Simple sign extension is applied
- Benefits
- enlarges effective capacity of VT
- reduces latency
- reduces power dissipation
34Non Redundant Data Cache
Pointer Table
Value Table
F
2
1234
0
Data Value Inlining
35Miss Rate vs Die Area
L2 Cache 256KB 512KB
1MB 2MB 4MB
Miss Ratio
0,1 0,5
1,0
cm2
VT50
VT30
VT20
CONV
Spec CPU2000, 1 billion instructions
36Results
- Caches ranging from 256 KB to 4 MB
37Outline
- Motivation Objectives
- Overview of Proposals
- To improve the memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
38Trace Level Speculation
- Avoids serialization caused by data dependences
- Skips in a row multiple instructions
- Predicts values based on the past
- Solves live-input test
- Introduces penalties due to misspeculations
39Trace Level Speculation
- Two orthogonal issues
- microarchitecture support for trace speculation
- control and data speculation techniques
- prediction of initial and final points
- prediction of live output values
- Trace Level Speculative Multithreaded
Architecture (TSMA) - does not introduce significant misspeculation
penalties - Compiler Analysis
- based on static analysis that uses profiling data
40Trace Level Speculation with Live Output Test
ST
NST
41TSMA Block Diagram
Look Ahead Buffer
42Compiler Analysis
- Focuses on
- developing effective trace selection schemes for
TSMA - based on static analysis that uses profiling data
- Trace Selection
- Graph Construction (CFG DDG)
- Graph Analysis
43Graph Analysis
- Two important issues
- initial and final point of a trace
- maximize trace length minimize misspeculations
- predictability of live output values
- prediction accuracy and utilization degree
- Three basic heuristics
- Procedure Trace Heuristic
- Loop Trace Heuristic
- Instruction Chaining Trace Heuristic
44Trace Speculation Engine
- Traces are communicated to the hardware
- at program loading time
- filling a special hardware structure (trace
table) - Each entry of the trace table contains
- initial PC
- final PC
- live-output values information
- branch history
- frequency counter
45Simulation Parameters
- Base microarchitecture
- out of order machine, 4 instructions per cycle
- I cache 16KB, D cache 16KB, L2 shared 256KB
- bimodal predictor
- 64-entry ROB, FUs 4 int, 2 div, 2 mul, 4 fps
- TSMA additional structures
- each thread I window, reorder buffer, register
file - speculative data cache 1KB
- trace table 128 entries, 4-way set associative
- look ahead buffer 128 entries
- verification engine up to 8 instructions per
cycle
46Speedup
1.45
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
Spec CPU2000, 250 million instructions
47Misspeculations
Spec CPU2000, 250 million instructions
48Outline
- Motivation Objectives
- Overview of Proposals
- To improve memory system
- To speed-up the execution of instructions
- Non Redundant Data Cache
- Trace-Level Speculative Multithreaded Arch.
- Conclusions Future Work
49Conclusions
- Repetition is very common in programs
- Can be applied
- to improve the memory system
- to speed-up the execution of instructions
- Investigated several alternatives
- Novel cache organizations
- Instruction level reuse approach
- Trace level reuse concept
- Trace level speculation architecture
50Future Work
- Value repetition in instruction caches
- Profiling to support data value reuse schemes
- Traces starting at different PCs
- Value prediction in TSMA
- Multiple speculations in TSMA
- Multiple threads in TSMA
51Publications
- Value Repetition in Cache Organizations
- Reducing Memory Traffic Via Redundant Store
Instructions, HPCN'99 - Non Redundant Data Cache, ISLPED'03
- Value Compression to Reduce Power in Data Caches,
EUROPAR'03 - Instruction Trace Level Reuse
- The Performance Potential of Data Value Reuse,
TR-UPC-DAC98 - Dynamic Removal of Redundant Computations, ICS'99
- Trace Level Reuse, ICPP'99
- Trace Level Speculation
- Trace-Level Speculative Multithreaded
Architecture, ICCD'02 - Compiler Analysis for TSMA, INTERACT05
- Reducing Misspeculation Penalty in TSMA, ISHPC05
52Microarchitectural Techniques to Exploit
Repetitive Computations and Values
LECTURA DE TESIS, (Barcelona, 14 de Diciembre de
2005)
Advisors Antonio González and Jordi Tubella