Slipstream Processors - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Slipstream Processors

Description:

Delay Buffer. Pipeline changes to enable instruction removal and ensure correctness ... The Delay Buffer is used to pass outcomes from A-stream to R-stream ... – PowerPoint PPT presentation

Number of Views:482

Avg rating:3.0/5.0

Slides: 43

Provided by: vkre

Category:

more less

Transcript and Presenter's Notes

Title: Slipstream Processors

1
Slipstream Processors

Presenter Vimal Reddy
Advisor Eric Rotenberg

2
The Slipstream paradigm

A fraction of the dynamic instruction stream
required for correct program execution
Rotenberg99
Detect and remove ineffectual instructions run a
shortened effectual version of the program
(Advanced or A-stream)
Ensure correctness by running a complete version
of the program (Redundant or R-stream)
Slipstream created shortened A-stream finishes
fast R-stream consumes near-perfect predictions
from A-stream and finishes close behind
Redundant arrangement much faster than
conventional, non-redundant execution

3
Slipstream microarchitecture

Multiple cores of a Chip Multiprocessor (CMP)
used to concurrently run R-stream and A-stream
Slipstream components
Instruction Removal (IR) Detector
IR Predictor
Delay Buffer
Pipeline changes to enable instruction removal
and ensure correctness
Core running A-stream
Get predictions from IR predictor (not
conventional branch predictor)
Skip ineffectual instructions in fetch
Write only to private, L1 data cache (not shared
L2)
Core running R-stream
Get predictions from Delay Buffer and verify

4
Slipstream microarchitecture(contd.)
5
Creating the slipstream

Main steps involved
Create a reduced A-stream
Communicate A-stream outcomes to the R-stream
Check A-streams forward progress and recover
from deviations

6
Creating a reduced A-stream

IR detector and IR predictor combine to create
A-stream
IR detector
Monitors retired R-stream instructions
Detects (past) ineffectualness and conveys it to
the IR predictor
IR predictor
Removes an instruction from A-stream after
repeated indications from the IR detector

7
Communicating outcomes

The Delay Buffer is used to pass outcomes from
A-stream to R-stream
Separate control and data FIFOs
Control flow information is complete IR
predictor predicts all branches
Data flow information is incomplete 1 bit per
dynamic instruction binds values to instructions

8
Memory hierarchy

A-stream loads and stores should not interfere
with R-streams
Solution exploit typical memory hierarchy found
in CMPs
Both A-stream and R-stream read and write their
respective private L1 data caches
R-stream L1 cache is write-through it writes to
a shared L2 cache
A-stream L1 cache is neither write-through or
write-back its stores are not propagated to the
shared L2 cache
R-stream close behind A-stream evicted line is
generally regenerated by R-stream in shared L2
If A-stream re-references an evicted line in the
shared L2 before regeneration, it gets stale data
and diverges

9
Memory hierarchy (contd.)
10
A-stream deviation detection and recovery

When?
A-stream deviates due to incorrect removal or
stale data access in L1 data cache
Detection?
Branch or value mispredict in R-stream (known as
an IR misprediction)
Recovery?
Restore A-stream register state copy values from
R-stream registers using delay buffer or
shared-memory exception handler
Restore A-stream memory state invalidate
A-stream L1 data cache (More recovery models
later)

11
Slipstream components The IR detector

Monitor A-stream instructions for three
triggering conditions
Unreferenced writes
Non-modifying writes
Correctly-predicted branches
Select triggering instructions as candidates for
removal
Also select their computation chains for removal
remove an instruction if it is killed and all
consumers are selected for removal
Computation chains are implicitly removed
removing consumers makes their producers
unreferenced writes next time around

12
Slipstream components The IR detector

13
IR detection example
14
Slipstream components The IR predictor

Augmented g-share branch predictor
Each table entry corresponds to one basic block
in the dynamic instruction stream
Tag start pc of the basic block
2-bit counter for prediction of the branch
terminating the basic block
Confidence counters one per basic block
instruction to predict its ineffectualness
Updated by IR-detector
Counter incremented if instruction detected as
removable
Counter reset to zero otherwise
Saturated counter gt instruction removed from
A-stream when next encountered

15
Slipstream components The improved IR predictor

Key ideas
Use ineffectual information to skip fetch for
completely ineffectual basic blocks
If execution bandwidth is high, slipstream still
performs good due to fetch cycles saved

16
Slipstream components The improved IR
predictor (contd.)
17
Memory recovery models

Invalidate A-stream L1 cache
Complete recovery
Easy to implement invalidation signal to reset
valid bits
Compulsory A-stream cache misses after recovery
Invalidate only dirty lines
Fewer compulsory misses
Easy to implement invalidation signal to reset
valid bits of dirty lines
Needless invalidation of lines dirty before
A-stream diverged
Incomplete recovery
Persistent-stale problem clean lines brought in
from L2 before A-stream diverged, persist
Persistent-skipped-write clean lines not dirty
because of incorrectly-skipped stores, persist

18
Memory recovery models (contd.)

Use invalidated lines as value-predictions in
A-stream
Key ideas
Invalidating a line preserves its tag and data
Only few lines are corrupt when A-stream diverges
Match tag even if cache line is invalid use data
as a value prediction
Summary
Memory recovery after A-stream divergence is easy
Only hardware support
Make L2 cache R-stream only
Provide cache invalidate signals based on
recovery model

19
Primary performance results (spec2k)

Slipstream configuration used
IR predictor
220 entries, gshare-indexed
16 confidence counters per entry
Confidence threshold 64
IR detector
Number of entries buffered 128
Delay buffer
Data flow buffer 256 entries
Control flow buffer 4K branch predictions
Memory model
Invalidate dirty lines, use invalidated data as
value predictions

20
Using a second processor for slipstreaming
21
Designing deployable slipstream components

IR detector
Operand rename table (ORT) to detect trigger
instructions
FIFO to update IR predictor on a per basic block
basis
Small cache like ORT to detect ineffectual stores
Delay buffer is a FIFO
Existing memory hierarchies work well for
slipstream
IR predictor is complex
Ineffectual information tied up with gshare
predictor
Tag and confidence counter per instruction adds
too much overhead

22
New IR predictor experiments Finding where
removal lies?

Observation
Removal is a 90-10 case 90 of removal is
contributed by
10 of all dynamic basic blocks

23
New IR predictor designs

Key ideas
Current design stores ineffectual removal (IR)
information for most frequently accessed basic
blocks
Store IR information only for basic blocks that
contribute most removal (10 basic blocks)

24
New IR predictor designs (contd.)

Design based on a simple filter
Cache the IR information and index it with
PC,BHR
Problem Frequently accessed basic blocks will
evict infrequent basic blocks which contribute
most removal
Fix Use a simple filter of counters
Use regular gshare for branch prediction

25
New IR predictor designs (contd.)

Integrate confidence counters into the
I-cache
A table of confidence counters, one per
instruction in the I-cache
Eliminates tag storage leverages I-cache tags

26
New IR predictor designs Roving confidence
counter (contd.)

Use one roving counter per basic block
Eliminates having one confidence counter per
instruction in a basic block
Instructions in a basic block time-share a single
counter
An instruction relinquishes the counter if
IR-detector does not select it for removal, OR
IR-detector selects it and the counter is
saturated

27
New IR predictor designs preliminary results

IR predictor size
Large 78 MB!!
Filter cache 56 KB
I-cache 12KB

28
Putting slipstream components to work

Observations
Slipstream does not yield performance always (low
instruction removal)
Slipstream components are off the critical path
Key ideas
Use slipstream components for profiling in the
background and predict slipstream performance
(while running in both modes slipstream and
non-slipstream)
Perform opportunity-based slipstreaming

29
Opportunity-based slipstreaming

Goal Get comparable slipstream performance with
the minimal required slipstream-on time
How to find best slipstreaming opportunities?
Using percentage of predicted-ineffectual
instructions
Main steps
Count number of instructions predicted as
ineffectual by the IR predictor
Monitor the count on an interval of retired
instructions
Slipstream on next interval if predicted removal
for current interval exceeds a threshold

30
Using predicted-ineffectual instructions Speedup
Interval 4K instructions, Slipstream turn-on
threshold 30 pred-ineff. instr.
31
Using predicted-ineffectual instructions
Slipstream-on time
32
Opportunity-based slipstreaming (contd.)

Problem with previous approach
Instruction removal not a correct measure to
predict performance. It is the cycles saved due
to instruction removal that matters
Program behavior may change across intervals.
Prediction based on current interval may be wrong
in next interval

33
Opportunity-based slipstreaming (contd.)

New approach
Count cycles saved due to removing
predicted-ineffectual instructions

34
Opportunity-based slipstreaming (contd.)

Slipstream if cycles saved exceed threshold
Add confidence to handle across-interval program
behavior change slipstream if cycles saved
exceed threshold repeatedly

35
Opportunity-based slipstreaming (contd.)

Other fun stuff Managing resources

36
Conclusions

Slipstream a novel means to use CMP cores to
Enhance single program speedup and,
Implicitly enhance fault tolerance
Preliminary experiments with new IR predictor
designs ecouraging
Existing slipstream components can be used to
implement opportunity-based slipstreaming
Slipstream management unit allows better
utilization of CMP cores for many job constraints

37
Questions?
38
Slipstream performance (spec2k)

Models used for comparison
SS(64x4) A single 4-way superscalar processor
with 64 ROB entries
SS(128x8) A single 8-way superscalar processor
with 128 ROB entries
SS(256x16) A single 16-way superscalar processor
with 256 ROB entries
CMP(2x64x4) Slipstreaming on a CMP composed of
two SS(64x4) cores
CMP(2x64x4)/byp Same as previous, but A-stream
can bypass instruction fetching
CMP(2x128x8) Slipstreaming on a CMP composed of
two SS(128x8) cores
CMP(2x128x8)/byp Same as previous, but A-stream
can bypass instruction fetching

39
Slipstream performance using an extra core for
slipstreaming
40
Slipstream performance two small cores vs. one
large core
41
Instruction removal
42
Memory recovery model results

Write a Comment

User Comments (0)