Title: Slipstream Processors
1Slipstream Processors
- Presenter Vimal Reddy
- Advisor Eric Rotenberg
2The Slipstream paradigm
- A fraction of the dynamic instruction stream
required for correct program execution
Rotenberg99 - Detect and remove ineffectual instructions run a
shortened effectual version of the program
(Advanced or A-stream) - Ensure correctness by running a complete version
of the program (Redundant or R-stream) - Slipstream created shortened A-stream finishes
fast R-stream consumes near-perfect predictions
from A-stream and finishes close behind - Redundant arrangement much faster than
conventional, non-redundant execution -
3Slipstream microarchitecture
- Multiple cores of a Chip Multiprocessor (CMP)
used to concurrently run R-stream and A-stream - Slipstream components
- Instruction Removal (IR) Detector
- IR Predictor
- Delay Buffer
- Pipeline changes to enable instruction removal
and ensure correctness - Core running A-stream
- Get predictions from IR predictor (not
conventional branch predictor) - Skip ineffectual instructions in fetch
- Write only to private, L1 data cache (not shared
L2) - Core running R-stream
- Get predictions from Delay Buffer and verify
4Slipstream microarchitecture(contd.)
5Creating the slipstream
- Main steps involved
- Create a reduced A-stream
- Communicate A-stream outcomes to the R-stream
- Check A-streams forward progress and recover
from deviations
6Creating a reduced A-stream
- IR detector and IR predictor combine to create
A-stream - IR detector
- Monitors retired R-stream instructions
- Detects (past) ineffectualness and conveys it to
the IR predictor - IR predictor
- Removes an instruction from A-stream after
repeated indications from the IR detector
7Communicating outcomes
- The Delay Buffer is used to pass outcomes from
A-stream to R-stream - Separate control and data FIFOs
- Control flow information is complete IR
predictor predicts all branches - Data flow information is incomplete 1 bit per
dynamic instruction binds values to instructions
8Memory hierarchy
- A-stream loads and stores should not interfere
with R-streams - Solution exploit typical memory hierarchy found
in CMPs - Both A-stream and R-stream read and write their
respective private L1 data caches - R-stream L1 cache is write-through it writes to
a shared L2 cache - A-stream L1 cache is neither write-through or
write-back its stores are not propagated to the
shared L2 cache - R-stream close behind A-stream evicted line is
generally regenerated by R-stream in shared L2 - If A-stream re-references an evicted line in the
shared L2 before regeneration, it gets stale data
and diverges
9Memory hierarchy (contd.)
10A-stream deviation detection and recovery
- When?
- A-stream deviates due to incorrect removal or
stale data access in L1 data cache - Detection?
- Branch or value mispredict in R-stream (known as
an IR misprediction) - Recovery?
- Restore A-stream register state copy values from
R-stream registers using delay buffer or
shared-memory exception handler - Restore A-stream memory state invalidate
A-stream L1 data cache (More recovery models
later)
11Slipstream components The IR detector
- Monitor A-stream instructions for three
triggering conditions - Unreferenced writes
- Non-modifying writes
- Correctly-predicted branches
- Select triggering instructions as candidates for
removal - Also select their computation chains for removal
remove an instruction if it is killed and all
consumers are selected for removal - Computation chains are implicitly removed
removing consumers makes their producers
unreferenced writes next time around
12Slipstream components The IR detector
13IR detection example
14Slipstream components The IR predictor
- Augmented g-share branch predictor
- Each table entry corresponds to one basic block
in the dynamic instruction stream - Tag start pc of the basic block
- 2-bit counter for prediction of the branch
terminating the basic block - Confidence counters one per basic block
instruction to predict its ineffectualness - Updated by IR-detector
- Counter incremented if instruction detected as
removable - Counter reset to zero otherwise
- Saturated counter gt instruction removed from
A-stream when next encountered
15Slipstream components The improved IR predictor
- Key ideas
- Use ineffectual information to skip fetch for
completely ineffectual basic blocks - If execution bandwidth is high, slipstream still
performs good due to fetch cycles saved
16Slipstream components The improved IR
predictor (contd.)
17Memory recovery models
- Invalidate A-stream L1 cache
- Complete recovery
- Easy to implement invalidation signal to reset
valid bits - Compulsory A-stream cache misses after recovery
- Invalidate only dirty lines
- Fewer compulsory misses
- Easy to implement invalidation signal to reset
valid bits of dirty lines - Needless invalidation of lines dirty before
A-stream diverged - Incomplete recovery
- Persistent-stale problem clean lines brought in
from L2 before A-stream diverged, persist - Persistent-skipped-write clean lines not dirty
because of incorrectly-skipped stores, persist
18Memory recovery models (contd.)
- Use invalidated lines as value-predictions in
A-stream - Key ideas
- Invalidating a line preserves its tag and data
- Only few lines are corrupt when A-stream diverges
- Match tag even if cache line is invalid use data
as a value prediction - Summary
- Memory recovery after A-stream divergence is easy
- Only hardware support
- Make L2 cache R-stream only
- Provide cache invalidate signals based on
recovery model
19Primary performance results (spec2k)
- Slipstream configuration used
- IR predictor
- 220 entries, gshare-indexed
- 16 confidence counters per entry
- Confidence threshold 64
- IR detector
- Number of entries buffered 128
- Delay buffer
- Data flow buffer 256 entries
- Control flow buffer 4K branch predictions
- Memory model
- Invalidate dirty lines, use invalidated data as
value predictions
20Using a second processor for slipstreaming
21Designing deployable slipstream components
- IR detector
- Operand rename table (ORT) to detect trigger
instructions - FIFO to update IR predictor on a per basic block
basis - Small cache like ORT to detect ineffectual stores
- Delay buffer is a FIFO
- Existing memory hierarchies work well for
slipstream - IR predictor is complex
- Ineffectual information tied up with gshare
predictor - Tag and confidence counter per instruction adds
too much overhead
22New IR predictor experiments Finding where
removal lies?
- Observation
- Removal is a 90-10 case 90 of removal is
contributed by - 10 of all dynamic basic blocks
23New IR predictor designs
- Key ideas
- Current design stores ineffectual removal (IR)
information for most frequently accessed basic
blocks - Store IR information only for basic blocks that
contribute most removal (10 basic blocks)
24New IR predictor designs (contd.)
- Design based on a simple filter
- Cache the IR information and index it with
PC,BHR - Problem Frequently accessed basic blocks will
evict infrequent basic blocks which contribute
most removal - Fix Use a simple filter of counters
- Use regular gshare for branch prediction
25New IR predictor designs (contd.)
- Integrate confidence counters into the
- I-cache
- A table of confidence counters, one per
instruction in the I-cache - Eliminates tag storage leverages I-cache tags
26New IR predictor designs Roving confidence
counter (contd.)
- Use one roving counter per basic block
- Eliminates having one confidence counter per
instruction in a basic block - Instructions in a basic block time-share a single
counter - An instruction relinquishes the counter if
- IR-detector does not select it for removal, OR
- IR-detector selects it and the counter is
saturated
27New IR predictor designs preliminary results
- IR predictor size
- Large 78 MB!!
- Filter cache 56 KB
- I-cache 12KB
28Putting slipstream components to work
- Observations
- Slipstream does not yield performance always (low
instruction removal) - Slipstream components are off the critical path
- Key ideas
- Use slipstream components for profiling in the
background and predict slipstream performance
(while running in both modes slipstream and
non-slipstream) - Perform opportunity-based slipstreaming
29Opportunity-based slipstreaming
- Goal Get comparable slipstream performance with
the minimal required slipstream-on time - How to find best slipstreaming opportunities?
- Using percentage of predicted-ineffectual
instructions - Main steps
- Count number of instructions predicted as
ineffectual by the IR predictor - Monitor the count on an interval of retired
instructions - Slipstream on next interval if predicted removal
for current interval exceeds a threshold
30Using predicted-ineffectual instructions Speedup
Interval 4K instructions, Slipstream turn-on
threshold 30 pred-ineff. instr.
31Using predicted-ineffectual instructions
Slipstream-on time
32Opportunity-based slipstreaming (contd.)
- Problem with previous approach
- Instruction removal not a correct measure to
predict performance. It is the cycles saved due
to instruction removal that matters - Program behavior may change across intervals.
Prediction based on current interval may be wrong
in next interval
33Opportunity-based slipstreaming (contd.)
- New approach
- Count cycles saved due to removing
predicted-ineffectual instructions
34Opportunity-based slipstreaming (contd.)
- Slipstream if cycles saved exceed threshold
- Add confidence to handle across-interval program
behavior change slipstream if cycles saved
exceed threshold repeatedly
35Opportunity-based slipstreaming (contd.)
- Other fun stuff Managing resources
36Conclusions
- Slipstream a novel means to use CMP cores to
- Enhance single program speedup and,
- Implicitly enhance fault tolerance
- Preliminary experiments with new IR predictor
designs ecouraging - Existing slipstream components can be used to
implement opportunity-based slipstreaming - Slipstream management unit allows better
utilization of CMP cores for many job constraints
37Questions?
38Slipstream performance (spec2k)
- Models used for comparison
- SS(64x4) A single 4-way superscalar processor
with 64 ROB entries - SS(128x8) A single 8-way superscalar processor
with 128 ROB entries - SS(256x16) A single 16-way superscalar processor
with 256 ROB entries - CMP(2x64x4) Slipstreaming on a CMP composed of
two SS(64x4) cores - CMP(2x64x4)/byp Same as previous, but A-stream
can bypass instruction fetching - CMP(2x128x8) Slipstreaming on a CMP composed of
two SS(128x8) cores - CMP(2x128x8)/byp Same as previous, but A-stream
can bypass instruction fetching
39Slipstream performance using an extra core for
slipstreaming
40Slipstream performance two small cores vs. one
large core
41Instruction removal
42Memory recovery model results