Title: Enhancing Software Reliability with Speculative Threads
1Enhancing Software Reliability withSpeculative
Threads
- Jeffrey Oplinger and Monica Lam
- Stanford University
2Motivation
- Reliability, availability and serviceability
(RAS) are dominant issues in computing - Security holes are costly!
- programmer finding and fixing vulnerabilities
- user applying update after update
- everyone aftermath of a security compromise
- What can we as architects and system designers do
to help? - first, a look at the current approaches
3Current Techniques to Address RAS
- Static analysis
- Formal verification doesnt really work
- Tools PREFIX, LCLINT,
- useful, but unsound/incomplete
- Runtime schemes overheads!
- Safer languages (Java) significantly slower
- Purify 2x to 5x slowdown
- bounds-checking-gcc gt 10x slowdown
- Programmer discipline bugs are inevitable!
4State of Computer Architecture
- What to do with all these transistors?
- Use them to increase performance?
- A perennial target
- Marginal returns are decreasing
- RAS is increasingly important
- Instead, provide new features to software
- Make safer code easier to write
- Speed up expensive but useful runtime schemes
5Proposal Monitor-and-Recover _at_ Runtime
- Monitor the program execution at runtime
- Verify that execution was correct
- Recover from detected errors if possible
- Future programs will hopefully have more
application-level checking and verification - Especially if hardware makes the task easier and
more efficient!
6Outline
- Motivation Architecture Support for RAS
- Monitoring Code Error Recovery
- Current Schemes
- Proposed Programming Paradigms
- Hardware Support Speculative Threads
- Experimental Evaluations
- Monitoring Code
- Recovery with Fine-grain Transactions
- Conclusions
7Execution Monitoring at Runtime
- Examples
- Performance monitoring (Pixie)
- Detecting memory misuse (Purify)
- Run-time anomaly detection (DIDUCE)
- Too expensive for shipped code
- Even painful during the development cycle
- Viewed as not essential even if useful
- PROPOSAL make monitoring more efficient
- efficiency/performance ? more use/functionality
8Monitor-and-Recover Paradigm
- Monitoring code
- Inserted into the original program and obeys
sequential semantics - Typically does not affect the main computation
- Perhaps a predictable ok value returned
- For performance, execute in parallel with rest of
normal program - Re-execute if any data dependences violated
- Precise exception semantics
9Error Detection at Runtime
- StackGuard
- Instruments the program to detect stack
corruption - Libsafe
- Replace unsafe string routines in C library
- Catches corruption before or after it happens
- Corruption detection how to recover?
- kill the process ? denial-of-service opportunity
10Error Recovery at Runtime
- Manual recovery
- Need to know exactly what to fix
- Hard to write cleanup code
- Easy to get wrong or incomplete
- Automatic recovery (transactions, logging)
- Often too expensive
- Coarse granularity often not appropriate
- PROPOSAL fine-grained transactions
11Monitor-and-Recover Paradigm
- Fine-grain recoverable transactions
- Software marks the beginning of a transaction
- All further side-effects (memory and register)
are buffered - Software decides when to either commit or abort
the transaction - Allows for robust end-to-end error detection and
recovery
12Recovery Programming Model Example
ltinput string parsing codegt
13Recovery Programming Model Example
ltinput string parsing codegt if
(StackGuardError()) exit(-1)
14Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
15Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
16Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
X X X
17Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
18Outline
- Motivation Architecture Support for RAS
- Monitoring Code Error Recovery
- Current Schemes
- Proposed Programming Paradigms
- Hardware Support Speculative Threads
- Experimental Evaluations
- Monitoring Code
- Recovery with Fine-grain Transactions
- Conclusions
19Hardware Support Speculative Threads
- Thread-level Speculation (TLS) originally
designed to speed up uniprocessor integer
programs - Break the computation into (relatively)
independent threads execute in parallel - Buffer side effects and detect data dependence
violations discard andre-execute if needed
20Procedural Thread-level Speculation (TLS)
NORMALSEQUENTIAL EXECUTION
. . .
21Procedural Thread-level Speculation (TLS)
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
CALL
C
RET
. . .
22Procedural Thread-level Speculation (TLS)
A1
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
A2
CALL
C
RET
A3
. . .
23Procedural Thread-level Speculation (TLS)
A1
A1
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
24Procedural Thread-level Speculation (TLS)
A1
A1
CALL
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
25Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
26Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
27Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
28Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
29Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
CALL
A2
CALL
C
RET
A3
. . .
30Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
fork
CALL
A2
C
A3
CALL
RET
C
. . .
RET
A3
. . .
31Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
fork
CALL
A2
C
A3
CALL
RET
C
. . .
RET
A3
. . .
32Procedural Thread-level Speculation (TLS)
need data dependence checking
A1
A1
fork
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
fork
A2
C
A3
CALL
C
. . .
RET
A3
. . .
33Procedural Thread-level Speculation (TLS)
observeddatadependence
A1
fork
CALL
ST
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
LD
fork
A2
CALL
C
. . .
RET
A3
. . .
34Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
fork
CALL
LD
B
NORMALSEQUENTIAL EXECUTION
ST
RET
EXECUTE
fork
A2
CALL
C
. . .
RET
A3
. . .
35Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
fork
x
CALL
LD
B
x
NORMALSEQUENTIAL EXECUTION
ST
RET
EXECUTE
fork
x
A2
CALL
C
. . .
RET
A3
. . .
36Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
fork
x
CALL
LD
B
x
NORMALSEQUENTIAL EXECUTION
ST
RET
EXECUTE
fork
x
x
A2
x
CALL
C
. . .
RET
A3
. . .
37Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
A1
fork
x
CALL
B
x
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
fork
x
x
A2
x
CALL
C
. . .
A2
re-execute
RET
fork
A3
C
A3
. . .
38Using TLS to speed up Monitoring
A1
A1
A1
fork
INSERT INSTRUMEN-TATION
M1
M1
A2
A2
EXECUTE
fork
A2
A3
M2
A3
M2
. . .
. . .
A3
. . .
39Using TLS to speed up Monitoring
A1
A1
A1
fork
INSERT INSTRUMEN-TATION
M1
M1
A2
A2
EXECUTE
fork
A2
A3
M2
A3
M2
. . .
hopefully significant parallelism
between monitoring andoriginal code
. . .
A3
. . .
40Using TLS to speed up Heavy Monitoring
fork
fork
M1
M1
fork
M2
INSERT INSTRUMEN-TATION
M3
M4
. . .
M2
EXECUTE
. . .
M3
here, need independencebetween monitoringcode
invocationsto get decent speedup
M4
. . .
41Using TLS to Support Transactions
- Speculative buffers must hold all memory
side-effects - Memory hazard detection not needed
- Start speculative execution at TRY
- Initial register state is saved
- Thread restart is changed to ABORT
- jumps to CATCH instead of re-executing
- Thread control exposed to software via COMMIT and
ABORT primitives
42Machine Architectures for Thread-level Speculation
- Variety of proposals
- Speculative buffering in cache or load-store
queues - Selective recovery or restart whole thread
- Our machine
- Fine-grained threads ?
- Simultaneous Multithreading (SMT) based
- Use load-store queues to buffer the state
- Trace buffers expensive ? No selective recovery
- Procedural speculation ? Return value prediction
43Machine Architecture Base Superscalar
PC
FETCH
FUs
D
INST QUEUE
DECODE
RENAME
44Machine Architecture SMT support
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
ADDED MODIFIED
45Machine Architecture TLS support
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
ADDED MODIFIED
46Machine Architecture TLS performance
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
ADDED MODIFIED
47Machine Architecture
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
48Outline
- Motivation Architecture Support for RAS
- Monitoring Code Error Recovery
- Current Schemes
- Proposed Programming Paradigms
- Hardware Support Speculative Threads
- Experimental Evaluations
- Monitoring Code
- Recovery with Fine-grain Transactions
- Conclusions
49Experimental Evaluation Monitoring Code
- Pixie counts basic-block executions
- Third Degree memory checker (like PURIFY)
- DIDUCE anomaly detection tool
- Originally for Java tracks values reports when
anomalies are detected - Can watch load/stores, parameters, return vals
- Our version instruments loads in the binary
- DIDUCE instruments all static loads
- DIDUCE.1 instruments only 10 of static loads
50Simulated Machine
- Common across all experiments
- 5-stage pipeline
- Configurable number of thread contexts
- Maximum of 2 threads fetch per cycle
- 32k 4-way L1D, 32k 2-way L1I, 512k 4-way L2
- Based on SimpleScalar simulator
51Simulation Parameters
- Sample Configurations
- SMT1/t1 is a 4-wide processor with one thread (no
TLS) - SMT4/t1 is a 16-wide processor with one thread
(no TLS) - only exploits additional ILP
- SMT4/t8 is a 16-wide processor with 8 TLS
threads
52Simulation Thread Control Operations
- Thread fork
- Initiated in DECODE pipeline stage
- Single-cycle flash copy of starting register
state - New thread begins FETCH in the following cycle
- Thread meet
- Before meet, finishing thread must
- issue all buffered stores to memory
- finalize outstanding register writes for
validation - Only one fork or meet allowed each cycle
53Simulated Programs
- Four different instrumentations
- Pixie
- Third
- DIDUCE.1
- DIDUCE
- Two different SPEC95 base programs are
instrumented - (V) Vortex
- (P) Perl
54Runtime Overhead of Instrumentation
55Runtime Overhead of Instrumentation
56Runtime Overhead of Instrumentation
57Effective IPC
ILP
ILPTLS
58Relative Performance Improvement
ILP
ILPTLS
TLS
59Outline
- Motivation Architecture Support for RAS
- Monitoring Code Error Recovery
- Current Schemes
- Proposed Programming Paradigms
- Hardware Support Speculative Threads
- Experimental Evaluations
- Monitoring Code
- Recovery with Fine-grain Transactions
- Conclusions
60Using TLS to Support Recovery
- Speculative buffers must hold all memory
side-effects - Need significant buffering ? use L1 cache
- e.g. to hold buffer overrun attacks
- possibilities when even larger?
- buffer further in the memory hierarchy (L2, L3)
- commit by default
- abort by default
- fall back to coarse (OS) support
61Evaluation Transactions with Recovery
- Examined three networked programs, wrapped
routines with buffer-overflow vulnerabilities
into transactions - bftpd, imapd unsafe use of C string library
functions, used Libsafe-like error detection - ntpd bug in handwritten string parsing, used
StackGuard-like error detection - Stack traversal is optimized here (unoptimized in
the paper)
62Transaction Results
63Conclusions
- Performance is still an issue for monitoring
- Better performance means more utility
- 1.6x speedup from TLS, 2.4x with ILP as well
- e.g. 2.5x overhead became 12 overhead
- 5.3 IPC overall
- Need to provide more ways for programmers to get
their code right! - Fine-grained transactions allow easier checks
with precise and complete recovery - More research needed!