Enhancing Software Reliability with Speculative Threads - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Enhancing Software Reliability with Speculative Threads

Description:

What can we as architects and system designers do to help? ... Especially if hardware makes the task easier and more ... Hardware Support: Speculative Threads ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 64

Provided by: constantin63

Category:

more less

Transcript and Presenter's Notes

Title: Enhancing Software Reliability with Speculative Threads

1
Enhancing Software Reliability withSpeculative
Threads

Jeffrey Oplinger and Monica Lam
Stanford University

2
Motivation

Reliability, availability and serviceability
(RAS) are dominant issues in computing
Security holes are costly!
programmer finding and fixing vulnerabilities
user applying update after update
everyone aftermath of a security compromise
What can we as architects and system designers do
to help?
first, a look at the current approaches

3
Current Techniques to Address RAS

Static analysis
Formal verification doesnt really work
Tools PREFIX, LCLINT,
useful, but unsound/incomplete
Runtime schemes overheads!
Safer languages (Java) significantly slower
Purify 2x to 5x slowdown
bounds-checking-gcc gt 10x slowdown
Programmer discipline bugs are inevitable!

4
State of Computer Architecture

What to do with all these transistors?
Use them to increase performance?
A perennial target
Marginal returns are decreasing
RAS is increasingly important
Instead, provide new features to software
Make safer code easier to write
Speed up expensive but useful runtime schemes

5
Proposal Monitor-and-Recover _at_ Runtime

Monitor the program execution at runtime
Verify that execution was correct
Recover from detected errors if possible
Future programs will hopefully have more
application-level checking and verification
Especially if hardware makes the task easier and
more efficient!

6
Outline

Motivation Architecture Support for RAS
Monitoring Code Error Recovery
Current Schemes
Proposed Programming Paradigms
Hardware Support Speculative Threads
Experimental Evaluations
Monitoring Code
Recovery with Fine-grain Transactions
Conclusions

7
Execution Monitoring at Runtime

Examples
Performance monitoring (Pixie)
Detecting memory misuse (Purify)
Run-time anomaly detection (DIDUCE)
Too expensive for shipped code
Even painful during the development cycle
Viewed as not essential even if useful
PROPOSAL make monitoring more efficient
efficiency/performance ? more use/functionality

8
Monitor-and-Recover Paradigm

Monitoring code
Inserted into the original program and obeys
sequential semantics
Typically does not affect the main computation
Perhaps a predictable ok value returned
For performance, execute in parallel with rest of
normal program
Re-execute if any data dependences violated
Precise exception semantics

9
Error Detection at Runtime

StackGuard
Instruments the program to detect stack
corruption
Libsafe
Replace unsafe string routines in C library
Catches corruption before or after it happens
Corruption detection how to recover?
kill the process ? denial-of-service opportunity

10
Error Recovery at Runtime

Manual recovery
Need to know exactly what to fix
Hard to write cleanup code
Easy to get wrong or incomplete
Automatic recovery (transactions, logging)
Often too expensive
Coarse granularity often not appropriate
PROPOSAL fine-grained transactions

11
Monitor-and-Recover Paradigm

Fine-grain recoverable transactions
Software marks the beginning of a transaction
All further side-effects (memory and register)
are buffered
Software decides when to either commit or abort
the transaction
Allows for robust end-to-end error detection and
recovery

12
Recovery Programming Model Example
ltinput string parsing codegt
13
Recovery Programming Model Example
ltinput string parsing codegt if
(StackGuardError()) exit(-1)
14
Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
15
Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
16
Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
X X X
17
Recovery Programming Model Example
try ltinput string parsing codegt if
(StackGuardError()) ABORT COMMIT
catch log_error() skip_to_next_input()
18
Outline

Motivation Architecture Support for RAS
Monitoring Code Error Recovery
Current Schemes
Proposed Programming Paradigms
Hardware Support Speculative Threads
Experimental Evaluations
Monitoring Code
Recovery with Fine-grain Transactions
Conclusions

19
Hardware Support Speculative Threads

Thread-level Speculation (TLS) originally
designed to speed up uniprocessor integer
programs
Break the computation into (relatively)
independent threads execute in parallel
Buffer side effects and detect data dependence
violations discard andre-execute if needed

20
Procedural Thread-level Speculation (TLS)
NORMALSEQUENTIAL EXECUTION
. . .
21
Procedural Thread-level Speculation (TLS)
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
CALL
C
RET
. . .
22
Procedural Thread-level Speculation (TLS)
A1
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
A2
CALL
C
RET
A3
. . .
23
Procedural Thread-level Speculation (TLS)
A1
A1
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
24
Procedural Thread-level Speculation (TLS)
A1
A1
CALL
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
25
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
26
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
27
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
28
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
A2
CALL
C
RET
A3
. . .
29
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
CALL
A2
CALL
C
RET
A3
. . .
30
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
fork
CALL
A2
C
A3
CALL
RET
C
. . .
RET
A3
. . .
31
Procedural Thread-level Speculation (TLS)
A1
A1
fork
CALL
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
RET
EXECUTE
fork
CALL
A2
C
A3
CALL
RET
C
. . .
RET
A3
. . .
32
Procedural Thread-level Speculation (TLS)
need data dependence checking
A1
A1
fork
CALL
B
B
A2
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
fork
A2
C
A3
CALL
C
. . .
RET
A3
. . .
33
Procedural Thread-level Speculation (TLS)
observeddatadependence
A1
fork
CALL
ST
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
LD
fork
A2
CALL
C
. . .
RET
A3
. . .
34
Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
fork
CALL
LD
B
NORMALSEQUENTIAL EXECUTION
ST
RET
EXECUTE
fork
A2
CALL
C
. . .
RET
A3
. . .
35
Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
fork
x
CALL
LD
B
x
NORMALSEQUENTIAL EXECUTION
ST
RET
EXECUTE
fork
x
A2
CALL
C
. . .
RET
A3
. . .
36
Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
fork
x
CALL
LD
B
x
NORMALSEQUENTIAL EXECUTION
ST
RET
EXECUTE
fork
x
x
A2
x
CALL
C
. . .
RET
A3
. . .
37
Procedural Thread-level Speculation (TLS)
unobserveddatadependence
A1
A1
fork
x
CALL
B
x
B
NORMALSEQUENTIAL EXECUTION
RET
EXECUTE
fork
x
x
A2
x
CALL
C
. . .
A2
re-execute
RET
fork
A3
C
A3
. . .
38
Using TLS to speed up Monitoring
A1
A1
A1
fork
INSERT INSTRUMEN-TATION
M1
M1
A2
A2
EXECUTE
fork
A2
A3
M2
A3
M2
. . .
. . .
A3
. . .
39
Using TLS to speed up Monitoring
A1
A1
A1
fork
INSERT INSTRUMEN-TATION
M1
M1
A2
A2
EXECUTE
fork
A2
A3
M2
A3
M2
. . .
hopefully significant parallelism
between monitoring andoriginal code
. . .
A3
. . .
40
Using TLS to speed up Heavy Monitoring
fork
fork
M1
M1
fork
M2
INSERT INSTRUMEN-TATION
M3
M4
. . .
M2
EXECUTE
. . .
M3
here, need independencebetween monitoringcode
invocationsto get decent speedup
M4
. . .
41
Using TLS to Support Transactions

Speculative buffers must hold all memory
side-effects
Memory hazard detection not needed
Start speculative execution at TRY
Initial register state is saved
Thread restart is changed to ABORT
jumps to CATCH instead of re-executing
Thread control exposed to software via COMMIT and
ABORT primitives

42
Machine Architectures for Thread-level Speculation

Variety of proposals
Speculative buffering in cache or load-store
queues
Selective recovery or restart whole thread
Our machine
Fine-grained threads ?
Simultaneous Multithreading (SMT) based
Use load-store queues to buffer the state
Trace buffers expensive ? No selective recovery
Procedural speculation ? Return value prediction

43
Machine Architecture Base Superscalar
PC
FETCH
FUs
D
INST QUEUE
DECODE
RENAME
44
Machine Architecture SMT support
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
ADDED MODIFIED
45
Machine Architecture TLS support
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
ADDED MODIFIED
46
Machine Architecture TLS performance
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
ADDED MODIFIED
47
Machine Architecture
FETCH
PC
FUs
D
INST QUEUE
DECODE
RENAME
48
Outline

Motivation Architecture Support for RAS
Monitoring Code Error Recovery
Current Schemes
Proposed Programming Paradigms
Hardware Support Speculative Threads
Experimental Evaluations
Monitoring Code
Recovery with Fine-grain Transactions
Conclusions

49
Experimental Evaluation Monitoring Code

Pixie counts basic-block executions
Third Degree memory checker (like PURIFY)
DIDUCE anomaly detection tool
Originally for Java tracks values reports when
anomalies are detected
Can watch load/stores, parameters, return vals
Our version instruments loads in the binary
DIDUCE instruments all static loads
DIDUCE.1 instruments only 10 of static loads

50
Simulated Machine

Common across all experiments
5-stage pipeline
Configurable number of thread contexts
Maximum of 2 threads fetch per cycle
32k 4-way L1D, 32k 2-way L1I, 512k 4-way L2
Based on SimpleScalar simulator

51
Simulation Parameters

Sample Configurations
SMT1/t1 is a 4-wide processor with one thread (no
TLS)
SMT4/t1 is a 16-wide processor with one thread
(no TLS)
only exploits additional ILP
SMT4/t8 is a 16-wide processor with 8 TLS
threads

52
Simulation Thread Control Operations

Thread fork
Initiated in DECODE pipeline stage
Single-cycle flash copy of starting register
state
New thread begins FETCH in the following cycle
Thread meet
Before meet, finishing thread must
issue all buffered stores to memory
finalize outstanding register writes for
validation
Only one fork or meet allowed each cycle

53
Simulated Programs

Four different instrumentations
Pixie
Third
DIDUCE.1
DIDUCE
Two different SPEC95 base programs are
instrumented
(V) Vortex
(P) Perl

54
Runtime Overhead of Instrumentation
55
Runtime Overhead of Instrumentation
56
Runtime Overhead of Instrumentation
57
Effective IPC
ILP
ILPTLS
58
Relative Performance Improvement
ILP
ILPTLS
TLS
59
Outline

Motivation Architecture Support for RAS
Monitoring Code Error Recovery
Current Schemes
Proposed Programming Paradigms
Hardware Support Speculative Threads
Experimental Evaluations
Monitoring Code
Recovery with Fine-grain Transactions
Conclusions

60
Using TLS to Support Recovery

Speculative buffers must hold all memory
side-effects
Need significant buffering ? use L1 cache
e.g. to hold buffer overrun attacks
possibilities when even larger?
buffer further in the memory hierarchy (L2, L3)
commit by default
abort by default
fall back to coarse (OS) support

61
Evaluation Transactions with Recovery

Examined three networked programs, wrapped
routines with buffer-overflow vulnerabilities
into transactions
bftpd, imapd unsafe use of C string library
functions, used Libsafe-like error detection
ntpd bug in handwritten string parsing, used
StackGuard-like error detection
Stack traversal is optimized here (unoptimized in
the paper)

62
Transaction Results
63
Conclusions

Performance is still an issue for monitoring
Better performance means more utility
1.6x speedup from TLS, 2.4x with ILP as well
e.g. 2.5x overhead became 12 overhead
5.3 IPC overall
Need to provide more ways for programmers to get
their code right!
Fine-grained transactions allow easier checks
with precise and complete recovery
More research needed!