Title: The potential for Software-only thread-level speculation
1The potential for Software-only thread-level
speculation
- Depth Oral Presentation
- Co-Supervisors Prof. Greg. Steffan
- Prof. Cristina Amza
- Committee Members
- Prof. Tarek. Abdelrahman Â
- Prof. Michael Voss
- Prof. Ken Sevick
- By Chuck (Chengyan) Zhao
- April 25, 2005
2Chip Multi-Processor (CMP) is now everywhere
- From all major companies
- IBM
- Power 4
- Power 5
- Intel
- Montecito
- Smithfield
- AMD
- Dual-core Opteron
- Sun
- MAJC
- Sony, Toshiba, IBM
- Cell
-
Power 4
Dual-core Intel chip
Cell
Dual-core Opteron
Abundant Chip Multiprocessors
3Improving Throughput with a Chip Multi-Processor
Multiprogramming Workload
Applications
Execution Time
Processor
Caches
improve throughput
4Improving Single Application Performance with a
Chip Multi-Processor
Single Application
?
Exec. Time
need parallel threads to reduce execution time
5Using Chip Multi-Processor for improvements
- Improve throughput for multi-programming workload
- Easy
- CMP behaves like a normal MP
- Improve single-application performance
- Hard
- Control and Data Dependence
- Proposed approach Thread-Level Speculation (TLS)
CMP trade-offs
6Thread-Level Speculation (TLS)
- Enable compiler to create parallel threads
despite the existence of ambiguous data
dependence - Optimistically parallelize at compile time
- Detect violations and recover at runtime
Optimistic at compile time, detect and recover at
runtime
7Example of Thread-Level Speculation
for ( ) p q
- Un-parallelizable through paralleling compilers
- Uncertain dependence between p and q
- Might be runtime or user-input dependent
Break loop iterations into threads, explore
uncertainty in each thread
8How Thread-Level Speculation works
?
9Thread-Level Speculation quick summary
- Benefits
- Reduce inter-thread communication time among
cores - Scale
- New parallel programming model
- Types of implementations
- Hardware only
- Combined with hardware and software
- Software only
Thread-Level Speculation is good for Chip
Multi-Processor
10Thread-Level Speculation Implementation Diagram
Overall picture of Thread-Level Speculation
11Thread-Level Speculation Implementation Comparison
- Hardware-only approach
- Lots of research
- Good speed up through simulation
- Nobody builds it yet
- cost, risky,
- need both HW SW at the same time
- Outcome
- HW-only TLS looks promising
- Significant hardware changes
- Software-only approach limited work, limited
progress - Major problem high overhead
- Buffer memory for speculative states
- Track each memory read write violation
detection - Recover from failed speculation re-execution
Quick summary on HW-only and SW-only approaches
12Outline for the rest of the talk
- Hardware TLS schemes
- Software TLS schemes
- Our scheme
- Our goals
- Starting point
- Potential applications
- Conclusion
13Hardware-only Thread-Level Speculation
Overall picture of HW-only TLS approach
14Hardware Thread-Level Speculation Schemes
- Lots of hardware TLS research
- CMU Stampede
- Stanford Hydra
- Wisconsin Multiscalar
- UIUC IA-COMA
- UMN Super-threaded architecture
-
- Convergence of hardware schemes
- Use cache to buffer speculative state
- Extend cache coherence protocol to track data
dependence
Convergence of HW-only Thread-Level Speculation
15Hardware TLS Schemes quick summary
- Result
- TLS is promising
- SPEC int improvement
- 30 - 100
- Depends on aggressiveness of the hardware support
Sp-state
Sp-state
Sp-state
Sp-state
CMP with hardware speculative buffer and enhanced
cache consistence protocol
Convergence of HW-only Thread-Level Speculation
16Software-only Thread-Level Speculation
Overall picture of SW-only TLS approach
17Software-only Thread-Level Speculation Schemes
- LRPD Test UIUC
- VM for dependence tracking Spiross, CMU
- Cintras SW TLS U Edinburgh
- Problem of software-only approach high overhead
- Try to reduce it
overview of SW-only TLS approach
18LRPD Test (UIUC)
Exec. Time
- implemented entirely in software
- applies only to array-based code
- no partial parallelism
- entire loop will re-execute sequentially if
there is any dependence
Pros Cons of LRPD
19Dependence tracking using Virtual Memory
Exec. Time
Software dependence tracking through VM pages
Virtual Memory Synchronize transfer VM pages
? Pros Cons of VM Tracking
20CMU Spiross approach -- Dependence tracking
using Virtual Memory
- Coarse-grain, software-only
- Based on memory tracking
- virtual memory page protection mechanism
- use software DSM (TreadMarks)
- Synchronization through VM pages through cost
analysis - Overhead is prohibitive
- 2 sec (seq) / 5 min (par)
- Not a viable approach on this level of coarse
granularity
SW-TLS through VM Tracking is not attractive
21Cintras SW TLS Memory tracking tuned for
performance
Exec. Time
Efficient tracking for array references
Efficient but custom-made for array only
22Cintras software-only Thread-Level Speculation
quick summary
- Features
- Software simulation for extended cache coherence
protocol - Provide speculative state transition table
- Violation detection through speculate state
comparison - Instrument on each load and store
- Pros Cons
- advanced implementation of LRPD test
- implement entirely in software
- cover partial parallelism
- hand-crafted code for performance
- apply only to array-based code
Summary of Cintras work
23Problems with Software Thread-Level Speculation
- High overhead
- Buffer speculative state
- Track data dependence for all memory reference
- Re-execute in case of failed speculation
- Potential speedup
- largely unexplored
- Possible directions for future research
- Reduce overhead
- Achieve speedup from TLS parallelism
Summary of Software TLS
24Our current Thread-Level Speculation approach
Overall position for our SW TLS approach
25Long term future plan
- Goals
- Target
- Chip Multi-Processors
- Tightly-coupled MPs
- Apply to general-purpose code not only arrays
- Minimize overhead
- Capitalize on compiler analysis and optimizations
- Idempotency analysis ltdonegt
- Synchronization and communications ltdonegt
- PPA Probabilistic pointer analysis Framework
(Jeffs work) ltprogressinggt - Minimal backup and buffer retrieval analysis
ltprogressinggt - more analysis we will invent lttodogt
- SW-only approach room to improve
- Starting point highly efficient software
checkpointing
Goals and Plans
26Starting point efficient software checkpointing
program execution
?
Buffer memory changes
Buffer more memory changes
?
Software checkpointing
- Some program points in source code
- Buffer state change between current execution
point and its latest check point - Execution can always efficiently rewind to its
latest checkpointing
Introduce software checkpointing
27Potential use of Software checkpointing
- Software Rollback
- automatic software TLS support
- foundation of future automatic TLS
parallelization - Debug
- controlled rewind
- Enhance application reliability
- Speculative optimizations in uni-processor
program - larger window size
- deep branch speculation
- speculative code motion
what can software checkpointing do
28Software checkpointing schemes
- Compiler analysis
- Local Basic Block level
- Backup only needed memory writes
- Optimize to minimize
- number of backup
- Number of buffer retrieval
- Global procedural level
- Populate buffers through control-flow graph
- Iterate until buffer stabilizes
- Inter-procedural level
- Potential approaches for software backup
- Undo backup
- Todo backup
build software checkpointing
29Undo backup
- Compile-time analysis
- Backup once
- per distinct memory write
- per Basic Block
- Program continue to operate on non-backup memory
- Action upon execution completion
- Commit trash buffer
- Rollback restore from buffer
undo backup properties
30Undo backup example
Program, Basic Block level
Undo backup memory
Undo backup action
(a, a) (b, b) (c, c)
a 10 b 12 c a b
conflicts check
Y
restore undo memory
N
trash undo memory
Next Basic Block
undo backup process
31Todo backup
- Perform at runtime
- Happen on each single memory write inside Basic
Block - Each following read might need to retrieve from
buffer - Action upon completion (reverse of Undo type)
- Commit write-back from buffer
- Rollback trash buffer
todo backup properties
32Todo backup example
Program, Basic Block level
todo backup memory
(p, a) (q, b)
p a q b p q
conflicts check
Y
trash todo backup
N
write todo backup to memory
Next Block
todo backup process
33Backup Comparison
- Undo
- Pro fast
- Few number of backups
- No need to retrieve from buffer for read
- Con Memory address needs to be known statically
- Scalar
- Pointer to fixed location
- Todo
- Pro
- Handle both scalar and general-purpose pointer
cases - Con slow
- Backup once per memory write
- Need to retrieve each following read from buffer
- In reality both types are used
pros cons of undo and todo
34An example in reality mixed mode
Code to execute
Undo buffer
int a, b, c int p, q (d) a 1 (d)
b 2 (d) p 5 (u) c a
b (u) q
(a, a) (b, b) (c, c)
Todo buffer
(p, 5)
combined-backup process in reality
35Selection of backups in reality
- Combined approach
- Undo memory address known
- Scalars
- Pointers to fixed address
- Compile-time analysis
- Todo memory address unknown
- Normal pointers
- Run-time analysis
- Plan for implementation
- put into SUIF, as a optimization pass
- Minimize performance drop
use both types together in reality
36Conclusion
- Thread-Level Speculation is compelling
- Potential large performance gains
- Challenge
- Software overhead
- Limited SW TLS work
- No previous SW TLS working on general-purpose
programs - Killer advantage compiler analyses
- Modest starting point
- efficient software checkpointing
summary
37Questions and Answers
38Concurrent HW-only Related Work
Approach Composition Compiler-assisted or Translator-only
DMT HW-only
CSMP HW-only
Trace Processor HW-only
Krishnan99 SW/HW
Hydra SW/HW
SVC SW/HW
SUDS SW/HW
Zhang99 SW/HW
Cintra00 SW/HW
STAMPede SW/HW
An other view of HW-only Thread-Level Speculation
Schemes