Title: SMT for Dedicated Threads
1SMT for Dedicated Threads
- Jim Gast, jgast_at_cs.wisc.edu,
- Laura Spencer, ljspence_at_cs.wisc.edu
- Brian Fields, fields_at_cs.wisc.edu
2Agenda
- Simultaneous Multithreading background
- Motivation for Priority
- Possible Priority Implementations
- Priority Setting Mechanisms
- Methodology/Results
- Conclusions
3Superscalar
SMT Background
1
2
5
Unused
3
7
6
Thread 1
Time (Processor Cycles)
8
9
4
Horizontal Waste
There are cycles in which no instructions can
issue because they must wait for
prior instructions to finish
12
11
10
14
13
15
Functional Units
4Fine-grained multithreading
SMT Background
1
2
5
Unused
3
7
6
Thread 1
Time (Processor Cycles)
8
9
4
Thread 2
Thread 3
Fills horizontal waste, but still has vertical
waste if thread cannot use all FUs in parallel
12
11
10
17
14
13
Functional Units
5Simultaneous MultiThreading
SMT Background
1
2
5
Unused
3
7
6
Thread 1
Time (Processor Cycles)
8
4
Thread 2
9
Thread 3
SMT allows several threads to issue instructions
in the same cycle
10
12
11
13
Functional Units
6Benefits
SMT Background
- Eliminate overheads in interthread communication
- Minimize interrupts
- Minimize context switches
- Minimize the cost of branch mis-predicts
- Minimize the cost of pipeline flushes
7Agenda
- Simultaneous Multithreading background
- Motivation for Priority
- Possible Priority Implementations
- Priority Setting Mechanisms
- Methodology/Results
- Conclusions
8Motivation for Priority
Example without priority
Thread 1 (long)
Thread 2 (short)
10 Add R10, R11, R12 11 Add R13, R14, R15 12 Lock
X
1 Add R1, R2, R3 2 Add R4, R5, R6 3 Add R7, R8,
R9 4 Add R9, R10, R11 5 Unlock X
4-way issue, no priority
Cycle 1 Cycle 2 Cycle 3
1
2
10
11
3
12
12
4
5
3 cycles to reach sync. point
9Motivation for Priority
Example with priority
Thread 1 (long)
Thread 2 (short)
10 Add R10, R11, R12 11 Add R13, R14, R15 12 Lock
X
1 Add R1, R2, R3 2 Add R4, R5, R6 3 Add R7, R8,
R9 4 Add R9, R10, R11 5 Unlock X
4-way issue, T1 has priority
Cycle 1 Cycle 2 Cycle 3
1
2
3
10
11
4
12
5
2 cycles to reach sync. point
10Do it on a larger scale???
Motivation for Priority
Time
Unlock
Lock
Lets give it a shot
11Agenda
- Simultaneous Multithreading background
- Motivation for Priority
- Possible Priority Implementations
- Priority Setting Mechanisms
- Methodology/Results
- Conclusions
12Where implement Priority?
Priority Implementation
- Fetch unit?
- ICOUNT 2.8 Tullsen, et al.
- Fetch 8 Inst/cycle from up to 2 threads
- Balance of Inst from each active thread
- Issue unit?
- Typically issue oldest first
- Very difficult implementation
13Priority Implementation
Proposed Prioritization Mechanisms
- Fetch based on priority
- Use ICOUNT for desired imbalance
- Strict priority
- Issue based on priority
- Countdown to rendezvous
- Issue first from thread farthest from barrier
- Use empty issue slots for other threads
14Agenda
- Simultaneous Multithreading background
- Motivation for Priority
- Possible Priority Implementations
- Priority Setting Mechanisms
- Methodology/Results
- Conclusions
15Who chooses thread priority?
Priority Setting Mechanisms
- Programmer does it
- Has knowledge of program structure
- But no dynamic information
- Operating System
- Limited dynamic information
- Hardware
- Has dynamic information
- But no knowledge of program structure
16Agenda
- Simultaneous Multithreading background
- Motivation for Priority
- Possible Priority Implementations
- Priority Setting Mechanisms
- Methodology/Results
- Conclusions
17SMTSIM Tullsen at UCSD
Methodology/Results
- Subset of the Alpha instruction set
- Synchronization primitives
- smt_create, smt_lock, smt_unlock, smt_terminate
- Implements quiesce
- 8 threads, 8-way issue, 32-entry IF queues
- 32kB L1 DIcaches, 1MB shared L2, 4MB shared L3
- We adjusted synch. interface, fixed several bugs,
and added smt_priority instruction.
18Benchmark
Methodology/Results
- 7 threads simulating things a video conference
server does - decrypt , encrypt, generate CRC, check CRC,
replicate packets - some threads are DCache hogs, some threads spawn
work to multiple threads, some threads are
register / CPU intensive.
19Experimental Results
Methodology/Results
20Agenda
- Simultaneous Multithreading background
- Motivation for Priority
- Possible Priority Implementations
- Priority Setting Mechanisms
- Methodology/Results
- Conclusions
21Conclusions
- Using priority for performance gain
- Use shared resources more effectively
- Likely need program structure information
- Be careful with metrics
- IPC no good if it counts spin locks
- Interaction between threads is complex