Title: The case for simultaneous multithreading SMT and chip multiprocessorCMP
1The case for simultaneous multithreading (SMT)
and chip multiprocessor(CMP)
- Dean Tullsen, Susan J. Eggers, and Henry M. Levy,
Simultaneous Multithreading Maximizing on-chip
parallelism, ISCA 1995. - Kunle Olukotun, et.al, The case for single-chip
multiprocessor, ASPLOS VII, 1996.
2State of the art microprocessor architecture in
1995
- Superscalar (SS)
- Multiple instruction issues
- Dynamic scheduling with hardware tracking
dependencies - Speculative execution look past predicted
branches - Non-blocking caches multiple outstanding memory
operations - Coarse grain threading instructions from the
same thread are packed in each cycle. - Moores law is still in action
- More logics for the processors
- What is the best path to higher performance?
3Options to higher performance
- Continue with superscalar
- Wider instruction issue
- Support more speculation
- Considerations
- Technology limits.
- Need to be able to pack instructions from one
thread (how much ILP is there?).
4Super-scalar designs
- 3 phases instruction fetch, issue and
retirement, execution. - Performance limiter
- Issue and retirement.
- 20 die area in PA-8000 with 56-instruction queue
- Wider issue width requires deeper issue queue
- Quadratic increase in size of the Q
- Long wires for broadcast tags
- Execution phase quadratic increases in the size
of register file and the bypass logic.
5Super-scalar designs
- Delays increase as the size of the issue queue
increases and as the size of multi-port register
file increases. - Performance return from wider issue is limited.
6How much ILP is there?
- Programs in SPEC92, 8-issue superscalar
processor. - lt 1.5 IPC
- dominant cycle losts differs by application.
- ILP in one application cannot fully utilize the
cycles.
7Some conclusion from the 8-issue simuation
- No dominant cause of wasted cycles.
- No dominant solution
- What is next?
- Each thread does not have enough ILP for a
8-issue superscalar. - Exploit ILP from multiple threads.
- Fine-grain multithreading context switching each
cycle, instructions from one threads in each
cycle.
8Type of cycle waste in super-scalar processors
The SPEC92 study says 61 vertical waste and 39
horizontal waste fine-grain multithreading also
has a limit.
9Another alternative Simultaneous multithreading
- Permit several independent threads to issue to
multiple functional units each cycle.
10Simultaneous multithreading
- Used in P4 (hyper-threading technology), power 5.
- Objective increase processor utilization despite
- Long memory latencies
- Limited available parallelism per thread
- Advantages combine
- Multiple instruction issue from superscalar
architectures - Latency hiding ability of multi-threaded
architectures.
11Simultaneous multithreading
- Design parameters
- Number of threads
- Number of functional units
- Issue width per thread
- Binding between threads and units
- Static
- Complete dynamic
12Evaluation of SMT
- Compare alternatives
- Wide superscalars
- Traditional multi-threaded processors
- Small-scale multiple issue multiprocessors
13Simulation environment
- What do they model?
- Execution pipelines
- Memory hierarchy
- TLBs
- Branch prediction logic
- Base processor Alpha 21164
- 10 functional units
- 4 integer
- 2 floating point
- 3 load store
- 1 branch
- Latencies from 21164
- Assume larger caches than alpha
14Alternatives
- Baseline fine-grain multi-threading
- SMT
- SM full simultaneous issue
- SM single issue, SM dual issue, SM four issue
- SM limited connection
- Limit the partitioning of functional units, less
dynamic - All with the state of the art instruction
scheduling.
15Results
16Results
17Cache issues in SMT
- Decrease in locality from single threaded model
- Design choice hybrid shared and private cache.
- Skip .
18Another option chip multiprocessor
- More powerful superscalar core or multiple less
powerful cores. - Same problem delay issue with superscalar and
the limited ILP in application. - Compare two microarchitectures
- 6 way superscalar architecture
- 4 x 2-way superscalar architecture
- Roughly equivalent in terms of die size
- Floorplan given in the paper.
19Simulate a set of benchmarks
- Integer eqntott, compress, m88ksim, Mpsim
- Floating point applu, apsi, swim, tomcatv.
- Multiprogramming application pmake
20IPC breakdown
- 6-issue 1.6x tomcatv, 2.4x swim
- pipeline stalls increase lack of IPC
- FP applications have significant ILP, but
dcache stalls consume gt 50 IPC
214 x 2 CMP vs. 6-issue SS
- Non-parallelizable codes (compress)
- wide superscalar architecture is better
- Fine-grain thread level parallelism (eqntott,
m88ksim, apsi) - wide superscalar architecture is slightly
better. - expect simple core will allow higher clock rate
- codes with coarse-grain thread level
parallelism - CMP is much better
22SMT vs. CMP (from SMT study)
- Difference way resources are partitioned
- CMP statically partitions resources
- SMT enable partitioning to change every cycle.
23SMT claims
- SMT faster by 24.
- Not very realistic due to assumptions die area
not compared. CMP with less issue is much
smaller.
24Discussion
- Why is SMT not dominant?
- Inter-thread contention
- Latency
- Hardware overhead (die size)
- Benchmarks
- How should the techniques be combined?
- Synergies and interference
- Do we need SMT per core?
- Where is the right balance?
- How should this be determined?