The case for simultaneous multithreading SMT and chip multiprocessorCMP

1 / 24
About This Presentation
Title:

The case for simultaneous multithreading SMT and chip multiprocessorCMP

Description:

Multiple instruction issue from superscalar architectures ... All with the state of the art instruction scheduling. Results. Results. Cache issues in SMT ... –

Number of Views:64
Avg rating:3.0/5.0
Slides: 25
Provided by: Surf6
Category:

less

Transcript and Presenter's Notes

Title: The case for simultaneous multithreading SMT and chip multiprocessorCMP


1
The case for simultaneous multithreading (SMT)
and chip multiprocessor(CMP)
  • Dean Tullsen, Susan J. Eggers, and Henry M. Levy,
    Simultaneous Multithreading Maximizing on-chip
    parallelism, ISCA 1995.
  • Kunle Olukotun, et.al, The case for single-chip
    multiprocessor, ASPLOS VII, 1996.

2
State of the art microprocessor architecture in
1995
  • Superscalar (SS)
  • Multiple instruction issues
  • Dynamic scheduling with hardware tracking
    dependencies
  • Speculative execution look past predicted
    branches
  • Non-blocking caches multiple outstanding memory
    operations
  • Coarse grain threading instructions from the
    same thread are packed in each cycle.
  • Moores law is still in action
  • More logics for the processors
  • What is the best path to higher performance?

3
Options to higher performance
  • Continue with superscalar
  • Wider instruction issue
  • Support more speculation
  • Considerations
  • Technology limits.
  • Need to be able to pack instructions from one
    thread (how much ILP is there?).

4
Super-scalar designs
  • 3 phases instruction fetch, issue and
    retirement, execution.
  • Performance limiter
  • Issue and retirement.
  • 20 die area in PA-8000 with 56-instruction queue
  • Wider issue width requires deeper issue queue
  • Quadratic increase in size of the Q
  • Long wires for broadcast tags
  • Execution phase quadratic increases in the size
    of register file and the bypass logic.

5
Super-scalar designs
  • Delays increase as the size of the issue queue
    increases and as the size of multi-port register
    file increases.
  • Performance return from wider issue is limited.

6
How much ILP is there?
  • Programs in SPEC92, 8-issue superscalar
    processor.
  • lt 1.5 IPC
  • dominant cycle losts differs by application.
  • ILP in one application cannot fully utilize the
    cycles.

7
Some conclusion from the 8-issue simuation
  • No dominant cause of wasted cycles.
  • No dominant solution
  • What is next?
  • Each thread does not have enough ILP for a
    8-issue superscalar.
  • Exploit ILP from multiple threads.
  • Fine-grain multithreading context switching each
    cycle, instructions from one threads in each
    cycle.

8
Type of cycle waste in super-scalar processors
The SPEC92 study says 61 vertical waste and 39
horizontal waste fine-grain multithreading also
has a limit.
9
Another alternative Simultaneous multithreading
  • Permit several independent threads to issue to
    multiple functional units each cycle.

10
Simultaneous multithreading
  • Used in P4 (hyper-threading technology), power 5.
  • Objective increase processor utilization despite
  • Long memory latencies
  • Limited available parallelism per thread
  • Advantages combine
  • Multiple instruction issue from superscalar
    architectures
  • Latency hiding ability of multi-threaded
    architectures.

11
Simultaneous multithreading
  • Design parameters
  • Number of threads
  • Number of functional units
  • Issue width per thread
  • Binding between threads and units
  • Static
  • Complete dynamic

12
Evaluation of SMT
  • Compare alternatives
  • Wide superscalars
  • Traditional multi-threaded processors
  • Small-scale multiple issue multiprocessors

13
Simulation environment
  • What do they model?
  • Execution pipelines
  • Memory hierarchy
  • TLBs
  • Branch prediction logic
  • Base processor Alpha 21164
  • 10 functional units
  • 4 integer
  • 2 floating point
  • 3 load store
  • 1 branch
  • Latencies from 21164
  • Assume larger caches than alpha

14
Alternatives
  • Baseline fine-grain multi-threading
  • SMT
  • SM full simultaneous issue
  • SM single issue, SM dual issue, SM four issue
  • SM limited connection
  • Limit the partitioning of functional units, less
    dynamic
  • All with the state of the art instruction
    scheduling.

15
Results
16
Results
17
Cache issues in SMT
  • Decrease in locality from single threaded model
  • Design choice hybrid shared and private cache.
  • Skip .

18
Another option chip multiprocessor
  • More powerful superscalar core or multiple less
    powerful cores.
  • Same problem delay issue with superscalar and
    the limited ILP in application.
  • Compare two microarchitectures
  • 6 way superscalar architecture
  • 4 x 2-way superscalar architecture
  • Roughly equivalent in terms of die size
  • Floorplan given in the paper.

19
Simulate a set of benchmarks
  • Integer eqntott, compress, m88ksim, Mpsim
  • Floating point applu, apsi, swim, tomcatv.
  • Multiprogramming application pmake

20
IPC breakdown
  • 6-issue 1.6x tomcatv, 2.4x swim
  • pipeline stalls increase lack of IPC
  • FP applications have significant ILP, but
    dcache stalls consume gt 50 IPC

21
4 x 2 CMP vs. 6-issue SS
  • Non-parallelizable codes (compress)
  • wide superscalar architecture is better
  • Fine-grain thread level parallelism (eqntott,
    m88ksim, apsi)
  • wide superscalar architecture is slightly
    better.
  • expect simple core will allow higher clock rate
  • codes with coarse-grain thread level
    parallelism
  • CMP is much better

22
SMT vs. CMP (from SMT study)
  • Difference way resources are partitioned
  • CMP statically partitions resources
  • SMT enable partitioning to change every cycle.

23
SMT claims
  • SMT faster by 24.
  • Not very realistic due to assumptions die area
    not compared. CMP with less issue is much
    smaller.

24
Discussion
  • Why is SMT not dominant?
  • Inter-thread contention
  • Latency
  • Hardware overhead (die size)
  • Benchmarks
  • How should the techniques be combined?
  • Synergies and interference
  • Do we need SMT per core?
  • Where is the right balance?
  • How should this be determined?
Write a Comment
User Comments (0)
About PowerShow.com