Simultaneous Multithreading:Maximising On-Chip Parallelism - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Simultaneous Multithreading:Maximising On-Chip Parallelism

Description:

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 26
Provided by: NBCS
Category:

less

Transcript and Presenter's Notes

Title: Simultaneous Multithreading:Maximising On-Chip Parallelism


1
Simultaneous MultithreadingMaximising On-Chip
Parallelism
Dean Tullsen, Susan Eggers, Henry Levy Department
of Computer Science, University of
Washington,Seattle Proceedings of ISCA 95, Italy
Presented by Amit Gaur
2
Overview
  • Instruction Level Parallelism vs. Thread Level
    Parallelism
  • Motivation
  • Simulation Environment and Workload
  • Simultaneous Multithreading Models
  • Performance Analysis
  • Extensions in Design
  • Single Chip Multiprocessing
  • Summary
  • Current Implementations
  • Retrospective

3
Instruction Level Parallelism
  • Superscalar processors
  • Shortcomings
  • a) Instruction Dependencies
  • b) long latencies within single thread

4
Thread Level Parallelism
  • Traditional Multithreaded Architecture
  • Exploit parallelism at application level
  • Multiple threads Inherent Parallelism
  • Attack Vertical Waste memory and functional unit
    latencies
  • E.g. Server applications, online transaction
    processing, web services

5
Need for Simultaneous Multithreading
  • Attack vertical as well as horizontal waste
  • Fetch instructions from multiple threads each
    cycle
  • Exploit all parallelism full utilization of
    execution resources
  • Decrease in wasted issue slots
  • Comparison with superscalar,fine-grain
    multithreaded processor, single-chip,multiple
    issue multiprocessors

6
Simulation Environment
  • Emulation based instruction level simulation
  • Model on Alpha AXP 21164 extended for wide
    superscalar execution and multithreaded execution
  • Support for increased single stream
    parallelism,more flexible instruction issue,
    improved branch prediction, and larger higher
    bandwidth caches
  • Code generated using Multiflow trace scheduling
    compiler(static scheduling)

7
Simulation Environment(Continued)
  • 10 functional units(4 integer, 2 floating point,
    3 Load/Store, 1 Branch)
  • All units pipelined
  • In-order issue of dependence free instructions
    with 8 instruction per thread window
  • L1 and L2 cache are on-chip
  • 2048 entry, 2 bit branch prediction history
    table maintained
  • Support for upto 8 hardware contexts

8
Workload Specifications
  • SPEC92 Benchmark suite simulated
  • To obtain TLP, distinct program allocated to each
    thread Parallel workload based on
    multiprogramming
  • Executable generated with lowest single thread
    execution time used

9
Limitations of Superscalar Processors
10
Superscalar Performance Degradation
  • Overlap in a number of delaying causes
  • Completely eliminating any 1 cause will not
    result in performance increase
  • 61 vertical waste and 39 horizontal waste
  • Tackle both using simultaneous multithreading

11
Simultaneous Multithreading Models
  • Fine Grain Multithreading 1 thread issues
    instructions in each cycle
  • SMFull Simultaneous Issue All eight threads
    compete for each issue slot, each cyclegt Maximum
    flexibility.
  • SMSingle Issue, SM Dual Issue, SMFour Issue
    limits the number of instructions each thread can
    issue, or have active in the scheduling window,
    each cycle.
  • SM Limited Connection Each hardware context is
    connected to exactly one type of functional
    unitgt Least Dynamic of all Models.

12
Hardware Complexities of Models
13
Design Challenges in SMT processors
  • Issue slot usage limited by imbalances in
    resource needs and resource availability
  • Number of active threads, limitations on buffer
    sizes, instruction mix from multiple threads
  • Hardware complexity need to implement
    superscalar along with thread level parallelism
  • Use of priority threads can result in throughput
    reduction as pipeline less likely to have
    instruction mix from different threads
  • Mixing many threads also compromises performancce
    of individual threads.
  • Tradeoff- small number of active threads, even
    smaller number of preferred threads

14
From Superscalar to SMT
  • SMT is an out of order superscalar extended with
    hardware to support multiple threads
  • Multiple Thread Support
  • a) per-thread program counters
  • b) per-thread return stacks
  • c) per-thread bookkeeping for instruction
    retirement,trap and instruction dispatch from
    prefetch queue
  • d) thread identifiers eg. With BTB and TLB
    entries
  • Should SMT processors speculate??
  • Determine role of instruction speculation in SMT.

15
Instruction Speculation
  • Speculation executes probable instructions to
    hide branch latencies
  • Processor fetches on a hardware based prediction
  • Correct prediction - Keep going
  • Incorrect prediction - Rollback
  • SMT has 2 ways to deal with branch delay stalls
  • a) Speculation
  • b) Fetch/Issue from other threads
  • SMT and Speculation
  • Speculation can be wasteful on SMT as one
    threads speculative instructions can compete
    with replace anothers non speculative
    instructions

16
Performance Evaluation of SMT
17
Performance Evaluation(Contd.)
  • Fine Grain MT Max Speedup is 2.1. No gain in
    vertical waste reduction after 4 threads
  • SMT models Speedup ranges from 3.5 to 4.2, with
    issue rate reaching 6.3 IPC
  • 4 issue model gets nearly same performance as
    full issue, dual issue is at 94 of full issue at
    8 threads
  • As ratio of threads to issue slots increases
    performance of models increases.
  • Tradeoff between number of hardware contexts and
    hardware complexity.
  • Adverse effect of competition for sharing of
    resources -gt lowest priority thread runs slowest
  • More strain on caches due to reduced locality-
    increase in I and D cache misses
  • Overall increase in instruction throughput

18
Extensions Alternative cache Design for SMT
  • Comparison of private per thread caches(L1) to
    shared caches for Instructions and Data.
  • Shared caches optimize for small number of
    threads
  • Shared d-cache outperforms private d-cache for
    all configurations.
  • Private I-caches perform better at high number
    of threads.

19
Speculation in SMT
20
SMT vs. Single chip Multiprocessing
  • Similarities use of multiple register sets,
    multiple functional units, need for high issue
    bandwidth on single chip
  • Differences Multiprocessor uses static
    allocation of resources, SM processor allows
    resource allocation to change every cycle.
  • Same configuration used for testing performance
  • a) 8KB private I-cache and D-cache
  • b) 256 KB 4-way set assoc.. L2 cache
  • c) 2 MB direct mapped L3 cache
  • Attempt to bias the test in favor of MP

21
Test Results
22
Test Results(Contd.)
  • Test A,B,C high ratio of FU and threads to
    issue bandwidth- greater opportunity to utilize
    issue bandwidth.
  • Test D repeats A but SMT Processor has 10 FUs.
    It still outperforms Multiprocessor
  • Test E F- MP is allowed greater issue bandwidth
    even then SMT processor shows better performance
  • Test G -both have 8 FUs and 8 issues per
    cycle, however SMT processor has 8 contexts and
    Multiprocessor has 2 processor (2 register
    sets)-SMT processor has 2.5 greater performance

23
Summary
  • Simultaneous Multithreading combines facilities
    of superscalar as well as multithreaded
    architectures
  • It has the ability to boost utilization of
    resources by dynamically scheduling functional
    units among multiple threads
  • Comparison of several models of SMT have been
    done with wide superscalar, fine-grain
    multithreaded, and single chip, multiple issue
    multiprocessing architectures
  • The results of simulation show that
  • a) a simultaneous multithreaded architecture
    with proper configuration can achieve 4 times
    instruction throughput of a single-threaded wide
    superscalar with the same issue width
  • b)simultaneous multithreading outperforms
    fine-grain multithreading by a factor of 2.
  • c)simultaneous multiprocessor is superior in
    performance to a multiple issue multiprocessor,
    given same hardware resources

24
Commercial Machines
  • MemoryLogix - SMT processor for mobile devices.
  • Sun Microsystems has announced a 4-SMT-processor
    CMP.
  • Hyper-Threading Technology (Intel Xeon
    Architecture)
  • Clearwater Networks , a Los Gatos-based startup,
    was building an 8-context SMT network processor.
  • Compaq Computer Corp. designed a 4-context SMT
    processor, Alpha 21464 (EV-8)

25
In Retrospect
  • The design of SMT architecture was influenced by
    previous projects like the Tera, MIT Alewife and
    M-machine
  • SMT was different from previous projects as it
    addressed a more complete and descriptive goal as
    compared to previous designs.
  • The idea was to utilize thread level parallelism
    in place of lack of instruction level parallelism
  • Aim was to target mainstream processor designs
    like the Alpha 21164
Write a Comment
User Comments (0)
About PowerShow.com