Title: Simultaneous Multithreading:Maximising On-Chip Parallelism
1Simultaneous MultithreadingMaximising On-Chip
Parallelism
Dean Tullsen, Susan Eggers, Henry Levy Department
of Computer Science, University of
Washington,Seattle Proceedings of ISCA 95, Italy
Presented by Amit Gaur
2Overview
- Instruction Level Parallelism vs. Thread Level
Parallelism - Motivation
- Simulation Environment and Workload
- Simultaneous Multithreading Models
- Performance Analysis
- Extensions in Design
- Single Chip Multiprocessing
- Summary
- Current Implementations
- Retrospective
3Instruction Level Parallelism
- Superscalar processors
- Shortcomings
- a) Instruction Dependencies
- b) long latencies within single thread
4Thread Level Parallelism
- Traditional Multithreaded Architecture
- Exploit parallelism at application level
- Multiple threads Inherent Parallelism
- Attack Vertical Waste memory and functional unit
latencies - E.g. Server applications, online transaction
processing, web services
5Need for Simultaneous Multithreading
- Attack vertical as well as horizontal waste
- Fetch instructions from multiple threads each
cycle - Exploit all parallelism full utilization of
execution resources - Decrease in wasted issue slots
- Comparison with superscalar,fine-grain
multithreaded processor, single-chip,multiple
issue multiprocessors
6Simulation Environment
- Emulation based instruction level simulation
- Model on Alpha AXP 21164 extended for wide
superscalar execution and multithreaded execution - Support for increased single stream
parallelism,more flexible instruction issue,
improved branch prediction, and larger higher
bandwidth caches - Code generated using Multiflow trace scheduling
compiler(static scheduling)
7Simulation Environment(Continued)
- 10 functional units(4 integer, 2 floating point,
3 Load/Store, 1 Branch) - All units pipelined
- In-order issue of dependence free instructions
with 8 instruction per thread window - L1 and L2 cache are on-chip
- 2048 entry, 2 bit branch prediction history
table maintained - Support for upto 8 hardware contexts
8Workload Specifications
- SPEC92 Benchmark suite simulated
- To obtain TLP, distinct program allocated to each
thread Parallel workload based on
multiprogramming - Executable generated with lowest single thread
execution time used
9Limitations of Superscalar Processors
10Superscalar Performance Degradation
- Overlap in a number of delaying causes
- Completely eliminating any 1 cause will not
result in performance increase - 61 vertical waste and 39 horizontal waste
- Tackle both using simultaneous multithreading
11Simultaneous Multithreading Models
- Fine Grain Multithreading 1 thread issues
instructions in each cycle - SMFull Simultaneous Issue All eight threads
compete for each issue slot, each cyclegt Maximum
flexibility. - SMSingle Issue, SM Dual Issue, SMFour Issue
limits the number of instructions each thread can
issue, or have active in the scheduling window,
each cycle. - SM Limited Connection Each hardware context is
connected to exactly one type of functional
unitgt Least Dynamic of all Models.
12Hardware Complexities of Models
13Design Challenges in SMT processors
- Issue slot usage limited by imbalances in
resource needs and resource availability - Number of active threads, limitations on buffer
sizes, instruction mix from multiple threads - Hardware complexity need to implement
superscalar along with thread level parallelism - Use of priority threads can result in throughput
reduction as pipeline less likely to have
instruction mix from different threads - Mixing many threads also compromises performancce
of individual threads. - Tradeoff- small number of active threads, even
smaller number of preferred threads
14From Superscalar to SMT
- SMT is an out of order superscalar extended with
hardware to support multiple threads - Multiple Thread Support
- a) per-thread program counters
- b) per-thread return stacks
- c) per-thread bookkeeping for instruction
retirement,trap and instruction dispatch from
prefetch queue - d) thread identifiers eg. With BTB and TLB
entries - Should SMT processors speculate??
- Determine role of instruction speculation in SMT.
15Instruction Speculation
- Speculation executes probable instructions to
hide branch latencies - Processor fetches on a hardware based prediction
- Correct prediction - Keep going
- Incorrect prediction - Rollback
- SMT has 2 ways to deal with branch delay stalls
- a) Speculation
- b) Fetch/Issue from other threads
- SMT and Speculation
- Speculation can be wasteful on SMT as one
threads speculative instructions can compete
with replace anothers non speculative
instructions
16Performance Evaluation of SMT
17Performance Evaluation(Contd.)
- Fine Grain MT Max Speedup is 2.1. No gain in
vertical waste reduction after 4 threads - SMT models Speedup ranges from 3.5 to 4.2, with
issue rate reaching 6.3 IPC - 4 issue model gets nearly same performance as
full issue, dual issue is at 94 of full issue at
8 threads - As ratio of threads to issue slots increases
performance of models increases. - Tradeoff between number of hardware contexts and
hardware complexity. - Adverse effect of competition for sharing of
resources -gt lowest priority thread runs slowest - More strain on caches due to reduced locality-
increase in I and D cache misses - Overall increase in instruction throughput
18Extensions Alternative cache Design for SMT
- Comparison of private per thread caches(L1) to
shared caches for Instructions and Data. - Shared caches optimize for small number of
threads - Shared d-cache outperforms private d-cache for
all configurations. - Private I-caches perform better at high number
of threads.
19Speculation in SMT
20SMT vs. Single chip Multiprocessing
- Similarities use of multiple register sets,
multiple functional units, need for high issue
bandwidth on single chip - Differences Multiprocessor uses static
allocation of resources, SM processor allows
resource allocation to change every cycle. - Same configuration used for testing performance
- a) 8KB private I-cache and D-cache
- b) 256 KB 4-way set assoc.. L2 cache
- c) 2 MB direct mapped L3 cache
- Attempt to bias the test in favor of MP
21Test Results
22Test Results(Contd.)
- Test A,B,C high ratio of FU and threads to
issue bandwidth- greater opportunity to utilize
issue bandwidth. - Test D repeats A but SMT Processor has 10 FUs.
It still outperforms Multiprocessor - Test E F- MP is allowed greater issue bandwidth
even then SMT processor shows better performance - Test G -both have 8 FUs and 8 issues per
cycle, however SMT processor has 8 contexts and
Multiprocessor has 2 processor (2 register
sets)-SMT processor has 2.5 greater performance
23Summary
- Simultaneous Multithreading combines facilities
of superscalar as well as multithreaded
architectures - It has the ability to boost utilization of
resources by dynamically scheduling functional
units among multiple threads - Comparison of several models of SMT have been
done with wide superscalar, fine-grain
multithreaded, and single chip, multiple issue
multiprocessing architectures - The results of simulation show that
- a) a simultaneous multithreaded architecture
with proper configuration can achieve 4 times
instruction throughput of a single-threaded wide
superscalar with the same issue width - b)simultaneous multithreading outperforms
fine-grain multithreading by a factor of 2. - c)simultaneous multiprocessor is superior in
performance to a multiple issue multiprocessor,
given same hardware resources
24Commercial Machines
- MemoryLogix - SMT processor for mobile devices.
- Sun Microsystems has announced a 4-SMT-processor
CMP. - Hyper-Threading Technology (Intel Xeon
Architecture) - Clearwater Networks , a Los Gatos-based startup,
was building an 8-context SMT network processor. - Compaq Computer Corp. designed a 4-context SMT
processor, Alpha 21464 (EV-8)
25In Retrospect
- The design of SMT architecture was influenced by
previous projects like the Tera, MIT Alewife and
M-machine - SMT was different from previous projects as it
addressed a more complete and descriptive goal as
compared to previous designs. - The idea was to utilize thread level parallelism
in place of lack of instruction level parallelism - Aim was to target mainstream processor designs
like the Alpha 21164