Title: Multithreaded Processor Architectures
1Multithreaded Processor Architectures
- Gregory T. Byrd and Mark A. Holliday
-
- IEEE Spectrum, Volume 32, Issue 8, Page 38-46,
Aug. 1995 - Speaker Wen-Kai Huang
2Abstract
- Often, in their unending quest for computers with
higher performance, architects seek to reduce or
hide latency the number of cycles an operation
takes from start to finish. Multithreaded
architectures take the tack of hiding latency by
supporting multiple concurrent streams of
threads, which are independent of one another.
This article gives an introduction to multithread
processor architectures and discussions for the
design challenges
3Whats the Problem
- The long latency operations such as
- memory accesses
- remote reads
- synchronization operations
- may extend for 10 to 100 cycles, forcing the
- traditional processor to sit idle until the
- result comes in
4Main Idea
- Multithreaded processors multiplex the execution
of a number of concurrent threads onto the
hardware in order to hide latencies - When a long-latency operation occurs in one of
the threads, another begins execution
5Multiple Hardware Contexts
- Threads are mapped onto hardware contexts, which
each include - General-purpose registers
- Program counters
- Status registers
- Since many cycles are required to switch between
threads, multithread processor must support some
mechanisms to reduce context switching overhead - To provide multiple hardware contexts
6Illustration of Multithreaded Processors
- - - - running thread - - - - ready threads
7Striking the Right Balance
- A multithreaded processors efficiency is
determined by four parameters - Number of contexts supported by the hardware
- Context switching overhead
- Run length
- The number of cycles executed between context
switches - Characteristic latency
8Multithreaded Processor Efficiency
9Multithread Models
- Coarse-grained (block interleaving)
- Executes a single thread until it reaches certain
situations - Fine-grained (cycle-by-cycle interleaving)
- The processor switches each cycle to a different
thread - Multiple-issue (simultaneous)
- Integrate multithreaded mechanism into
superscalar architectures
10Coarse-grained Multithreading
- The triggering event in a block interleaving
(coarse-grained) model can be classified as
follows - 1
11Coarse-grained Example - Sparcle
- The Sparcle processor
- Supports four hardware contexts and switches from
one to another whenever a cache miss occurs
12Discussions about Sparcle
- There has only one program counter and status
register - 14 cycles context-switching overhead
- The number of contexts that can be effectively
used is often less than four - Improvements of Efficiency
- Reduce the switching overhead
- Reduce the switching frequency (longer run
length) - Memory prefetching
- Supports more hardware context?? ? NO!!
13Fine-grained Architectures
- Fine-grained architectures issue an instruction
from different thread on every cycle - Zero switching overhead ? achieved by the
presence of enough registers so that no saving
and restoring needed when switches contexts - Major problems of fine-grained solutions
- Hardware cost
- Not all workloads contain sufficient parallelism
- Poor performance with single-thread workload
14Cycle-by-cycle Interleaving
- An instruction of a thread can be fed into the
pipeline after the completion of the previous
instruction of that thread - It eliminates pipeline hazards so that the
processor pipeline can be very simple - However, single-thread performance is poor
15Fine-grained Example - MTA
- The MTA processor
- Three-stage pipeline
- 128 hardware contexts
- No cache, so memory accesses have very long
latency - Its an expensive design
16The Laudon Scheme
- In Laudons architecture, the cache systems had
been adopted - It supports only four hardware contexts
- Caches keep most latencies short, many threads
are not needed - Also, if running single-thread workload, it can
fill up the CPU pipeline
17Multiple Issue (Simultaneous)
- In superscalar processor, each operation proceeds
through one of the several functional units - A setup clearly compatible with multithreading
- Operations can be issued by different threads in
the same cycle
18Simultaneous Example M-Machine
- The M-Machine processor supports 8 functional
units - This example illustrate the processor coupling
with 3 threads for 5 consecutive cycles
19Detecting Parallelism
- The coarsest level
- Task-level parallelism, identified by the user
- Medium-grained, or control-level
- Function or subroutine-level parallelism,
specified by the programmer or compiler - Fine-grained, or data-level
- Involves executing the same set of instructions
on different data, e.g. software pipeline - Very fine-grained
- VLIW or superscalar
20Different Level of Parallelism
- The threads of a multithreaded architecture
usually correspond to medium- and fine-grain units
21Prospects for Success
- To date (1995), there have been no successful
multithreaded machines because of the - Extra cost
- Complexity of hardware
- Dearth of tools for extracting thread-level
parallelism - Today, however, there are many solutions for
these problems - The advanced semi-conductor technologies
- The advanced CAD tools for circuits design
- The advanced software tools for parallelism
exploitation
22Conclusion Multithreaded Trends
- Recent announcements by processor vendors depict
a trend, with an increasing emphasis on enhancing
chip-level throughput through multiple cores and
threads 2 - Each core or thread may not necessarily aim at
delivering the highest frequency 2
23Appendix
- IEEE Micro Hot Chip 2004 (total 5 chips)
- Sun Microsystems, A Chip MultiThreaded (CMT)
Processor for Network Workloads ? Dual-thread - IBM, IBM Power5 A Dual-Core Multithreaded
Processor ? Dual-thread for each core - Related Work to ARM processor
- G. Cui and Z. Li, MT_ARM Multithreading
Implementation in ARM7 Architecture, Proc. 4th
International Conference on ASIC, pp. 793 - 796
Oct. 2001 - Reference
- 1 J. Kreuzinger, etl., Context-switching
Techniques for Decoupled Multithreaded
Processors, Proc. 25th EUROMICRO Conference,
Vol. 1, pp. 248 - 251 Sept. 1999 - 2 P. Bose, Chip-level microarchitecture
trends, IEEE Micro, Vol. 24, Issue 2, pp. 5,
Mar.-Apr. 2004