Title: Theo Ungerer
1Opportunities for Hardware Multithreading in
Microprocessors and Microcontrollers
- Theo Ungerer
- Systems and Networking
- University of Augsburg
- ungerer_at_informatik.uni-augsburg.de
- http//www.informatik.uni-augsburg.de/sik/
2Basic Principle of Multithreading
thread 1
Register set 1
PC PSR 1
thread 2
Register set 2
PC PSR 2
Thread pointer
thread 3
Register set 3
PC PSR 3
thread 4
Register set 4
PC PSR 4
...
...
...
3Multithreadingin High Performance Processors
Hardware multithreading is the ability to pursue
more than one thread within a processor pipeline.
Typically features multiple register sets,
fast context switching Main objective
performance gain by latency hiding for
multithreaded workloads
- Multithreading in high-performance
microprocessors - IBM RS64 IV (SStar)
- Sun UltraSPARC V
- Intel Xeon TM
4Outline of the Presentation
- Motivation
- State-of-the-art
- Multithreading
- Multithreading for throughput increase
- Multithreading for power reduction
- Multithreading for embedded real-time systems
- Conclusions Research Opportunities
5Todays Multiple-issue Processors
- Utilization of instruction level parallelism
- by a long instruction pipeline and
- by the superscalar or the VLIW-/EPIC-technique.
6Problem Low Resource Utilization by Sequential
Programs
issue slots
horizontal loss 1
horizontal loss 2
processor
vertical loss ( 4)
cycles
vertical loss ( 4)
horizontal loss 3
Losses by empty issue slots
7Outline of the Presentation
- Motivation
- State-of-the-art
- Multithreading
- Multithreading for throughput increase
- Multithreading for power reduction
- Multithreading for embedded real-time systems
- Conclusions Research Opportunities
8Multithreading
- Two basic multithreading techniques
- Interleaved Multithreading
- Block Multithreading
- Simultaneous multithreading (SMT)
- combines wide issue superscalar with
multithreading, - issues instructions from several threads
simultaneously.
9Basic Multithreading Techniques
Single thread
Interleaved MT
Block MT
10SMT vs. CMP
SMT
CMP
11Characteristics of Multithreading
- Latency Utilization
- The latencies that arise in the computation of a
single instruction stream are filled by
computations of another thread. - ? Throughput of multithreaded workloads is
increased - Power Reduction
- Using less speculation
- Rapid Context Switching
- appropriate for real-time applications
12Outline of the Presentation
- Motivation
- State-of-the-art
- Multithreading
- Multithreading for throughput increase
- Multithreading for power reduction
- Multithreading for embedded real-time systems
- Conclusions Research Opportunities
13Multithreading for Throughput Increase
- Lots of research results with simulated SMT since
1995 - Some of our own research results
- Performance estimation of SMT multimedia
- Regard transistor count and chip-space estimation
of the models.
14Relevant Attributes for Rating Microprocessors
Performance
Resource Requirement
Clock Speed
Power Consumption
- Two tools
- Performance estimation tool
- Transistor count and chip-space estimation tool
15Transistor Count and Chip-space Estimator
- Vision
- The resources of the baseline model should be
adjusted such that the same chip space or the
same transistor count is covered as in the new
microachitecture models. - We use an analytical method for memory-based
structures like register files or internal queues
and - an empirical method for logic blocks like control
logic and functional units. - half-feature size l as measure of length of basic
cell - Estimator tool is available (also for
SimpleScalar) athttp//www.informatik.uni-augsb
urg.de/lehrstuehle/info3/research/complexity/
16Execution-based SimulatorBaseline SMT
Multimedia Processor Model
17Results of Performance and Hardware Cost
Estimation
- Demonstrated by two set of models
- Maximum processor models with an abundance of
resources - Small processor models
-
- Workload is a MPEG-2 decoder made multithreaded
18Simulation Parameters
- Fixed parameters
- 1024-entry BTAC, gshare branch predictor (2 K
2-bit counters, 8 bit history, mispred. pen. 5
cycles) - 4-way set-associative D- and I-caches with 32
byte cache lines - 32 KB local on-chip RAM
- 64-bit system bus, 4 MB main memory
- Varied parameters
- 8-12 execution units
- 256- and 32-entry reservation stations
- 10 to 4 result buses
- different D-cache sizes, D- and I-caches of 4 MB
and 64 KB - Parameters Varied with Number of Threads
- 32 32-bit general-purpose registers and 40
rename registers (per thread), - 32- and 16-entry issue and retirement buffers
(per thread) - Fetch and decode bandwidth is scaled with issue
bandwidth and number of threads 1x1 8x8
19Performance vs. Hardware Cost EstimationMaximum
Processor Models
4 MB I- and D-caches, 6 integer/mm units 2 local
load/store units
20Transistor Count and Chip Space Estimation of
Maximum Processor Models
21Small Processor Models
64 KB I- and D-caches, 3 integer/mm units 1 local
load/store unit 32-enty reserv. stations 16-entry
issue and retirement buffers 4 result buses 2x4
fetch and decode bandwidth fixed
22Transistor Count and Chip Space Estimation of
Small Processor Models
23Results
- 4-threaded 8-issue SMT over a single-threaded
8-issue - Commercial Multithreaded Processors
- Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun
UltraSPARC V - Network processors (Intel IXP, IBM PowerNP,
Vitesse IQ2x00, Lextra,..) - IBM RS64 IV two-threaded block MT, reported 5
overhead - Intel Xeon TM (hyperthreading) two-threaded SMT,
reported 5 overhead
Speedup
Transistor Chip Space
Increase
Increase maximum model
3 2 9
small model 1.5 9
27
24Outline of the Presentation
- Motivation
- State-of-the-art
- Multithreading
- Multithreading for throughput increase
- Multithreading for power reduction
- Multithreading for embedded real-time systems
- Conclusions Research Opportunities
25SMT for Reduction of Power Consumption
- Observation Mispredictions cost energy
- Todays superscalars 60 of the fetched and
30 of the executed instructions are squashed - Idea fill issue slots by less speculative
instructions of other threads ? Simulations of
Seng et al. 2000 show that 22 less energy is
consumed by using a power-aware scheduler
26Outline of the Presentation
- Motivation
- State-of-the-art
- Multithreading
- Multithreading for throughput increase
- Multithreading for power reduction
- Multithreading for embedded real-time systems
- Conclusions Research Opportunities
27Multithreading in Embedded Real-time Systems
The Komodo Approach
- Observation multithreading allows a context
switching overhead of zero cycles - Idea harness multithreading for embedded
real-time systems - Komodo Project Real-time Java Based on a
Multithreaded Java-microcontroller - http//www.informatik.uni-augsburg.de/lehrstuehle/
info3/research/ - komodo/indexEng.html
28Real-time Requirements
- run-time predictability
- isolation of the threads
- programmability
- real-time scheduling support
- fast context switching
Hard real-time a deadline may never be
missed Soft real-time a deadline may
occasionally be missed
29Komodo Solutions
- Extremely fast context switching by hardware
multithreading - Real-time scheduling in hardware
- Based on a Java processor core
- Predictability of all instruction executions by a
careful hardware design
30Komodo Microcontroller Pipeline
31Komodo Microcontroller Design
32Hardware Real-time Scheduling
- Real-time scheduler is realized in hardware (by
the priority manager) - Scheduling decision every clock cycle
- Four different scheduling algorithms implemented
- Fixed Priority Preemptive (FPP)
- Earliest Deadline First (EDF)
- Least Laxity First (LLF)
- Guaranteed Percentage (GP)
33Guaranteed Percentage Scheme
i
o
n
a
l
p
r
o
c
e
s
s
o
r
t
i
o
n
v
i
o
la
c
o
n
t
e
x
t
s
w
i
t
c
h
o
n
a
m
u
l
t
i
t
h
r
e
a
d
e
d
p
r
o
c
e
s
s
o
r
s
u
r
p
l
u
s
34Simulation Results
thread mix (IC, PID, and FFT) applied
35Technical Data of the Komodo Prototype
- Implementation of Komodo core pipeline on a
Xilinx XCV800 with 800k gates - ASIC synthesis of whole microcontroller (0.18 mm
technology) 340 MHz, 3 mm2 chip
data bit width address space number of
threads instruction window size stack
size external frequency internal
frequency CLBs number of gates
32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25
MHz 9 200 133 000
36Chip-Space of Komodo Core Pipeline
37 Reducing Power Consumption Using Real-time
Scheduling in Hardware
Current work Idea Use information about the
thread states and configurations available
within the priority manager for a fine-grained
adaption of power consumption and performance.
- Frequency and voltage adjustments in short time
intervals done by hardware
38State of the Komodo Project
- Software simulator
- FPGA prototyp
- Real-time Java system
- ASIC
- Middleware for distributed embedded systems
39Conclusions onMultithreading in Real-time
Environments
- Multithreaded processor cores
- Performance gain due to fast context switching
(for hard real-time) and latency hiding (for soft
and non real-time) - More efficient event handling by ISTs
- Helper threads possible (garbage collection,
debugging)
- Real-time scheduling in hardware
- Software overhead for real-time scheduling
removed - more efficient power saving mechanisms possible
- better predictablility by isolation of threads
(GP scheduling)
40Conclusions Research Opportunities
- Multithreading proves advantageous
- Latency hiding speed-ups of 2-3 for SMT, lots
of research done, next generation of
microprocessors - Power reduction 22 savings reported, not much
research up to now - Fast context switching utilized by
microcontroller for real-time systems,not much
research up to now - Research opportunities
- Scheduling in SMT, network processors and
multithreaded real-time systems - Thread-speculation how to speed-up
single-threaded programs? - Multithreading and power consumption
- Multithreading in other communities
microcontrollers, SoCs - System software based on helper threads
41Acknowledgements
- SMT Multimedia research group
- Uli Sigmund and Heiko Oehring
- Complexity estimation group
- Marc Steinhaus, Reiner Kolla, Josep L.
Larriba-Pey, Mateo Valero - Komodo project group
- Jochen Kreuzinger, Matthias Pfeffer, Sascha
Uhrig, Uwe Brinkschulte, Florentin Picioroaga,
Etienne Schneider
42Mikroprozessors Technology Prognosis up to 2012
- SIA (semiconductor industries association)
Prognose 1997
43Research Directions?
- Increase performance of a single thread of
control by - more instruction-level speculation
- Better branch prediction,
- Trace cache and next trace prediction,
- Data dependence and value prediction
- Increase throughput of a workload of multiple
threads - Utilize thread-level and instruction-level
parallelism - Chip-Multiprocessors
- Multithreading (hardware thread thread or
process) - Thread speculation