Theo Ungerer - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Theo Ungerer

Description:

IBM RS64 IV: two-threaded block MT, reported 5% overhead ... Technical Data of the Komodo Prototype ... Using Real-time Scheduling in Hardware. Current work: ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 44

Provided by: irf66

Category:

more less

Transcript and Presenter's Notes

Title: Theo Ungerer

1
Opportunities for Hardware Multithreading in
Microprocessors and Microcontrollers

Theo Ungerer
Systems and Networking
University of Augsburg
ungerer_at_informatik.uni-augsburg.de
http//www.informatik.uni-augsburg.de/sik/

2
Basic Principle of Multithreading
thread 1
Register set 1
PC PSR 1
thread 2
Register set 2
PC PSR 2
Thread pointer
thread 3
Register set 3
PC PSR 3
thread 4
Register set 4
PC PSR 4
...
...
...
3
Multithreadingin High Performance Processors
Hardware multithreading is the ability to pursue
more than one thread within a processor pipeline.
Typically features multiple register sets,
fast context switching Main objective
performance gain by latency hiding for
multithreaded workloads

Multithreading in high-performance
microprocessors
IBM RS64 IV (SStar)
Sun UltraSPARC V
Intel Xeon TM

4
Outline of the Presentation

Motivation
State-of-the-art
Multithreading
Multithreading for throughput increase
Multithreading for power reduction
Multithreading for embedded real-time systems
Conclusions Research Opportunities

5
Todays Multiple-issue Processors

Utilization of instruction level parallelism
by a long instruction pipeline and
by the superscalar or the VLIW-/EPIC-technique.

6
Problem Low Resource Utilization by Sequential
Programs
issue slots
horizontal loss 1
horizontal loss 2
processor
vertical loss ( 4)
cycles
vertical loss ( 4)
horizontal loss 3
Losses by empty issue slots
7
Outline of the Presentation

Motivation
State-of-the-art
Multithreading
Multithreading for throughput increase
Multithreading for power reduction
Multithreading for embedded real-time systems
Conclusions Research Opportunities

8
Multithreading

Two basic multithreading techniques
Interleaved Multithreading
Block Multithreading
Simultaneous multithreading (SMT)
combines wide issue superscalar with
multithreading,
issues instructions from several threads
simultaneously.

9
Basic Multithreading Techniques
Single thread
Interleaved MT
Block MT
10
SMT vs. CMP
SMT
CMP
11
Characteristics of Multithreading

Latency Utilization
The latencies that arise in the computation of a
single instruction stream are filled by
computations of another thread.
? Throughput of multithreaded workloads is
increased
Power Reduction
Using less speculation
Rapid Context Switching
appropriate for real-time applications

12
Outline of the Presentation

Motivation
State-of-the-art
Multithreading
Multithreading for throughput increase
Multithreading for power reduction
Multithreading for embedded real-time systems
Conclusions Research Opportunities

13
Multithreading for Throughput Increase

Lots of research results with simulated SMT since
1995
Some of our own research results
Performance estimation of SMT multimedia
Regard transistor count and chip-space estimation
of the models.

14
Relevant Attributes for Rating Microprocessors
Performance
Resource Requirement
Clock Speed
Power Consumption

Two tools
Performance estimation tool
Transistor count and chip-space estimation tool

15
Transistor Count and Chip-space Estimator

Vision
The resources of the baseline model should be
adjusted such that the same chip space or the
same transistor count is covered as in the new
microachitecture models.
We use an analytical method for memory-based
structures like register files or internal queues
and
an empirical method for logic blocks like control
logic and functional units.
half-feature size l as measure of length of basic
cell
Estimator tool is available (also for
SimpleScalar) athttp//www.informatik.uni-augsb
urg.de/lehrstuehle/info3/research/complexity/

16
Execution-based SimulatorBaseline SMT
Multimedia Processor Model
17
Results of Performance and Hardware Cost
Estimation

Demonstrated by two set of models
Maximum processor models with an abundance of
resources
Small processor models
Workload is a MPEG-2 decoder made multithreaded

18
Simulation Parameters

Fixed parameters
1024-entry BTAC, gshare branch predictor (2 K
2-bit counters, 8 bit history, mispred. pen. 5
cycles)
4-way set-associative D- and I-caches with 32
byte cache lines
32 KB local on-chip RAM
64-bit system bus, 4 MB main memory
Varied parameters
8-12 execution units
256- and 32-entry reservation stations
10 to 4 result buses
different D-cache sizes, D- and I-caches of 4 MB
and 64 KB
Parameters Varied with Number of Threads
32 32-bit general-purpose registers and 40
rename registers (per thread),
32- and 16-entry issue and retirement buffers
(per thread)
Fetch and decode bandwidth is scaled with issue
bandwidth and number of threads 1x1 8x8

19
Performance vs. Hardware Cost EstimationMaximum
Processor Models
4 MB I- and D-caches, 6 integer/mm units 2 local
load/store units
20
Transistor Count and Chip Space Estimation of
Maximum Processor Models
21
Small Processor Models
64 KB I- and D-caches, 3 integer/mm units 1 local
load/store unit 32-enty reserv. stations 16-entry
issue and retirement buffers 4 result buses 2x4
fetch and decode bandwidth fixed
22
Transistor Count and Chip Space Estimation of
Small Processor Models
23
Results

4-threaded 8-issue SMT over a single-threaded
8-issue
Commercial Multithreaded Processors
Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun
UltraSPARC V
Network processors (Intel IXP, IBM PowerNP,
Vitesse IQ2x00, Lextra,..)
IBM RS64 IV two-threaded block MT, reported 5
overhead
Intel Xeon TM (hyperthreading) two-threaded SMT,
reported 5 overhead

Speedup
Transistor Chip Space
Increase
Increase maximum model
3 2 9
small model 1.5 9
27
24
Outline of the Presentation

Motivation
State-of-the-art
Multithreading
Multithreading for throughput increase
Multithreading for power reduction
Multithreading for embedded real-time systems
Conclusions Research Opportunities

25
SMT for Reduction of Power Consumption

Observation Mispredictions cost energy
Todays superscalars 60 of the fetched and
30 of the executed instructions are squashed
Idea fill issue slots by less speculative
instructions of other threads ? Simulations of
Seng et al. 2000 show that 22 less energy is
consumed by using a power-aware scheduler

26
Outline of the Presentation

Motivation
State-of-the-art
Multithreading
Multithreading for throughput increase
Multithreading for power reduction
Multithreading for embedded real-time systems
Conclusions Research Opportunities

27
Multithreading in Embedded Real-time Systems
The Komodo Approach

Observation multithreading allows a context
switching overhead of zero cycles
Idea harness multithreading for embedded
real-time systems
Komodo Project Real-time Java Based on a
Multithreaded Java-microcontroller
http//www.informatik.uni-augsburg.de/lehrstuehle/
info3/research/
komodo/indexEng.html

28
Real-time Requirements

run-time predictability
isolation of the threads
programmability
real-time scheduling support
fast context switching

Hard real-time a deadline may never be
missed Soft real-time a deadline may
occasionally be missed
29
Komodo Solutions

Extremely fast context switching by hardware
multithreading
Real-time scheduling in hardware
Based on a Java processor core
Predictability of all instruction executions by a
careful hardware design

30
Komodo Microcontroller Pipeline
31
Komodo Microcontroller Design
32
Hardware Real-time Scheduling

Real-time scheduler is realized in hardware (by
the priority manager)
Scheduling decision every clock cycle
Four different scheduling algorithms implemented
Fixed Priority Preemptive (FPP)
Earliest Deadline First (EDF)
Least Laxity First (LLF)
Guaranteed Percentage (GP)

33
Guaranteed Percentage Scheme
i
o
n
a
l

p
r
o
c
e
s
s
o
r
t
i
o
n
v
i
o
la
c
o
n
t
e
x
t

s
w
i
t
c
h
o
n

a

m
u
l
t
i
t
h
r
e
a
d
e
d

p
r
o
c
e
s
s
o
r
s
u
r
p
l
u
s
34
Simulation Results
thread mix (IC, PID, and FFT) applied
35
Technical Data of the Komodo Prototype

Implementation of Komodo core pipeline on a
Xilinx XCV800 with 800k gates
ASIC synthesis of whole microcontroller (0.18 mm
technology) 340 MHz, 3 mm2 chip

data bit width address space number of
threads instruction window size stack
size external frequency internal
frequency CLBs number of gates
32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25
MHz 9 200 133 000
36
Chip-Space of Komodo Core Pipeline
37
Reducing Power Consumption Using Real-time
Scheduling in Hardware
Current work Idea Use information about the
thread states and configurations available
within the priority manager for a fine-grained
adaption of power consumption and performance.

Frequency and voltage adjustments in short time
intervals done by hardware

38
State of the Komodo Project

Software simulator
FPGA prototyp
Real-time Java system

ASIC
Middleware for distributed embedded systems

39
Conclusions onMultithreading in Real-time
Environments

Multithreaded processor cores
Performance gain due to fast context switching
(for hard real-time) and latency hiding (for soft
and non real-time)
More efficient event handling by ISTs
Helper threads possible (garbage collection,
debugging)

Real-time scheduling in hardware
Software overhead for real-time scheduling
removed
more efficient power saving mechanisms possible
better predictablility by isolation of threads
(GP scheduling)

40
Conclusions Research Opportunities

Multithreading proves advantageous
Latency hiding speed-ups of 2-3 for SMT, lots
of research done, next generation of
microprocessors
Power reduction 22 savings reported, not much
research up to now
Fast context switching utilized by
microcontroller for real-time systems,not much
research up to now
Research opportunities
Scheduling in SMT, network processors and
multithreaded real-time systems
Thread-speculation how to speed-up
single-threaded programs?
Multithreading and power consumption
Multithreading in other communities
microcontrollers, SoCs
System software based on helper threads

41
Acknowledgements

SMT Multimedia research group
Uli Sigmund and Heiko Oehring
Complexity estimation group
Marc Steinhaus, Reiner Kolla, Josep L.
Larriba-Pey, Mateo Valero
Komodo project group
Jochen Kreuzinger, Matthias Pfeffer, Sascha
Uhrig, Uwe Brinkschulte, Florentin Picioroaga,
Etienne Schneider

42
Mikroprozessors Technology Prognosis up to 2012