Theo Ungerer - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Theo Ungerer

Description:

IBM RS64 IV: two-threaded block MT, reported 5% overhead ... Technical Data of the Komodo Prototype ... Using Real-time Scheduling in Hardware. Current work: ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 44
Provided by: irf66
Category:

less

Transcript and Presenter's Notes

Title: Theo Ungerer


1
Opportunities for Hardware Multithreading in
Microprocessors and Microcontrollers
  • Theo Ungerer
  • Systems and Networking
  • University of Augsburg
  • ungerer_at_informatik.uni-augsburg.de
  • http//www.informatik.uni-augsburg.de/sik/

2
Basic Principle of Multithreading
thread 1
Register set 1
PC PSR 1
thread 2
Register set 2
PC PSR 2
Thread pointer
thread 3
Register set 3
PC PSR 3
thread 4
Register set 4
PC PSR 4
...
...
...
3
Multithreadingin High Performance Processors
Hardware multithreading is the ability to pursue
more than one thread within a processor pipeline.
Typically features multiple register sets,
fast context switching Main objective
performance gain by latency hiding for
multithreaded workloads
  • Multithreading in high-performance
    microprocessors
  • IBM RS64 IV (SStar)
  • Sun UltraSPARC V
  • Intel Xeon TM

4
Outline of the Presentation
  • Motivation
  • State-of-the-art
  • Multithreading
  • Multithreading for throughput increase
  • Multithreading for power reduction
  • Multithreading for embedded real-time systems
  • Conclusions Research Opportunities

5
Todays Multiple-issue Processors
  • Utilization of instruction level parallelism
  • by a long instruction pipeline and
  • by the superscalar or the VLIW-/EPIC-technique.

6
Problem Low Resource Utilization by Sequential
Programs
issue slots
horizontal loss 1
horizontal loss 2
processor
vertical loss ( 4)
cycles
vertical loss ( 4)
horizontal loss 3
Losses by empty issue slots
7
Outline of the Presentation
  • Motivation
  • State-of-the-art
  • Multithreading
  • Multithreading for throughput increase
  • Multithreading for power reduction
  • Multithreading for embedded real-time systems
  • Conclusions Research Opportunities

8
Multithreading
  • Two basic multithreading techniques
  • Interleaved Multithreading
  • Block Multithreading
  • Simultaneous multithreading (SMT)
  • combines wide issue superscalar with
    multithreading,
  • issues instructions from several threads
    simultaneously.

9
Basic Multithreading Techniques
Single thread
Interleaved MT
Block MT
10
SMT vs. CMP
SMT
CMP
11
Characteristics of Multithreading
  • Latency Utilization
  • The latencies that arise in the computation of a
    single instruction stream are filled by
    computations of another thread.
  • ? Throughput of multithreaded workloads is
    increased
  • Power Reduction
  • Using less speculation
  • Rapid Context Switching
  • appropriate for real-time applications

12
Outline of the Presentation
  • Motivation
  • State-of-the-art
  • Multithreading
  • Multithreading for throughput increase
  • Multithreading for power reduction
  • Multithreading for embedded real-time systems
  • Conclusions Research Opportunities

13
Multithreading for Throughput Increase
  • Lots of research results with simulated SMT since
    1995
  • Some of our own research results
  • Performance estimation of SMT multimedia
  • Regard transistor count and chip-space estimation
    of the models.

14
Relevant Attributes for Rating Microprocessors
Performance
Resource Requirement
Clock Speed
Power Consumption
  • Two tools
  • Performance estimation tool
  • Transistor count and chip-space estimation tool

15
Transistor Count and Chip-space Estimator
  • Vision
  • The resources of the baseline model should be
    adjusted such that the same chip space or the
    same transistor count is covered as in the new
    microachitecture models.
  • We use an analytical method for memory-based
    structures like register files or internal queues
    and
  • an empirical method for logic blocks like control
    logic and functional units.
  • half-feature size l as measure of length of basic
    cell
  • Estimator tool is available (also for
    SimpleScalar) athttp//www.informatik.uni-augsb
    urg.de/lehrstuehle/info3/research/complexity/

16
Execution-based SimulatorBaseline SMT
Multimedia Processor Model
17
Results of Performance and Hardware Cost
Estimation
  • Demonstrated by two set of models
  • Maximum processor models with an abundance of
    resources
  • Small processor models
  • Workload is a MPEG-2 decoder made multithreaded

18
Simulation Parameters
  • Fixed parameters
  • 1024-entry BTAC, gshare branch predictor (2 K
    2-bit counters, 8 bit history, mispred. pen. 5
    cycles)
  • 4-way set-associative D- and I-caches with 32
    byte cache lines
  • 32 KB local on-chip RAM
  • 64-bit system bus, 4 MB main memory
  • Varied parameters
  • 8-12 execution units
  • 256- and 32-entry reservation stations
  • 10 to 4 result buses
  • different D-cache sizes, D- and I-caches of 4 MB
    and 64 KB
  • Parameters Varied with Number of Threads
  • 32 32-bit general-purpose registers and 40
    rename registers (per thread),
  • 32- and 16-entry issue and retirement buffers
    (per thread)
  • Fetch and decode bandwidth is scaled with issue
    bandwidth and number of threads 1x1 8x8

19
Performance vs. Hardware Cost EstimationMaximum
Processor Models
4 MB I- and D-caches, 6 integer/mm units 2 local
load/store units
20
Transistor Count and Chip Space Estimation of
Maximum Processor Models
21
Small Processor Models
64 KB I- and D-caches, 3 integer/mm units 1 local
load/store unit 32-enty reserv. stations 16-entry
issue and retirement buffers 4 result buses 2x4
fetch and decode bandwidth fixed
22
Transistor Count and Chip Space Estimation of
Small Processor Models
23
Results
  • 4-threaded 8-issue SMT over a single-threaded
    8-issue
  • Commercial Multithreaded Processors
  • Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun
    UltraSPARC V
  • Network processors (Intel IXP, IBM PowerNP,
    Vitesse IQ2x00, Lextra,..)
  • IBM RS64 IV two-threaded block MT, reported 5
    overhead
  • Intel Xeon TM (hyperthreading) two-threaded SMT,
    reported 5 overhead

Speedup
Transistor Chip Space
Increase
Increase maximum model
3 2 9
small model 1.5 9
27
24
Outline of the Presentation
  • Motivation
  • State-of-the-art
  • Multithreading
  • Multithreading for throughput increase
  • Multithreading for power reduction
  • Multithreading for embedded real-time systems
  • Conclusions Research Opportunities

25
SMT for Reduction of Power Consumption
  • Observation Mispredictions cost energy
  • Todays superscalars 60 of the fetched and
    30 of the executed instructions are squashed
  • Idea fill issue slots by less speculative
    instructions of other threads ? Simulations of
    Seng et al. 2000 show that 22 less energy is
    consumed by using a power-aware scheduler

26
Outline of the Presentation
  • Motivation
  • State-of-the-art
  • Multithreading
  • Multithreading for throughput increase
  • Multithreading for power reduction
  • Multithreading for embedded real-time systems
  • Conclusions Research Opportunities

27
Multithreading in Embedded Real-time Systems
The Komodo Approach
  • Observation multithreading allows a context
    switching overhead of zero cycles
  • Idea harness multithreading for embedded
    real-time systems
  • Komodo Project Real-time Java Based on a
    Multithreaded Java-microcontroller
  • http//www.informatik.uni-augsburg.de/lehrstuehle/
    info3/research/
  • komodo/indexEng.html

28
Real-time Requirements
  • run-time predictability
  • isolation of the threads
  • programmability
  • real-time scheduling support
  • fast context switching

Hard real-time a deadline may never be
missed Soft real-time a deadline may
occasionally be missed
29
Komodo Solutions
  • Extremely fast context switching by hardware
    multithreading
  • Real-time scheduling in hardware
  • Based on a Java processor core
  • Predictability of all instruction executions by a
    careful hardware design

30
Komodo Microcontroller Pipeline
31
Komodo Microcontroller Design
32
Hardware Real-time Scheduling
  • Real-time scheduler is realized in hardware (by
    the priority manager)
  • Scheduling decision every clock cycle
  • Four different scheduling algorithms implemented
  • Fixed Priority Preemptive (FPP)
  • Earliest Deadline First (EDF)
  • Least Laxity First (LLF)
  • Guaranteed Percentage (GP)

33
Guaranteed Percentage Scheme
i
o
n
a
l

p
r
o
c
e
s
s
o
r
t
i
o
n
v
i
o
la
c
o
n
t
e
x
t

s
w
i
t
c
h
o
n

a

m
u
l
t
i
t
h
r
e
a
d
e
d

p
r
o
c
e
s
s
o
r
s
u
r
p
l
u
s
34
Simulation Results
thread mix (IC, PID, and FFT) applied
35
Technical Data of the Komodo Prototype
  • Implementation of Komodo core pipeline on a
    Xilinx XCV800 with 800k gates
  • ASIC synthesis of whole microcontroller (0.18 mm
    technology) 340 MHz, 3 mm2 chip

data bit width address space number of
threads instruction window size stack
size external frequency internal
frequency CLBs number of gates
32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25
MHz 9 200 133 000
36
Chip-Space of Komodo Core Pipeline
37
Reducing Power Consumption Using Real-time
Scheduling in Hardware
Current work Idea Use information about the
thread states and configurations available
within the priority manager for a fine-grained
adaption of power consumption and performance.
  • Frequency and voltage adjustments in short time
    intervals done by hardware

38
State of the Komodo Project
  • Software simulator
  • FPGA prototyp
  • Real-time Java system
  • ASIC
  • Middleware for distributed embedded systems

39
Conclusions onMultithreading in Real-time
Environments
  • Multithreaded processor cores
  • Performance gain due to fast context switching
    (for hard real-time) and latency hiding (for soft
    and non real-time)
  • More efficient event handling by ISTs
  • Helper threads possible (garbage collection,
    debugging)
  • Real-time scheduling in hardware
  • Software overhead for real-time scheduling
    removed
  • more efficient power saving mechanisms possible
  • better predictablility by isolation of threads
    (GP scheduling)

40
Conclusions Research Opportunities
  • Multithreading proves advantageous
  • Latency hiding speed-ups of 2-3 for SMT, lots
    of research done, next generation of
    microprocessors
  • Power reduction 22 savings reported, not much
    research up to now
  • Fast context switching utilized by
    microcontroller for real-time systems,not much
    research up to now
  • Research opportunities
  • Scheduling in SMT, network processors and
    multithreaded real-time systems
  • Thread-speculation how to speed-up
    single-threaded programs?
  • Multithreading and power consumption
  • Multithreading in other communities
    microcontrollers, SoCs
  • System software based on helper threads

41
Acknowledgements
  • SMT Multimedia research group
  • Uli Sigmund and Heiko Oehring
  • Complexity estimation group
  • Marc Steinhaus, Reiner Kolla, Josep L.
    Larriba-Pey, Mateo Valero
  • Komodo project group
  • Jochen Kreuzinger, Matthias Pfeffer, Sascha
    Uhrig, Uwe Brinkschulte, Florentin Picioroaga,
    Etienne Schneider

42
Mikroprozessors Technology Prognosis up to 2012
  • SIA (semiconductor industries association)
    Prognose 1997

43
Research Directions?
  • Increase performance of a single thread of
    control by
  • more instruction-level speculation
  • Better branch prediction,
  • Trace cache and next trace prediction,
  • Data dependence and value prediction
  • Increase throughput of a workload of multiple
    threads
  • Utilize thread-level and instruction-level
    parallelism
  • Chip-Multiprocessors
  • Multithreading (hardware thread thread or
    process)
  • Thread speculation
Write a Comment
User Comments (0)
About PowerShow.com