Introduction to Simultaneous Multithreading

About This Presentation

Title:

Introduction to Simultaneous Multithreading

Description:

1 Time Instructions Cycles Time. Performance Program Program Instruction Cycle. Performance ... cpu-bound w/ mem-bound) alleviates resource contentions. ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 29

Provided by: cslab

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Simultaneous Multithreading

1
Introduction toSimultaneousMultithreading

2
Law of Microprocessor Performance

1 Time
Instructions Cycles Time
Performance Program Program
Instruction Cycle

___________ _________ _________
________ ______
x
x
(instr. count) (CPI) (cycle time)
IPC x Hz instr. count
__________
? Performance
3
Optimization areas (so far)

clock speed (?Hz)
architectural optimizations (?IPC)
pipelining, superscalar execution, branch
prediction,
out-of-order execution
cache (?IPC)

4
Main problems

clock speed physical issues when increasing
frequency
heat too much, too hard to dissipate
power consumption too high
current leakage problems
architectural optimizations insufficient
inherent ILP in many applications ? superscalar
processors cannot exploit all available issue
bandwidth
Solution
Must find other ways than ILP to boost processor
performance
? increase TLP

5
Current directions

CMP Chip Multiprocessors (multicores)
SMT Simultaneous Multithreading
cache

CMP SMT target at increased per chip TLP

SMT processors
issue and execute instructions from multiple
threads simultaneously (same cycle)
can be built easily and cheaply on top of
modern superscalar processors
threads share almost all execution resources

6
Superscalar execution
7
Superscalar execution (2)
8
Chip Multiprocessors
9
Time Slicing Multithreading (Fine-grained)
10
Simultaneous Multithreading
Maximum utilization of functional units by
independent operations
11
Hyperthreading technology

Executes two tasks simultaneously
multiprogrammed workloads/multithreaded
applications
Physical CPU maintains architectural state
for two logical processors
First implemented on the Intel Xeon processor
family
the logical processors come at a cost of lt5
additional
die area

12
Replicated vs. Shared resources
Multiprocessor
Hyperthreading

SMPs replicate execution resources
HT shares execution resources

13
Resource management in HT technology

Replicated resources
Architectural state GPRs, control registers,
APICs
Instruction Pointers, renaming logic
smaller resources (ITLBs, return stack buffers,
branch
history buffers)
Partitioned resources
Re-Order buffers, Load/Store buffers,
instruction queues
Buffering queues guarantee independent forward
progress
Dynamically shared resources
Out-of-Order execution engine, global history
array
Caches

14
Basic Microarchitecture
15
Front End Trace Cache hit

Trace Cache arbitrated
access each cycle
when both LPs request at
the same time, access is
granted in alternating cycles
when 1 LP requests, full TC
bandwidth is exploited

16
Front End Trace Cache miss

L2 Cache access arbitrated on a FCFS basis,
with 1 slot always reserved for each LP
Decode when used by both LPs at the same time,
access is alternated, but at a more
coarse-grained fashion

17
OoO Execution Engine

Allocator
allocates buffer entries to each LP 63/126 ROB,
24/48 Load, 12/24 Store, 128/128 Integer phys.
regs, 128/128 FP phys. regs
on simultaneous requests alternates between LPs
at each cycle
stalls a LP when it tries to use more than the
half of partitioned buffers
(even if there arent any uops from the peer LP
in the queue)

18
OoO Execution Engine (2)

Register Rename
expands dynamically arch. registers to physical
registers
per-LP Register Alias Table

19
OoO Execution Engine (3)

Schedulers / Execution Units
oblivious of logical processors
generalmemory queues send uops, alternating
between LPs every cycle
6 µIPC dispatch bandwidth( ? 3 µIPC per-LP
effective bandwidth, when both LPs active)

20
Retirement
- architectural state for each thread is
committed in program order by alternating between
LPs at each cycle
21
Software development implications

ILP vs. TLP
Factors parallelization cost, registercache
reuse, contention
on execution units.
TLP for non-optimized codes, ILP for highly-tuned
ones.
Shared caches
Threads working on disjoint small, or shared
large
cache portions?
Threads execution profile
Pairing threads of different profile (int-bound
w/ fp-bound ,
cpu-bound w/ mem-bound) alleviates resource
contentions.
Can we apply alternative, non-symmetric paral.
methods?

22
SMTs

Targeting
throughput of multiprogrammed workloads
latency of multithreaded applications
Challenge
latency of single-threaded applications
How?
Speculative execution predict and/or precompute
future
accesses, branches, or even operations results,
and integrate them into the main threads
execution.
Hardware of current SMT implementations can
support only memory precomputation (prefetching)
schemes.

23
Example prefetching helper threads
24
Synchronization issues and HT

spin-wait loops core of many synch. primitives
(spin-locks, semaphores, barriers, cond. vars)
Typical spin-wait loop
Two issues when executed on HT
upon condition satisfaction, all o-o-o
pre-executed ld-cmp-jnes waiting to be committed
must be discarded ? pipeline flush penalty
spins too fast checks repeatedly for condition
satisfaction a lot faster than the memory bus can
actually update sync_var ? valuable resources are
consumed

wait_loop ld eax,sync_var cmp eax,0 jne
wait_loop
25
Synchronization issues and HT (2)

Use of pause instruction
introduces a slight delay in the loop
causes condition tests to be issued at
approximately the speed of the memory bus
during pause , the execution of spinning thread
is de-pipelined ? dynamically shared resources
are effectively allocated to the working thread

wait_loop pause ld eax,sync_var cmp
eax,0 jne wait_loop
26
Synchronization issues and HT (3)

However statically partitioned resources are
not released (but still remain unused)
performance bottleneck, especially for long
duration wait loops (e.g. in prefetching schemes)
15 slowdown of working thread on average

27
Synchronization issues and HT (4)

Further optimization use of halt instruction

wait_loop hlt ld eax,sync_var cmp
eax,0 jne wait_loop
28
Synchronization issues and HT (5)

spinning thread halts, partitioned resources are
recombined for full use by the working thread
(ST-mode)
upon condition satisfaction, worker sends IPIs
to wake up sleeping thread, resources are
repartitioned (MT-mode)
wait-loop repetitions and resource consumption
are minimized
tradeoffs OS intervention ST/MT-mode
transitions cost

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Simultaneous Multithreading - PowerPoint PPT Presentation

Introduction to Simultaneous Multithreading

1 Time Instructions Cycles Time. Performance Program Program Instruction Cycle. Performance ... cpu-bound w/ mem-bound) alleviates resource contentions. ... – PowerPoint PPT presentation