Introduction to Simultaneous Multithreading - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Introduction to Simultaneous Multithreading

Description:

1 Time Instructions Cycles Time. Performance Program Program Instruction Cycle. Performance ... cpu-bound w/ mem-bound) alleviates resource contentions. ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 29
Provided by: cslab
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Simultaneous Multithreading


1
Introduction toSimultaneousMultithreading

2
Law of Microprocessor Performance
  • 1 Time
    Instructions Cycles Time
  • Performance Program Program
    Instruction Cycle

___________ _________ _________
________ ______
x
x
(instr. count) (CPI) (cycle time)
IPC x Hz instr. count
__________
? Performance
3
Optimization areas (so far)
  • clock speed (?Hz)
  • architectural optimizations (?IPC)
  • pipelining, superscalar execution, branch
    prediction,
  • out-of-order execution
  • cache (?IPC)

4
Main problems
  • clock speed physical issues when increasing
    frequency
  • heat too much, too hard to dissipate
  • power consumption too high
  • current leakage problems
  • architectural optimizations insufficient
    inherent ILP in many applications ? superscalar
    processors cannot exploit all available issue
    bandwidth
  • Solution
  • Must find other ways than ILP to boost processor
    performance
  • ? increase TLP

5
Current directions
  • CMP Chip Multiprocessors (multicores)
  • SMT Simultaneous Multithreading
  • cache

CMP SMT target at increased per chip TLP
  • SMT processors
  • issue and execute instructions from multiple
    threads simultaneously (same cycle)
  • can be built easily and cheaply on top of
    modern superscalar processors
  • threads share almost all execution resources

6
Superscalar execution
7
Superscalar execution (2)
8
Chip Multiprocessors
9
Time Slicing Multithreading (Fine-grained)
10
Simultaneous Multithreading
Maximum utilization of functional units by
independent operations
11
Hyperthreading technology
  • Executes two tasks simultaneously
  • multiprogrammed workloads/multithreaded
    applications
  • Physical CPU maintains architectural state
  • for two logical processors
  • First implemented on the Intel Xeon processor
    family
  • the logical processors come at a cost of lt5
    additional
  • die area

12
Replicated vs. Shared resources
Multiprocessor
Hyperthreading
  • SMPs replicate execution resources
  • HT shares execution resources

13
Resource management in HT technology
  • Replicated resources
  • Architectural state GPRs, control registers,
    APICs
  • Instruction Pointers, renaming logic
  • smaller resources (ITLBs, return stack buffers,
    branch
  • history buffers)
  • Partitioned resources
  • Re-Order buffers, Load/Store buffers,
    instruction queues
  • Buffering queues guarantee independent forward
    progress
  • Dynamically shared resources
  • Out-of-Order execution engine, global history
    array
  • Caches

14
Basic Microarchitecture
15
Front End Trace Cache hit
  • Trace Cache arbitrated
  • access each cycle
  • when both LPs request at
  • the same time, access is
  • granted in alternating cycles
  • when 1 LP requests, full TC
  • bandwidth is exploited

16
Front End Trace Cache miss
  • L2 Cache access arbitrated on a FCFS basis,
    with 1 slot always reserved for each LP
  • Decode when used by both LPs at the same time,
    access is alternated, but at a more
    coarse-grained fashion

17
OoO Execution Engine
  • Allocator
  • allocates buffer entries to each LP 63/126 ROB,
    24/48 Load, 12/24 Store, 128/128 Integer phys.
    regs, 128/128 FP phys. regs
  • on simultaneous requests alternates between LPs
    at each cycle
  • stalls a LP when it tries to use more than the
    half of partitioned buffers
  • (even if there arent any uops from the peer LP
    in the queue)

18
OoO Execution Engine (2)
  • Register Rename
  • expands dynamically arch. registers to physical
    registers
  • per-LP Register Alias Table

19
OoO Execution Engine (3)
  • Schedulers / Execution Units
  • oblivious of logical processors
  • generalmemory queues send uops, alternating
    between LPs every cycle
  • 6 µIPC dispatch bandwidth( ? 3 µIPC per-LP
    effective bandwidth, when both LPs active)

20
Retirement
- architectural state for each thread is
committed in program order by alternating between
LPs at each cycle
21
Software development implications
  • ILP vs. TLP
  • Factors parallelization cost, registercache
    reuse, contention
  • on execution units.
  • TLP for non-optimized codes, ILP for highly-tuned
    ones.
  • Shared caches
  • Threads working on disjoint small, or shared
    large
  • cache portions?
  • Threads execution profile
  • Pairing threads of different profile (int-bound
    w/ fp-bound ,
  • cpu-bound w/ mem-bound) alleviates resource
    contentions.
  • Can we apply alternative, non-symmetric paral.
    methods?

22
SMTs
  • Targeting
  • throughput of multiprogrammed workloads
  • latency of multithreaded applications
  • Challenge
  • latency of single-threaded applications
  • How?
  • Speculative execution predict and/or precompute
    future
  • accesses, branches, or even operations results,
    and integrate them into the main threads
    execution.
  • Hardware of current SMT implementations can
    support only memory precomputation (prefetching)
    schemes.

23
Example prefetching helper threads
24
Synchronization issues and HT
  • spin-wait loops core of many synch. primitives
    (spin-locks, semaphores, barriers, cond. vars)
  • Typical spin-wait loop
  • Two issues when executed on HT
  • upon condition satisfaction, all o-o-o
    pre-executed ld-cmp-jnes waiting to be committed
    must be discarded ? pipeline flush penalty
  • spins too fast checks repeatedly for condition
    satisfaction a lot faster than the memory bus can
    actually update sync_var ? valuable resources are
    consumed

wait_loop ld eax,sync_var cmp eax,0 jne
wait_loop
25
Synchronization issues and HT (2)
  • Use of pause instruction
  • introduces a slight delay in the loop
  • causes condition tests to be issued at
    approximately the speed of the memory bus
  • during pause , the execution of spinning thread
    is de-pipelined ? dynamically shared resources
    are effectively allocated to the working thread

wait_loop pause ld eax,sync_var cmp
eax,0 jne wait_loop
26
Synchronization issues and HT (3)
  • However statically partitioned resources are
    not released (but still remain unused)
  • performance bottleneck, especially for long
    duration wait loops (e.g. in prefetching schemes)
  • 15 slowdown of working thread on average

27
Synchronization issues and HT (4)
  • Further optimization use of halt instruction

wait_loop hlt ld eax,sync_var cmp
eax,0 jne wait_loop
28
Synchronization issues and HT (5)
  • spinning thread halts, partitioned resources are
    recombined for full use by the working thread
    (ST-mode)
  • upon condition satisfaction, worker sends IPIs
    to wake up sleeping thread, resources are
    repartitioned (MT-mode)
  • wait-loop repetitions and resource consumption
    are minimized
  • tradeoffs OS intervention ST/MT-mode
    transitions cost
Write a Comment
User Comments (0)
About PowerShow.com