CPE 631: Multithreading: Thread-Level Parallelism Within a Processor - PowerPoint PPT Presentation

About This Presentation
Title:

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Description:

Resource Sharing Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 24
Provided by: Alek155
Learn more at: http://www.ece.uah.edu
Category:

less

Transcript and Presenter's Notes

Title: CPE 631: Multithreading: Thread-Level Parallelism Within a Processor


1
CPE 631 Multithreading Thread-Level
Parallelism Within a Processor
  • Electrical and Computer EngineeringUniversity of
    Alabama in Huntsville
  • Aleksandar Milenkovicmilenka_at_ece.uah.edu
  • http//www.ece.uah.edu/milenka

2
Outline
  • Trends in microarchitecture
  • Exploiting thread-level parallelism
  • Exploiting TLP within a processor
  • Resource sharing
  • Performance implications
  • Design challenges
  • Intels HT technology

3
Trends in microarchitecture
  • Higher clock speeds
  • To achieve high clock frequency make pipeline
    deeper (superpipelining)
  • Events that disrupt pipeline (branch
    mispredictions, cache misses, etc) become very
    expensive in terms of lost clock cycles
  • ILP Instruction Level Parallelism
  • Extract parallelism in a single program
  • Superscalar processors have multiple execution
    units working in parallel
  • Challenge to find enough instructions that can be
    executed concurrently
  • Out-of-order execution gt instructions are sent
    to execution units based on instruction
    dependencies rather than program order

4
Trends in microarchitecture
  • Cache hierarchies
  • Processor-memory speed gap
  • Use caches to reduce memory latency
  • Multiple levels of caches smaller and faster
    closer to the processor core
  • Thread-level Parallelism
  • Multiple programs execute concurrently
  • Web-servers have an abundance of software threads
  • Users surfing the web, listening to music,
    encoding/decoding video streams, etc.

5
Exploiting thread-level parallelism
  • CMP Chip Multiprocessing
  • Multiple processors, each with a full set of
    architectural resources, reside on the same die
  • Processors may share an on-chip cache or each
    can have its own cache
  • Examples HP Mako, IBM Power4
  • Challenges Power, Die area (cost)
  • Time-slice multithreading
  • Processor switches between software threads
    after a predefined time slice
  • Can minimize the effects of long lasting events
  • Still, some execution slots are wasted

6
Multithreading Within a Processor
  • Until now, we have executed multiple threads of
    an application on different processors can
    multiple threads execute concurrently on the same
    processor?
  • Why is this desireable?
  • inexpensive one CPU, no external interconnects
  • no remote or coherence misses (more capacity
    misses)
  • Why does this make sense?
  • most processors cant find enough work peak IPC
    is 6, average IPC is 1.5!
  • threads can share resources ? we can increase
    threads without a corresponding linear increase
    in area

7
What Resources are Shared?
  • Multiple threads are simultaneously active (in
    other words, a new thread can start without a
    context switch)
  • For correctness, each thread needs its own PC,
    its own logical regs (and its own mapping from
    logical to phys regs)
  • For performance, each thread could have its own
    ROB (so that a stall in one thread does not stall
    commit in other threads), I-cache, branch
    predictor, D-cache, etc. (for low interference),
    although note that more sharing ? better
    utilization of resources
  • Each additional thread costs a PC, rename table,
    and ROB cheap!

8
Approaches to Multithreading Within a Processor
  • Fine-grained multithreadingswitches threads on
    every clock cycle
  • Pro hide latency of from both short and long
    stalls
  • Con Slows down execution of the individual
    threads ready to go
  • Course-grained multithreadingswitches threads
    only on costly stalls (e.g., L2 stalls)
  • Pros no switching each clock cycle, no slow down
    for ready-to-go threads
  • Con limitations in hiding shorter stalls
  • Simultaneous Multithreadingexploits TLP at the
    same time it exploits ILP

9
How Resources are Shared?
Each box represents an issue slot for a
functional unit. Peak thruput is 4 IPC.
Thread 1
Thread 2
Thread 3
Cycles
Thread 4
Idle
Coarse-grainedMultithreading
Fine-Grained Multithreading
Superscalar
Simultaneous Multithreading
  • Superscalar processor has high under-utilization
    not enough work every cycle, especially when
    there is a cache miss
  • Fine-grained multithreading can only issue
    instructions from a single thread in a cycle
    can not find max work every cycle, but cache
    misses can be tolerated
  • Simultaneous multithreading can issue
    instructions from any thread every cycle has
    the highest probability of finding work for every
    issue slot

10
Resource Sharing
Thread-1
R1 ? R1 R2 R3 ? R1 R4 R5 ? R1 R3
P73? P1 P2 P74 ? P73 P4 P75 ? P73 P74
Instr Fetch
Instr Rename
Issue Queue
Instr Fetch
Instr Rename
P73? P1 P2 P74 ? P73 P4 P75 ? P73 P74 P76 ?
P33 P34 P77 ? P33 P76 P78 ? P77 P35
R2 ? R1 R2 R5 ? R1 R2 R3 ? R5 R3
P76 ? P33 P34 P77 ? P33 P76 P78 ? P77 P35
Thread-2
Register File
FU
FU
FU
FU
11
Performance Implications of SMT
  • Single thread performance is likely to go down
    (caches, branch predictors, registers, etc. are
    shared) this effect can be mitigated by trying
    to prioritize one thread
  • While fetching instructions, thread priority can
    dramatically influence total throughput a
    widely accepted heuristic (ICOUNT) fetch such
    that each thread has an equal share of processor
    resources
  • With eight threads in a processor with many
    resources, SMT yields throughput improvements of
    roughly 2-4
  • Alpha 21464 and Intel Pentium 4 are examples of
    SMT

12
Design Challenges
  • How many threads?
  • Many to find enough parallelism
  • However, mixing many threads will compromise
    execution of individual threads
  • Processor front-end (instruction fetch)
  • Fetch as far as possible in a single thread (to
    maximize thread performance)
  • However, this limits the number of instructions
    available for scheduling from other threads
  • Larger register files (multiple contexts)
  • Minimize clock cycle time
  • Cache conflicts

13
Pentium 4 Hyperthreading architecture
  • One physical processor appears as multiple
    logical processors
  • HT implementation on NetBurst microarchitecture
    has 2 logical processors

Architectural State
Architectural State
  • Architectural state
  • general purpose registers
  • control registers
  • APIC advanced programmable interrupt controller

Processor execution resources
14
Pentium 4 Hyperthreading architecture
  • Main processor resources are shared
  • caches, branch predictors, execution units,
    buses, control logic
  • Duplicated resources
  • register alias tables (map the architectural
    registers to physical rename registers)
  • next instruction pointer and associated control
    logic
  • return stack pointer
  • instruction streaming buffer and trace cache fill
    buffers

15
Pentium 4 Die Size and Complexity
16
Pentium 4 Resources sharing schemes
  • Partition dedicate equal resources to each
    logical processors
  • Good when expect high utilization and somewhat
    unpredicatable
  • Threshold flexible resource sharing with a
    limit on maximum resource usage
  • Good for small resources with bursty utilization
    and when the micro-ops stay in the structure for
    short predictable periods
  • Full sharing flexible with no limits
  • Good for large structures, with variable
    working-set sizes

17
Pentium 4 Shared vs. partitioned queues
shared
partitioned
18
NetBurst Pipeline
threshold
partitioned
19
Pentium 4 Shared vs. partitioned resources
  • Partitioned
  • E.g., major pipeline queues
  • Threshold
  • Puts a threshold on the number of resource
    entries a logical processor can have
  • E.g., scheduler
  • Fully shared resources
  • E.g., caches
  • Modest interference
  • Benefit if we have shared code and/or data

20
Pentium 4 Scheduler occupancy
21
Pentium 4 Shared vs. partitioned cache
22
Pentium 4 Performance Improvements
23
Multi-Programmed Speedup
Write a Comment
User Comments (0)
About PowerShow.com