Title: Hyper-Threading, Chip multiprocessors and both
1Hyper-Threading, Chip multiprocessors and both
2To Be Tackled in Multithreading
- Review of Threading Algorithms
- Hyper-Threading Concepts
- Hyper-Threading Architecture
- Advantages/Disadvantages
3Threading Algorithms
- Time-slicing
- A processor switches between threads in fixed
time intervals. - High expenses, especially if one of the processes
is in the wait state. Fine grain - Switch-on-event
- Task switching in case of long pauses
- Waiting for data coming from a relatively slow
source, CPU resources are given to other
processes. Coarse grain
4Threading Algorithms (cont.)
- Multiprocessing
- Distribute the load over many processors
- Adds extra cost
- Simultaneous multi-threading
- Multiple threads execute on a single processor
without switching. - Basis of Intels Hyper-Threading technology.
5Hyper-Threading Concept
- At each point of time only a part of processor
resources is used for execution of the program
code of a thread. - Unused resources can also be loaded, for example,
with parallel execution of another
thread/application. - Extremely useful in desktop and server
applications where many threads are used.
6Quick Recall Many Resources IDLE!
For an 8-way superscalar.
From Tullsen, Eggers, and Levy, Simultaneous
Multithreading Maximizing On-chip Parallelism,
ISCA 1995.
7(No Transcript)
8- A superscalar processor with no multithreading
- A superscalar processor with coarse-grain
multithreading - A superscalar processor with fine-grain
multithreading - A superscalar processor with simultaneous
multithreading (SMT)
9Simultaneous Multithreading (SMT)
- Example new Pentium with Hyperthreading
- Key Idea Exploit ILP across multiple threads!
- i.e., convert thread-level parallelism into more
ILP - exploit following features of modern processors
- multiple functional units
- modern processors typically have more functional
units available than a single thread can utilize - register renaming and dynamic scheduling
- multiple instructions from independent threads
can co-exist and co-execute!
10Hyper-Threading Architecture
- First used in Intel Xeon MP processor
- Makes a single physical processor appear as
multiple logical processors. - Each logical processor has a copy of architecture
state. - Logical processors share a single set of physical
execution resources
11Hyper-Threading Architecture
- Operating systems and user programs can schedule
processes or threads to logical processors as if
they were in a multiprocessing system with
physical processors. - From an architecture perspective we have to worry
about the logical processors using shared
resources. - Caches, execution units, branch predictors,
control logic, and buses.
12Power 5 dataflow ...
- Why only two threads?
- With 4, one of the shared resources (physical
registers, cache, memory bandwidth) would be
prone to bottleneck - Cost
- The Power5 core is about 24 larger than the
Power4 core because of the addition of SMT support
13Advantages
- Extra architecture only adds about 5 to the
total die area. - No performance loss if only one thread is active.
Increased performance with multiple threads - Better resource utilization.
14Disadvantages
- To take advantage of hyper-threading performance,
serial execution can not be used. - Threads are non-deterministic and involve extra
design - Threads have increased overhead
- Shared resource conflicts
15Multicore
- Multiprocessors on a single chip
16Basic Shared Memory Architecture
- Processors all connected to a large shared memory
- Where are caches?
P2
P1
Pn
interconnect
memory
- Now take a closer look at structure, costs,
limits, programming
17What About Caching???
- Want High performance for shared memory Use
Caches! - Each processor has its own cache (or multiple
caches) - Place data from memory into cache
- Writeback cache dont send all writes over bus
to memory - Caches Reduce average latency
- Automatic replication closer to processor
- More important to multiprocessor than
uniprocessor latencies longer - Normal uniprocessor mechanisms to access data
- Loads and Stores form very low-overhead
communication primitive - Problem Cache Coherence!
18Example Cache Coherence Problem
P
P
P
2
1
3
I/O devices
- Things to note
- Processors could see different values for u after
event 3 - With write back caches, value written back to
memory depends on happenstance of which cache
flushes or writes back value when - How to fix with a bus Coherence Protocol
- Use bus to broadcast writes or invalidations
- Simple protocols rely on presence of broadcast
medium - Bus not scalable beyond about 64 processors (max)
- Capacity, bandwidth limitations
Memory
19Limits of Bus-Based Shared Memory
- Assume
- 1 GHz processor w/o cache
- gt 4 GB/s inst BW per processor (32-bit)
- gt 1.2 GB/s data BW at 30 load-store
- Suppose 98 inst hit rate and 95 data hit rate
- gt 80 MB/s inst BW per processor
- gt 60 MB/s data BW per processor
- 140 MB/s combined BW
- Assuming 1 GB/s bus bandwidth
- \ 8 processors will saturate bus
I/O
MEM
MEM
140 MB/s
cache
cache
5.2 GB/s
PROC
PROC
20(No Transcript)
21Cache Organizations for Multi-cores
- L1 caches are always private to a core
- L2 caches can be private or shared
- Advantages of a shared L2 cache
- efficient dynamic allocation of space to each
core - data shared by multiple cores is not replicated
- every block has a fixed home hence, easy to
find - the latest copy
- Advantages of a private L2 cache
- quick access to private L2 good for small
working sets - private bus to private L2 ? less contention
22A Reminder SMT (Simultaneous Multi Threading)
SMT vs. CMP
23A Single Chip Multiprocessor L. Hammond at al.
(Stanford), IEEE Computer 97
Superscalar (SS)
- For Same area (a billion tr. DRAM area)
- Superscalar and SMT Very Complex
- Wide
- Advanced Branch prediction
- Register Renaming
- OOO Instruction Issue
- Non-Blocking data caches
CMP
24SS and SMT vs. CMP
- CPU Cores Three main hardware design problems
(of SS and SMT) - Area increases quadratically with core complexity
- Number of Registers O(Instruction window size)
- Register ports - O(Issue width)
- CMP solves this problem ( linear Area to Issue
width) - Longer Cycle Times
- Long Wires, many MUXes and crossbars
- Large buffers, queues and register files
- Clustering (decreases ILP) or Deep Pipelining
(Branch mispredication penalties) - CMP allows small cycle time (with little effort)
- Small and fast
- Relies on software to schedule
- Poor ILP
- Complex Design and Verification
25SS and SMT vs. CMP
- Memory
- 12 issue SS or SMT require multiport data cache
(4-6 ports) - 2 X 128 Kbyte (2 cycle latency)
- CMP 16 X 16 Kbyte (single cycle latency), but
secondary cache is slower (multiport) - Shared memory write through caches
CMP
26Performance comparison
- Compress (Integer apps) Low ILP and no TLP
- Mpeg-2 (MMedia apps) High ILP and TLP and
moderate memory requirement (parallelized by
hand) - SMT utilizes core resources better
- But CMP has 16 issue slots instead of 12
- Tomcatv (FP applications) Large loop-level
parallelism and large memory bandwidth (TLP by
compiler) - CMP has large memory bandwidth on
primary cache - SMT fundamental problem
unified and slow cache - Multiprogram Integer multiprogramming
workload, all computation-intensive (Low ILP,
High PLP)
27CMP Motivation
- How to utilize available silicon?
- Speculation (aggressive superscalar)
- Simultaneous Multithreading (SMT,
Hyperthreading) - Several processors on a single chip
- What is a CMP (Chip MultiProcessor)?
- Several processors (several masters)
- Both shared and distributed memory architectures
- Both homogenous and heterogeneous processor
types - Why?
- Wire Delays
- Diminishing of Uniprocessors
- Very long design and verification times for
modern processors
28A Single Chip Multiprocessor L. Hammond at al.
(Stanford), IEEE Computer 97
- TLP and PLP become widespread in future
applications - Various Multimedia applications
- Compilers and OS
- Favours CMP
- CMP
- Better performance with simple hardware
- Higher clock rates, better memory bandwidth
- Shorter pipelines
- SMT has better utilizations but CMP has more
resources (no wide-issue logic) - Although CMP bad for no TLP and ILP (compress),
SMT and SS not much better
29A Reminder SMT (Simultaneous Multi Threading)
CMP
SMT
- Pool of execution units (Wide machine)
- Several Logical processors
- Copy of State for each
- Mul. Threads are running concurrently
- Better utilization and Latency Tolerance
- Simple Cores
- Moderate amount of parallelism
- Threads are running concurrently on different
cores
30SMT Dual-core all four threads can run
concurrently
L1 D-Cache D-TLB
L1 D-Cache D-TLB
Integer
Floating Point
Integer
Floating Point
Schedulers
Schedulers
L2 Cache and Control
Uop queues
Uop queues
L2 Cache and Control
Rename/Alloc
Rename/Alloc
Trace Cache
uCode ROM
BTB
Trace Cache
uCode ROM
BTB
Decoder
Decoder
Bus
Bus
BTB and I-TLB
BTB and I-TLB
Thread 1
Thread 3
Thread 2
Thread 4