Hyper-Threading, Chip multiprocessors and both - PowerPoint PPT Presentation

About This Presentation

Title:

Hyper-Threading, Chip multiprocessors and both

Description:

Title: Hyperthreading Author: Neilin Chakrabarty Last modified by: Zoran Jovanovic Created Date: 4/3/2004 3:49:55 AM Document presentation format – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 31

Provided by: NeilinCha1

Category:

more less

Transcript and Presenter's Notes

Title: Hyper-Threading, Chip multiprocessors and both

1
Hyper-Threading, Chip multiprocessors and both

Zoran Jovanovic

2
To Be Tackled in Multithreading

Review of Threading Algorithms
Hyper-Threading Concepts
Hyper-Threading Architecture
Advantages/Disadvantages

3
Threading Algorithms

Time-slicing
A processor switches between threads in fixed
time intervals.
High expenses, especially if one of the processes
is in the wait state. Fine grain
Switch-on-event
Task switching in case of long pauses
Waiting for data coming from a relatively slow
source, CPU resources are given to other
processes. Coarse grain

4
Threading Algorithms (cont.)

Multiprocessing
Distribute the load over many processors
Adds extra cost
Simultaneous multi-threading
Multiple threads execute on a single processor
without switching.
Basis of Intels Hyper-Threading technology.

5
Hyper-Threading Concept

At each point of time only a part of processor
resources is used for execution of the program
code of a thread.
Unused resources can also be loaded, for example,
with parallel execution of another
thread/application.
Extremely useful in desktop and server
applications where many threads are used.

6
Quick Recall Many Resources IDLE!
For an 8-way superscalar.
From Tullsen, Eggers, and Levy, Simultaneous
Multithreading Maximizing On-chip Parallelism,
ISCA 1995.
7
(No Transcript)
8

A superscalar processor with no multithreading
A superscalar processor with coarse-grain
multithreading
A superscalar processor with fine-grain
multithreading
A superscalar processor with simultaneous
multithreading (SMT)

9
Simultaneous Multithreading (SMT)

Example new Pentium with Hyperthreading
Key Idea Exploit ILP across multiple threads!
i.e., convert thread-level parallelism into more
ILP
exploit following features of modern processors
multiple functional units
modern processors typically have more functional
units available than a single thread can utilize
register renaming and dynamic scheduling
multiple instructions from independent threads
can co-exist and co-execute!

10
Hyper-Threading Architecture

First used in Intel Xeon MP processor
Makes a single physical processor appear as
multiple logical processors.
Each logical processor has a copy of architecture
state.
Logical processors share a single set of physical
execution resources

11
Hyper-Threading Architecture

Operating systems and user programs can schedule
processes or threads to logical processors as if
they were in a multiprocessing system with
physical processors.
From an architecture perspective we have to worry
about the logical processors using shared
resources.
Caches, execution units, branch predictors,
control logic, and buses.

12
Power 5 dataflow ...

Why only two threads?
With 4, one of the shared resources (physical
registers, cache, memory bandwidth) would be
prone to bottleneck
Cost
The Power5 core is about 24 larger than the
Power4 core because of the addition of SMT support

13
Advantages

Extra architecture only adds about 5 to the
total die area.
No performance loss if only one thread is active.
Increased performance with multiple threads
Better resource utilization.

14
Disadvantages

To take advantage of hyper-threading performance,
serial execution can not be used.
Threads are non-deterministic and involve extra
design
Threads have increased overhead
Shared resource conflicts

15
Multicore

Multiprocessors on a single chip

16
Basic Shared Memory Architecture

Processors all connected to a large shared memory
Where are caches?

P2
P1
Pn
interconnect
memory

Now take a closer look at structure, costs,
limits, programming

17
What About Caching???

Want High performance for shared memory Use
Caches!
Each processor has its own cache (or multiple
caches)
Place data from memory into cache
Writeback cache dont send all writes over bus
to memory
Caches Reduce average latency
Automatic replication closer to processor
More important to multiprocessor than
uniprocessor latencies longer
Normal uniprocessor mechanisms to access data
Loads and Stores form very low-overhead
communication primitive
Problem Cache Coherence!

18
Example Cache Coherence Problem
P
P
P
2
1
3

I/O devices

Things to note
Processors could see different values for u after
event 3
With write back caches, value written back to
memory depends on happenstance of which cache
flushes or writes back value when
How to fix with a bus Coherence Protocol
Use bus to broadcast writes or invalidations
Simple protocols rely on presence of broadcast
medium
Bus not scalable beyond about 64 processors (max)
Capacity, bandwidth limitations

Memory
19
Limits of Bus-Based Shared Memory

Assume
1 GHz processor w/o cache
gt 4 GB/s inst BW per processor (32-bit)
gt 1.2 GB/s data BW at 30 load-store
Suppose 98 inst hit rate and 95 data hit rate
gt 80 MB/s inst BW per processor
gt 60 MB/s data BW per processor
140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
\ 8 processors will saturate bus

I/O
MEM
MEM

140 MB/s

cache
cache
5.2 GB/s
PROC
PROC
20
(No Transcript)
21
Cache Organizations for Multi-cores

L1 caches are always private to a core
L2 caches can be private or shared
Advantages of a shared L2 cache
efficient dynamic allocation of space to each
core
data shared by multiple cores is not replicated
every block has a fixed home hence, easy to
find
the latest copy
Advantages of a private L2 cache
quick access to private L2 good for small
working sets
private bus to private L2 ? less contention

22
A Reminder SMT (Simultaneous Multi Threading)
SMT vs. CMP
23
A Single Chip Multiprocessor L. Hammond at al.
(Stanford), IEEE Computer 97
Superscalar (SS)

For Same area (a billion tr. DRAM area)
Superscalar and SMT Very Complex
Wide
Advanced Branch prediction
Register Renaming
OOO Instruction Issue
Non-Blocking data caches

CMP
24
SS and SMT vs. CMP

CPU Cores Three main hardware design problems
(of SS and SMT)
Area increases quadratically with core complexity
Number of Registers O(Instruction window size)
Register ports - O(Issue width)
CMP solves this problem ( linear Area to Issue
width)
Longer Cycle Times
Long Wires, many MUXes and crossbars
Large buffers, queues and register files
Clustering (decreases ILP) or Deep Pipelining
(Branch mispredication penalties)
CMP allows small cycle time (with little effort)
Small and fast
Relies on software to schedule
Poor ILP
Complex Design and Verification

25
SS and SMT vs. CMP

Memory
12 issue SS or SMT require multiport data cache
(4-6 ports)
2 X 128 Kbyte (2 cycle latency)
CMP 16 X 16 Kbyte (single cycle latency), but
secondary cache is slower (multiport)
Shared memory write through caches

CMP
26
Performance comparison

Compress (Integer apps) Low ILP and no TLP
Mpeg-2 (MMedia apps) High ILP and TLP and
moderate memory requirement (parallelized by
hand)
SMT utilizes core resources better
But CMP has 16 issue slots instead of 12
Tomcatv (FP applications) Large loop-level
parallelism and large memory bandwidth (TLP by
compiler)
CMP has large memory bandwidth on
primary cache - SMT fundamental problem
unified and slow cache
Multiprogram Integer multiprogramming
workload, all computation-intensive (Low ILP,
High PLP)

27
CMP Motivation

How to utilize available silicon?
Speculation (aggressive superscalar)
Simultaneous Multithreading (SMT,
Hyperthreading)
Several processors on a single chip
What is a CMP (Chip MultiProcessor)?
Several processors (several masters)
Both shared and distributed memory architectures
Both homogenous and heterogeneous processor
types
Why?
Wire Delays
Diminishing of Uniprocessors
Very long design and verification times for
modern processors

28
A Single Chip Multiprocessor L. Hammond at al.
(Stanford), IEEE Computer 97

TLP and PLP become widespread in future
applications
Various Multimedia applications
Compilers and OS
Favours CMP

CMP
Better performance with simple hardware
Higher clock rates, better memory bandwidth
Shorter pipelines
SMT has better utilizations but CMP has more
resources (no wide-issue logic)
Although CMP bad for no TLP and ILP (compress),
SMT and SS not much better

29
A Reminder SMT (Simultaneous Multi Threading)
CMP
SMT

Pool of execution units (Wide machine)
Several Logical processors
Copy of State for each
Mul. Threads are running concurrently
Better utilization and Latency Tolerance

Simple Cores
Moderate amount of parallelism
Threads are running concurrently on different
cores

30
SMT Dual-core all four threads can run
concurrently

L1 D-Cache D-TLB
L1 D-Cache D-TLB
Integer
Floating Point
Integer
Floating Point
Schedulers
Schedulers
L2 Cache and Control
Uop queues
Uop queues
L2 Cache and Control
Rename/Alloc
Rename/Alloc
Trace Cache
uCode ROM
BTB
Trace Cache
uCode ROM
BTB
Decoder
Decoder
Bus
Bus
BTB and I-TLB
BTB and I-TLB
Thread 1
Thread 3
Thread 2
Thread 4

Write a Comment

User Comments (0)