Multi-/Many-Core Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Multi-/Many-Core Processors

Description:

Multi-/Many-Core Processors Ana Lucia Varbanescu analucia_at_cs.vu.nl Why? In the search for performance In the search for performance We have M(o)ore transistors – PowerPoint PPT presentation

Number of Views:360

Avg rating:3.0/5.0

Slides: 49

Provided by: Netwer68

Category:

more less

Transcript and Presenter's Notes

Title: Multi-/Many-Core Processors

1
Multi-/Many-Core Processors

Ana Lucia Varbanescu
analucia_at_cs.vu.nl

2
Why?
3
In the search for performance
4
In the search for performance

We have M(o)ore transistors
How do we use them?
Bigger cores
Hit the walls power, memory, parallelism (ILP)
Dig through ?
Requires new technologies
Go around?
Multi-/many-cores

David Patterson The Future of Computer
Architecture 2006 http//www.slidefinder.net/f/
future_computer_architecture_david_patterson/69126
80
5
Multi-/many-cores

In the search for performance

Build (HW)
What architectures?
Evaluate (HW)
What metrics?
How do we measure?
Use (HW SW)
What workloads?
Expected performance?
Program (SW (HW))
How to program?
How to optimize?
Benchmark
How to analyze performance?

6
Build
7
Choices

Core type(s)
Fat or slim ?
Vectorized (SIMD) ?
Homogeneous or heterogeneous?
Number of cores
Few or many ?
Memory
Shared-memory or distributed-memory?
Parallelism
SIMD/MIMD, SPMD/MPMD,

Main constraint chip area!
8
A taxonomy

Based on field-of-origin
General-purpose (GPP/GPMC)
Intel, AMD
Graphics (GPUs)
NVIDIA, ATI
Embedded systems
Philips/NXP, ARM
Servers
Sun (Oracle), IBM
Gaming/Entertainment
Sony/Toshiba/IBM
High Performance Computing
Intel, IBM,

9
General Purpose Processors

Architecture
Few fat cores
Homogeneous
Stand-alone
Memory
Shared, multi-layered
Per-core cache
Programming
SMP machines
Both symmetrical and asymmetrical threading
OS Scheduler
Gain performance
MPMD, coarse-level parallelism

10
Intel
11
Intels next gen
12
AMD
13
AMDs next gen
14
Server-side

GPP-like with more HW threads
Lower performance-per-thread
Examples
Sun UltraSPARC T2, T2
8 cores x 8 threads each
high throughput
IBM POWER7

15
Graphics Processing Units

Architecture
Hundreds/thousands of slim cores
Homogeneous
Accelerator(s)
Memory
Very complex hierarchy
Both shared and per-core
Programming
Off-load model
(Many) Symmetrical threads
Hardware scheduler
Gain performance
fine-grain parallelism, SIMT

16
NVIDIA G80/GT200/Fermi

SM streaming multiprocessor
1 SM 8 SP (streaming processors/CUDA cores)
1TPC 2 x SM / 3 x SM thread processing
clusters

17
NVIDIA GT200
18
NVIDIA Fermi
19
ATI GPUs
20
Cell/B.E.

Architecture
Heterogeneous
8 vector-processors (SPEs) 1 trimmed PowerPC
(PPE)
Accelerator or stand-alone
Memory
Per-core only
Programming
Asymmetrical multi-threading
User-controlled scheduling
6 levels of parallelism, all under user control
Gain performance
Fine- and coarse-grain parallelism (MPMD, SPMD)
SPE-specific optimizations
Scheduling

21
Cell/B.E.

1 x PPE 64-bit PowerPC
L1 32 KB I32 KB D
L2 512 KB
8 x SPE cores
Local mem (LS) 256 KB
128 x 128 bit vector registers
Main memory access
PPE Rd/Wr
SPEs Async DMA

22
Intel Single-chip Cloud Computer

Architecture
Tile-based many-core (48 cores)
A tile is a dual-core
Stand-alone / cluster
Memory
Per-core and per-tile
Shared off-chip
Programming
Multi-processing with message passing
User-controlled mapping/scheduling
Gain performance
Coarse-grain parallelism (MPMD, SPMD)
Multi-application workloads (cluster-like)

23
Intel SCC
24
Summary

Computation

25
Summary

Memory

26
Take home message

Variety of platforms
Core types counts
Memory architecture sizes
Parallelism layers types
Scheduling
Open question(s)
Why so many?
How many platforms do we need?
Any application to run on any platform?

27
Evaluate in theory
28
HW Performance metrics

Clock frequency Hz Absolute HW speed(s)
Memories, CPUs, interconnects
Operational speed GFLOPs
Operations per cycle
Bandwidth GB/s
memory access speed(s)
differs a lot between different memories on chip
Power
Per core/per chip
Derived metrics
FLOP/Byte
FLOP/Watt

29
Peak performance

Peak cores threads_per_core
FLOPS/cycle clock_frequency
Examples
Nehalem EX 8 2 4 2.26GHz 170 GFLOPs
HD 5870 (2016) 5 0.85GHz 1360 GFLOPs
GF100 (1632) 2 1.45GHz 1484 GFLOPs

30
On-chip memory bandwidth

Registers and per-core caches
- specification
Shared memory
Peak_Data_Rate x Data_Bus_Width
(frequency data_rate) data_bus_width
Example(s)
Nehalem DDR3 1.333264 21 GB/s
HD 5870 4.800 256 153.6 GB/s
Fermi 4.200 384 201.6 GB/s

31
Off-chip memory bandwidth

Depends on the interconnect
Intels technology QPI
25.6 GB/s
AMDs technology HT3
19.2 GB/s
Accelerators PCI-e 1.0 or 2.0
8GB/s or 16 GB/s

32
Summary
Cores Threads/ALUs GFLOPS BW FLOPS/Byte
Cell/B.E. 8 8 204.80 25.6 8.0000
Nehalem EE 4 8 57.60 25.5 2.2588
Nehalem EX 8 16 170.00 63 2.6984
Niagara 8 32 9.33 20 0.4665
Niagara 2 8 64 11.20 76 0.1474
AMD Barcelona 4 8 37.00 21.4 1.7290
AMD Istanbul 6 6 62.40 25.6 2.4375
AMD Magny-Cours 12 12 124.80 25.6 4.8750
IBM Power 7 8 32 264.96 68.22 3.8839
G80 16 128 404.80 86.4 4.6852
GT200 30 240 933.00 141.7 6.5843
GF100 16 512 1484.00 201.6 7.3611
ATI Radeon 4890 160 800 680.00 124.8 5.4487
HD5870 320 1600 1360.00 153.6 8.8542
33
Absolute HW performance 1

Achieved in the optimal conditions
Processing units 100 used
All parallelism 100 exploited
All data transfers at maximum bandwidth

How many applications like this? Basically none
its even hard to build the right benchmarks
34
Evaluate in use
35
Workloads

For a new application
Design parallel algorithm
Implement
Optimize
Benchmark
Any application can run on any platform
Influence on
performance
portability
productivity
Ideally, we want a good fit!

36
Performance goals

Hardware designer
How fast is my hardware running?
End-user
How fast is my application running?
End-users manager
How efficient is my application?
Developers manager
How much time it takes to program it?
Developer
How close can I get to the peak performance?

37
SW Performance metrics

Execution time (user)
Speed-up
vs. best available sequential application
Achieved GFLOPs (developer/users manager)
Computational efficiency
Achieved GB/s (developer)
Memory efficiency
Productivity and portability (developers
manager)
Production costs
Maintenance costs

38
For example
Hundreds of applications to reach speed-ups of
up to 2 orders of magnitude!!! Incredible
performance! Or is it?
39
Developer

Searching for peak performance
Which platform to use?
What is the maximum I can achieve? And how?
Performance models
Amdahls Law
Arithmetic Intensity and the Roofline model

40
Amdahls Law

How can we apply Amdahls law for MC applications
?
- Discussion

41
Arithmetic intensity (AI)

AI OP/Byte
How many operations are executed per transferred
byte?
Determines the boundary between compute intensive
and data intensive

42
Applications AI
Is the application compute intensive or memory
intensive ?

Example
AI (RGB-to-Gray conversion) 5/4
Read 3B Write 1B
Compute 3 MUL 2 ADD

43
Platform AI
Is the application compute intensive or memory
intensive ?
RGB to Gray
44
The Roofline model 1

Achievable_peak
min PeakGFLOPs, AI streamBW
Peak GFLOPs platform peak
StreamBW streaming bandwidth
AI application arithmetic intensity
Theoretical peak values to be replaced by real
values
Without various optimizations

45
The Roofline model 2

Black
Theoretical peak
Yellow
No streaming optimizations
Green
No in-core optimizations
Red
worst case performance zone
Dashed
The application

46
Use the Roofline model

To determine what to do first to gain
performance?
Increase arithmetic intensity
Increase streaming rate
Apply in-core optimizations
and these are topics for your next lecture

Samuel Williams et. al Roofline an insightful
visual performance model for multicore
architectures
47
Take home message

Performance evaluation depends on goals
Execution time (users)
GFLOPs and GB/s (developers)
Efficiency (budget holders ?)
Stop tweaking when
Reach performance goal
Constrained by the capabilities of the
(application,platform) pair e.g., as predicted
by Roofline
Choose platform to fit application
Parallelism layers
Arithmetic intensity
Streaming capabilities