Title: Multi-/Many-Core Processors
1Multi-/Many-Core Processors
- Ana Lucia Varbanescu
- analucia_at_cs.vu.nl
2Why?
3In the search for performance
4In the search for performance
- We have M(o)ore transistors
- How do we use them?
- Bigger cores
- Hit the walls power, memory, parallelism (ILP)
- Dig through ?
- Requires new technologies
- Go around?
- Multi-/many-cores
David Patterson The Future of Computer
Architecture 2006 http//www.slidefinder.net/f/
future_computer_architecture_david_patterson/69126
80
5Multi-/many-cores
- In the search for performance
- Build (HW)
- What architectures?
- Evaluate (HW)
- What metrics?
- How do we measure?
- Use (HW SW)
- What workloads?
- Expected performance?
- Program (SW (HW))
- How to program?
- How to optimize?
- Benchmark
- How to analyze performance?
6Build
7Choices
- Core type(s)
- Fat or slim ?
- Vectorized (SIMD) ?
- Homogeneous or heterogeneous?
- Number of cores
- Few or many ?
- Memory
- Shared-memory or distributed-memory?
- Parallelism
- SIMD/MIMD, SPMD/MPMD,
Main constraint chip area!
8A taxonomy
- Based on field-of-origin
- General-purpose (GPP/GPMC)
- Intel, AMD
- Graphics (GPUs)
- NVIDIA, ATI
- Embedded systems
- Philips/NXP, ARM
- Servers
- Sun (Oracle), IBM
- Gaming/Entertainment
- Sony/Toshiba/IBM
- High Performance Computing
- Intel, IBM,
9General Purpose Processors
- Architecture
- Few fat cores
- Homogeneous
- Stand-alone
- Memory
- Shared, multi-layered
- Per-core cache
- Programming
- SMP machines
- Both symmetrical and asymmetrical threading
- OS Scheduler
- Gain performance
- MPMD, coarse-level parallelism
10Intel
11Intels next gen
12AMD
13AMDs next gen
14Server-side
- GPP-like with more HW threads
- Lower performance-per-thread
- Examples
- Sun UltraSPARC T2, T2
- 8 cores x 8 threads each
- high throughput
- IBM POWER7
15Graphics Processing Units
- Architecture
- Hundreds/thousands of slim cores
- Homogeneous
- Accelerator(s)
- Memory
- Very complex hierarchy
- Both shared and per-core
- Programming
- Off-load model
- (Many) Symmetrical threads
- Hardware scheduler
- Gain performance
- fine-grain parallelism, SIMT
16NVIDIA G80/GT200/Fermi
- SM streaming multiprocessor
- 1 SM 8 SP (streaming processors/CUDA cores)
- 1TPC 2 x SM / 3 x SM thread processing
clusters
17NVIDIA GT200
18NVIDIA Fermi
19ATI GPUs
20Cell/B.E.
- Architecture
- Heterogeneous
- 8 vector-processors (SPEs) 1 trimmed PowerPC
(PPE) - Accelerator or stand-alone
- Memory
- Per-core only
- Programming
- Asymmetrical multi-threading
- User-controlled scheduling
- 6 levels of parallelism, all under user control
- Gain performance
- Fine- and coarse-grain parallelism (MPMD, SPMD)
- SPE-specific optimizations
- Scheduling
21Cell/B.E.
- 1 x PPE 64-bit PowerPC
- L1 32 KB I32 KB D
- L2 512 KB
- 8 x SPE cores
- Local mem (LS) 256 KB
- 128 x 128 bit vector registers
- Main memory access
- PPE Rd/Wr
- SPEs Async DMA
22Intel Single-chip Cloud Computer
- Architecture
- Tile-based many-core (48 cores)
- A tile is a dual-core
- Stand-alone / cluster
- Memory
- Per-core and per-tile
- Shared off-chip
- Programming
- Multi-processing with message passing
- User-controlled mapping/scheduling
- Gain performance
- Coarse-grain parallelism (MPMD, SPMD)
- Multi-application workloads (cluster-like)
23Intel SCC
24Summary
25Summary
26Take home message
- Variety of platforms
- Core types counts
- Memory architecture sizes
- Parallelism layers types
- Scheduling
- Open question(s)
- Why so many?
- How many platforms do we need?
- Any application to run on any platform?
27Evaluate in theory
28HW Performance metrics
- Clock frequency Hz Absolute HW speed(s)
- Memories, CPUs, interconnects
- Operational speed GFLOPs
- Operations per cycle
- Bandwidth GB/s
- memory access speed(s)
- differs a lot between different memories on chip
- Power
- Per core/per chip
- Derived metrics
- FLOP/Byte
- FLOP/Watt
29Peak performance
- Peak cores threads_per_core
- FLOPS/cycle clock_frequency
- Examples
- Nehalem EX 8 2 4 2.26GHz 170 GFLOPs
- HD 5870 (2016) 5 0.85GHz 1360 GFLOPs
- GF100 (1632) 2 1.45GHz 1484 GFLOPs
30On-chip memory bandwidth
- Registers and per-core caches
- - specification
- Shared memory
- Peak_Data_Rate x Data_Bus_Width
- (frequency data_rate) data_bus_width
- Example(s)
- Nehalem DDR3 1.333264 21 GB/s
- HD 5870 4.800 256 153.6 GB/s
- Fermi 4.200 384 201.6 GB/s
31Off-chip memory bandwidth
- Depends on the interconnect
- Intels technology QPI
- 25.6 GB/s
- AMDs technology HT3
- 19.2 GB/s
- Accelerators PCI-e 1.0 or 2.0
- 8GB/s or 16 GB/s
32Summary
Cores Threads/ALUs GFLOPS BW FLOPS/Byte
Cell/B.E. 8 8 204.80 25.6 8.0000
Nehalem EE 4 8 57.60 25.5 2.2588
Nehalem EX 8 16 170.00 63 2.6984
Niagara 8 32 9.33 20 0.4665
Niagara 2 8 64 11.20 76 0.1474
AMD Barcelona 4 8 37.00 21.4 1.7290
AMD Istanbul 6 6 62.40 25.6 2.4375
AMD Magny-Cours 12 12 124.80 25.6 4.8750
IBM Power 7 8 32 264.96 68.22 3.8839
G80 16 128 404.80 86.4 4.6852
GT200 30 240 933.00 141.7 6.5843
GF100 16 512 1484.00 201.6 7.3611
ATI Radeon 4890 160 800 680.00 124.8 5.4487
HD5870 320 1600 1360.00 153.6 8.8542
33Absolute HW performance 1
- Achieved in the optimal conditions
- Processing units 100 used
- All parallelism 100 exploited
- All data transfers at maximum bandwidth
How many applications like this? Basically none
its even hard to build the right benchmarks
34Evaluate in use
35Workloads
- For a new application
- Design parallel algorithm
- Implement
- Optimize
- Benchmark
- Any application can run on any platform
- Influence on
- performance
- portability
- productivity
- Ideally, we want a good fit!
36Performance goals
- Hardware designer
- How fast is my hardware running?
- End-user
- How fast is my application running?
- End-users manager
- How efficient is my application?
- Developers manager
- How much time it takes to program it?
- Developer
- How close can I get to the peak performance?
37SW Performance metrics
- Execution time (user)
- Speed-up
- vs. best available sequential application
- Achieved GFLOPs (developer/users manager)
- Computational efficiency
- Achieved GB/s (developer)
- Memory efficiency
- Productivity and portability (developers
manager) - Production costs
- Maintenance costs
38For example
Hundreds of applications to reach speed-ups of
up to 2 orders of magnitude!!! Incredible
performance! Or is it?
39Developer
- Searching for peak performance
- Which platform to use?
- What is the maximum I can achieve? And how?
- Performance models
- Amdahls Law
- Arithmetic Intensity and the Roofline model
40Amdahls Law
- How can we apply Amdahls law for MC applications
? - - Discussion
41Arithmetic intensity (AI)
- AI OP/Byte
- How many operations are executed per transferred
byte? - Determines the boundary between compute intensive
and data intensive
42Applications AI
Is the application compute intensive or memory
intensive ?
- Example
- AI (RGB-to-Gray conversion) 5/4
- Read 3B Write 1B
- Compute 3 MUL 2 ADD
43Platform AI
Is the application compute intensive or memory
intensive ?
RGB to Gray
44The Roofline model 1
- Achievable_peak
- min PeakGFLOPs, AI streamBW
- Peak GFLOPs platform peak
- StreamBW streaming bandwidth
- AI application arithmetic intensity
- Theoretical peak values to be replaced by real
values - Without various optimizations
45The Roofline model 2
- Black
- Theoretical peak
- Yellow
- No streaming optimizations
- Green
- No in-core optimizations
- Red
- worst case performance zone
- Dashed
- The application
46Use the Roofline model
- To determine what to do first to gain
performance? - Increase arithmetic intensity
- Increase streaming rate
- Apply in-core optimizations
- and these are topics for your next lecture
Samuel Williams et. al Roofline an insightful
visual performance model for multicore
architectures
47Take home message
- Performance evaluation depends on goals
- Execution time (users)
- GFLOPs and GB/s (developers)
- Efficiency (budget holders ?)
- Stop tweaking when
- Reach performance goal
- Constrained by the capabilities of the
(application,platform) pair e.g., as predicted
by Roofline - Choose platform to fit application
- Parallelism layers
- Arithmetic intensity
- Streaming capabilities
48Questions
Ana Lucia Varbanescu analucia_at_cs.vu.nl