Multi-/Many-Core Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Multi-/Many-Core Processors

Description:

Multi-/Many-Core Processors Ana Lucia Varbanescu analucia_at_cs.vu.nl Why? In the search for performance In the search for performance We have M(o)ore transistors – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 49
Provided by: Netwer68
Category:

less

Transcript and Presenter's Notes

Title: Multi-/Many-Core Processors


1
Multi-/Many-Core Processors
  • Ana Lucia Varbanescu
  • analucia_at_cs.vu.nl

2
Why?
3
In the search for performance
4
In the search for performance
  • We have M(o)ore transistors
  • How do we use them?
  • Bigger cores
  • Hit the walls power, memory, parallelism (ILP)
  • Dig through ?
  • Requires new technologies
  • Go around?
  • Multi-/many-cores

David Patterson The Future of Computer
Architecture 2006 http//www.slidefinder.net/f/
future_computer_architecture_david_patterson/69126
80
5
Multi-/many-cores
  • In the search for performance
  • Build (HW)
  • What architectures?
  • Evaluate (HW)
  • What metrics?
  • How do we measure?
  • Use (HW SW)
  • What workloads?
  • Expected performance?
  • Program (SW (HW))
  • How to program?
  • How to optimize?
  • Benchmark
  • How to analyze performance?

6
Build
7
Choices
  • Core type(s)
  • Fat or slim ?
  • Vectorized (SIMD) ?
  • Homogeneous or heterogeneous?
  • Number of cores
  • Few or many ?
  • Memory
  • Shared-memory or distributed-memory?
  • Parallelism
  • SIMD/MIMD, SPMD/MPMD,

Main constraint chip area!
8
A taxonomy
  • Based on field-of-origin
  • General-purpose (GPP/GPMC)
  • Intel, AMD
  • Graphics (GPUs)
  • NVIDIA, ATI
  • Embedded systems
  • Philips/NXP, ARM
  • Servers
  • Sun (Oracle), IBM
  • Gaming/Entertainment
  • Sony/Toshiba/IBM
  • High Performance Computing
  • Intel, IBM,

9
General Purpose Processors
  • Architecture
  • Few fat cores
  • Homogeneous
  • Stand-alone
  • Memory
  • Shared, multi-layered
  • Per-core cache
  • Programming
  • SMP machines
  • Both symmetrical and asymmetrical threading
  • OS Scheduler
  • Gain performance
  • MPMD, coarse-level parallelism

10
Intel
11
Intels next gen
12
AMD
13
AMDs next gen
14
Server-side
  • GPP-like with more HW threads
  • Lower performance-per-thread
  • Examples
  • Sun UltraSPARC T2, T2
  • 8 cores x 8 threads each
  • high throughput
  • IBM POWER7

15
Graphics Processing Units
  • Architecture
  • Hundreds/thousands of slim cores
  • Homogeneous
  • Accelerator(s)
  • Memory
  • Very complex hierarchy
  • Both shared and per-core
  • Programming
  • Off-load model
  • (Many) Symmetrical threads
  • Hardware scheduler
  • Gain performance
  • fine-grain parallelism, SIMT

16
NVIDIA G80/GT200/Fermi
  • SM streaming multiprocessor
  • 1 SM 8 SP (streaming processors/CUDA cores)
  • 1TPC 2 x SM / 3 x SM thread processing
    clusters

17
NVIDIA GT200
18
NVIDIA Fermi
19
ATI GPUs
20
Cell/B.E.
  • Architecture
  • Heterogeneous
  • 8 vector-processors (SPEs) 1 trimmed PowerPC
    (PPE)
  • Accelerator or stand-alone
  • Memory
  • Per-core only
  • Programming
  • Asymmetrical multi-threading
  • User-controlled scheduling
  • 6 levels of parallelism, all under user control
  • Gain performance
  • Fine- and coarse-grain parallelism (MPMD, SPMD)
  • SPE-specific optimizations
  • Scheduling

21
Cell/B.E.
  • 1 x PPE 64-bit PowerPC
  • L1 32 KB I32 KB D
  • L2 512 KB
  • 8 x SPE cores
  • Local mem (LS) 256 KB
  • 128 x 128 bit vector registers
  • Main memory access
  • PPE Rd/Wr
  • SPEs Async DMA

22
Intel Single-chip Cloud Computer
  • Architecture
  • Tile-based many-core (48 cores)
  • A tile is a dual-core
  • Stand-alone / cluster
  • Memory
  • Per-core and per-tile
  • Shared off-chip
  • Programming
  • Multi-processing with message passing
  • User-controlled mapping/scheduling
  • Gain performance
  • Coarse-grain parallelism (MPMD, SPMD)
  • Multi-application workloads (cluster-like)

23
Intel SCC
24
Summary
  • Computation

25
Summary
  • Memory

26
Take home message
  • Variety of platforms
  • Core types counts
  • Memory architecture sizes
  • Parallelism layers types
  • Scheduling
  • Open question(s)
  • Why so many?
  • How many platforms do we need?
  • Any application to run on any platform?

27
Evaluate in theory
28
HW Performance metrics
  • Clock frequency Hz Absolute HW speed(s)
  • Memories, CPUs, interconnects
  • Operational speed GFLOPs
  • Operations per cycle
  • Bandwidth GB/s
  • memory access speed(s)
  • differs a lot between different memories on chip
  • Power
  • Per core/per chip
  • Derived metrics
  • FLOP/Byte
  • FLOP/Watt

29
Peak performance
  • Peak cores threads_per_core
  • FLOPS/cycle clock_frequency
  • Examples
  • Nehalem EX 8 2 4 2.26GHz 170 GFLOPs
  • HD 5870 (2016) 5 0.85GHz 1360 GFLOPs
  • GF100 (1632) 2 1.45GHz 1484 GFLOPs

30
On-chip memory bandwidth
  • Registers and per-core caches
  • - specification
  • Shared memory
  • Peak_Data_Rate x Data_Bus_Width
  • (frequency data_rate) data_bus_width
  • Example(s)
  • Nehalem DDR3 1.333264 21 GB/s
  • HD 5870 4.800 256 153.6 GB/s
  • Fermi 4.200 384 201.6 GB/s

31
Off-chip memory bandwidth
  • Depends on the interconnect
  • Intels technology QPI
  • 25.6 GB/s
  • AMDs technology HT3
  • 19.2 GB/s
  • Accelerators PCI-e 1.0 or 2.0
  • 8GB/s or 16 GB/s

32
Summary
Cores Threads/ALUs GFLOPS BW FLOPS/Byte
Cell/B.E. 8 8 204.80 25.6 8.0000
Nehalem EE 4 8 57.60 25.5 2.2588
Nehalem EX 8 16 170.00 63 2.6984
Niagara 8 32 9.33 20 0.4665
Niagara 2 8 64 11.20 76 0.1474
AMD Barcelona 4 8 37.00 21.4 1.7290
AMD Istanbul 6 6 62.40 25.6 2.4375
AMD Magny-Cours 12 12 124.80 25.6 4.8750
IBM Power 7 8 32 264.96 68.22 3.8839
G80 16 128 404.80 86.4 4.6852
GT200 30 240 933.00 141.7 6.5843
GF100 16 512 1484.00 201.6 7.3611
ATI Radeon 4890 160 800 680.00 124.8 5.4487
HD5870 320 1600 1360.00 153.6 8.8542
33
Absolute HW performance 1
  • Achieved in the optimal conditions
  • Processing units 100 used
  • All parallelism 100 exploited
  • All data transfers at maximum bandwidth

How many applications like this? Basically none
its even hard to build the right benchmarks
34
Evaluate in use
35
Workloads
  • For a new application
  • Design parallel algorithm
  • Implement
  • Optimize
  • Benchmark
  • Any application can run on any platform
  • Influence on
  • performance
  • portability
  • productivity
  • Ideally, we want a good fit!

36
Performance goals
  • Hardware designer
  • How fast is my hardware running?
  • End-user
  • How fast is my application running?
  • End-users manager
  • How efficient is my application?
  • Developers manager
  • How much time it takes to program it?
  • Developer
  • How close can I get to the peak performance?

37
SW Performance metrics
  • Execution time (user)
  • Speed-up
  • vs. best available sequential application
  • Achieved GFLOPs (developer/users manager)
  • Computational efficiency
  • Achieved GB/s (developer)
  • Memory efficiency
  • Productivity and portability (developers
    manager)
  • Production costs
  • Maintenance costs

38
For example
Hundreds of applications to reach speed-ups of
up to 2 orders of magnitude!!! Incredible
performance! Or is it?
39
Developer
  • Searching for peak performance
  • Which platform to use?
  • What is the maximum I can achieve? And how?
  • Performance models
  • Amdahls Law
  • Arithmetic Intensity and the Roofline model

40
Amdahls Law
  • How can we apply Amdahls law for MC applications
    ?
  • - Discussion

41
Arithmetic intensity (AI)
  • AI OP/Byte
  • How many operations are executed per transferred
    byte?
  • Determines the boundary between compute intensive
    and data intensive

42
Applications AI
Is the application compute intensive or memory
intensive ?
  • Example
  • AI (RGB-to-Gray conversion) 5/4
  • Read 3B Write 1B
  • Compute 3 MUL 2 ADD

43
Platform AI
Is the application compute intensive or memory
intensive ?
RGB to Gray
44
The Roofline model 1
  • Achievable_peak
  • min PeakGFLOPs, AI streamBW
  • Peak GFLOPs platform peak
  • StreamBW streaming bandwidth
  • AI application arithmetic intensity
  • Theoretical peak values to be replaced by real
    values
  • Without various optimizations

45
The Roofline model 2
  • Black
  • Theoretical peak
  • Yellow
  • No streaming optimizations
  • Green
  • No in-core optimizations
  • Red
  • worst case performance zone
  • Dashed
  • The application

46
Use the Roofline model
  • To determine what to do first to gain
    performance?
  • Increase arithmetic intensity
  • Increase streaming rate
  • Apply in-core optimizations
  • and these are topics for your next lecture

Samuel Williams et. al Roofline an insightful
visual performance model for multicore
architectures
47
Take home message
  • Performance evaluation depends on goals
  • Execution time (users)
  • GFLOPs and GB/s (developers)
  • Efficiency (budget holders ?)
  • Stop tweaking when
  • Reach performance goal
  • Constrained by the capabilities of the
    (application,platform) pair e.g., as predicted
    by Roofline
  • Choose platform to fit application
  • Parallelism layers
  • Arithmetic intensity
  • Streaming capabilities

48
Questions
Ana Lucia Varbanescu analucia_at_cs.vu.nl
Write a Comment
User Comments (0)
About PowerShow.com