Multicore - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Multicore

Description:

Title: Multicore Author: Anant Agarwal Last modified by: xiaoping zhu Created Date: 12/4/1999 11:38:41 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 28

Provided by: Anan46

Category:

more less

Transcript and Presenter's Notes

Title: Multicore

1
(No Transcript)
2
The Kill Rule for Multicore

Anant Agarwal
MIT and Tilera Corp.

3
Multicore is Moving Fast
Corollary of Moores Law Number of cores will
double every 18 months
What must change to enable this growth?
4
Multicore Drivers Suggest Three Directions

Diminishing returns
Smaller structures
Power efficiency
Smaller structures
Slower clocks, voltage scaling
Wire delay
Distributed structures
Multicore programming

1. How we size core resources
2. How we connect the cores
3. How programming will evolve
5
How We Size Core Resources
3 cores Small Cache
6
KILL Rule for Multicore
Kill If Less than Linear
A resource in a core must be increased in area
only if the cores performance improvement is at
least proportional to the cores area increase
Put another way, increase resource size only if
for every 1 increase in core area there is at
least a 1 increase in core performance
Leads to power-efficient multicore design
7
Kill Rule for Cache Size Using Video Codec
8
Well Beyond Diminishing Returns
Madison Itanium2
Cache System
L3 Cache
Photo courtesy Intel Corp.
9
Slower Clocks Suggest Even Smaller Caches
Insight
Maintain constant instructions per cycle (IPC)
10
Multicore Drivers Suggest Three Directions

Diminishing returns
Smaller structures
Power efficiency
Smaller structures
Slower clocks, voltage scaling
Wire delay
Distributed structures
Multicore programming

1. How we size core resources
KILL rule suggests smaller caches for
multicore If the clock is slower by x, for
constant IPC, the cache can be smaller by x2
KILL rule applies to all multicore
resources Issue width 2-way is probably ideal
Simplefit, TPDS 7/2001 Cache sizes and number
of memory hierarchy levels
2. How we connect the cores
3. How programming will evolve
11
Interconnect Options
Packet routing through switches
12
Bisection Bandwidth is Important
13
Concept of Bisection Bandwidth
14
Meshes are Power Efficient
Energy Savings(Mesh vs. Bus)
Number of Processors
Benchmarks
15
Meshes Offer Simple Layout
ExampleMITs Raw Multicore

16 cores
Demonstrated in 2002
0.18 micron
425 MHz
IBM SA27E standard cell
6.8 GOPS

www.cag.csail.mit.edu/raw
16
Multicore

Single chip
Multiple processing units
Multiple, independent threads of control, or
program counters MIMD

17
Multicore Drivers Suggest Three Directions

Diminishing returns
Smaller structures
Power efficiency
Smaller structures
Slower clocks, voltage scaling
Wire delay
Distributed structures
Multicore programming

1. How we size core resources
2. How we connect the cores
3. How programming will evolve
18
Multicore Programming Challenge

Multicore programming is hard. Why?
New
Misunderstood- some sequential programs are
harder
Current tools are where VLSI design tools where
in the mid 80s
Standards are needed (tools, ecosystems)
This problem will be solved soon. Why?
Multicore is here to stay
Intel webinar Think parallel or perish
Opportunity to create the API foundations
The incentives are there

19
Old Approaches Fall Short

Pthreads
Intel webinar likens it to the assembly of
parallel programming
Data races are hard to analyze
No encapsulation or modularity
But evolutionary, and OK in the interim
DMA with external shared memory
DSP programmers favor DMA
Explicit copying from global shared memory to
local store
Wastes pin bandwidth and energy
But, evolutionary, simple, modular and small core
memory footprint
MPI
Province of HPC users
Based on sending explicit messages between
private memories
High overheads and large core memory footprint

But, there is a big new idea staring us in the
face
20
Inspiration from ASICs Streaming
mem
Stream of data over a hardware FIFO

Streaming is energy efficient and fast
Concept familiar and well developed in hardware
design and simulation languages

21
Streaming is Familiar Like Sockets

Basis of networking and internet software
Familiar popular
Modular scalable
Conceptually simple
Each process can use existing sequential code

22
Core-to-Core Data Transfer Cheaper than Memory
Access

Energy
32b network transfer over 1mm channel 3pJ
32KB cache read 50pJ
External access 200pJ
Latency
Reg to reg 5 cycles (RAW)
Cache to cache 50 cycle
DRAM access 200 cycle

Data based on 90nm process node
23
Streaming Supports Many Models
Pipeline
Not great for Blackboard style
Shared state
But then, there is no one size fits all
24
Multicore Streaming Can be Way Faster than Sockets

No fundamental overheads for
Unreliable communication
High latency buffering
Hardware heterogeneity
OS heterogeneity
Infrequent setup
Common-case operations are fast and power
efficient
Low memory footprint

MCAs CAPI standard
25
CAPIs Stream Implementation 1
Process A (E.g., FIR1)
Process B (E.g., FIR2)
Core 1
Core 2
Multicore Chip
I/O register-mapped hardware FIFOs in SOCs
26
CAPIs Stream Implementation 2
Cache
Cache
Process A (E.g., FIR)
Process B (E.g., FIR)
Core 1
Core 2
On-chip Interconnect
Multicore Chip
On-chip cache to cache transfers over on-chip
interconnect in general multicores
27
Conclusions