Title: Multicore
1(No Transcript)
2The Kill Rule for Multicore
- Anant Agarwal
- MIT and Tilera Corp.
3Multicore is Moving Fast
Corollary of Moores Law Number of cores will
double every 18 months
What must change to enable this growth?
4Multicore Drivers Suggest Three Directions
- Diminishing returns
- Smaller structures
- Power efficiency
- Smaller structures
- Slower clocks, voltage scaling
- Wire delay
- Distributed structures
- Multicore programming
1. How we size core resources
2. How we connect the cores
3. How programming will evolve
5How We Size Core Resources
3 cores Small Cache
6KILL Rule for Multicore
Kill If Less than Linear
A resource in a core must be increased in area
only if the cores performance improvement is at
least proportional to the cores area increase
Put another way, increase resource size only if
for every 1 increase in core area there is at
least a 1 increase in core performance
Leads to power-efficient multicore design
7Kill Rule for Cache Size Using Video Codec
8Well Beyond Diminishing Returns
Madison Itanium2
Cache System
L3 Cache
Photo courtesy Intel Corp.
9Slower Clocks Suggest Even Smaller Caches
Insight
Maintain constant instructions per cycle (IPC)
10Multicore Drivers Suggest Three Directions
- Diminishing returns
- Smaller structures
- Power efficiency
- Smaller structures
- Slower clocks, voltage scaling
- Wire delay
- Distributed structures
- Multicore programming
1. How we size core resources
KILL rule suggests smaller caches for
multicore If the clock is slower by x, for
constant IPC, the cache can be smaller by x2
KILL rule applies to all multicore
resources Issue width 2-way is probably ideal
Simplefit, TPDS 7/2001 Cache sizes and number
of memory hierarchy levels
2. How we connect the cores
3. How programming will evolve
11Interconnect Options
Packet routing through switches
12Bisection Bandwidth is Important
13Concept of Bisection Bandwidth
14Meshes are Power Efficient
Energy Savings(Mesh vs. Bus)
Number of Processors
Benchmarks
15Meshes Offer Simple Layout
ExampleMITs Raw Multicore
- 16 cores
- Demonstrated in 2002
- 0.18 micron
- 425 MHz
- IBM SA27E standard cell
- 6.8 GOPS
www.cag.csail.mit.edu/raw
16Multicore
- Single chip
- Multiple processing units
- Multiple, independent threads of control, or
program counters MIMD
17Multicore Drivers Suggest Three Directions
- Diminishing returns
- Smaller structures
- Power efficiency
- Smaller structures
- Slower clocks, voltage scaling
- Wire delay
- Distributed structures
- Multicore programming
1. How we size core resources
2. How we connect the cores
3. How programming will evolve
18Multicore Programming Challenge
- Multicore programming is hard. Why?
- New
- Misunderstood- some sequential programs are
harder - Current tools are where VLSI design tools where
in the mid 80s - Standards are needed (tools, ecosystems)
- This problem will be solved soon. Why?
- Multicore is here to stay
- Intel webinar Think parallel or perish
- Opportunity to create the API foundations
- The incentives are there
19Old Approaches Fall Short
- Pthreads
- Intel webinar likens it to the assembly of
parallel programming - Data races are hard to analyze
- No encapsulation or modularity
- But evolutionary, and OK in the interim
- DMA with external shared memory
- DSP programmers favor DMA
- Explicit copying from global shared memory to
local store - Wastes pin bandwidth and energy
- But, evolutionary, simple, modular and small core
memory footprint - MPI
- Province of HPC users
- Based on sending explicit messages between
private memories - High overheads and large core memory footprint
But, there is a big new idea staring us in the
face
20Inspiration from ASICs Streaming
mem
Stream of data over a hardware FIFO
- Streaming is energy efficient and fast
- Concept familiar and well developed in hardware
design and simulation languages
21Streaming is Familiar Like Sockets
- Basis of networking and internet software
- Familiar popular
- Modular scalable
- Conceptually simple
- Each process can use existing sequential code
22Core-to-Core Data Transfer Cheaper than Memory
Access
- Energy
- 32b network transfer over 1mm channel 3pJ
- 32KB cache read 50pJ
- External access 200pJ
- Latency
- Reg to reg 5 cycles (RAW)
- Cache to cache 50 cycle
- DRAM access 200 cycle
Data based on 90nm process node
23Streaming Supports Many Models
Pipeline
Not great for Blackboard style
Shared state
But then, there is no one size fits all
24Multicore Streaming Can be Way Faster than Sockets
- No fundamental overheads for
- Unreliable communication
- High latency buffering
- Hardware heterogeneity
- OS heterogeneity
- Infrequent setup
- Common-case operations are fast and power
efficient - Low memory footprint
MCAs CAPI standard
25CAPIs Stream Implementation 1
Process A (E.g., FIR1)
Process B (E.g., FIR2)
Core 1
Core 2
Multicore Chip
I/O register-mapped hardware FIFOs in SOCs
26CAPIs Stream Implementation 2
Cache
Cache
Process A (E.g., FIR)
Process B (E.g., FIR)
Core 1
Core 2
On-chip Interconnect
Multicore Chip
On-chip cache to cache transfers over on-chip
interconnect in general multicores
27Conclusions
- Multicore is here to stay
- Evolve core and interconnect
- Create multicore programming standards users
are ready - Multicore success requires
- Reduction in core cache size
- Adoption of mesh based on-chip interconnect
- Use of a stream based programming API
- Successful solutions will offer evolutionary
transition path