Intels Tarascale computing project - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Intels Tarascale computing project

Description:

What are the key architecture issues in many-cores CMP. CDA6159fa07 Peir 2. On ... A typical DDR2 bus is 16 bytes (128 bits) wide and operating at 800Mb/s. The ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 34

Provided by: cise8

Category:

more less

Transcript and Presenter's Notes

Title: Intels Tarascale computing project

1
Tara-Scale CMP

Intels Tara-scale computing project
100 cores, gt100 threads
Datacenter-on-a-chip
Suns Niagara2
8 cores, 64 Threads
Key design issues
Architecture Challenges and Tradeoffs
Packaging and off-chip memory bandwidth
Software and runtime environment

CDA6159fa07 peir
2
Many-Core CMPs High-level View
Cores
What are the key architecture issues in
many-cores CMP
L1I/D
L2

On-die interconnect
Cache organization Cache coherence
I/O and Memory architecture

CDA6159fa07 Peir 2
3
The General Block Diagram
FFU Fixed Function Unit, Mem C Memory
Controller, PCI-E C PCI-based Controller, R
Router, ShdU Shader Unit, Sys I/F System
Interface, TexU Texture Unit
CDA6159fa07 Peir 3
4
On-Die Interconnect
2D Embedding of a 64-core 3D-mesh network The
longest hop of the topological distance is
extended from 9 to 18!
5
On-Die Interconnect

Must satisfy bandwidth and latency within
power/area
Ring or 2D mesh/torus are good candidate topology
Wiring density, router complexity, design
complexity
Multiple source/dest. pairs can be switched
together avoid packets stop and buffered, save
power, help throughput
Xbar, general router are power hungry
Fault-tolerant interconnect
Provide spare modules, allow fault-tolerant
routing
Partition for performance isolation

6
Performance Isolation in 2D mesh

Performance isolation in 2D mesh with partition
3 rectangular partitions
Intra-communication confined within partition
Traffic generated in a partition will not affect
others
Virtualization of network interfaces
Interconnect as an abstraction of applications
Allow programmers fine-tune applications
inter-processor communication

7
Basic VC Router for On-Die Interconnect
8
2-D Mesh Router Layout
9
Router Pipeline

Micro-architecture optimization on Router
pipeline
Use Express Virtual Channel for intermediate hops
Eliminate VA and SA stages, Bypass buffers
Static EVC vs. dynamic EVC
Intelligent inconnect topologies

10
Many-Core CMPs
Cores
How about on-die cache organization with so many
cores?
L1I/D
L2

Shared vs. Private
Cache capacity vs. accessibility
Data replication vs. block migration
Cache partition

11
CMP Cache Organization
12
Capacity vs. Accessibility, A Tradeoff

Capacity favor Shared cache
No data replication, no cache coherence
Longer access time, contention issue
Flexible cache capacity sharing
Fair sharing among cores Cache partition
Accessibility favor Private cache
Fast local access with data replication, capacity
may suffer
Need maintain coherence among private caches
Equal partition, inflexible
Many works to take advantage of both
Capacity sharing on private cooperative caching
Utility-based cache partition on shared

13
Analytical Data Replication Model
Local hits increase R/S of hits to replica
Local hits increase R/S of hits to replica L of
replica hits local
P Miss Penalty Cycles G Local Gain Cycles Net
memory access cycle increase
Reuse distance histogram f(x) of accesses with
distance x
Cache size S Total hits gt Area beneath the
curve gt
Cache misses increase
Capacity decreases Cache hits now
14
Get Histogram f(x) for OLTP
X106
Step 1 Stack simulation Collect discrete reuse
distance
Step 2 Matlab Curve Fitting Find math expr.
15
Data Replication Effects

f(x)
G 15
P 400
L 0.5

Data Replication Impacts vary with different
cache sizes
S 2M
S 2M 0 best
S 4M
S 4M 40 best
S 8M
S 8M 65 best
(R/S)
16
Many-Core CMPs
Cores
How about Cache Coherence with so many
corescaches?
L1I/D
L2

Snooping bus Broadcast requests
Directory-based maintaining memory block
information
Review Cullers book

17
Simplicity Shared L2, Write-through L1

Existing designs
IBM Power4 5
Sun Niagara Niagara 2
Small number of cores,
Multiple L2 banks, Xbar
Still need L1 coherence!!
Inclusive L2, use L2 directory
record L1 sharers in Power45
Non-inclusive L2, Shadow L1 directory in Niagara
L2 (shared) coherence among multiple CMPs
Private L2 is assumed

18
Other Considerations

Broadcast
Snooping Bus loading, speed,
space, power, scalability, etc.
Ring slow traversal, ordering,
scalability
Memory-based directory
Huge directory space
Directory cache, extra penalty
Shadow L2 Directory copy all local L2s
Aggregated associativity Cores Ways/Core
6416 1024 way
High power

19
Directory-Based Approach

Directory needs to maintain the state and
location of all cached blocks
Directory is checked when the data cannot be
accessed locally, e.g. cache miss,
write-to-shared
Directory may route the request to remote cache
to fetch the requested block

20
Sparse Directory Approach

Holds states for all cached blocks
Low-cost set-associative design
No backup
Key issues
Centralized vs. Distributed
Extra invalidation due to conflicts
Presence bit vs. duplicated blocks

21
Conflict Issues in Coherence Directory

Coherence directory must be a superset of all
cached blocks
Uneven distribution of cached blocks in each
directory set cause invalidations
Potential solutions
High set associativity costly
Directory victim directory
Randomization and Skew associativity
Bigger directory - Costly
Others?

22
Impact of Invalidation due to Directory Conflict

8-core CMP, 1MB 8-way private L2 (total 8MB)
Set-associative dir of dir entry total
of cache blocks
Each cached block occupies a directory entry

96
93
75
72
23
Presence bits Issue in Directory

Presence bits (or not?)
Extra space, useless for multi-programs
Coherence directory must cover all cached blocks
(consider no sharing)
Potential solutions
Coarse-granularity present bits
Sparse presence vectors record core-ids
Allow duplicated block addresses with few
core-ids for each shared block, enable multiple
hits on directory search

24
Valid Blocks
Skew, and 10w-1/4 helps No difference 64v
25
Challenge in Memory Bandwidth

Increase in off-chip memory bandwidth to sustain
chip-level IPC
Need power-efficient high-speed off-die I/O
Need power-efficient high-bandwidth DRAM access
Potential Solutions
Embedded DRAM
Integrated DRAM, GDDR inside processor package
3D stacking of multiple DRAM/processor dies
Many technology issues to overcome

26
Memory Bandwidth Fundamental

BW of bits x bit rate
A typical DDR2 bus is 16 bytes (128 bits) wide
and operating at 800Mb/s. The memory bandwidth of
that bus is 16 bytes x 800Mb/s, which is 12.8GB/s
Latency and Capacity
Fast, but small capacity on-chip SRAM (caches)
Slow large capacity off-chip DRAM

27
Memory Bus vs. System Bus Bandwidth

Scaling of bus capability has usually involved a
combination of increasing the bus width while
simultaneously increasing the bus speed

28
Integrated CPU with Memory Controller

Eliminate off-chip controller delay
Fast, but difficult to adapt new DRAM technology
The entire burden of pin count and interconnect
speed to sustain increases in memory bandwidth
requirements now falls on the CPU package alone

29
Challenge in Memory Bandwidth and Pin Count
30
Challenge in Memory Bandwidth