Efficient Interconnects for Clustered Microarchitectures - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Efficient Interconnects for Clustered Microarchitectures

Description:

Select producer cluster. 2. Maximize ... Baseline: 'Select clusters that minimize # communications' ... 1 Bus per cluster, each connected to 1 write port ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 31

Provided by: joanmanuel

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Interconnects for Clustered Microarchitectures

1
Efficient Interconnects for Clustered
Microarchitectures

Joan-Manuel Parcerisa
Antonio González
Universitat Politècnica de Catalunya Barcelona,
Spainjmanel,antonio_at_ac.upc.es

Julio Sahuquillo José Duato Universitat
Politècnica de València València,
Spainjsahuqui,jduato_at_disca.upv.es
2
Why Clustered Microarchitectures

Larger issue width, window length, predictor
sizes
More complexity ? more latency and power
Even worse wire delays do not scale across
technologies
Deeper pipelines, fewer logic levels per stage
Tight loops difficult to fit in a single cycle
E.g. issue logic, bypass
Partitioning critical structures attacks both
problems
E.g. clustered microarchitectures

3
A Typical Clustered uArch

Partitioned processor core
Instructions dynamically steered

Each cluster RF, IQ, FUs
Faster issue, read, bypass
Inter-cluster communications
Go through slow interconnects
Take 1 cycle or more

Steering must maximize communication locality

4
Motivation

ICN is a critical part of the architecture
Performance very sensitive to communication
latency !
ICN assumed by previous works
Cross-bar ? does not scale
Ring ? simple, but long delays
Idealized

Our proposals
Several point-to-point ICN for 4 and 8 clusters
Implementable, simple and efficient
A topology-aware steering

5
Outline

Clustered architecture
Topology-aware steering
Proposed Interconnects
Experimental results
Summary and conclusions

6
Our Assumed Clustered uArch

Distributed RF
Results only written to local RF
Values are communicated with copy instructions
Automatically inserted
Each copy creates a new instance
Rename Table tracks locations of multiple
instances

7
Communication Timing
(to C1) copy R1C1-gtC2
8
Baseline Steering Scheme(dependence-based)

1. Minimize communication penalty
If all source operands available
Select clusters that minimize communications
If any source operand not available
Select producer cluster
2. Maximize workload balance
Choose the least loaded of clusters selected by
rule 1
One exception
If workload imbalance gt threshold, ignore rule 1

9
Topology-Aware Steering Scheme

Also minimize distance
Change part of rule 1
If all source operands are available
Baseline Select clusters that minimize
communications
Topology-aware Select clusters that minimize
the longest communication distance

10
Design Issues Bandwidth

For each additional input bypass path
1 tag across the IQ
1 RF write port
1 entry to FU input MUXes
It increases the wakeup and bypass delays
Bandwidth requirements are rather low

1 input bypass path per cluster (1 RF write port)
2 links per connected cluster pair

11
Design Issues Latency

Performance very sensitive to communication
latency

Simple routing structures and algorithms
Source routing
No intermediate buffering
In-transit messages have priority over newly
injected ones

12
Design Issues Connectivity

Assumed 1-cycle communication delay between
adjacent clusters
Number of adjacents dictated by technology and
layout

Study topologies with different connectivity
degrees

13
Design Issues Point-to-point vs Buses

Point-to-point advantages
Access to links is arbitrated locally
Wires are shorter and less loaded

Shared buses are studied for comparison

14
Interconnects for 4 clusters (I)

Bus2
1 Bus per cluster, each connected to 1 write port
Latency 4 cycles (2 for arbitration 2 for
transmission)
Arbitration overlaps with transmission

15
Interconnects for 4 clusters (II)

Synchronous Ring
Injection rules prevent that 2 messages arrive at
once
Even cycles 1-hop counter-clockwise/ 2-hops
clockwise
Odd cycles reverse directions

16
Interconnects for 4 clusters (III)

Partially Asynchronous Ring
Messages may issue in any cycle
2 messages may arrive at once
Small input queues

17
Interconnects for 4 clusters (IV)

Ideal Ring
Contention-free
unlimited number of links
unlimited number of RF write ports
For comparison purposes (upper-bound performance)

18
Interconnects for 8 Clusters (I)

Buses
Analogous to those for 4 clusters
Bus2 same latency (optimistic) 22 cycles
Bus4 twice the latency (realistic) 44 cycles
Rings
Analogous to those for 4 clusters
Synchronous and Asynchronous
Max. Distance 4 hops (average 2.29 hops)

19
Interconnects for 8 Clusters (II)

Mesh
Max. distance 4 hops (average 2 hops)
2 in-transit messages may compete for the same
output link
Constrained connectivity

20
Interconnects for 8 Clusters (III)

Torus
Max. distance 3 hops
Same connectivity constraints as the mesh

21
Interconnects for 8 Clusters (IV)

Ideal Torus
Contention-free
unlimited number of links
unlimited number of RF write ports
For comparison purposes (upper-bound
performance)

22
Router Structures

Common features to all ICN
No intermediate buffering

RightLink
LeftLink

Partially asynchronous ICN
Competence for a write port
Add small input queues

Topologies with 3 adjacent nodes
Competence for the same output link
Constrained connectivity

Cluster Datapath
23
Experimental Setup

Simulation
Extended version of sim-outorder (SimpleScalar
v3.0)
14 Mediabench programs
Compiled with O4 for an Alpha AXP
Architecture
L1 D-cache 64KB, 2-way, 3-cycle hit
128 ROB, 64 LSQ
Each cluster 2-way issue, 16-entry IQ, 56
physical regs.

24
Performance 4 Clusters

Poor performance of Bus2
Asynchronous Ring
Better than Synchronous Ring
Close to Ideal (within 1)

25
Synchronous / Asynchronous

Contention delays
Lower for Async. Ring
Message issues as soon asthe link is available
Higher for 1-hop messages
a single path
Sync. Ring issue 1 cycle every 2

26
Length of Input Queues

Max. observed occupancy lt 9 entries
Handle overflows by flushing the pipeline
Rather than including complex control flow

Sample statistics (djpeg)
27
Performance 8 Clusters

Poor performance of buses
Connectivity degree has a significant impact
Asynchronous Torus close to Ideal (within1.5)

28
Topology-Aware Steering

16.5 IPC improvement with 8 clusters (2.5 with
4 clusters)

29
Summary

An efficient topology-aware steering scheme
Cluster point-to-point interconnects
For 4 clusters and 8 clusters
Designed to minimize complexity and latency
Compared to
Bus-based models
Idealized models with unlimited bandwidth

30
Conclusions

The choice of ICN is crucial for performance
Point-to-point better than buses
Asynchronous rings better than synchronous
Asynchronous interconnects perform close to ideal
with minimal complexity
Higher connectivity significantly improves
performance
Topology-aware steering essential to reduce
latency
Especially with many clusters

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors PowerPoint PPT Presentation

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors - F = ab ad cd e f. Mapped for 3-LUT. Sept 28th 2004 ... Every extra node = extra area. extra power. Sept 28th 2004. 21. Swap. F = e f ab ad cd ... | PowerPoint PPT presentation | free to view

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors - Sept 28th 2004. 4. Motivation. Wire delays do not scale as well as their transistor counterparts ... Sept 28th 2004. 9. Impact of Power-centric Design ... | PowerPoint PPT presentation | free to view

BehaviorDriven Hierarchical Synthesis of Multithreaded Workloads for Efficient MultiCore Processor D PowerPoint PPT Presentation

BehaviorDriven Hierarchical Synthesis of Multithreaded Workloads for Efficient MultiCore Processor D - Wavelet-based branch model. SSFG. SFG. Nodes represent basic blocks ... k-mean algorithm classifies branching patterns into clusters based on their similarity ... | PowerPoint PPT presentation | free to view

Dynamic Management of PowerPoint PPT Presentation

Dynamic Management of - Delay is a quadratic function of the wire length. By inserting repeaters/buffers, delay grows ... Minimum acceptable interval length and its instability factor ... | PowerPoint PPT presentation | free to view

CS 7810 Lecture 4 PowerPoint PPT Presentation

CS 7810 Lecture 4 - The max possible improvement (UB model) is 44% Other Results ... Load imbalance and communication become. worse the best heuristic/threshold will depend ... | PowerPoint PPT presentation | free to view

High Performance Computing Group PowerPoint PPT Presentation

High Performance Computing Group - pref brings all data into registers (allocated dynamically) L1 Cache ... and renaming assigns their registers to the preallocated by the pref instruction ... | PowerPoint PPT presentation | free to view

Thermal Issues References PowerPoint PPT Presentation

Thermal Issues References - Title: Thermal Management Issues (MICRO-35 Tutorial) Author: Skadron, Stan, and Brooks Last modified by: M. Stan Created Date: 9/18/2001 1:12:01 AM | PowerPoint PPT presentation | free to view

Wire%20Aware%20Architecture PowerPoint PPT Presentation

Wire%20Aware%20Architecture - www.cs.utah.edu | PowerPoint PPT presentation | free to view

Thermal Issues References - Thermal Issues References Thermal considerations in cooling large scale high compute density data centers Patel, C.D.; Sharma, R.; Bash, C.E.; Beitelmal, A ... | PowerPoint PPT presentation | free to view

Vector IRAM A Mediaoriented Vector Processor with Embedded DRAM PowerPoint PPT Presentation

Vector IRAM A Mediaoriented Vector Processor with Embedded DRAM - notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, set-top boxes ... Reduces design and testing time ... | PowerPoint PPT presentation | free to view

The%20Vector-Thread%20Architecture PowerPoint PPT Presentation

The%20Vector-Thread%20Architecture - ... Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanovic ... Parallelism and Locality are key application characteristics ... | PowerPoint PPT presentation | free to view

Options%20for%20embedded%20systems.%20Constraints,%20challenges,%20and%20approaches%20HPEC%202001%20Lincoln%20Laboratory%2025%20September%202001 - ALU Pipe I/O Timer MMU Register File Cache Tailored, HDL uP core Customized Compiler, Assembler, ... terminal, PC, workstation ... Altera Sea of un-committed ... | PowerPoint PPT presentation | free to view

Vector IRAM A Media-oriented Vector Processor with Embedded DRAM - A Media-oriented Vector Processor with Embedded DRAM Christoforos E. Kozyrakis Computer Science Division ... If the DRAM macro used had a multi-bank structure, ... | PowerPoint PPT presentation | free to view

On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus PowerPoint PPT Presentation

On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus - On Characterizing Performance of the. Cell Broadband Engine. Element Interconnect Bus ... Jason Dale, Eiji Iwata, 'Cell Broadband Engine Architecture and its first ... | PowerPoint PPT presentation | free to view

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems PowerPoint PPT Presentation

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems - Motivation for L0 Buffer organization and status. Distributed L0 Buffer ... Ref: T. Conte et.al [TINKER] No correlation between partitioning and FUs. Bank 1. FU ... | PowerPoint PPT presentation | free to view

Instruction Level Power Analysis PowerPoint PPT Presentation

Instruction Level Power Analysis - Unmanageable time complexity even for simpler designs ... Still unmanageable time complexity especially to use in design space exploration ... | PowerPoint PPT presentation | free to view

NASA NCCS APPLICATION PERFORMANCE DISCUSSION PowerPoint PPT Presentation

NASA NCCS APPLICATION PERFORMANCE DISCUSSION - Harpertown Seaburg Chipset. IBM Federal 2006 IBM Corporation. IBM ... Motherboards must use a chipset that supports QuickPath. The following caches: ... | PowerPoint PPT presentation | free to view

System-Level Exploration of Power, Temperature, Performance, and Area for Multicore Architectures PowerPoint PPT Presentation

System-Level Exploration of Power, Temperature, Performance, and Area for Multicore Architectures - ... 189940 # total number of hits il1.misses 23763 # total number of misses il1 .replacements 23507 # total number of ... | PowerPoint PPT presentation | free to view

TLP on Chip: SMT and CMP PowerPoint PPT Presentation

TLP on Chip: SMT and CMP - ... custom designed snoopy bus connecting the L1 controllers or may use a simple directory protocol ... build an SMP over a snoopy bus; you can connect these ... | PowerPoint PPT presentation | free to view

Distributed SharedMemory Parallel Computing with UPC on SANbased Clusters Q3 Status Rpt. PowerPoint PPT Presentation

Distributed SharedMemory Parallel Computing with UPC on SANbased Clusters Q3 Status Rpt. - HAMSTER interface allows multiple modules to support MPI and shared-memory models ... sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8-port switch ... | PowerPoint PPT presentation | free to view

Parallel Processors from Client to Cloud - Chapter 6 Parallel Processors from Client to Cloud | PowerPoint PPT presentation | free to view

IRAM Original Plan PowerPoint PPT Presentation

IRAM Original Plan - Based on media processing and embedded DRAM. Simple, scalable, ... Kyocera 304 pin Quad Flat Pack. Cavity is 20.0 x 20.0 mm. Must allow space around die - 1.2 mm ... | PowerPoint PPT presentation | free to view

Multicores, Multiprocessors, and Clusters - Chapter 7 Multicores, Multiprocessors, and Clusters | PowerPoint PPT presentation | free to view

Design Productivity Crisis PowerPoint PPT Presentation

Design Productivity Crisis - 1. abk 000714. Futures for DSM Physical Implementation: Where ... 'Anakin Skywalker's Pod Racer' 12. abk 000714. Clear Thinking: Basics of Design Convergence ... | PowerPoint PPT presentation | free to view

Modern Physical Design: Algorithm Technology Methodology PowerPoint PPT Presentation

Modern Physical Design: Algorithm Technology Methodology - Title: Modern Physical Design: Algorithm Technology Methodology (Part I) Last modified by: stefanus Created Date: 7/21/1999 5:58:31 PM Document presentation format | PowerPoint PPT presentation | free to view

Part of Chapter 7 PowerPoint PPT Presentation

Part of Chapter 7 - Simplifies hardware, but doesn't hide short stalls (eg, data hazards) Good: Does not require too many thread switches Bad: Throughput loss on short stalls ... | PowerPoint PPT presentation | free to view

Multicores, Multiprocessors, and Clusters - Chapter 7 Multicores ... software Devising appropriate architectures Many reasons for optimism ... Time 7.8 Introduction to Multiprocessor Network Topologies ... | PowerPoint PPT presentation | free to view