Title: Efficient Interconnects for Clustered Microarchitectures
1Efficient Interconnects for Clustered
Microarchitectures
- Joan-Manuel Parcerisa
- Antonio González
- Universitat Politècnica de Catalunya Barcelona,
Spainjmanel,antonio_at_ac.upc.es
Julio Sahuquillo José Duato Universitat
Politècnica de València València,
Spainjsahuqui,jduato_at_disca.upv.es
2Why Clustered Microarchitectures
- Larger issue width, window length, predictor
sizes - More complexity ? more latency and power
- Even worse wire delays do not scale across
technologies - Deeper pipelines, fewer logic levels per stage
- Tight loops difficult to fit in a single cycle
- E.g. issue logic, bypass
- Partitioning critical structures attacks both
problems - E.g. clustered microarchitectures
3A Typical Clustered uArch
- Partitioned processor core
- Instructions dynamically steered
- Each cluster RF, IQ, FUs
- Faster issue, read, bypass
- Inter-cluster communications
- Go through slow interconnects
- Take 1 cycle or more
- Steering must maximize communication locality
4Motivation
- ICN is a critical part of the architecture
- Performance very sensitive to communication
latency ! - ICN assumed by previous works
- Cross-bar ? does not scale
- Ring ? simple, but long delays
- Idealized
- Our proposals
- Several point-to-point ICN for 4 and 8 clusters
- Implementable, simple and efficient
- A topology-aware steering
5Outline
- Clustered architecture
- Topology-aware steering
- Proposed Interconnects
- Experimental results
- Summary and conclusions
6Our Assumed Clustered uArch
- Distributed RF
- Results only written to local RF
- Values are communicated with copy instructions
- Automatically inserted
- Each copy creates a new instance
- Rename Table tracks locations of multiple
instances
7Communication Timing
(to C1) copy R1C1-gtC2
8Baseline Steering Scheme(dependence-based)
- 1. Minimize communication penalty
- If all source operands available
- Select clusters that minimize communications
- If any source operand not available
- Select producer cluster
- 2. Maximize workload balance
- Choose the least loaded of clusters selected by
rule 1 - One exception
- If workload imbalance gt threshold, ignore rule 1
9Topology-Aware Steering Scheme
- Also minimize distance
- Change part of rule 1
- If all source operands are available
- Baseline Select clusters that minimize
communications - Topology-aware Select clusters that minimize
the longest communication distance
10Design Issues Bandwidth
- For each additional input bypass path
- 1 tag across the IQ
- 1 RF write port
- 1 entry to FU input MUXes
- It increases the wakeup and bypass delays
- Bandwidth requirements are rather low
- 1 input bypass path per cluster (1 RF write port)
- 2 links per connected cluster pair
11Design Issues Latency
- Performance very sensitive to communication
latency
- Simple routing structures and algorithms
- Source routing
- No intermediate buffering
- In-transit messages have priority over newly
injected ones
12Design Issues Connectivity
- Assumed 1-cycle communication delay between
adjacent clusters - Number of adjacents dictated by technology and
layout
- Study topologies with different connectivity
degrees
13Design Issues Point-to-point vs Buses
- Point-to-point advantages
- Access to links is arbitrated locally
- Wires are shorter and less loaded
- Shared buses are studied for comparison
14Interconnects for 4 clusters (I)
- Bus2
- 1 Bus per cluster, each connected to 1 write port
- Latency 4 cycles (2 for arbitration 2 for
transmission) - Arbitration overlaps with transmission
15Interconnects for 4 clusters (II)
- Synchronous Ring
- Injection rules prevent that 2 messages arrive at
once - Even cycles 1-hop counter-clockwise/ 2-hops
clockwise - Odd cycles reverse directions
16Interconnects for 4 clusters (III)
- Partially Asynchronous Ring
- Messages may issue in any cycle
- 2 messages may arrive at once
- Small input queues
17Interconnects for 4 clusters (IV)
- Ideal Ring
- Contention-free
- unlimited number of links
- unlimited number of RF write ports
- For comparison purposes (upper-bound performance)
18Interconnects for 8 Clusters (I)
- Buses
- Analogous to those for 4 clusters
- Bus2 same latency (optimistic) 22 cycles
- Bus4 twice the latency (realistic) 44 cycles
- Rings
- Analogous to those for 4 clusters
- Synchronous and Asynchronous
- Max. Distance 4 hops (average 2.29 hops)
19Interconnects for 8 Clusters (II)
- Mesh
- Max. distance 4 hops (average 2 hops)
- 2 in-transit messages may compete for the same
output link - Constrained connectivity
20Interconnects for 8 Clusters (III)
- Torus
- Max. distance 3 hops
- Same connectivity constraints as the mesh
21Interconnects for 8 Clusters (IV)
- Ideal Torus
- Contention-free
- unlimited number of links
- unlimited number of RF write ports
- For comparison purposes (upper-bound
performance)
22Router Structures
- Common features to all ICN
- No intermediate buffering
RightLink
LeftLink
- Partially asynchronous ICN
- Competence for a write port
- Add small input queues
- Topologies with 3 adjacent nodes
- Competence for the same output link
- Constrained connectivity
Cluster Datapath
23Experimental Setup
- Simulation
- Extended version of sim-outorder (SimpleScalar
v3.0) - 14 Mediabench programs
- Compiled with O4 for an Alpha AXP
- Architecture
- L1 D-cache 64KB, 2-way, 3-cycle hit
- 128 ROB, 64 LSQ
- Each cluster 2-way issue, 16-entry IQ, 56
physical regs.
24Performance 4 Clusters
- Poor performance of Bus2
- Asynchronous Ring
- Better than Synchronous Ring
- Close to Ideal (within 1)
25Synchronous / Asynchronous
- Contention delays
- Lower for Async. Ring
- Message issues as soon asthe link is available
- Higher for 1-hop messages
- a single path
- Sync. Ring issue 1 cycle every 2
26Length of Input Queues
- Max. observed occupancy lt 9 entries
- Handle overflows by flushing the pipeline
- Rather than including complex control flow
Sample statistics (djpeg)
27Performance 8 Clusters
- Poor performance of buses
- Connectivity degree has a significant impact
- Asynchronous Torus close to Ideal (within1.5)
28Topology-Aware Steering
- 16.5 IPC improvement with 8 clusters (2.5 with
4 clusters)
29Summary
- An efficient topology-aware steering scheme
- Cluster point-to-point interconnects
- For 4 clusters and 8 clusters
- Designed to minimize complexity and latency
- Compared to
- Bus-based models
- Idealized models with unlimited bandwidth
30Conclusions
- The choice of ICN is crucial for performance
- Point-to-point better than buses
- Asynchronous rings better than synchronous
- Asynchronous interconnects perform close to ideal
- with minimal complexity
- Higher connectivity significantly improves
performance - Topology-aware steering essential to reduce
latency - Especially with many clusters