Efficient Interconnects for Clustered Microarchitectures - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Efficient Interconnects for Clustered Microarchitectures

Description:

Select producer cluster. 2. Maximize ... Baseline: 'Select clusters that minimize # communications' ... 1 Bus per cluster, each connected to 1 write port ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 31
Provided by: joanmanuel
Category:

less

Transcript and Presenter's Notes

Title: Efficient Interconnects for Clustered Microarchitectures


1
Efficient Interconnects for Clustered
Microarchitectures
  • Joan-Manuel Parcerisa
  • Antonio González
  • Universitat Politècnica de Catalunya Barcelona,
    Spainjmanel,antonio_at_ac.upc.es

Julio Sahuquillo José Duato Universitat
Politècnica de València València,
Spainjsahuqui,jduato_at_disca.upv.es
2
Why Clustered Microarchitectures
  • Larger issue width, window length, predictor
    sizes
  • More complexity ? more latency and power
  • Even worse wire delays do not scale across
    technologies
  • Deeper pipelines, fewer logic levels per stage
  • Tight loops difficult to fit in a single cycle
  • E.g. issue logic, bypass
  • Partitioning critical structures attacks both
    problems
  • E.g. clustered microarchitectures

3
A Typical Clustered uArch
  • Partitioned processor core
  • Instructions dynamically steered
  • Each cluster RF, IQ, FUs
  • Faster issue, read, bypass
  • Inter-cluster communications
  • Go through slow interconnects
  • Take 1 cycle or more
  • Steering must maximize communication locality

4
Motivation
  • ICN is a critical part of the architecture
  • Performance very sensitive to communication
    latency !
  • ICN assumed by previous works
  • Cross-bar ? does not scale
  • Ring ? simple, but long delays
  • Idealized
  • Our proposals
  • Several point-to-point ICN for 4 and 8 clusters
  • Implementable, simple and efficient
  • A topology-aware steering

5
Outline
  • Clustered architecture
  • Topology-aware steering
  • Proposed Interconnects
  • Experimental results
  • Summary and conclusions

6
Our Assumed Clustered uArch
  • Distributed RF
  • Results only written to local RF
  • Values are communicated with copy instructions
  • Automatically inserted
  • Each copy creates a new instance
  • Rename Table tracks locations of multiple
    instances

7
Communication Timing
(to C1) copy R1C1-gtC2
8
Baseline Steering Scheme(dependence-based)
  • 1. Minimize communication penalty
  • If all source operands available
  • Select clusters that minimize communications
  • If any source operand not available
  • Select producer cluster
  • 2. Maximize workload balance
  • Choose the least loaded of clusters selected by
    rule 1
  • One exception
  • If workload imbalance gt threshold, ignore rule 1

9
Topology-Aware Steering Scheme
  • Also minimize distance
  • Change part of rule 1
  • If all source operands are available
  • Baseline Select clusters that minimize
    communications
  • Topology-aware Select clusters that minimize
    the longest communication distance

10
Design Issues Bandwidth
  • For each additional input bypass path
  • 1 tag across the IQ
  • 1 RF write port
  • 1 entry to FU input MUXes
  • It increases the wakeup and bypass delays
  • Bandwidth requirements are rather low
  • 1 input bypass path per cluster (1 RF write port)
  • 2 links per connected cluster pair

11
Design Issues Latency
  • Performance very sensitive to communication
    latency
  • Simple routing structures and algorithms
  • Source routing
  • No intermediate buffering
  • In-transit messages have priority over newly
    injected ones

12
Design Issues Connectivity
  • Assumed 1-cycle communication delay between
    adjacent clusters
  • Number of adjacents dictated by technology and
    layout
  • Study topologies with different connectivity
    degrees

13
Design Issues Point-to-point vs Buses
  • Point-to-point advantages
  • Access to links is arbitrated locally
  • Wires are shorter and less loaded
  • Shared buses are studied for comparison

14
Interconnects for 4 clusters (I)
  • Bus2
  • 1 Bus per cluster, each connected to 1 write port
  • Latency 4 cycles (2 for arbitration 2 for
    transmission)
  • Arbitration overlaps with transmission

15
Interconnects for 4 clusters (II)
  • Synchronous Ring
  • Injection rules prevent that 2 messages arrive at
    once
  • Even cycles 1-hop counter-clockwise/ 2-hops
    clockwise
  • Odd cycles reverse directions

16
Interconnects for 4 clusters (III)
  • Partially Asynchronous Ring
  • Messages may issue in any cycle
  • 2 messages may arrive at once
  • Small input queues

17
Interconnects for 4 clusters (IV)
  • Ideal Ring
  • Contention-free
  • unlimited number of links
  • unlimited number of RF write ports
  • For comparison purposes (upper-bound performance)

18
Interconnects for 8 Clusters (I)
  • Buses
  • Analogous to those for 4 clusters
  • Bus2 same latency (optimistic) 22 cycles
  • Bus4 twice the latency (realistic) 44 cycles
  • Rings
  • Analogous to those for 4 clusters
  • Synchronous and Asynchronous
  • Max. Distance 4 hops (average 2.29 hops)

19
Interconnects for 8 Clusters (II)
  • Mesh
  • Max. distance 4 hops (average 2 hops)
  • 2 in-transit messages may compete for the same
    output link
  • Constrained connectivity

20
Interconnects for 8 Clusters (III)
  • Torus
  • Max. distance 3 hops
  • Same connectivity constraints as the mesh

21
Interconnects for 8 Clusters (IV)
  • Ideal Torus
  • Contention-free
  • unlimited number of links
  • unlimited number of RF write ports
  • For comparison purposes (upper-bound
    performance)

22
Router Structures
  • Common features to all ICN
  • No intermediate buffering

RightLink
LeftLink
  • Partially asynchronous ICN
  • Competence for a write port
  • Add small input queues
  • Topologies with 3 adjacent nodes
  • Competence for the same output link
  • Constrained connectivity

Cluster Datapath
23
Experimental Setup
  • Simulation
  • Extended version of sim-outorder (SimpleScalar
    v3.0)
  • 14 Mediabench programs
  • Compiled with O4 for an Alpha AXP
  • Architecture
  • L1 D-cache 64KB, 2-way, 3-cycle hit
  • 128 ROB, 64 LSQ
  • Each cluster 2-way issue, 16-entry IQ, 56
    physical regs.

24
Performance 4 Clusters
  • Poor performance of Bus2
  • Asynchronous Ring
  • Better than Synchronous Ring
  • Close to Ideal (within 1)

25
Synchronous / Asynchronous
  • Contention delays
  • Lower for Async. Ring
  • Message issues as soon asthe link is available
  • Higher for 1-hop messages
  • a single path
  • Sync. Ring issue 1 cycle every 2

26
Length of Input Queues
  • Max. observed occupancy lt 9 entries
  • Handle overflows by flushing the pipeline
  • Rather than including complex control flow

Sample statistics (djpeg)
27
Performance 8 Clusters
  • Poor performance of buses
  • Connectivity degree has a significant impact
  • Asynchronous Torus close to Ideal (within1.5)

28
Topology-Aware Steering
  • 16.5 IPC improvement with 8 clusters (2.5 with
    4 clusters)

29
Summary
  • An efficient topology-aware steering scheme
  • Cluster point-to-point interconnects
  • For 4 clusters and 8 clusters
  • Designed to minimize complexity and latency
  • Compared to
  • Bus-based models
  • Idealized models with unlimited bandwidth

30
Conclusions
  • The choice of ICN is crucial for performance
  • Point-to-point better than buses
  • Asynchronous rings better than synchronous
  • Asynchronous interconnects perform close to ideal
  • with minimal complexity
  • Higher connectivity significantly improves
    performance
  • Topology-aware steering essential to reduce
    latency
  • Especially with many clusters
Write a Comment
User Comments (0)
About PowerShow.com