Piranha:%20A%20Scalable%20Architecture%20Based%20on%20Single-Chip%20Multiprocessing - PowerPoint PPT Presentation

About This Presentation
Title:

Piranha:%20A%20Scalable%20Architecture%20Based%20on%20Single-Chip%20Multiprocessing

Description:

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing ... Four evaluations: P1 (One-core Piranha _at_ 500MHz), INO (1GHz single-issue in ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 11
Provided by: albert95
Category:

less

Transcript and Presenter's Notes

Title: Piranha:%20A%20Scalable%20Architecture%20Based%20on%20Single-Chip%20Multiprocessing


1
Piranha A Scalable Architecture Based on
Single-Chip Multiprocessing
  • Barroso, Gharachorloo, McNamara, et. Al
  • Proceedings of the 27th Annual ISCA, June 2000

Presented by Garver Moore ECE259 Spring
2006 Professor Daniel Sorin
2
Motivation
  • Economic High demand for OLTP machines
  • Disconnect between ILP-focus and this demand
  • OLTP
  • -- High memory latency
  • -- Little ILP (Get, process, store)
  • -- Large TLP
  • OLTP unserved by aggressive ILP machines
  • Use old cores, ASIC design methodology for
    glueless, scalable OLTP machines and low
    development costs and time to market
  • Amdahls Law

3
The Piranha Processing Node
CPU Alpha ECE152 work Single in-order 8-stage
pipeline
180 nm process (2000) Almost entirely ASIC
design 50 clock speed, 200 area versus
full-custom methodology
Separate I/D L1 for each CPU Logically shared
interleaved L2 cache. Eight memory controllers
interface to a bank of up to 32 Rambus DRAM
chips. Aggregate max bandwidth of 12.8 GB/sec.
Directly from Barroso et. al
4
Communication Assist
Home Engine and Remote Engine support shared
memory across multiple nodes System Control
tackles system miscellany interrupts,
exceptions, init, monitoring, etc. OQ, Router,
IQ, Switch standard Total inter-node I/O
Bandwidth 32 GB/sec
Each link and block here corresponds to actual
wiring and module. This allows for rapid
parallel development and an semi-custom design
methodology Also facilitates multiple clock
domains
THERE IS NO INHERENT I/O CAPABILITY.
5
I/O Organization
Smaller than processing node Router ? 2
links, alleviates need for routing table
Memory is globally visible and part of coherency
scheme CPU ? optimized placement for drivers,
translations etc. with low-latency access needs
to I/O. Re-used dL1 design provides interface
to PCI/X interface Supports arbitrary I/OP
ratio, network topology Glueless scaling up
to 1024 nodes of any type supports application
specific customization
6
Coherence Local
L2 bank and associated controller contains
directory data for intra-chip requests
Centralized directory Chip ICS responsible for
all on-chip communication L2 is
non-inclusive. Large victim buffer for
L1s. Keeps tags and state copies of L1 data
The L2 controller can determine whether data is
cached remotely, and if exclusively. Majority of
L1 requests then require no CA assist. L2 on
request can service directly, forward to owner
L1, forward to protocol engine, or get from
memory. L2 on forwards blocks conflicting
requests
7
Coherence Global
  • Trades ECC granularity for free directory data
    storage (4x granularity ? leaves 44 bits per 64
    bit line)
  • Invalidation-based distributed directory protocol
  • Some optimizations
  • No NACKing Deadlock avoidance through I/O, L, H
    priority virtual lanes L Home node, low
    priority. H Forwarded requests, replies
  • Also guarantee forwards always serviced by
    targets e.g. owner writes back to home, holds
    data until home acknowledges.
  • Removes NACK/Retry traffic, as well as ownership
    change (DASH), retry-counts (Origin), No,
    seriously (Token).
  • Routing toward empty buffers for old messages ?
    linear buffer dependence on N. Share buffer
    space among lanes, and CMI invalidations avoid
    deadlock.

8
Evaluation Methodology
  • Admittedly favorable OLTP benchmarks chosen
    (TPC-B and TPC-D modifications)
  • Simulated and compared to performance of
    aggressive OOO core (Alpha 21364) with integrated
    coherence and cache hardware
  • Fudged for full-custom effect
  • Four evaluations P1 (One-core Piranha _at_
    500MHz), INO (1GHz single-issue in-order
    aggressive core), OOO (4-issue 1GHz) and P8
    (Spec. system)

9
Results
10
Questions/Discussion
  • Deadlock avoidance w/o NACK
  • CMP vs SMP
  • Fishy evaluation methodology?
  • Specialized computing
  • Buildability?
Write a Comment
User Comments (0)
About PowerShow.com