Industrial Experiences Pioneering Asynchronous Commercial Design - PowerPoint PPT Presentation

About This Presentation
Title:

Industrial Experiences Pioneering Asynchronous Commercial Design

Description:

Title: Introduction to basic concepts on asynchronous circuit design Author: Compaq Created Date: 2/13/2000 11:54:46 AM Document presentation format – PowerPoint PPT presentation

Number of Views:262
Avg rating:3.0/5.0
Slides: 36
Provided by: Comp6150
Learn more at: https://www.cs.upc.edu
Category:

less

Transcript and Presenter's Notes

Title: Industrial Experiences Pioneering Asynchronous Commercial Design


1
Industrial Experiences Pioneering Asynchronous
Commercial Design
  • Peter A. Beerel
  • Fulcrum Microsystems
  • Calabasas Hills, CA, USA

2
Agenda
  • Introduction to Fulcrum
  • Description of Integrated Pipelining
  • Fulcrums clockless circuit architecture
  • Description of Fulcrums Design Flow
  • Overview of Nexus
  • Fulcrums Terabit crossbar
  • Overview of PivotPoint
  • Fulcrums first commercial product

Circuit B
Circuit A
3
Company Snapshot
ClocklessSemiconductor Company
Backed by top-tier investors(raised 14M in June)
4
Agenda
  • Introduction to Fulcrum
  • Description of Integrated Pipelining
  • Fulcrums clockless circuit architecture
  • Description of Fulcrums Design Flow
  • Overview of Nexus
  • Fulcrums Terabit crossbar
  • Overview of PivotPoint
  • Fulcrums first commercial product

Circuit B
Circuit A
5
Fulcrums Integrated Pipelining
Robust, power efficient, and high performance
Acknowledge
Acknowledge
Fast delay-insensitive style using domino logic
without latches (Developed at Caltech by
Fulcrums founders)
6
Integrated Pipelining
Leaf Cell A
Leaf Cell B
Leaf Cell C
  • Harnessing the power of Domino Logic
  • Addresses delay variability with Completion
    Sensing
  • Addresses power inefficiency with Async
    Handshakes
  • Leverages more efficient N transistors

Dual-Rail Domino Logic
Dual-Rail Domino Logic
Dual-Rail Domino Logic
OutputCompletionDetection
InputCompletionDetection
Control
Control
Control
7
Hierarchical Design
  • Multi-level hierarchy of communicating blocks

At each level blocks communicate along channels
8
Leaf Cells
C
F
RCD
LCD
D
  • Definition
  • Smallest block that performs logic and
    communicates via channels
  • Based on small number of pipeline templates
    guiding design
  • Forms basic building block for physical design
  • Features
  • Facilitates high throughput and low latency
  • Provides easy timing validation and analog
    verification
  • 1,000 digital leaf cell types compose our leaf
    cell library
  • 200 additional subtypes for different physical
    environments (e.g., loads)

9
Template-Based Cell Design
  • Each pipeline style (QDI, timed) has a different
    blueprint
  • Library uses a blueprint to implement the lowest
    level blocks

C
RCD
LCD
F
LCD
C
2-input 1-output pipeline stage
RCD
LCD
F
C
RCD
LCD
RCD
F
Blueprint for a QDI N-input M-output pipeline
stage
1-input 2-output pipeline stage
10
Summary of Characteristics
  • Delay-Insensitive timing model
  • Gates and wires can have arbitrary delays
  • 4 phase 1of4 handshake
  • Uses 4 wires to send 2 bits
  • Plus an acknowledge wire for flow control
  • Returned to neutral between each data transfer
  • Self shielding
  • Precharge domino logic plus async handshake
  • Low latency high frequency robust
  • Auto power conservation zero standby power

11
Agenda
  • Introduction to Fulcrum
  • Description of Integrated Pipelining
  • Fulcrums clockless circuit architecture
  • Description of Fulcrums Design Flow
  • Overview of Nexus
  • Fulcrums Terabit crossbar
  • Overview of PivotPoint
  • Fulcrums first commercial product

Circuit B
Circuit A
12
Fulcrum Design Flow
  • Hierarchical design flow
  • Executable specifications
  • Formal decomposition
  • Creates design hierarchy
  • Semi-custom synthesis layout
  • Hierarchical floor planning
  • Automated transistor sizing
  • Semi-automated physical design
  • Supports synchronous asynchronous designs
  • Hard macro from place route

13
Managing Design Hierarchy
  • Proprietary Objected Oriented Hardware Language
  • Integrated hierarchical design/verification
    language
  • Defines cell specification implementation
  • Specification
  • Java or communicating-sequential-processes (CSP)
  • Implementation multiple forms
  • Sub-cells
  • Sub-cells defined in terms of specification or
    implementation
  • Defines integrated test environment for each cell
  • Enables verification at all pairs of levels
  • Efficiency features
  • Supports refinement of cells and channels

14
Physical Design
  • Layout hierarchy based on design hierarchy
  • Hierarchical floor-planning semi-automated
  • Large scale hand placement before sizing
  • Long distance channels planned carefully
  • Timing closure by construction
  • Placement drives sizing
  • Can insert extra pipelining on long wires late in
    design
  • Tradeoffs between performance and design time
  • Hand layout where necessary
  • Automated layout where possible
  • Goals
  • Full-custom density and speed within ASIC design
    time

15
Design Verification System-Level
Test Bench
Device Under Test
ConfigurationManager
Bus Functional Model
Test Cases
Executable Spec
Traffic Generator Checker
Gate-level Verilog Model
  • Mission
  • Verify that executable spec written spec
    gate-level model
  • Use industry-standard tools methods
  • Cadence NCSIM and efficient Java-Verilog
    interface
  • Directed random testing
  • Line functional coverage

Monitor
16
Design Verification Unit-Level
Log
Test Engine

Copy
  • Mitered co-simulation for unit-level verification
  • Check correctness of digital model by comparing
    it to golden CSP/Java model
  • Features
  • Framework automated and regressed
  • Checks correctness
  • Checks delay insensitivity and/or throughput and
    latency

17
Analog Verification Charge Sharing
Charge Sharing Test Generator
Synthesis
SPICE
  • SPICE-based charge sharing analysis
  • Test case generation and analysis automated
  • Charge-sharing problems solved in numerous ways
  • Symmetrization
  • Less transistor sharing
  • Delay perturbations

18
Synthesis Gate Generation / Sizing
  • Automated generation of transistor netlists
  • Dynamic logic generation
  • Transistor sharing
  • Symmetrization
  • Gate-library matching
  • Transistor sizing
  • Path-based sizing to meet amortized unit-delay
    model
  • Micro-architecture feedback
  • Identifies where fanout limits performance

Logic Synthesis
Transistor Sizing
CDL Netlist
19
Fulcrum QDI v. Synchronous Flows
  • Save clock tree design, analysis, optimization,
    and verification
  • No timing closure problems
  • Unexpected long-wire bottlenecks easily solved
    with additional pipeline buffers late in design
    cycle
  • QDI/DI timing model reduces timing analysis
    challenges
  • Fulcrum QDI hierarchical design facilitates
  • Composability, re-use, and early bug detection
  • Hierarchical-floorplanning improves
    predictability of wires
  • Template-based leaf cell designs simplifies logic
    design
  • Design reuse reduces criticality of high-level
    synthesis
  • Decomposition methodology amenable to formal
    verification

20
Agenda
  • Introduction to Fulcrum
  • Description of Integrated Pipelining
  • Fulcrums clockless circuit architecture
  • Description of Fulcrums Design Flow
  • Overview of Nexus
  • Fulcrums Terabit crossbar
  • Overview of PivotPoint
  • Fulcrums first commercial product

Circuit B
Circuit A
21
Globally Asynchronous,Locally Synchronous
  • SoC designs many cores with different clock
    domains
  • Async circuits can interconnect multiple sync
    cores in an SoC design, eliminating global clock
    distribution and simplifying clock domain
    crossing
  • Fulcrums Nexus is a high speed on-chip
    interconnect
  • 16 port, 36 bit asynchronous crossbar
  • Asynchronous cross-chip channels
  • Async-sync clock domain converters
  • Runs at 1.35GHz in 130nm process

22
Nexus System-on-Chip Interconnect
Generic Nexus Example
  • Non-blocking crossbar
  • 16 full-duplex ports
  • Flow control extends through the crossbar
  • Full speed arbitration
  • Arbitrary-length bursts
  • Bridges clock domains
  • Scales in bit width and ports
  • Process portable
  • Synchronous IP block
  • Asynchronous IP block
  • Pipelined repeater
  • Clock domain converter

23
Nexus Burst Format
Outgoing To Target
Incoming From Source
D1
D2
D3
DN
D1
D2
D3
DN


Data 36 bit
Tail 1 bit
0
0
0
1
0
0
0
1
To
From
Control 4 bit
Target Module
Source Module
Arbitrary-length source-routed bursts provide
flexibility
24
Sync-to-Async Conversion
  • Synchronous Request / Grant FIFO protocol
  • Data transferred if request and grant both high
    on rising edge of clock
  • Compensates for any skew on asynchronous side
  • Low latency 1/2 to 3/2 clock cycles at A2S

S2A
A2S
Asynchronous Datapath
Synchronous Datapath
Asynchronous Datapath
Synchronous Datapath
Request
Request
A
A
Grant
Grant
clock
clock
Seamlessly Bridges Different Clock Domains
25
Arbitration and Ordering
  • Unrelated sender/receiver links are independent
  • Bursts sent from multiple input ports to the same
    output port are serviced fairly by built-in
    arbitration circuitry
  • Bursts from A to B remain ordered
  • Producer-consumer and global-store-ordering
    satisfied
  • A sends X to B, A notifies C, C can read X from B
  • A writes X to B, A writes Y to C, if D reads Y
    from C, it can read X from B
  • Split transactions implement loads
  • Load request and load completion bursts
  • Load completions returned out-of-order

Can tunnel common bus and cache coherance
protocols
26
Example Load/Store Systems
  • Option 1 Pure Master/Target Ports
  • Masters send Requests to Targets, which may
    return Completions
  • Each port must either be a Master or a Target so
    that Completions are never blocked by Requests
  • Devices which need to be both Masters and Targets
    are given two separate full-duplex ports
  • Could use two separate Nexus crossbars
  • Option 2 Peers
  • Modules which are both Masters and Targets
    implement an internal buffer to hold Requests so
    that Completions can bypass them
  • All Masters or Peers restrict number of
    outstanding Requests to avoid overflowing Request
    buffers

27
Example Switch Fabric
  • Each module maintains input/output queues for
    traffic to/from each other module
  • Data is sent from an input queue to an output
    queue over Nexus as a series of short bursts
  • Flow control credits for each output queue are
    sent backward
  • Eliminates head-of-line blocking
  • Segmentation, buffering, and overspeed optimize
    performance during congestion
  • Used in PivotPoint, Fulcrums first chip product.

28
Nexus Silicon Validation
TSMC 130nm LV Results
Block diagram of Nexus Validation Chip
Proc V GHz ns pJ/bit
Low-K 1.2 1.35 2.0 10.4
Low-K 1.0 1.11 2.4 7.0
FSG 1.2 1.10 2.5 11.2
FSG 1.0 0.87 3.1 7.6
Crossbar area 1.75mm2 Total interconnect area
4.15mm2 Peak cross-section bandwidth 778Gb/s
Plot of Nexus crossbar
29
Nexus Summary
  • Nexus is an asynchronous crossbar interconnect
    designed to connect up to 16 synchronous modules
    in a SoC
  • Nexus can be used to implement load/store systems
    as well as switch fabrics
  • Systems using Nexus can be tested with standard
    equipment
  • Nexus runs up to 1.35GHz in TSMC 130nm
  • Asynchronous interconnect is now viable for very
    high performance SoC designs

30
Agenda
  • Introduction to Fulcrum
  • Description of Integrated Pipelining
  • Fulcrums clockless circuit architecture
  • Description of Fulcrums Design Flow
  • Overview of Nexus
  • Fulcrums Terabit crossbar
  • Overview of PivotPoint
  • Fulcrums first commercial product

Circuit B
Circuit A
31
PivotPoint Blade Interconnect
  • Large-scale SoC design
  • gt32.5M transistors (83 async)
  • 14 separate clock domains
  • Includes key Fulcrum IP
  • Nexus Terabit Crossbar
  • Quad-port 600MHz async SRAM
  • Operates at over 1GHz
  • Delivers 192Gbps of non-blocking switching
    capacity
  • Testable via standard tools
  • JTAG scan chain
  • Activity-based power scaling
  • 9-month project

Worlds first high-performance clockless chip
Generic System Blade
CPU NPU ASIC FPGA
CPU NPU ASIC FPGA
SPI-4
X8
I/O (Phy/MAC)
Backplane Interface
CPU NPU ASIC FPGA
CPU NPU ASIC FPGA
32
PivotPoint Leverages Nexus
  • Flexible architecture
  • 6 duplex SPI-4.2 interfaces
  • All paths are independent
  • Optimized for performance
  • Up to 14.4Gbps per interface
  • Up to 32Gbps per Nexus port
  • Full-rate buffer memories
  • Lossless flow control
  • Easily configurable
  • 16-bit CPU interface
  • JTAG support
  • Modest size and power
  • 2 Watt per active interface
  • 1036 ball package

SPI-4
16KB Buffer
SPI-4
16KB Buffer
Control Bus (Serial Tree)
Route Table
Route Table
SPI-4
16KB Buffer
SPI-4
16KB Buffer
SPI-4
16KB Buffer
SPI-4
16KB Buffer
Route Table
Route Table
SPI-4
16KB Buffer
SPI-4
16KB Buffer
SPI-4
16KB Buffer
SPI-4
16KB Buffer
Route Table
Route Table
SPI-4
16KB Buffer
SPI-4
16KB Buffer
3ns latency
A true SoC GALS design
33
Testing A Multi-Dimensional Approach
  • DFT
  • Synchronous scan chains for Synchronous logic
  • Asynchronous scan-chain-like structures for
    asynchronous logic and sync-async interfaces
  • Standardized JTAG interface for testing
  • Fault-Grading
  • Verilog fault-model for domino logic
  • Industry-standard fault grading tools
  • BIST
  • Use Nexus for observability in Nexus-Based SOCs
  • RAM self test and repair

34
Differentiating Through Technology
Leveraging our clockless technology foundation
Differentiated Product Offering
High performance (latency, capacity) Power
efficient (linear scaling) Robust in operation
Unique IP Blocks
Unmatched performance Extremely robust (power and
temperature) Easy to integrate (benign behavior)
Clockless Technology Foundation
Silicon proven and customer validated Mature CAD
flow (integrated with commercial tools) Robust
cell library (thousands of unique cells)
35
Thank You!
Peter A. Beerel, PhD VP Strategic
CAD pabeerel_at_fulcrummicro.com
818.871.8100 www.fulcrummicro.com 26775 Malibu
Hills Road Suite 200 Calabasas Hills, CA 91301
A group of engineers wants to turn the
microprocessor world on its head by doing the
unthinkable tossing out the clock and letting
the signals move about unencumbered. For those
designers, inspired by research conducted at
Caltech, clocks are for wimps. Anthony Cataldo
, EE Times
Write a Comment
User Comments (0)
About PowerShow.com