Title: Asynchronous Circuit Compilation
1Asynchronous Circuit Compilation
- Dr. Doug Edwards
- doug_at_cs.man.ac.uk
2Overview
- Asynchronous circuits
- Advantages
- Asynchronous Design Paradigms
- Syntax Directed Compilation
- Handshake Circuits
- Balsa
- Datapath Compilation
- Design Example - DMA Controller
3Asynchronous (self-timed) Basics
- Synchronous circuits
- a global clock separates system states
- A time domain view of system activity.
- Asynchronous circuits
- input changes separate system states
- A sequence or trace domain view of system
activity.
4Why Asynchronous?
- Low Power
- data-driven power is only used to do useful work
- zero power when idle with instant restart
- Low EMI
- In a clocked circuit, all noise is correlated
- Async circuits have distributed switching
activity leading to uncorrelated EMI
5Why Asynchronous?
- No clock distribution problems
- Composability/Modularity
- facilitates IP reuse
- Average Case Performance
- exploit the fact that worst-case often occurs
infrequently
6Timing Models
- Delay Insensitive (DI)
- Delays in circuits wires are arbitrary
- Quasi-Delay Insensitive (QDI)
- Similar to DI but assuming isochronic forks
- Speed Independent (SI)
- Wires have no delays, arbitrary gate delays
- Bounded Delay
- Single-sided timing constraints
7Asynchronous Design Paradigms
- AFSMs - for fast controllers etc
- Traditionally hard
- hazards, races ,state asigment problems
- Research has led to new techniques
- STG/Petri net based SI circuits
- Burst-Mode circuits
- Macromodule-like for larger systems
- micropipeline approach, handshake circuits
8Asynchronous Control
- With no clock, some other means is required to
co-ordinate control flow - Use a request/acknowledge handshake
Req
Ack
Sender
9Signalling Protocols
- req ack are abstractions
- layer a signalling protocol on top of them
- Two common protocols
- 2-phase (transition signalling, NRZ)
- 4-phase (Return-to-Zero signalling)
10Data Validity Models
- Self Timed
- The validity of the data is encoded within the
data itself redundant coding - e.g. Dual Rail each data bit requires two wires.
00 -gt no data, 01 -gt 0, 10 -gt 1 - Bundled Data approach
- conventional datapath
- validity is assured by imposing timing
constraints.
112-phase Protocol
º
R
eq
1 tr
ansaction
1 tr
ansaction
Ac
k
v
alid
v
alid
124-phase protocol
- Signals are returned to initial state after each
transaction - Several possible interleavings of the signal
transitions
13Comparison of Approaches
- 2-phase/4-phase
- 2-phase conceptually simpler (once an event
mind-set is adopted) - 2-phase circuits slower more complex
- think 2-phase, build 4-phase
- Bundled-Data/Dual-rail
- Current orthodoxy bundled data is faster, lower
power, smaller area with tolerancing task no
worse than for a clocked design
14Current Approach
- QDI control
- Bounded-Delay (bundled-data) datapath
- 4-phase signalling
- Amulet3i
15Asynchronous HDLs
- Conventional programming languages lack 3
necessary constructs - communication
- parallelism/concurrency
- sharing (of hardware)
- Conventional HDLs lack adequate
- fine-grain concurrency
- channel based communication primitives
16Asynchronous HDLs 2
- Tangram , Balsa
- CSP based data types
- based on underlying formal semantics
- guarantees correct composition rules
- easier composition than in sync circuits???
- transparent compilation
- each production rule in the language translates
to an intermediate handshake circuit - allows designer to infer circuit costs
performance from the program
17Handshake Circuits - 1
- Circuits communicate along channels
- Channels connect ports at circuit interface
- Ports have
- Type
- Direction
- Sense
18Handshake Circuits - 2
- Port type determines the number of data wires
- no data wires control only port!
- Port direction is input, output or control only
- Port sense
- Active initiates transfers
- Passive responds to requests
19Micropipeline-Style Circuits
Push Circuits Circuit waits for data
req
req
data
data
cct
ack
ack
passive input
active output
20Micropipeline-Style Circuits
Push Circuits data arrives
req
req
data
data
cct
ack
ack
21Micropipeline-Style Circuits
Push Circuits data validity signalled
req
req
data
data
cct
ack
ack
22Micropipeline-Style Circuits
Push Circuits circuit accepts data
req
req
data
data
cct
ack
ack
23Micropipeline-Style Circuits
Push Circuits circuit signals data taken
req
req
data
data
cct
ack
ack
24Micropipeline-Style Circuits
Push Circuits Circuit outputs data
req
req
data
data
cct
ack
ack
25Micropipeline-Style Circuits
Push Circuits Circuit signals validity
req
req
data
data
cct
ack
ack
26Micropipeline-Style Circuits
Push Circuits receiver takes data
req
req
data
data
cct
ack
ack
27Micropipeline-Style Circuits
- 4-phase protocol not detailed
- Previous circuit decoupled input and ouput
- implies a latch inside the handshake circuit
- An alternative is for the input handshake to
enclose the output handshake
28Enclosed Handshake
Push Circuits data arrives
req
req
data
data
cct
ack
ack
29Enclosed Handshake
Push Circuits data validity signalled
req
req
data
data
cct
ack
ack
30Enclosed Handshake
Push Circuits circuit accepts data
req
req
data
data
cct
ack
ack
31Enclosed Handshake
Push Circuits Circuit outputs data
req
req
data
data
cct
ack
ack
32Enclosed Handshake
Push Circuits Circuit signals validity
req
req
data
data
cct
ack
ack
33Enclosed Handshake
Push Circuits receiver takes data
req
req
data
data
cct
ack
ack
34Enclosed Handshake
Push Circuits input handshake completes No
latch required
req
req
data
data
cct
ack
ack
35Tangram Style Circuits
Pull Circuits active ported circuits/ control
driven
req
req
data
data
cct
ack
ack
active input port
36Tangram Style Circuits
Pull Circuits Circuit demands data
req
req
data
data
cct
ack
ack
37Tangram Style Circuits
Pull Circuits data is sent on demand
req
req
data
data
cct
ack
ack
38Tangram Style Circuits
Pull Circuits data is accepted and can then be
released
req
req
data
data
cct
ack
ack
39Balsa
- Language for synthesising large async circuits
systems - CSP/OCCAM background
- Tangram-like
- based on Tangram compilation function
- compiles to a small (but expanding) set of
handshake circuits - origins ESPRIT EXACT project
40Balsa Language Features
- Data types based on sequence of bits
- Arrays and records are bit-based
- Element extraction is by array slicing
- Strict data typing
- Structural iteration
- Arrayed channels
- Parameterised recursive functions
41Balsa Language Features
- Enclosed selection semantics
- Allows passive ported circuits
- Allows push (micropipeline-style) circuits
- Allows unbuffered (latch-free) circuits
- Can be considered a restricted form of Burns
probe construct.
42Balsa Source
43Example Single Place Buffer
- import balsa.types.basic
- public
- type word is 16 bits
- procedure buffer (input i word output o
word) is - local variable x word
- begin
- loop
- i -gt x -- Input communication
- o lt- x -- Output communication
- end
- end
library mechanism
visibility
type declaration
channel declarations
procedure definition
implies latch
repeat forever
sequential operation
read input channel into local variable x
output local variable x to output channel
44Buffer Handshake Circuit
Single-place buffer
repeater
?
activation channel
sequencer
transferrer
i
o
x
T
T
variable
45Buffer Handshake Circuit
Single-place buffer repeater is activated
?
i
o
x
T
T
46Buffer Handshake Circuit
Single-place buffer Sequencer handshakes to left
transferrer
?
i
o
x
T
T
47Buffer Handshake Circuit
Single-place buffer transferrer requests data
from environment
?
i
o
x
T
T
48Buffer Handshake Circuit
Single-place buffer data transferred to variable
x
?
i
o
x
T
T
49Buffer Handshake Circuit
Single-place buffer variable handshake completes
?
i
o
x
T
T
50Buffer Handshake Circuit
Single-place buffer transferrer handshake
completes to environment
?
i
o
x
T
T
51Buffer Handshake Circuit
Single-place buffer transferrer handshake
completes
?
i
o
x
T
T
52Buffer Handshake Circuit
Single-place buffer Sequencer handshakes to right
transferrer
?
i
o
x
T
T
53Buffer Handshake Circuit
Single-place buffer Transferrer reads variable
?
i
o
x
T
T
54Buffer Handshake Circuit
Single-place buffer Transferrer outputs to
environment
?
i
o
x
T
T
55Buffer Handshake Circuit
Single-place buffer handshakes complete
?
i
o
x
T
T
56Buffer Handshake Circuit
Single-place buffer Sequencer completes its input
handshake
?
i
o
x
T
T
57Buffer Handshake Circuit
Single-place buffer repeater initiates another
transfer, etc
o
58Example Single Place Buffer
- import balsa.types.basic
- public
- type word is 16 bits
- procedure buffer (input i word output o
word) is - local variable x word
- begin
- loop
- i -gt x -- Input communication
- o lt- x -- Output communication
- end
- end
59Example 2-place buffer
- import balsa.types.basic
- import buffer1a
- public
- type word is 16 bits
- procedure buffer2c (input i word output o
word) is - local channel c word
- begin
- buffer (i, c)
- buffer (c, o)
- end
reuse component
internal channel connects two 1-place buffers
parallel composition
buffers connected by common signal name
602-place Buffer Handshake Circuit
612-place Buffer Handshake Circuit
par component
passivator
c
c
o
i
x
x
T
T
T
T
62Peephole Optimisation
- Composition of handshake circuits leads to
inefficiencies at circuit boundaries - Straightforward peephole optimizations
632-place Buffer Handshake Circuit
par component
passivator
c
c
o
i
x
x
T
T
T
T
64Optimized 2-place Buffer Circuit
?
??
control-only
i
x
x
T
T
65The Repeater
- Formal Definition
- REP(a?,b?) (a? b?)
? denotes active port
denotes handshake enclosure
denotes repeat
? denotes passive port
66The Repeater
- Formal Definition
- REP(a?,b?) (a? b?)
- (a?? b??b??)
- (ar?? br?? ba?? br?? ba??)
67The Transferrer
- Several Implementations
- simplest wire-only
ar
aa
br
ca
cr
ba
datan
68Balsa Toolkit -1
- balsa-c
- The compiler for the language
- breeze2dot
- Produces a postscript plot of the generated
handshake circuits - breezecost
- Reports the cost of the compiled circuit in
arbitrary units
69Balsa Toolkit -2
- breeze2lard
- The interface to the LARD simulation environment.
- balsa source is translated to LARD
- simple test harness is generated
- balsa-md
- An automatic makefile generation facility.
- balsa-mgr
- A GUI project manager
70Mod-16 Counter (all even)
71Bundled-Data Datapaths
- Problems
- random standard cell layout
- mixed control datapath
- timing analysis required
- robustness of design reduced
- Possible Solutions
- DI codes
- hybrid bundled DI
- simpler timing analysis
72DI Codes
- Dual Rail (used in 1st Tangram system)
- Can use standard cell approach without timing
analysis - no need to distinguish between control data
- abandoned in favour of bundled-data
- area cost in extra wires
- area time cost in completion detection
- Tangram/Balsa generates push-pull pipelines with
expensive synchronization
73Generic Pipeline
- Passivators join compiled procedure
passivator
74Passivator Implementation
ar
br
C
ba
aa
n-wide C-gate
datan
d0
br
d1
n-bits wide
dn-1
ba
aa
75DI Code Synchronizations
- Expensive
- need C-element synchronisation tree
- A partial solution (not always possible/desirable)
is - transform to push-style datapath
- (not possible in Tangram only Balsa)
76Push Pipeline
Passive input port
connector (wires-only)
77Hybrid Solutions
- Use DI coding within bundled datapath framework
- e.g. use dual-rail carry signals within a
conventional adder - early completion easily detected
- Average-case performance
- Only applicable to a few datapath operations
78Simpler Timing Analysis
- Separate control and datapath
- generate regular, compiled, datapath
- area improvement over standard cell (because of
regular layout) - generate matched delay paths (c.f. self-timed
PLAs) - must be able to recognize datapath
- difficult control often contains datapath-like
elements. - e.g. start at variables and work backwards ...
79Datapath meets Control
- Example Balsa case statement
1 hot encoding
data n bits wide
true/complement lines dual-rail expansion
80Case Component
- input from datapath
- dual-rail simplifies internal logic
- expansions parameterisable
- encode component is similar
- opposite of case with true/false expansion
81Simpler Timing Analysis
- Tool support required
- use existing (non-Balsa) tools if possible
- automatically add matched paths/delays to
synthesised datapaths - Design own cells where appropriate
- e.g. hybrid stages
82Future Work
- Provide support for DI, hybrid and
datapath-compiled datapaths - even with datapath compilation, some datapath
would still be standard cell - e.g. instruction decoder (control heavy)
- datapath in control
- cost of connecting separate blocks in layout
- Test Design required (datapath heavy)
83Tool Enhancement
- balsa-c
- support for attribution to select compilation
mechanisms/ optimisation schemes - breeze2lard
- new models
- balsa-netlist
- new tech-mapping descriptions
- interface to datapath compilers
84AMULET3i
- Asynchronous macrocell
- ARM compatible processor core
- Full custom RAM
- Compiled ROM
- Balsa compiled DMA controller
- Test I/F, synchronous and off-chip bus bridges
- Synchronous peripherals
- Designed by commercial partner ...
85AMULET3 System
Periph1
Periph1
Periph1
CPU / RAM
Sync bridge
MARBLE
SOCB
ROM
DMAC
86DMA Local RAM Access
Periph1
Periph1
Periph1
CPU / RAM
Sync bridge
MARBLE
SOCB
ROM
DMAC
87DMA Peripheral Accesses
Periph1
Periph1
Periph1
CPU / RAM
Sync bridge
MARBLE
SOCB
ROM
DMAC
DMA requests
88Requirements / Specification
- 16 clients, 32 channels
- 3 channel types - complicated register structure
- Programmable client ? channel1 ? many mapping
- Support synchronous requests
- Transfers mostly between synchronous clients
89Controller Structure
90Two Controller Descriptions
- Sequential (previous slides)
- Very simple control flow
- Requires two passes through register bank
- Slow!, Only memory decoupling helps
- Parallel (next slides)
- Decouple TE actions from memory R/W with a new
unit Transfer Interface - Interrupt the register bank on end of transfer
91Parallel Design
92The Design
- 919 lines of Balsa describing register bank
control, TE and TI. - Custom register banks and Synchronous Peripheral
Interface - Miscellaneous glue standard cells
- Register bank controllers
- MARBLE interfaces
- Compass Design Automation CAD
93Implementation Technology
- 0.35?m, 3LM CMOS
- Standard cells from ARM Ltd.
- Locally designed complex gates and asynchronous
elements/gates. - Automated standard cell PR
- Only essential and simple gate level
optimisation (by hand)
94Design Partitioning
Marble BUS outside of DMA controller
95Design Partitioning
Balsa synthesised standard cells
96Design Partitioning
Custom regular layout
97Design Partitioning
Hand designed standard cells
98DMA Controller Floor-Plan