Title: Synchronous Latency Insensitive Design
1Synchronous Latency Insensitive Design
- Christer Svensson and Anders Edman
- Linköping University
- With contributions by Behzad Mesgarzadeh and
Peter Caputa
2Outline
- Introduction
- A comment on wires
- Architectural view of future systems
- Synchronous Latency Insensitive Design
- Multiple clocks
- Unmatched buses
- Conclusion
3Introduction
Decreased feature sizes and increased chip sizes
results in More devices per chip - larger
complexity Increased wire delays Prevents
synchronous design Increased clock rates
Wire delay L2/s2, Gate delay sa, sfeature
size, 1ltalt2
4Introduction
- Synchronous design paradigm VERY established we
need to keep. - (Easy to keep track on exact timing of all
events predictable performance) - Vast experience used to manage ever increasing
complexity. - Critical Timing relations between clock and data
- Present solution
- Flat clock distribution (skew-free clock)
- Does not solve problem with data delays
clk
Balanced clk net - no skew Wire delay still
affects data
5Introduction
The wire delay problem was recognized very early
(Anceau 1982) In spite of the alarm 1982, we
still manage multigigahertz synchronous designs,
BUT today with considerable problems. ASIC style
designs normally limited to 200-400MHz clock,
with severe timing closure problems. Multigigahe
rtz designs very demanding full custom design
style.
Wire delay L2/s2, Gate delay sa, sfeature
size, a1..2
6Introduction
Classical methods to manage Asynchronous
communication between blocks FCPU (Lawson et. al.
1974) Global asynchronous local synchronous
(GALS) Low frequency communication clock (Anceau
1982) Skew-free clock distribution and pipelined
buses (eg. Intel microprocessors)
7A comment on wires
Three cases Highly lossy wires, RC-limited,
large delay L2 Highly lossy wires with
repeaters, large delay L Low loss wires
(transmission lines), delay L/v
L length of wire v velocity of light in actual
dielectric, 1.5 dm/ns
8A comment on wires
Note the difference between latency ,td, and data
rate, 1/Ts We may have data rate larger than
1/delay wave pipelining
Tsgttd
RC case
td
(wave pipelining, Xu 2003)
Tslttd
td
Transmission line case
RC case with repeaters
9Architectural view of future systems
On-chip global links
Chip
Chip
Synchronous blocks
High speed board links
Clock
10Architectural view of future systems
On-chip global links
Chip
Chip
Challenges Allow scaling of clock rates and
bandwidths Mitigate synchronization and clock
skew problems Keep an unchanged synchronous
design paradigm Find a systematic IP-based design
style
Synchronous blocks
High speed board links
Clock
11Architectural view of future systems
- Wire delays are inevitable we must accept
latency. - The latency/delay problem should be managed at
two levels - System level (predictability)
- Implementation level (error-free)
12Architectural view of future systems
System level. Partition the system into blocks of
limited size. (Preferably natural partition,
processors, memories, IP-blocks etc.) We may
define a system where only order of events is
important. (Classical asynchronous, Patient
systems (Carloni et al 1999)) We may then accept
any latency between blocks. We may define a
system with fixed latency between blocks. (If
fixed latency is n clock cycles, the system is
synchronous) We may then accept any latency lt nTc
between blocks.
13Architectural view of future systems
Implementation level (We must avoid
synchronization errors) Use synchronizers with
long decision time (extra latency, nonzero error
probability) Use stoppable clocks to synchronize
communication (Classical GALS, Chapiro
1984) Adapt clock phase to data (mesochronous
clocks) (Mu 2001) Use FIFOs to isolate clock
regions (FIFOs initialized with synchronizers,
Chakraborty 2001) (FIFOs initialized via system
reset, Edman 2004)
14Architectural view of future systems
Implementation level, Examples
Data in
Data out
Choice of clock phase (Mu 2001)
Metastab. detector
Rx clk
FIFO solution (Chakraborty 2001, Edman 2004)
Write pointer
Read pointer
Data in
Data out
Tx clk
Rx clk
Circular FIFO
15Synchronous Latency Insensitive Design
- Problem formulation
- Find a method to mitigate wire-induced latencies
within a synchronous paradigm
16Synchronous Latency Insensitive Design
Global clock
5 clock cycles
A-block clock
A-logic
B-block clock
B-logic
Nonflexible
17Synchronous Latency Insensitive Design
Global clock
7 clock cycles
A-block clock
A-logic
FIFO
B-block clock
B-logic
FIFO
Flexible at a cost of a fixed communication delay
18Synchronous Latency Insensitive Design
Global clock
7 clock cycles
A-block clock
A-logic
B-block clock
B-logic
Flexible Data delay added (0.6 clock cycles)
19Synchronous Latency Insensitive Design
Global clock
7 clock cycles
A-block clock
A-logic
B-block clock
B-logic
Flexible Skew added, A-clock late (0.2), B-clock
early(0.1)
20Synchronous Latency Insensitive Design
Global clock
7 clock cycles
A-block clock
A-logic
B-block clock
B-logic
Flexible Skew added (A-clock early (0.2), B-clock
late (0.1)
21Synchronous Latency Insensitive Design
Global clock
7 clock cycles
A-block clock
A-logic
B-block clock
B-logic
Flexible Skew added, very small delay
22Synchronous Latency Insensitive Design
Concept
Clock true model
Synthesis
During synthesis we replace Fixed delays
with synchronizing ports (elastic FIFOs) that
absorb all link latencies and clock
skews. Final design agree exactly with Clock
true model independently of link delays and clock
skews.
Fixed delays (n clk cycles)
Communication links
Synchronous blocks
clk
23Synchronous Latency Insensitive Design
Design flow
System partition
Natural partition (processors,
memories, IP-blocks) into isochronous regions
Clock-true model verification
NEW Insertion of dummy delays between
isochronic regions. Clock-true verification.
Synthesis Back-end
Replace dummy delays with elastic FIFOs
Timing verification
Considerably easier, feedback can be avoided
24Synchronous Latency Insensitive Design
Implementation
reg
data
data
reg
select
data
Output counter
Input counter
Local clock
strobe
strobe
clk
Synchronizing port Fixed nominal delay preset in
counters
Example with three blocks and two links
25Synchronous Latency Insensitive Design
Implementation
System reset and clock start used as
initialization mechanism (example n2)
reset
clk
rst
clk at root
Tx1
data at Tx1
written into FIFO(2) by strobe
data at Rx
FIFO(2)
Tx2
read from FIFO(2) by Rx clk after 2 counts
clk at Rx
data in Rx
Rx
Note that data relation to clk period number
predictable
26Synchronous Latency Insensitive Design
Simulation
clk
Tx1
Tx2
Rx
27Synchronous Latency Insensitive Design
- New method to ease timing closure in large DSM
chips - Correct clock-true verification before synthesis
- Synchronous design paradigm and design tools
kept - Implementation induced data delays and clock
skews mitigated - Implementation in standard libraries
- Full clock alignment between blocks
- No synchronizers, no risk for metastability
28Synchronous Latency Insensitive Design
On-going experiments Corner-to-corner 200MHz bus
in FPGA (master thesis with Hardi electronics
Xilinx Virtex2 XCV6000-5 total data delay 5
clock cycles at 200MHz) Implementation of SLID
in Philips Aethereal Network on Chip (master
thesis with Philips research , Eindhoven, fixed
delay 1.5 clock cycle) 6Gb/s per wire over 5mm
on-chip transmission line (data delay 100ps, PhD
student Peter Caputa)
29Multiple clocks
Can a multiple clock system be synchronous?
Example rationally related clocks
fc1
fc2(2/3)fc1
fltfc2gt Synchronous to fc1
30Multiple clocks
FIFO synchronization can be extended to
rationally related clocks (FIFO used for
mitigation of delays and introduced clock
jitter) Chakraborty 2003, (Our proposal
2004) Chakraborty extended his scheme to any
clock frequency relation
Write pointer
Read pointer
Jitter accepted
31Multiple clocks
Global reset used for initialization Distributed
along slow global clock and resynchronized
locally Local clock generated by PLL flocal
nfglobal Frequency adjust with rate
multipliers Highest frequency reduced to average
of lower frequency
32Multiple clocks
A) fTx1fTx2fRx352 B) fTx1fTx2fRx235 Both
cases nominal delay 2
A)
B)
33Unmatched buses
The proposed schemes depends on a matched bus,
That is data bits and strobe have equal
delay Can we extend this scheme to an unmatched
bus? Motivations Long buses in ASIC or FPGA
may have matching problem, it tools do not
support matching. In chip-to-chip communication,
wire length matching is tricky due to package or
backplane contacts do not fit board bus width
34Unmatched buses
Need to solve two problems Phase alignment of
latching receiver clock to arriving
data Individual FIFO initialization per bit (can
not use strobe)
35Unmatched buses
data
data
FIFO
Rx clock
Phase choice
put
get
First data edge after reset initializes First
data edge latch chooses latching edge Strobe
tracks data delay variations due to temperature,
voltage
strobe
strobe
data1
data2
36Conclusions
Wire delays are inevitable Delays must be managed
at system level and implementation level Our
proposed scheme facilitates synchronous flow
from system to implementation clock-true
verification before synthesis mitigation of
clock skews and data latencies Our scheme can be
extended to multiple clock frequencies Our scheme
can be extended to unmatched buses
37References
F. Anceau, "A Synchronous Approach for Clocking
VLSI Systems", IEEE J. Solid-State Circuits, Vol.
17, pp. 51-56, 1982. D. M. Chapiro,
Globally-Asynchronous Locally-Synchronous
Systems, PhD Thesis, Stanford University, Oct.
1984. D. Sylvester and K. Keutzer, "Getting to
the bottom of deep submicron", IEEE/ACM Int.
Conference on Computer Aided Design 1998, Digest
of Technical Papers, pp. 203-211, 1998. L. P.
Carloni, K. L. McMillan, A. Saldanha and A. L.
Sangiovanni-Vincentelli, "A Methodology for
Correct-by-Construction Latency Insensitive
Design", 1999 IEEE/ACM International Conference
on Computer-Aided Design, Digest of Technical
Papers, pp. 309-315, Nov. 1999. F. Mu and C.
Svensson, Self-tested self-synchronization
circuit for mesochronous clocking, IEEE Trans.
on Circuits and Systems II Analog and Digital
Signal Processing, vol 48, pp. 129 140, Feb.
2001 C. Svensson, Electrical Interconnects
Revitalized, IEEE Trans. on Very Large Scale
Integration, vol. 10, pp. 777-788, Dec. 2002. J.
Xu and W. Wolf, A Wave-Pipelined On-chip
Interconnect Structure for Network-on-Chips,
Proc. of the 11th Symp. On High Performance
Interconnect, pp. 10-14, 2003 A. Chakraborty and
M. R. Greenstreet, Efficient Self-Timed
Interfaces for Crossing Clock Domains,
Proceedings of Ninth International Symposium on
Asynchronous Circuits and Systems, pp. 78-88, May
2003. C. Svensson, "Synchronous Latency
Insensitive Design", invited paper, The 10th IEEE
International Symposium on Asynchronous Circuits
and Systems, Crete, April 19-23, 2004. A. Edman
and C. Svensson, "Timing Closure through a
Globally Synchronous, Timing Partitioned Design
Methodology", Proc of the 41st Design Automation
Conference, pp. 71-74, June 2004.