Performance Analysis and Optimization of Latency Insensitive Systems - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Performance Analysis and Optimization of Latency Insensitive Systems

Description:

... and synchronization properties, while neglecting the particular data items ... Focus on data synchronization, neglecting data values. 1. 2. 3. 4. 5. 6. 7 ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 35

Provided by: lucaca4

Category:

more less

Transcript and Presenter's Notes

Title: Performance Analysis and Optimization of Latency Insensitive Systems

1
Performance Analysis and Optimization of Latency
Insensitive Systems

Luca P. Carloni
Alberto L. Sangiovanni-Vincentelli

UC Berkeley
Design Automation Conference Los Angeles, June
2000
2
Motivation System-on-a-Chip Design
3
Sequential Modules and RTL Design
Output Register
Primary Outputs
Primary Inputs
Combinational Logic
State Register
4
Block Diagram of a MAC Circuit
RTL Design separates functional specification
from performance analysis
5
Intra-Module Delay and Timing Constraints
Once all modules are composed, the overall
system works correctly as far as it is running
with a clock period Tclk max T1 ,T2 ,T3 ,T4
6
Impact of Inter-Module Path Delays
7
DSM Percentage of Reachable Die

For a 0.06 micron process a signal can reach only
5 of the dies length in a clock cycle D.
Matzke, (TI) 1997
Cause Combination of high frequencies and slower
wires

8
Need of a New Design Approach

To relax time constraints during early phases of
the design when correct measures of the
inter-module delay paths are not available
To simplify the composition of sequential modules
in pipeline mode
To facilitate the insertion of extra pipeline
stages between one module and the next one with
the purpose of buffering those signals which
propagate on long wires

9
Latency Insensitive Design ICCAD99
10
Latency Insensitive Design ICCAD99
RS
11
Latency Insensitive Design ICCAD99
P3
P2
RS
RS
12
Informative Events and Stalling Events

Each RelayStation introduces 1 stalling event
A module receiving a stalling event as input
emits stalling events as outputs at the next
cycle

13
Advantages of LID Methodology
14
Robustness of LID Performance
Performance Loss (after RelayStation
insertion)
The Latency Insensitive Protocol does not affect
performance only if the design does not present
any feedback path between the sequential modules
15
Latency Insensitive Systems (LIS) Graph

Capture the structure of a Latency Insensitive
System without getting lost into the details of
the logic inside each sequential module
Focus on communication and synchronization
properties, while neglecting the particular data
items exchanged among the modules
Model the system performance by enabling
early-exploration as well as late-adjustments
of the latency-throughput trade-offs

16
LIS-Graph for the MAC Circuit

REG
REG

Composite
REG
REG
MPY
REG

REG
REG
REG
REG
REG
17
Weight of LIS-Graph Arcs
The weight of an arc is equal to the number of
relayStations on the corresponding channel
18
Equivalence of LIS-Graphs
19
Progressive Trace of a LIS-Graph Arc
20
Behavior of a LIS-Graph
S1
S5
The notion of LIS-graph behavior captures
the communication and synchronization
properties of a latency insensitive system
S4
S2
S3
S1
S2
S3
S4
S5
21
Firing Semantic of a LIS-Graph

Independence Rule every vertex Vj fires the
first informative event (number 1) on each
outgoing arc Ai (Vj, Vk). However, if arc Ai
has weight w(Ai), the down-link vertex Vj will
observe w(Ai) stalling events before seeing the
first informative events from Vj
AND-Causality Rule every vertex Vj fires the
n-th informative event only after the (n-1)-th
informative event has appeared on each arc
entering Vj

22
Cycle Means and System Throughput
23
Computing the Maximum Cycle Mean

Acyclic LIS-Graph (pipelined system with no
feedback)
Thp(G) MCM(G) 1
Cyclic LIS-Graph (1 Strongly Connected Component
(SCC))
all K cycles can be detected in O((VA)
(K1))
Cyclic LIS-Graph (more than 1 SCC)
use Tarjans algorithm to detect all SCCs,
then derive the largest MCM among all SCCs

24
Recycling an Illegal LIS-Graph

Annotated LIS-Graph each arc ai has a length
l(ai) that corresponds to the smallest multiple
of the clock period that is larger then the delay
of the channel associated to the arc
Illegal Arc iff w(ai) lt l(ai) 1
Illegal LIS-Graph iff contains an illegal arc
Recycling Operation Legalize a graph be
increasing the weights of illegal arcs (i.e.
adding relay stations to the corresponding
channels)

25
Recycling Legalization Equalization

Legalization after deriving the annotated
LIS-graph G legalize it by augmenting the weights
of each illegal arc ai by DW(ai) l(ai) - 1
- w(ai)
Equalization compute the max throughput Tk
sustainable by each SCC Sk in the legalized graph
G and equalize them by distributing Nk extra
relay stations on the critical cycle Ck of Sk
Key Point avoid being forced to augment weights
of cycles having small cardinality

26
Case Study MPEG-2 Video Encoder
Frame Memory
DCT
Preprocessing
Input
Quantizer (Q)
Motion Compensation
Frame Memory

IDCT
Regulator
Motion Estimation
VLC Encoder
Buffer
Output
27
LIS-graph of MPEG-2 Video Encoder
S
V1
V2
V3
V4
V5
V15
V10
V6
V7
V8
V9
V14
V11
V12
V13
T
28
Detecting Cycles in MPEG-2 LIS-graph
S
V1
V2
V3
V4
Cycles

V10
V8

V5
V11

V6
V7
V9

V15
V14

V12
V13
T
29
MPEG2 - Throughput Degradation
Cycles
Cardinality
3
4
5
8
9
10
Cycle Weight
30
Moving Around the Latency - 1
Critical Cycle
S
V1
V2
V3
V4
V10
V8
V5
V11
V6
V7
V9
thp(G)
V15
V14
V12
V13
T
31
Moving Around the Latency - 2
Critical Cycle
S
V1
V2
V3
V4
V15
V10
V6
V5
V7
V8
V9
V11
V14
thp(G)
V12
V13
T
32
Practical Guidelines for LI Design

All modules should put comparable timing
constraints on the global clock
Modules whose corresponding lis-graph nodes
belong to the same cycle should be kept close
while deriving the final implementation
Relay Station Insertion should be automatically
performed in a way similar to Buffer Insertion

33
Conclusions

LIS-graphs are a formal model to analyze the
properties of a Latency Insensitive System
Recycling is a rigorous method
to capture latency variations of the
communication channels
to compute exactly the final throughput of the
system
MPEG-2 Case Study shows that the present work
enables the exploration of latency/throughput
trade-offs at any stages of the design process,
facilitates the integration of pre-designed IP
cores on a single chip.

34
Performance Analysis and Optimization of Latency
Insensitive Systems