Asynchronous Pipelines - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Asynchronous Pipelines

Description:

Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 54
Provided by: Peter1621
Category:

less

Transcript and Presenter's Notes

Title: Asynchronous Pipelines


1
Asynchronous Pipelines
  • Author Peter Yeh
  • Advisor Professor Beerel

2
Motivation
  • Can we reduce asynchronous pipelines
    communication overhead while hiding precharge
    time?
  • Can we have cycle time in asynchronous pipelines
    as fast, if not faster, than best synchronous
    counterparts.

3
Motivation System Performance
  • Fixed stage pipeline
  • Low pipeline usage Low latency is critical
  • High pipeline usage Cycle time is the limiting
    factor to generate new outputs as fast as
    possible
  • Flexible stage pipeline
  • With zero forward overhead and short cycle time,
    we can achieve a given desired throughput with
    fewer stages

4
Motivation System Performance
  • Pipelines with loop dependencies
  • Optimal cycle time is the sum of latency around
    the loop
  • Pipelining is required to ensure precharge/reset
    is not in the critical path
  • Our scheme requires less pipeline stages to
    achieve same performance

5
Introduction
  • Asynchronous pipeline schemes using Taken
    Detector (TD)
  • Best use in coarse-grained pipelines
  • Two schemes targeting different requirements (a
    possible third SI scheme as well)

6
Outline
  • Background review
  • Sutherland
  • Ted William
  • Renaudin
  • Martin
  • Taken pipeline
  • Performance comparison
  • Conclusion

7
Definition
  • Stage A collection of logic that is precharged
    or evaluated at the same time
  • Cycle The time it takes for a stage to start
    next evaluation from the current one
  • Forward Latency The time it takes between the
    start of the evaluation of current stage to next
    stage

8
Background Outline
  • Sutherlands Micropipeline scheme
  • Ted Williams PS0 and PC0 pipeline schemes
  • Renaudins DCVSL pipeline scheme
  • Martins deep pipeline scheme

9
Sutherlands Micropipeline
  • Father of Asynchronous Pipeline. Presented in
    Turing Award lecture
  • Delay Insensitive

A(out)
c
c
R(in)
LOGIC
LOGIC
LOGIC
D(out)
D(in)
A(in)
c
R(out)
10
Williams PC0
  • Speed Independent
  • Cycle Time (P) 3tF ? 1tF ? 4tC4tD
  • Forward Latency (Lf) 1tF?1tD1tC

A(in)
A(out)
C1
C2
C3
R(out)
R(in)
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F3
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F1
Precharged Function Block F3
Precharged Function Block F2
D2
D1
D3
D(out)
D(in)
11
PC0 Timing Diagram
  • The cycle time is shown in read arrows while the
    blue arrows show the precharge phase

12
Dependency Graph
C2?
F2?
C3?
F3?
C4?
F4?
D2?
D2?
D2?
C1?
F1?
C2?
F2?
C3?
F3?
D1?
D2?
D3?
1
Flat Dependency Graph
1
0
0
C?
F?
D?
-1
Folded Dependency Graph
-1
0
0
C?
F?
D?
1
1
13
Williams PC1
  • Cycle Time (P) 2tF ? 4tC4tD
  • Forward Latency (Lf) 1tF?2tC1tD

A(in)
A(out)
C1
C2
R(out)
R(in)
Precharged Function Block F1
Precharged Function Block F2
C Latch
DB
DA
D2
D(in)
D(out)
14
Williams PS0
  • Not Speed Independent
  • Cycle Time (P) 3tF ? 1tF ? 2tD
  • Forward Latency (Lf) 1tF?

A(in)
A(out)
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D2
D1
D3
D(out)
D(in)
15
PS0 Timing Diagram
16
PS0 Timing Assumption
  • The pipeline has to meet the following timing
    assoumption

tF?
17
Renaudins DCVSL Pipeline
  • Compare to Teds PC0 only
  • Use DCVSL exclusively
  • Introduce Latched DCVSL
  • Improve cycle time but not forward latency
  • Cycle Time (P) 1tF? 1tF? 4tC 2tD
  • Forward Latency (Lf) 1tF? 1tC 1tD

18
DCVS Logic Family
DCVS Logic
Latched DCVS Logic
19
More on DCVSL
  • Advantage
  • Fast, based on the dynamic domino type logic
  • Build-in Four-Phase handshaking
  • Robust completion sensing
  • Storage element
  • Disadvantage
  • Higher Complexity - increase in number of
    transistors and area
  • Higher Power dissipation

20
DCVS Pipeline
  • Cycle Time (P) 1tF? 1tF? 4tC 2tD
  • (2tF? 4tC 2tD )
  • Forward Latency (Lf) 1tF? 1tC 1tD

R(in)
A(out)
C1
C2
C3
A(in)
R(out)
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D2
D1
D3
D(in)
D(out)
21
DCVS Pipeline Timing Diagram
22
DCVS Dependency Graph
  • Cycle Time (P) 1tF? 1tF? 4tC 2tD
  • Forward Latency (Lf) 1tF? 1tC 1tD

1
1
0
0
C?
F?
D?
Folded Dependency Graph
-1
-1
0
0
C?
F?
D?
1
1
23
Martins Pipeline Schemes
  • Deep pipelining
  • Quasi Delay-Insensitive (QDI)?No timing
    assumption
  • Based on different handshaking reshuffling
  • Best scheme has high concurrency which reduce
    control overhead
  • Control logic is more complex

24
Basic Asynchronous Handshaking
Le?
Re?
Re?
Le?
R1?
L1?
L1?
R1?
  • Reshuffling eliminates the explicit variable x
  • Large control overhead

25
Handshaking Reshuffling
Re?
Le?
Le?
Re?
R1?
L1?
L1?
R1?
  • Still wait for predecessor to reset before
    resetting itself?larger overhead for more inputs

26
Precharge-Logic Half-Buffer
Re?
Le?
Le?
Re?
R1?
L1?
L1?
R1?
  • Doesnt wait for the predecessor to reset before
    it resets its outputs. Yet, the control logic
    wait for the reset of the predecessor only after
    current stage has reset

27
Precharge-Logic Full-Buffer
Re?
Le?
Le?
Re?
en?
en?
R1?
L1?
L1?
R1?
  • Allows the neutrality test of the output data to
    overlap with raising the left enables
  • Complex control logic, requires extra state
    variable

28
Martins PCHB Full-adder
29
Martins Pipeline in General
Le
Le
Control
Control
Control
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
Re
D2
D1
D3
D(out)
D(in)
  • The Cycle time is limited by the properties of
    QDI
  • Next stage has to finish precharge before the
    current stage can evaluate next input

30
Performance Analysis on PCFB
  • Control logic can be seen as completion detection
    (D) plus C-element (C)
  • Reshuffling of handshaking just changes the
    degree of the concurrency but it doesnt affect
    the best case performance analysis
  • Cycle Time (P) 3tF? 1tF? 2tC 2tD
  • Forward Latency (Lf) 1tF?

31
Outline
  • Background review
  • Sutherland
  • Ted William
  • Renaudin
  • Martin
  • Taken pipeline
  • Performance comparison
  • Conclusion

32
Taken Pipeline
  • Use of Taken Detector
  • Two schemes to satisfy different requirements
  • Both are not speed independent

33
Initial Idea
  • Precharge only when next stage has taken the
    current result
  • Evaluation only when next stage has precharged
  • Similar idea to Martins pipeline schemes

34
Further Observation
  • Precharge
  • We can precharge the current stage as soon as the
    first level logic of next stage has
    evaluated?next stage has taken the result
  • Evaluate
  • Evaluation can be started as soon as the guarded
    N-transistor in the first level logic of next
    stage has turned off

35
Relax Precharge (RP) Constraint
  • Current stage can precharge as soon as the first
    level logic of next stage has evaluated Next
    stage has Taken the result
  • Current stage can evaluate as soon as the first
    level logic of next stage has precharged,
    blocking the new result from passing through
  • No need for extra control logic except TD which
    is similar to completion detector

36
RP Pipeline Scheme
  • Cycle Time (P) 2tF? 1tF1? 1tF1? 2tTD
  • Forward Latency (Lf) 1tF?

Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D(in)
D(out)
37
RP Timing Diagram
38
RP Timing Assumption
  • Easy to meet timing assumption

39
RP Timing Assumption Cont.
  • tF1i is the first level logic of stage i
  • tF2i is the logic after the first level of stage
    i
  • Assuming rising and falling of TD is the same

40
Relax Evaluation (RE) Constraint
  • Current stage can start the evaluation about the
    same time as the next stage turns off the guarded
    N-transistors in the first level logic
  • Requires general C-element, yet improve cycle time

41
RE Pipeline Scheme
  • TD can be skewed for fast evaluation detection
  • Cycle Time (P) 2tF? 1tF1? 1tTD 1tC
  • Forward Latency (Lf) 1tF?




GC1
GC1
GC1
Precharged Function Block F1
Precharged Function Block F2
Precharged Function Block F3
D(in)
D(out)
42
RE Timing Diagram
43
RE Timing Assumption 1
  • Precharge constraint

44
RE Timing Assumption 2
  • Evaluation constraint (Min Delay)

45
Issue in Fine-Grained Pipelines
  • In a fine-grained pipeline, such as Martins
    single gate pipeline, RE scheme may require
    buffering due to process variation
  • Buffering is necessary because of second timing
    assumption, next gate (stage) may not have turned
    off N-stack before the result from current stage
    reaches it

46
Taken Detector (TD)
  • Similar to Completion Detector
  • Detect both evaluation and precharge
  • Inputs are the output of first level logic of
    each stage

47
Datapath Merging Splitting
  • Datapath merging and splitting can be done
    similar to Williams style

48
Outline
  • Background review
  • Sutherland
  • Ted William
  • Renaudin
  • Martin
  • Taken pipeline
  • Performance comparison
  • Conclusions

49
Comparison of RE and Synchronous Skew Tolerant
  • Assuming 4 stages pipeline, stage 1-4, and 4
    phases clocking
  • Synchronous
  • Stage 1 starts next evaluation after stage 4
    starts evaluation
  • Asynchronous
  • Stage 1 starts next evaluation after we detect
    the completion of the first level logic of stage 3

50
Comparison Assumptions
  • It is a balanced pipelineall stages have equal
    evaluation time
  • Precharge time is same as evaluation time

51
Graphical Comparison
52
Optimum Number of Stages
  • Optimum Number of Stages (ONS)
  • Cycle Time is not the only factor in system
    performance, Forward Latency is also a limiting
    factor
  • Larger cycle time can be compensated by
    increasing the number of stages
  • However, high Lf means system throughput can not
    be increased by adding more stages

53
Conclusion
  • With Taken logic and some easy to meet timing
    requirement, we can achieve the best cycle time
    and forward latency
  • The performance comparison with existing pipeline
    schemes are favorable
  • Implementation is still required to prove the
    theory
Write a Comment
User Comments (0)
About PowerShow.com