High-level Specification and Efficient Implementation of Pipelined Circuits - PowerPoint PPT Presentation

About This Presentation

Title:

High-level Specification and Efficient Implementation of Pipelined Circuits

Description:

Rules can insert into full queues if within length at the end of clock cycle. Global Scheduling. Assumption: queues start within length at beginning of cycle ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 34

Provided by: Arv54

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: High-level Specification and Efficient Implementation of Pipelined Circuits

1
High-level Specification and Efficient
Implementation of Pipelined Circuits
Maria-Cristina Marinescu Martin Rinard
Laboratory for Computer Science Massachusetts
Institute of Technology
2
Overall Goal
Efficient, Synchronous, Parallel Implementation in
Synthesizable Verilog
Modular, Asynchronous, Sequential Specification
3
Specification Language Concepts

State (Registers, Memory)
Queues (Conceptually Unbounded Length)
Modules
Read inputs from queues and state
Write outputs to queues and state

4
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
ltjz r0gt,ltinc r1gt
ltinc r2 100gt, ltinc r3 84gt
Input Queue
Output Queue
Register Operand Fetch Module
5
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
r1
ltinc r1gt
ltjz r0gt
ltinc r2 100gt, ltinc r3 84gt
Input Queue
Output Queue
Register Operand Fetch Module
6
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
r1
43
ltinc r1 43gt
ltjz r0gt
ltinc r2 100gt, ltinc r3 84gt
Input Queue
Output Queue
Register Operand Fetch Module
7
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
ltjz r0gt
ltinc r1 43gt, ltinc r2 100gt, ltinc r3 84gt
Output Queue
Input Queue
Register Operand Fetch Module
8
Module Behavior

Each module has a set of update rules
Each Update Rule Consists of
Precondition
Action (set of updates)
Rule is enabled (and can execute) if precondition
is true in current state
When rule executes, atomically applies updates in
action to produce new state

9
Update Rules in Example

If an increment instruction is at the head of
the input queue and there is no RAW hazard, then
atomically remove the instruction from the queue,
fetch the value from the register file, and
append the instruction with the register value
into the output queue
ltINC rgt head(iq) and notin(oq, ltINC r _gt) ?
iq tail(iq), oq append(oq, ltINC r rfrgt)
If a jump on zero instruction is at the head of
the input queue and there is no RAW hazard, then
atomically remove the instruction from the queue,
fetch the value from the register file, and
append the instruction with the register value
into the output queue
ltJZ r lgt head(iq) and notin(oq, ltINC r _gt) ?
iq tail(iq), oq append(oq, ltJZ rfr lgt)

10
From Modules to Systems

System is a set of Modules
Access same Registers and Memories
Also communicate via Queues
Behavior of System
Update rules from all Modules
Queues Provide Modularity
Decouple Modules
Enable Independent Development
Promote Reusable Modular Designs

11
Example System Specification

Instruction Fetch Module
TRUE ? iq append(iq,impc), pc pc 1
Register Operand Fetch Module
ltINC rgt head(iq) and notin(rq, ltINC r _gt) ?
iq tail(iq), rq append(rq, ltINC r rfrgt)
ltJZ r lgt head(iq) and notin(rq, ltINC r _gt) ?
iq tail(iq), rq append(rq, ltJZ rfr lgt)
Compute and Writeback Module
ltINC r vgt head(rq) ?
rf rfr v1, rq tail(rq)
ltJZ v lgt head(rq) and (v 0) ?
pc l, iq nil, rq nil
ltJZ v lgt head(rq) and (v !0) ? rq tail(rq)

12
Abstract Model of Execution

Conceptually, system execution is a sequence of
rule executions
while TRUE choose
an enabled rule execute rule
obtain new state
Concepts in Abstract Execution Model
Rules execute atomically
Rules execute asynchronously
Rules execute sequentially
Unbounded Queues

13
Synthesis Algorithm
14
Key Challenge

Specification Language
Sequential, atomic, asynchronous semantics
Conceptually unbounded queues
Implemented Circuit
Coordinated parallel execution
Finite length queues

15
Initial Synthesis Algorithm

Symbolically Execute Rules in Order
Each rule starts with result from previous rule
Obtain Expressions for New Values of Registers,
Memories, and Queues
Generate Combinational Circuit that Produces New
Values
Each clock cycle circuit computes new values,
writes new values back
Every rule gets a chance to execute, every clock
cycle!

SE0
SE1
SE2
SE3
Rule 1
Rule 2
Rule 3
16
Properties of Initial Algorithm

Preserves Semantics of Specification
Independent Rules Execute Concurrently
But May Have Long Clock Cycle
Output of each preceding rule fed in as input to
next rule
Data traverses ALL rules (and pipeline stages) in
a single cycle!
Solution Relaxation

17
Relaxation

for each rule Ri with precondition Pi
for each variable instance vi in
precondition Pi replace vi with its earliest
safe version

...
Rk-1 Pk-1 -gt vk ...
...
Ri Pi(vi,...) -gt ...
...

vk safe for vi if either
Pivk/vi implies Pi
(Pi,Pk-1) mutually exclusive

0
1
2
gt
0
1
2
3
3
18
Relaxation Result

Relaxation exposes additional parallelism
Queues separate pipeline stages
Items traverse one stage per clock cycle
Safety If a rule executes in new system
Then it also executes in old system
And it generates same result
Liveness After relaxation, all rules test
initial state
If rule enabled in old system but not in new
system, then
Some rule executes in new system

19
Global Scheduling

Issue
Conceptually unbounded queues
Finite hardware buffers
Solution Modify append rules s.t. no queue
exceeds its specified length
Challenge
Schedule maximum number of rules
Rules can insert into full queues if within
length at the end of clock cycle

20
Global Scheduling

Assumption queues start within length at
beginning of cycle
Goal generate circuit that makes queues remain
within length at end of cycle
Basic Approach
Before enabled rule executes
Be sure will be room for result in output queues
at end of clock cycle
Key Idea a rule can insert into a queue as long
as enough following rules remove from it

21
GS Basic Concepts

Rule-Queue Graph
Nodes of 2 types rules and queues
Edge from rule node to queue node if rule inserts
into queue
Edge from queue node to rule node if rule removes
from queue
In Example

1
iq
rq
22
Acyclic Rule-Queue Graphs

Process Rules in Topological Sort Order
Augment execution precondition
If rule inserts into a queue, require that either
there is room in queue when rule executes or
future rules will execute and remove items to
make room in queue
Each queue has counter of number of elements in
queue at start of cycle
Combinational logic tracks queue insertions and
deletions
GS algorithm generates the control signals for
the combinational logic

23
Pipeline Implications

Counter becomes presence bit for single element
queues
Additional preconditions can be viewed as
pipeline stall logic
Design can be written to generate pipeline
forwarding/bypassing instead of stall

24
Global Scheduling Example

For length(iq) 1, length(rq) 1
R0 executes and appends to iq if
P1 P2 P4 OR
iq0 nil
R4 doesnt insert into queues
gt P4 P4
Apply same rationale for R1 R2
R1 executes and appends to rq if
P4 P3 P5
rq0 nil
R3 and R5 dont insert into queues
gt P3 P3, P5 P5

IQ0
P0
IQ0
IQ1
P1IQ0/IQ1, P2IQ0/IQ2
P1IQ0/IQ1
tail(IQ1)
IQ0
IQ2
IQ2
P4
P2IQ0/IQ2
P4
tail(IQ1)
IQ3
nil
nil
IQ5
IQ5
P4
nil
IQ5

GS1(rq) GS2(rq) (rq0 nil) P4 P3
P5
GS0(iq) (iq0 nil) P4 (P1 P2) ?
(rq0 nil) P3 P5
(iq0 nil) P4 P1 P2

25
Cyclic Rule-Queue Graphs

Cyclic Graphs lead to Cyclic Dependences
Rule 1 depends on rule 2 to remove an item from a
queue
But rule 2 depends on rule 1 to remove an item
from another queue
Algorithm from acyclic case would generate
recursive preconditions

Queue x
rule 2
rule 1
Queue y
26
Cyclic R-Q Graphs Example

Let P1 P1 ? GS1
Assumption R1 executes (P1 TRUE)
Find group of rules that must fire together
P1 P1 ? (xnil) P2
P1 ? (xnil) P2 ? (ynil) P1
No need to explore P1 further (P1 TRUE) gt
P1 P1 ? (xnil) P2

Queue x
rule 2
rule 1
Queue y
27
Solution to Cyclic Dependence Problem

Key Idea no deadlock if we can coordinate
removals and insertions from/to all queues in
cycle s.t. removals make room for insertions
Groups of rules must execute together
Use depth-first search on rule-queue graph to
find cyclic groups
Augment preconditions to allow all rules in cycle
to execute together
Extensions include paths into and out of cyclic
group

28
Cyclic R-Q Graphs Algorithm

SymbolicExecution(Ri, CrtPath)
for each queue q that Ri inserts into
for each rule Rj that inserts/removes
in/from q newRj if Rj ? CrtPath
then TRUE rule
already examined
else SymbolicExecution(Rj)
newCrtPath if Rj ? CrtPath
then CrtPath
else CrtPath ?
Rj replace Rj with newRj in GSi(q)
GSi ? GSi(q)
Ri Ri ?
GSi

q
29
Symbolic Execution

Substitute out all intermediate versions of
variables
Obtain expression for last version of each
variable
Each expression defines new value of
corresponding variable

30
Optimizations

Optimize expressions from symbolic execution
CSE avoid unnecessary replication of HW
Mutual Exclusion Testing
Eliminate computation of values that never occur
in practice as result of mutually exclusive
preconditions

31
Verilog Generation

Synthesize HW directly from expressions
Each queue as one or more registers
Each memory variable as library block
Each state variable as one or more registers,
depending on type
Each expression as combinational logic that feeds
back into corresponding registers

32
Experimental Results

We have implemented synthesis system
Used system to generate synthesizable Verilog for
several specifications
(map effort medium, area effort low)

Architecture Cycle (MHz)
Area RISC Pipelined Processor 88.89
23195.25 SCU RTL 98 DSP
90.91 22999.50
Benchmark Cycle (MHz) Area Bubblesort
107.06 5434 Butterfly 104.42
5411 Filter 105.01
3757
33
Conclusion

Starting Point (Good for Designer)
Modular, Asynchronous, Sequential Specification
with Conceptually Infinite Queues
Ending Point (Good for Implementation)
Efficient, Synchronous, Globally Scheduled,
Parallel Implementation with Finite Queues in
Synthesizable Verilog
Variety of Techniques
Symbolic Execution
Global Scheduling