High-level Specification and Efficient Implementation of Pipelined Circuits - PowerPoint PPT Presentation

About This Presentation
Title:

High-level Specification and Efficient Implementation of Pipelined Circuits

Description:

Rules can insert into full queues if within length at the end of clock cycle. Global Scheduling. Assumption: queues start within length at beginning of cycle ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 34
Provided by: Arv54
Category:

less

Transcript and Presenter's Notes

Title: High-level Specification and Efficient Implementation of Pipelined Circuits


1
High-level Specification and Efficient
Implementation of Pipelined Circuits
Maria-Cristina Marinescu Martin Rinard
Laboratory for Computer Science Massachusetts
Institute of Technology
2
Overall Goal
Efficient, Synchronous, Parallel Implementation in
Synthesizable Verilog
Modular, Asynchronous, Sequential Specification
3
Specification Language Concepts
  • State (Registers, Memory)
  • Queues (Conceptually Unbounded Length)
  • Modules
  • Read inputs from queues and state
  • Write outputs to queues and state

4
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
ltjz r0gt,ltinc r1gt
ltinc r2 100gt, ltinc r3 84gt
Input Queue
Output Queue
Register Operand Fetch Module
5
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
r1
ltinc r1gt
ltjz r0gt
ltinc r2 100gt, ltinc r3 84gt
Input Queue
Output Queue
Register Operand Fetch Module
6
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
r1
43
ltinc r1 43gt
ltjz r0gt
ltinc r2 100gt, ltinc r3 84gt
Input Queue
Output Queue
Register Operand Fetch Module
7
Module Example
0
r0
Register File
43
r1
100
r2
84
r3
ltjz r0gt
ltinc r1 43gt, ltinc r2 100gt, ltinc r3 84gt
Output Queue
Input Queue
Register Operand Fetch Module
8
Module Behavior
  • Each module has a set of update rules
  • Each Update Rule Consists of
  • Precondition
  • Action (set of updates)
  • Rule is enabled (and can execute) if precondition
    is true in current state
  • When rule executes, atomically applies updates in
    action to produce new state

9
Update Rules in Example
  • If an increment instruction is at the head of
    the input queue and there is no RAW hazard, then
    atomically remove the instruction from the queue,
    fetch the value from the register file, and
    append the instruction with the register value
    into the output queue
  • ltINC rgt head(iq) and notin(oq, ltINC r _gt) ?
  • iq tail(iq), oq append(oq, ltINC r rfrgt)
  • If a jump on zero instruction is at the head of
    the input queue and there is no RAW hazard, then
    atomically remove the instruction from the queue,
    fetch the value from the register file, and
    append the instruction with the register value
    into the output queue
  • ltJZ r lgt head(iq) and notin(oq, ltINC r _gt) ?
  • iq tail(iq), oq append(oq, ltJZ rfr lgt)

10
From Modules to Systems
  • System is a set of Modules
  • Access same Registers and Memories
  • Also communicate via Queues
  • Behavior of System
  • Update rules from all Modules
  • Queues Provide Modularity
  • Decouple Modules
  • Enable Independent Development
  • Promote Reusable Modular Designs

11
Example System Specification
  • Instruction Fetch Module
  • TRUE ? iq append(iq,impc), pc pc 1
  • Register Operand Fetch Module
  • ltINC rgt head(iq) and notin(rq, ltINC r _gt) ?
  • iq tail(iq), rq append(rq, ltINC r rfrgt)
  • ltJZ r lgt head(iq) and notin(rq, ltINC r _gt) ?
  • iq tail(iq), rq append(rq, ltJZ rfr lgt)
  • Compute and Writeback Module
  • ltINC r vgt head(rq) ?
  • rf rfr v1, rq tail(rq)
  • ltJZ v lgt head(rq) and (v 0) ?
  • pc l, iq nil, rq nil
  • ltJZ v lgt head(rq) and (v !0) ? rq tail(rq)

12
Abstract Model of Execution
  • Conceptually, system execution is a sequence of
    rule executions
  • while TRUE choose
    an enabled rule execute rule
    obtain new state
  • Concepts in Abstract Execution Model
  • Rules execute atomically
  • Rules execute asynchronously
  • Rules execute sequentially
  • Unbounded Queues

13
Synthesis Algorithm
14
Key Challenge
  • Specification Language
  • Sequential, atomic, asynchronous semantics
  • Conceptually unbounded queues
  • Implemented Circuit
  • Coordinated parallel execution
  • Finite length queues

15
Initial Synthesis Algorithm
  • Symbolically Execute Rules in Order
  • Each rule starts with result from previous rule
  • Obtain Expressions for New Values of Registers,
    Memories, and Queues
  • Generate Combinational Circuit that Produces New
    Values
  • Each clock cycle circuit computes new values,
    writes new values back
  • Every rule gets a chance to execute, every clock
    cycle!

SE0
SE1
SE2
SE3
Rule 1
Rule 2
Rule 3
16
Properties of Initial Algorithm
  • Preserves Semantics of Specification
  • Independent Rules Execute Concurrently
  • But May Have Long Clock Cycle
  • Output of each preceding rule fed in as input to
    next rule
  • Data traverses ALL rules (and pipeline stages) in
    a single cycle!
  • Solution Relaxation

17
Relaxation
  • for each rule Ri with precondition Pi
    for each variable instance vi in
    precondition Pi replace vi with its earliest
    safe version

...
Rk-1 Pk-1 -gt vk ...
...
Ri Pi(vi,...) -gt ...
...
  • vk safe for vi if either
  • Pivk/vi implies Pi
  • (Pi,Pk-1) mutually exclusive

0
1
2
gt
0
1
2
3
3
18
Relaxation Result
  • Relaxation exposes additional parallelism
  • Queues separate pipeline stages
  • Items traverse one stage per clock cycle
  • Safety If a rule executes in new system
  • Then it also executes in old system
  • And it generates same result
  • Liveness After relaxation, all rules test
    initial state
  • If rule enabled in old system but not in new
    system, then
  • Some rule executes in new system

19
Global Scheduling
  • Issue
  • Conceptually unbounded queues
  • Finite hardware buffers
  • Solution Modify append rules s.t. no queue
    exceeds its specified length
  • Challenge
  • Schedule maximum number of rules
  • Rules can insert into full queues if within
    length at the end of clock cycle

20
Global Scheduling
  • Assumption queues start within length at
    beginning of cycle
  • Goal generate circuit that makes queues remain
    within length at end of cycle
  • Basic Approach
  • Before enabled rule executes
  • Be sure will be room for result in output queues
    at end of clock cycle
  • Key Idea a rule can insert into a queue as long
    as enough following rules remove from it

21
GS Basic Concepts
  • Rule-Queue Graph
  • Nodes of 2 types rules and queues
  • Edge from rule node to queue node if rule inserts
    into queue
  • Edge from queue node to rule node if rule removes
    from queue
  • In Example

1
iq
rq
22
Acyclic Rule-Queue Graphs
  • Process Rules in Topological Sort Order
  • Augment execution precondition
  • If rule inserts into a queue, require that either
  • there is room in queue when rule executes or
  • future rules will execute and remove items to
    make room in queue
  • Each queue has counter of number of elements in
    queue at start of cycle
  • Combinational logic tracks queue insertions and
    deletions
  • GS algorithm generates the control signals for
    the combinational logic

23
Pipeline Implications
  • Counter becomes presence bit for single element
    queues
  • Additional preconditions can be viewed as
    pipeline stall logic
  • Design can be written to generate pipeline
    forwarding/bypassing instead of stall

24
Global Scheduling Example
  • For length(iq) 1, length(rq) 1
  • R0 executes and appends to iq if
  • P1 P2 P4 OR
  • iq0 nil
  • R4 doesnt insert into queues
  • gt P4 P4
  • Apply same rationale for R1 R2
  • R1 executes and appends to rq if
  • P4 P3 P5
  • rq0 nil
  • R3 and R5 dont insert into queues
  • gt P3 P3, P5 P5

IQ0
P0
IQ0
IQ1
P1IQ0/IQ1, P2IQ0/IQ2
P1IQ0/IQ1
tail(IQ1)
IQ0
IQ2
IQ2
P4
P2IQ0/IQ2
P4
tail(IQ1)
IQ3
nil
nil
IQ5
IQ5
P4
nil
IQ5
  • GS1(rq) GS2(rq) (rq0 nil) P4 P3
    P5
  • GS0(iq) (iq0 nil) P4 (P1 P2) ?
    (rq0 nil) P3 P5
  • (iq0 nil) P4 P1 P2

25
Cyclic Rule-Queue Graphs
  • Cyclic Graphs lead to Cyclic Dependences
  • Rule 1 depends on rule 2 to remove an item from a
    queue
  • But rule 2 depends on rule 1 to remove an item
    from another queue
  • Algorithm from acyclic case would generate
    recursive preconditions

Queue x
rule 2
rule 1
Queue y
26
Cyclic R-Q Graphs Example
  • Let P1 P1 ? GS1
  • Assumption R1 executes (P1 TRUE)
  • Find group of rules that must fire together
  • P1 P1 ? (xnil) P2
    P1 ? (xnil) P2 ? (ynil) P1
  • No need to explore P1 further (P1 TRUE) gt
    P1 P1 ? (xnil) P2

Queue x
rule 2
rule 1
Queue y
27
Solution to Cyclic Dependence Problem
  • Key Idea no deadlock if we can coordinate
    removals and insertions from/to all queues in
    cycle s.t. removals make room for insertions
  • Groups of rules must execute together
  • Use depth-first search on rule-queue graph to
    find cyclic groups
  • Augment preconditions to allow all rules in cycle
    to execute together
  • Extensions include paths into and out of cyclic
    group

28
Cyclic R-Q Graphs Algorithm
  • SymbolicExecution(Ri, CrtPath)
    for each queue q that Ri inserts into
    for each rule Rj that inserts/removes
    in/from q newRj if Rj ? CrtPath
    then TRUE rule
    already examined
    else SymbolicExecution(Rj)
    newCrtPath if Rj ? CrtPath
    then CrtPath
    else CrtPath ?
    Rj replace Rj with newRj in GSi(q)
    GSi ? GSi(q)
    Ri Ri ?
    GSi

q
29
Symbolic Execution
  • Substitute out all intermediate versions of
    variables
  • Obtain expression for last version of each
    variable
  • Each expression defines new value of
    corresponding variable

30
Optimizations
  • Optimize expressions from symbolic execution
  • CSE avoid unnecessary replication of HW
  • Mutual Exclusion Testing
  • Eliminate computation of values that never occur
    in practice as result of mutually exclusive
    preconditions

31
Verilog Generation
  • Synthesize HW directly from expressions
  • Each queue as one or more registers
  • Each memory variable as library block
  • Each state variable as one or more registers,
    depending on type
  • Each expression as combinational logic that feeds
    back into corresponding registers

32
Experimental Results
  • We have implemented synthesis system
  • Used system to generate synthesizable Verilog for
    several specifications
  • (map effort medium, area effort low)

Architecture Cycle (MHz)
Area RISC Pipelined Processor 88.89
23195.25 SCU RTL 98 DSP
90.91 22999.50
Benchmark Cycle (MHz) Area Bubblesort
107.06 5434 Butterfly 104.42
5411 Filter 105.01
3757
33
Conclusion
  • Starting Point (Good for Designer)
  • Modular, Asynchronous, Sequential Specification
    with Conceptually Infinite Queues
  • Ending Point (Good for Implementation)
  • Efficient, Synchronous, Globally Scheduled,
    Parallel Implementation with Finite Queues in
    Synthesizable Verilog
  • Variety of Techniques
  • Symbolic Execution
  • Global Scheduling
Write a Comment
User Comments (0)
About PowerShow.com