Title: L041
1- Introduction to Bluespec A new methodology for
designing Hardware - Arvind
- Computer Science Artificial Intelligence Lab.
- Massachusetts Institute of Technology
- February 11, 2009
2What is needed to make hardware design easier
- Extreme IP reuse
- Multiple instantiations of a block for different
performance and application requirements - Packaging of IP so that the blocks can be
assembled easily to build a large system (black
box model) - Ability to do modular refinement
- Whole system simulation to enable concurrent
hardware-software development
3IP Reuse sounds wonderful until you try it ...
Example Commercially available FIFO IP block
No machine verification of such informal
constraints is feasible
These constraints are spread over many pages of
the documentation...
Bluespec can change all this
4Bluespec promotes compositionthrough guarded
interfaces
Self-documenting interfaces Automatic
generation of logic to eliminate conflicts in use.
theModuleA
theModuleB
5Bluespec A new way of expressing behavior using
Guarded Atomic Actions
Bluespec
- Formalizes composition
- Modules with guarded interfaces
- Compiler manages connectivity (muxing and
associated control) - Powerful static elaboration facility
- Permits parameterization of designs at all levels
- Transaction level modeling
- Allows C and Verilog codes to be encapsulated in
Bluespec modules
- Smaller, simpler, clearer, more correct code
- not just simulation, synthesis as well
6Bluespec State and Rules organized into modules
All state (e.g., Registers, FIFOs, RAMs, ...) is
explicit. Behavior is expressed in terms of
atomic actions on the state Rule guard ?
action Rules can manipulate state in other
modules only via their interfaces.
7- GCD A simple example to explain hardware
generation from Bluespec
8Programming withrules A simple example
- Euclids algorithm for computing the Greatest
Common Divisor (GCD) - 15 6
- 9 6 subtract
- 3 6 subtract
- 6 3 swap
- 3 3 subtract
- 0 3 subtract
answer
9GCD in BSV
module mkGCD (I_GCD) Reg(Int(32)) x lt-
mkRegU Reg(Int(32)) y lt- mkReg(0)
rule swap ((x gt y) (y ! 0)) x lt
y y lt x endrule rule subtract ((x lt
y) (y ! 0)) y lt y x
endrule method Action start(Int(32) a,
Int(32) b) if (y0) x lt a y lt
b endmethod method Int(32) result() if
(y0) return x endmethod endmodule
Assume a/0
10GCD Hardware Module
In a GCD call t could be Int(32), UInt(16), Int
(13), ...
implicit conditions
interface I_GCD method Action start
(Int(32) a, Int(32) b) method Int(32)
result() endinterface
- The module can easily be made polymorphic
- Many different implementations can provide the
same interface module mkGCD (I_GCD)
11GCD Another implementation
module mkGCD (I_GCD) Reg(Int(32)) x lt-
mkRegU Reg(Int(32)) y lt- mkReg(0)
rule swapANDsub ((x gt y) (y ! 0)) x
lt y y lt x - y endrule rule subtract
((xlty) (y!0)) y lt y x
endrule method Action start(Int(32) a,
Int(32) b) if (y0) x lt a y lt b
endmethod method Int(32) result() if
(y0) return x endmethod endmodule
Does it compute faster ?
Does it take more resources ?
12Bluespec Tool flow
Works in conjunction with exiting tool flows
13Generated Verilog RTL GCD
module mkGCD(CLK,RST_N,start_a,start_b,EN_start,RD
Y_start, result,RDY_result) input CLK
input RST_N // action method start input 31
0 start_a input 31 0 start_b input
EN_start output RDY_start // value method
result output 31 0 result output
RDY_result // register x and y reg 31 0
x wire 31 0 xD_IN wire xEN reg 31
0 y wire 31 0 yD_IN wire yEN ... //
rule RL_subtract assign WILL_FIRE_RL_subtract
x_SLE_y___d3 !y_EQ_0___d10 // rule RL_swap
assign WILL_FIRE_RL_swap !x_SLE_y___d3
!y_EQ_0___d10 ...
14Generated Hardware
x_en y_en
swap?
swap? OR subtract?
15Generated Hardware Module
sub
x_en swap? y_en swap? OR subtract?
OR start_en
OR start_en
rdy
(y0)
16GCD A Simple Test Bench
module mkTest () Reg(Int(32)) state lt-
mkReg(0) I_GCD gcd lt- mkGCD() rule
go (state 0) gcd.start (423, 142)
state lt 1 endrule rule finish (state
1) display (GCD of 423 142
d,gcd.result()) state lt 2
endrule endmodule
Why do we need the state variable?
Is there any timing issue in displaying the
result?
No. Because the finish rule cannot execute until
gcd.result is ready
17GCD Test Bench
module mkTest () Reg(Int(32)) state lt-
mkReg(0) Reg(Int(4)) c1 lt- mkReg(1)
Reg(Int(7)) c2 lt- mkReg(1) I_GCD gcd
lt- mkGCD() rule req (state0)
gcd.start(signExtend(c1), signExtend(c2))
state lt 1 endrule rule resp (state1)
display (GCD of d d d, c1, c2,
gcd.result()) if (c17) begin c1 lt 1 c2
lt c21 end else c1 lt c11
if (c17 c263) state lt 2 else state lt
0 endrule endmodule
Feeds all pairs (c1,c2) 1 lt c1 lt 7 1 lt c2 lt
63 to GCD
18GCD Synthesis results
- Original (16 bits)
- Clock Period 1.6 ns
- Area 4240 mm2
- Unrolled (16 bits)
- Clock Period 1.65ns
- Area 5944 mm2
- Unrolled takes 31 fewer cycles on the testbench
19Rule scheduling and the synthesis of a scheduler
20GAA Execution model
- Repeatedly
- Select a rule to execute
- Compute the state updates
- Make the state updates
User annotations can help in rule selection
Implementation concern Schedule multiple rules
concurrently without violating one-rule-at-a-time
semantics
21Rule As a State Transformer
- A rule may be decomposed into two parts p(s) and
d(s) such that - snext if p(s) then d(s) else s
- p(s) is the condition (predicate) of the rule,
a.k.a. the CAN_FIRE signal of the rule. p is a
conjunction of explicit and implicit conditions - d(s) is the state transformation function,
i.e., computes the next-state values from the
current state values
22Compiling a Rule
rule r (f.first() gt 0) x lt x 1
f.deq () endrule
enable
p
f
f
x
x
d
current state
next state values
rdy signals read methods
enable signals action parameters
p enabling condition d action signals values
23Combining State Updates strawman
p1
ps from the rules that update R
OR
pn
latch enable
OR
ds from the rules that update R
next state value
What if more than one rule is enabled?
24Combining State Updates
f1
Scheduler Priority Encoder
p1
OR
ps from all the rules
pn
fn
latch enable
OR
ds from the rules that update R
next state value
Scheduler ensures that at most one fi is true
25One-rule-at-a-time Scheduler
Scheduler Priority Encoder
p1
f1
p2
f2
pn
fn
1. fi ? pi 2. p1 ? p2 ? .... ? pn ? f1 ? f2 ?
.... ? fn 3. One rewrite at a time i.e. at
most one fi is true
Very conservative way of guaranteeing correctness
26Executing Multiple Rules Per Cycle Conflict-free
rules
rule ra (z gt 10) x lt x 1 endrule rule rb
(z gt 20) y lt y 2 endrule
Parallel execution behaves like ra lt rb or
equivalently rb lt ra
Rulea and Ruleb are conflict-free if ?s . pa(s)
? pb(s) ? 1. pa(db(s)) ? pb(da(s))
2. da(db(s)) db(da(s))
Parallel Execution can also be understood in
terms of a composite rule
rule ra_rb if (zgt10) then x lt x1 if
(zgt20) then y lt y2 endrule
27Mutually Exclusive Rules
- Rulea and Ruleb are mutually exclusive if they
can never be enabled simultaneously - ?s . pa(s) ? pb(s)
Mutually-exclusive rules are Conflict-free by
definition
28Executing Multiple Rules Per Cycle Sequentially
Composable rules
rule ra (z gt 10) x lt y 1 endrule rule rb
(z gt 20) y lt y 2 endrule
Parallel execution behaves like ra lt rb
- Rulea and Ruleb are sequentially composable if
- ?s . pa(s) ? pb(s) ? 1. pb(da(s))
- 2. PrjR(Rb)(db(s))
PrjR(Rb)(db(da(s)))
Parallel Execution can also be understood in
terms of a composite rule
rule ra_rb if (zgt10) then x lt x1 if
(zgt20) then y lt y2 endrule
29Multiple-Rules-per-Cycle Scheduler
Divide the rules into smallest conflicting
groups provide a scheduler for each group
1. fi ? pi 2. p1 ? p2 ? .... ? pn ? f1 ? f2 ?
.... ? fn 3. Multiple operations such that fi ?
fj ? Ri and Rj are conflict-free or
sequentially composable
30Compiler determines if two rules can be executed
in parallel
Rulea and Ruleb are conflict-free if ?s . pa(s)
? pb(s) ? 1. pa(db(s)) ? pb(da(s)) 2.
da(db(s)) db(da(s))
D(Ra) ? R(Rb) ? D(Rb) ? R(Ra) ? R(Ra) ?
R(Rb) ?
- Rulea and Ruleb are sequentially composable if
- ?s . pa(s) ? pb(s) ?
- 1. pb(da(s))
- 2. PrjR(Rb)(db(s)) PrjR(Rb)(db(da(s)))
D(Rb) ? R(Ra) ?
These conditions are sufficient but not necessary
These properties can be determined by examining
the domains and ranges of the rules in a pairwise
manner.
Parallel execution of CF and SC rules does not
increase the critical path delay
31Muxing structure
- Muxing logic requires determining for each
register (action method) the rules that update it
and under what conditions
If two CF rules update the same element then they
must be mutually exclusive (p1 ? p2)
32Scheduling and control logic
Modules (Current state)
Modules (Next state)
CAN_FIRE
WILL_FIRE
Rules
p1
f1
Scheduler
fn
pn
d1
Muxing
cond
action
dn