L05-1 - PowerPoint PPT Presentation

About This Presentation
Title:

L05-1

Description:

Architectural Exploration: Area-Power tradeoff in 802.11a transmitter design Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 42
Provided by: Nik1
Learn more at: http://csg.csail.mit.edu
Category:
Tags: l05 | permutation

less

Transcript and Presenter's Notes

Title: L05-1


1
  • Architectural Exploration
  • Area-Power tradeoff in 802.11a transmitter design
  • Arvind
  • Computer Science Artificial Intelligence Lab
  • Massachusetts Institute of Technology

2
This lecture has two purposes
  • Illustrate how area-power tradeoff can be studied
    at a high-level for a realistic design
  • Example 802.11a transmitter
  • Illustrate some features of BSV
  • Static elaboration
  • Combinational circuits
  • Simple synchronous pipelines
  • Valid bits as the Maybe type in BSV

No prior understanding of 802.11a is necessary to
follow this lecture
3
Bluespec Two-Level Compilation
Bluespec (Objects, Types, Higher-order functions)
  • Lennart Augustsson
  • _at_Sandburst 2000-2002
  • Type checking
  • Massive partial evaluation and static
    elaboration

Level 1 compilation
Rules and Actions (Term Rewriting System)
  • Rule conflict analysis
  • Rule scheduling

Level 2 synthesis
  • James Hoe Arvind
  • _at_MIT 1997-2000

Object code (Verilog/C)
4
Static Elaboration
  • At compile time
  • Inline function calls and datatypes
  • Instantiate modules with specific parameters
  • Resolve polymorphism/overloading

5
802.11a Transmitter Overview
headers
Must produce one OFDM symbol every 4 msec
24 Uncoded bits
data
6
Preliminary resultsMEMOCODE 2006 Dave,
Gerding, Pellauer, Arvind
  • Design Lines of Relative
  • Block Code (BSV) Area
  • Controller 49 0
  • Scrambler 40 0
  • Conv. Encoder 113 0
  • Interleaver 76 1
  • Mapper 112 11
  • IFFT 95 85
  • Cyc. Extender 23 3

Complex arithmetic libraries constitute another
200 lines of code
7
Combinational IFFT
All numbers are complex and represented as two
sixteen bit quantities. Fixed-point arithmetic is
used to reduce area, power, ...
8
4-way Butterfly Node
  • function Vector(4,Complex) Bfly4
  • (Vector(4,Complex) t, Vector(4,Complex)
    k)
  • BSV has a very strong notion of types
  • Every expression has a type. Either it is
    declared by the user or automatically deduced by
    the compiler
  • The compiler verifies that the type declarations
    are compatible

9
BSV code 4-way Butterfly
  • function Vector(4,Complex) Bfly4
  • (Vector(4,Complex) t,
    Vector(4,Complex) k)
  • Vector(4,Complex) m newVector(),
  • y newVector(),
  • z newVector()
  • m0 k0 t0 m1 k1 t1
  • m2 k2 t2 m3 k3 t3
  • y0 m0 m2 y1 m0 m2
  • y2 m1 m3 y3 i(m1 m3)
  • z0 y0 y2 z1 y1 y3
  • z2 y0 y2 z3 y1 y3
  • return(z)
  • endfunction

Polymorphic code works on any type of numbers
for which , and - have been defined
Note Vector does not mean storage
10
Combinational IFFT
stage_f function
repeat it three times
11
BSV Code Combinational IFFT
function SVector(64, Complex) ifft
(SVector(64, Complex) in_data) //Declare
vectors SVector(4,SVector(64, Complex))
stage_data replicate(newSVector)
stage_data0 in_data for (Integer stage
0 stage lt 3 stage stage 1)
stage_datastage1 stage_f(stage,stage_datasta
ge) return(stage_data3)
The for loop is unfolded and stage_f is inlined
during static elaboration
Note no notion of loops or procedures during
execution
12
BSV Code Combinational IFFT- Unfolded
function SVector(64, Complex) ifft
(SVector(64, Complex) in_data) //Declare
vectors SVector(4,SVector(64, Complex))
stage_data replicate(newSVector)
stage_data0 in_data for (Integer stage
0 stage lt 3 stage stage 1)
stage_datastage1 stage_f(stage,stage_datasta
ge) return(stage_data3)
stage_data1 stage_f(0,stage_data0) stage_da
ta2 stage_f(1,stage_data1) stage_data3
stage_f(2,stage_data2)
Stage_f can be inlined now it could have been
inlined before loop unfolding also. Does the
order matter?
13
Bluespec Code for stage_f
  • function SVector(64, Complex) stage_f
  • (Bit(2) stage, SVector(64, Complex)
    stage_in)
  • begin
  • for (Integer i 0 i lt 16 i i 1)
  • begin
  • Integer idx i 4
  • let twid getTwiddle(stage,
    fromInteger(i))
  • let y bfly4(twid, stage_inidxidx3)
  • stage_tempidx y0 stage_tempidx1
    y1
  • stage_tempidx2 y2 stage_tempidx3
    y3
  • end
  • //Permutation
  • for (Integer i 0 i lt 64 i i 1)
  • stage_outi stage_temppermutei
  • end
  • return(stage_out)

14
Architectural Exploration
15
Design Alternatives
  • Reuse a block over multiple cycles

we expect Throughput to Area to
decrease less parallelism
decrease reusing a block
The clock needs to run faster for the same
throughput ? hyper-linear increase in energy
Energy/unit work ?
more on power issues later
16
Combinational IFFTOpportunity for reuse
Reuse the same circuit three times
17
Circular pipeline Reusing the Pipeline Stage
64, 4-way Muxes
Stage Counter
16 Bfly4s can be shared but not the three
permutations. Hence the need for muxes
18
Superfolded circular pipeline Just one Bfly-4
node!
19
Algorithmic Improvements
1. All the three permutations can be made
identical ? more saving in area in the folded
case 2. One multiplication can be removed from
Bfly-4
20
Area improvements because of change in Algorithm
21
Which design consumes the least energy to
transmit a symbol?
  • Can we quickly code up all the alternatives?
  • single source with parameters?

Not practical in traditional hardware description
languages like Verilog/VHDL
22
Pipelining a block
Clock C lt P ? FP
Area FP lt C lt P
Throughput FP lt C lt P
23
Synchronous pipeline
rule sync-pipeline (True) inQ.deq() sReg1
lt f1(inQ.first()) sReg2 lt f2(sReg1)
outQ.enq(f3(sReg2)) endrule
This rule can fire only if
- inQ has an element - outQ has space
Atomicity Either all or none of the state
elements inQ, outQ, sReg1 and sReg2 will be
updated
This is real IFFT code just replace f1, f2 and
f3 with stage_f code
24
Stage functions f1, f2 and f3
function f1(x) return (stage_f(1,x))
endfunction function f2(x) return
(stage_f(2,x)) endfunction function f3(x)
return (stage_f(3,x)) endfunction
The stage_f fucntion is given on slide 12
25
Problem What about pipeline bubbles?
Red and Green tokens must move even if there is
nothing in the inQ!
rule sync-pipeline (True) inQ.deq() sReg1
lt f1(inQ.first()) sReg2 lt f2(sReg1)
outQ.enq(f3(sReg2)) endrule
Also if there is no token in sReg2 then nothing
should be enqueued in the outQ
Valid bits or the Maybe type
Modify the rule to deal with these conditions
26
The Maybe type data in the pipeline
typedef union tagged void Invalid data_T
Valid Maybe(type data_T)
Registers contain Maybe type values
rule sync-pipeline (True) if (inQ.notEmpty())
begin sReg1 lt Valid f1(inQ.first()) inq.deq()
end else sReg1 lt Invalid case (sReg1)
matches tagged Valid .sx1 sReg2 lt Valid
f2(sx1) tagged Invalid sReg2 lt Invalid
case (sReg2) matches tagged Valid .sx2
outQ.enq(f3(sx2)) endrule
27
Folded pipeline
The same code will work for superfolded pipelines
by changing n and stage function f
rule folded-pipeline (True) if (stage1)
begin sxIn inQ.first() inQ.deq() end else
sxIn sReg sxOut f(stage,sxIn) if
(stagen) outQ.enq(sxOut) else sReg lt sxOut
stage lt (stagen)? 1 stage1 endrule
28
802.11a Transmitter Synthesis results (Only the
IFFT block is changing)
IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required
Pipelined 5.25 04 1.0 MHz
Combinational 4.91 04 1.0 MHz
Folded (16 Bfly-4s) 3.97 04 1.0 MHz
Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz
SF(4 Bfly-4s) 2.45 12 3.0 MHz
SF(2 Bfly-4s) 1.84 24 6.0 MHz
SF (1 Bfly4) 1.52 48 12 MHZ
All these designs were done in less than 24 hours!
TSMC .18 micron numbers reported are before
place and route.
29
Why are the areas so similar
  • Folding should have given a 3x improvement in
    IFFT area
  • BUT a constant twiddle allows low-level
    optimization on a Bfly-4 block
  • a 2.5x area reduction!

30
Summary
  • It is essential to do architectural exploration
    for better (area, power, performance, ...)
    designs.
  • It is possible to do so with new design tools and
    methodologies, i.e., Bluespec
  • Better and faster tools for estimating area,
    timing and power would dramatically increase our
    capability to do architectural exploration.

31
Bluespec Learnings
  • How to write highly parameterized combinational
    codes
  • How to write rules for simple synchronous
    pipelines
  • Effect of dynamic vs static values on generated
    circuits
  • Using Maybe types to express valid/invalid data

Thanks
32
Backup slides
33
Function f for the folded pipeline is the same
stage_f function but ...
  • function SVector(64, Complex) stage_f
  • (Bit(2) stage, SVector(64, Complex)
    stage_in)
  • begin
  • for (Integer i 0 i lt 16 i i 1)
  • begin
  • Integer idx i 4
  • let twid getTwiddle(stage,
    fromInteger(i))
  • let y bfly4(twid, stage_inidxidx3)
  • stage_tempidx y0 stage_tempidx1
    y1
  • stage_tempidx2 y2 stage_tempidx3
    y3
  • end
  • //Permutation
  • for (Integer i 0 i lt 64 i i 1)
  • stage_outi stage_temppermutei
  • end
  • return(stage_out)

will cause a mux to be generated
34
Folded pipeline stage function f
35
Function f for the Superfolded pipeline (One
Bfly-4 case)
  • f will be invoked for 48 dynamic values of stage
  • each invocation will modify 4 numbers in sReg
  • after 16 invocations a permutation would be done
    on the whole sReg

36
Code for the Superfolded pipeline stage function
  • function SVector(64, Complex) f
  • (Bit(6) stage, SVector(64, Complex)
    stage_in)
  • begin
  • let idx stage mod 16
  • let twid getTwiddle(stage div 16, idx)
  • let y bfly4(twid, stage_inidxidx3)
  • stage_temp stage_in
  • stage_tempidx y0
  • stage_tempidx1 y1
  • stage_tempidx2 y2
  • stage_tempidx3 y3
  • for (Integer i 0 i lt 64 i i 1)
  • stage_outi stage_temppermutei
  • end
  • return((idx 15) ? stage_out stage_temp)

One Bfly-4 case
37
Experimental Results
  • Nirav Dave, Mike Pellauer, Steve Gerding, Arvind
  • MEMOCODE 2006

38
Expressing these designs in Bluespec was easy
Combinational
Pipelined
Folded (16 Bfly-4s)
Super-Folded (8 Bfly-4s)
Super-Folded (4 Bfly-4s)
Super-Folded (2 Bfly-4s)
Super-Folded (1 Bfly-4)
  • All these designs were done in less than one day!
  • Designers were experts in Bluespec
  • Area and power estimates?

39
Bluespec Tool flow
Bluespec SystemVerilog source
Bluespec Compiler
Verilog 95 RTL
Verilog sim
RTL synthesis
gates
FPGA
40
802.11a Transmitter Synthesis results (Only the
IFFT block is changing)
IFFT Design Area (mm2) Symbol Latency (CLKs) ThroughputLatency (CLKs/sym) Min. Freq Required Average Power (mW)
Pipelined 5.25 12 04 1.0 MHz 4.92
Combinational 4.91 10 04 1.0 MHz 3.99
Folded (16 Bfly-4s) 3.97 12 04 1.0 MHz 7.27
Super-Folded (8 Bfly-4s) 3.69 15 06 1.5 MHz 10.9
SF(4 Bfly-4s) 2.45 21 12 3.0 MHz 14.4
SF(2 Bfly-4s) 1.84 33 24 6.0 MHz 21.1
SF (1 Bfly4) 1.52 57 48 12 MHZ 34.6
TSMC .18 micron numbers reported are before
place and route. (DesignCompiler), Power numbers
are from Sequence PowerTheater
41
Power can be reduced further
  • Right now all blocks in the transmitter run on
    the same clock
  • ? if we run IFFT faster then all other blocks
    also run faster
  • Bluespec has facilities for Multiple Clock
    Domains and the design can be easily modified to
    run the earlier blocks at a lower clock rate
Write a Comment
User Comments (0)
About PowerShow.com