Title: Highlevel Synthesis: An Essential Ingredient for Designing Complex ASICs
1- High-level Synthesis An Essential Ingredient
for Designing Complex ASICs - Arvind, Rishiyur S. Nikhil, Daniel L.
Rosenband, Nirav Dave - Massachusetts Institute of Technology, CSAIL
- Bluespec Inc.
- ICCAD 2004
- November 10, 2004
2The Designers Dilemma
Designer
Architect
More and more gates to design in constant time.
Chips becoming larger and larger
Designer is responsible for micro-architecture
and intra-block interfaces
Architect creates a less precise spec. that
describes larger blocks and interfaces, e.g.
This block takes a 32b address and returns the
longest prefix match with the SRAM table.
Without design exploration, the designer makes an
educated guess
Sub-optimal implementations
3High-level Synthesis to the Rescue?
- High-level synthesis promises
- Higher level of abstraction than RTL
- Faster design time
- More code reuse
- Why hasnt high-level synthesis worked?
- Tools have attempted to derive micro-architectures
automatically - Ignores designers ingenuity
- Unpredictable results
- Poor synthesis results (area, timing, power)
Our high-level synthesis flow avoids these
pitfalls by presenting a high-level of
abstraction while allowing the designer to
specify the micro-architecture.
4Outline
- The IP lookup problem
- Three different implementations and their
synthesis results - Additional case studies
- Why did Bluespec help?
5IP Lookup block in a router
Arbitration
Line Card (LC)
Packet Processor
Control Processor
SRAM (lookup table)
Switch
Queue Manager
IP Lookup
Exit functions
6The IP lookup problem
- Packets are routed (at line rate 15Mpps for
10GE) based on the Longest Prefix Match (LPM)
of packets IP address (32b) with entries in a
routing table - Variable number of memory lookups required
- Packets must be output in order of arrival
Example lookups
IP Lookup Table
F
A
7Sparse tree representation
0
3
14
5
E
F
7
10
255
18
2
200
3
4
1
4
8Table representation issues
- Real-world lookup algorithms are more complex but
most follow a sequence of dependent memory
references. Major challenges - small memory foot-print
- conserving memory bandwidth
- reasonable latency
- table updates must be possible
- Constraint results must be returned in
order - Given a lookup algorithm, the designer still
faces many micro-architectural choices
9Outline
- The IP lookup problem
- Three different implementations and their
synthesis results - Additional case studies
- Why did Bluespec help?
10Longest Prefix Match for IP lookup3 possible
implementation architectures
Circular pipeline
Efficient memory with most complex control
Designers Ranking
Which is best?
111. Rigid Static Scheduling
- Assume the SRAM containing the table has n-cycle
latency, statically schedule memory accesses to
avoid conflicts - Issues
- Since an LPM may take 1-3 memory accesses, unused
slots may be left idle - May have to reschedule the pipeline for a
different memory latency - Very difficult to plan if memory is also to be
used for some unrelated task.
122. Adaptive Linear Pipeline
RAM
IP Address Table
port replicator
Memory is usedefficiently
rom2
rom0
rom1
start lookup 1
finish lookup 1, (start lookup 2)
finish lookup 2, (start lookup 3)
finish lookup 3
ofifo
fifo0
fifo1
fifo2
- Each pipeline stage accesses the memory only if
required - Advantages Better memory utilization, easy
design, robust to changes in memory latencies - Issues FIFO sizing, FIFO area, latency
133. Flexible Circular Pipeline
lpmResp
getTicket
Completion buffer
IP Address Table
done
tf
lpmReq
leaf
Complete
Enter
RAM
node
Move
Circulate
ops
- Completion buffer
- gives out tokens to control the entry into the
circular pipeline - ensures that departures take place in order even
if lookups complete out-of-order - Advantages Robust to changes in memory latency,
easy to alter lookup algorithm - Disadvantages More complicated control,
gate-count?
14Experimental setup
Routing table (real, or random)
compiler
Targeted test data (IP addrs, expected routes)
Forwarding table
Testbench
SRAM(lookup table)
Load SRAM
SRAM address
SRAM data
Generate packets (random IP addr)
IP address
Routing info
Check against expectedresult
Pass Fail
Expected routes
- Implemented the three architectures in BSV
(Bluespec System Verilog) and Verilog (RTL). - Synthesized designs to compare gate count and
timing - Used the common test infrastructure to verify all
6 implementations
15Synthesis results
Synthesized to TSMC 0.18 µm library
- V Verilog
- BSV Bluespec System Verilog
Bluespec and Verilog synthesis results are nearly
identical
16Static pipeline explorationOne spec., two
designers, two results
Each packet is processed by one FSM
Shared FSM
17Static pipeline explorationBluespec
- BSV Data Alignment
- Automatically packs complex data-types into bits
- Simplifies design process but is not always
optimal for usage - BSV Type System
- Types provide a level of safety
(correct-by-construction design) not found in
other hardware languages - Conversions from one type to another can
introduce extra logic - This problem has been solved since by Bluespec
Inc.
18Static pipeline explorationsummary
- Variations within a design
- - Implementation choice has dramatic impact on
performance - - Bluespec results can match carefully coded
Verilog - - Thinking about micro-architecture is
important! - Much more important than language differences!!
These issues are more serious for larger designs
19Outline
- The IP lookup problem
- Three different implementations and their
synthesis results - Additional case studies
- Why did Bluespec help?
20Beyond the paper ...
- Is BSV (Bluespec in System Verilog) for real?
- More examples to compare the quality of results
BSV vs. Verilog - Bigger examples to showcase the productivity
- Usual caveats
- mileage varies from designer to designer
- controlled experiments to measure productivity,
especially for large designs, are difficult
21Case Study Pkt
- In an apples-to-apples comparison with a product
ASIC coded in Verilog, in-house Bluespec team
demonstrated - 4 man-months to complete 1.5M gates
- Pass full regression test suite
- 13x reduction in source code (from 66K
Lines-of-Verilog-Code design to 4.7K lines of
code) - 66 reduction in verification bugs
- Matched performance (clock speed, area)
- Enabled major design space explorations within
time budgets
200 MHz, 1.5M gates, 0.18u
22Case Study MPEG4 design blocks
YUV data
MPEG4Stream
Motion Compensation
IDCT
InverseQuantization
InverseAC/DCPrediction
InverseScan
VideoBitstreamDecoder
- Inverse Discrete Cosine Transform (IDCT)
- Lines of code 2716 (Verilog) 723
(BSV) - Total effort 4 man-weeks(Verilog) 2.5
man-weeks(BSV) - Area (gates) 52K(Verilog) 48K(BSV)
- Motion compensation decoding
- 1184 lines of BSV
- Arch 6 weeks Coding 3 weeks Verif 5 weeks
- 180 nm TSMC 10 nsec cycle time (7.52 nsec slack)
23Outline
- The IP lookup problem
- Three different implementations and their
synthesis results - Additional case studies
- Why did Bluespec help?
24Bluespec significantly reduces lines of code
Design ExamplesLines of Code
Bugs
Note Typical RTL design metric. There are
approximately 1-2 bugs for every 100 lines of code
- Depends on factors like
- Designer
- Architectural Complexity
- Quality of static verification
- Level of correct-by-construction
- Re-use
Lines of Code
25So how did Bluespec help?
- Module interfaces are more than wires
- capture protocol
- strongly typed
- FIFO, Port Replicator,
- Completion Buffer, ...
- Automatic rule scheduling
- Rule atomicity helps in identifying conflicts
- Eliminates subtle bugs in control logic
- very important in some designs
- Rich high-level, two-level, modern language
- Data structures and Polymorphic Type system
- Rules, actions, modules, ... are all first class
objects in the language - exploitation of these features
requires training - in modern high-level programming
26Conclusion
- Micro-architecture exploration is important to
the design process - High-level synthesis is important for rapid
design exploration - without RTL it is difficult to optimize the
architecture for area, time or power - High-level synthesis must allow the designer to
express micro-architectures - High-level synthesis (BSV-style) produces
comparable timing and area results to Verilog
27Thank you!