Title: Automatic Synthesis and Optimization of Floating Point Hardware
1Automatic Synthesis and Optimization of Floating
Point Hardware
- Ho Chun Hok
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
18JUL2003
2Overview
- Introduction
- Fly Modifiable Compiler
- Float Floating Point Library
- Function Generator
- Results
- Conclusion
3Introduction
- Hardware Description Language (HDL) based design
has shortcomings - Hardware designs are parallel and people think in
von-Neumann patterns - Complex to decompose a hardware design into
datapath and control signal - Errors must introduced during the translation
- Debugging on the hardware is harder then on the
software - Hardware Interface for FPGA board must be
developed - A designer must have strong background on the
hardware design
4Introduction
- Elementary Functions are not supported
- No floating point arithmetic
- No standard mathematical library like ltmath.hgt
- No log, sin, cos, 1/x,
- The size of FPGA is limited.
- Area is an essential factor of a design
5Motivations
- Is it possible to use single description on both
software and hardware design? - Can we optimize the floating point arithmetic on
hardware to save the resource? - On hardware design, can we introduce mathematic
library just like software do?
6Objectives
- Main goal ? Use the smallest effort to develop
hardware on FPGA - No need to familiar with hardware knowledge
- The compilation from description to hardware is
transparent to the designer - Floating point arithmetic supported
- Elementary mathematic library provided, like
software programming
7Contributions
- A framework with 3 modules is developed
- Fly Modifiable Hardware Complier
- Translate description into datapath
- Float Floating Point Arithmetic Library
- Provide parameterized floating point operator and
optimization engine - Function Generator
- Generate any differentiable function, can be
regarded as mathematic library.
8Contributions
- Applications Developed using this framework
- Greatest Common Divisor Coprocessor
- Digital Sine-Cosine Generator (DSCG)
- Ordinary Differential Equation Solver (ODE)
- N-Body Problem Simulator
- Ranged from fixed point design to floating point
one
9Contribution Traditional Design Flow
10Contribution Revised Design Flow
Hardware Process is transparent to designer
11Fly Hardware Compiler
12Introduction
- Fly is easily extensible
- Source code can be easily understood and modified
- Support common programming constructs
- Fly language supports
- Register assignment
- Parallel statements
- If else branches
- While loops
- Built-in functions
- Comments
13Fly Programming Language
Main elements
14Fly Programming Language
- Compilation Technique
- Uses Pages compilation technique
- Each statement has associated start and end
signals - Fly constructs a one-hot state machine (i.e. the
control part of the hardware design) from the
program by cascading the signals - Fly compiler implementation simple and concise
due to use of Perl as the development language - One pass compilation
- Outputs VHDL code
- Can support different FPGA and ASIC design tools
- Gives opportunity for synthesis tools to perform
further logic optimization
15Application I - GCD Example
-
- s din1 l din2
- while (s ! l)
- a l - s
- if (a gt 0)
- l a
-
- else
- s ll s swap
-
-
- dout1 1
16Resultant Datapath
- s din1 l din2
- else s ll s
17Resultant Datapath
18Resultant Datapath
19Host Interface (register)
20Summary
- Input Perl-like description of floating point
design - Output Synthesizable VHDL code for
implementation - Datapath
- One-hot state machine (control signal)
- host interface is introduced
- The datapath is correct because of automatic
construction - Error eliminated when translating software
algorithm into datapath and control signal - Bitstream generation is transparent to user
- GCD coprocessor was given as an example
21Float Floating Point Design Environment
22Introduction
- Many applications involve floating point
operation - Graphical Transformation
- Scientific Simulation
- Seldom implementation of floating point
arithmetic on FPGA system - Implement the floating point arithmetic on FPGA
is possible - Larger area
- Higher speed
- Arbitrary size of floating point on FPGA is
possible - Allow more flexible design
23Introduction
- Float Class
- Optimize the floating point algorithm during
simulation - A instant of float class represent a floating
point variable when simulate the algorithm - VHDL Floating Point Generator
- Generate arbitrary sized Floating point
adder/multiplier - Integrated into fly environment
24Float Design Environment
- Float Class
- Encapsulate the Floating Point data structure
- Arbitrary exponent and fraction size
- Implemented on Perl
- Support several method on floating point
operation
25Float Design Environment
- Float Class Attribute
- Sign, Exponent, Fraction,
- Size of exponent, fraction
- Maximum magnitude
- Use to determine the minimum exponent size
required - Circuit size required for the floating point
operation
26Float Design Environment
- Float Class Method Support
- add()
- multiply()
- setExponetSize()
- setFractionSize()
- setValue()
- getValue()
- getCircuitSize()
27Float Design Environment
- Optimization
- Input accuracy, resource constraint
- Output size of each floating point operator
- Nelder-Mead method to minimize the cost function
28Float Design Environment
- Cost Function
- Adder size
- Multiplier size
- Quantization Error (dB)
- Cost Function
29Float Design Environment
- VHDL Floating Point Generator
- Generate parameterized adder and multiplier with
arbitrary size of exponent and fraction - Fully-Pipelined Design
- Latency of Multiplication 8 cycle
- Latency of Addition 4 cycle
- 1 clock cycle throughput
- Module is written in Perl as the Interface of
library - Compatible to the fly compiler through start and
end signal
30Integration into fly compiler
- CAB
- Datapath for integer addition need 1 one clock
cycle to complete - CA . B
- Datapath for floating point operation need more
cycle to complete, add more Flip-Flop to delay
the control signal
31Application II - Digital Sine Cosine Generator
- Let sin be the signal at time n
- If
32Application II - Digital Sine Cosine Generator
- cos_theta new Float(23, 8, 0.9)
- cos_theta_p1 new Float(23, 8, 1.9)
- cos_theta_m1 new Float(23, 8, -0.1)
- s10 new Float(23, 8, 0)
- s20 new Float(23, 8, 1)
- for (i 0 i lt 50 i )
- s1i1 s1i cos_theta
- s2i cos_theta_p1
- s2i1 s1i cos_theta_m1
- s2i cos_theta
-
33Application III - Ordinary Differential Equation
Solver
- Used modified fly compiler to solve ordinary
differential equation - Used Eulers method, h is step size
- Example involves floating point addition,
subtraction and multiplication
34Application III - Ordinary Differential Equation
Solver
-
- h read_host(1)
-
- t0.0y1.0dy0.0
- onehalf0.5index0
-
- while (t lt 3.0)
- t1 h . onehalft2 t .- y
- dy t1 . t2t t . h
- y y . dyindex index 1
- void write_host(y, index)
-
-
35Summary
- Float Environment is introduced
- Float Class allow to determine the size of
floating point operation and maintain certain
level of accuracy - Area can be reduced through optimization
- ? more logic can be implemented on the FPGA
- Module generation allow fly compiler supports
arbitrary-sized floating point arithmetic - Floating Point algorithm can be implemented on
FPGA with ease - Translation from floating point to fixed point is
no longer required - DSCG and ODE applications were given
36Function Generator
37Introduction
- In software system, standard mathematical library
function is available - In hardware design, mathematic library is
required to implemented by designer - A general method which allow arbitrary
differentiable function generation is desirable - STAM approach was adopted
- Integrated into fly compiler
38STAM datapath
Symmetric Properties were removed during
implementation for simplicity
39Implementation using VHDL
- A Perl program which automates the generation of
VHDL code with STAM algorithm - The program preprocesses the VHDL design and the
STAM specification is inside the comment - BlockRAM store the table entries
- The design can be used directly in the VHDL
40VHDL Preprocessor
41Floating Point extension
- The original STAM can apply to Fixed Point
Arithmetic - Minor add-on can let the STAM handle floating
point arithmetic - Floating point arithmetic of v(-3/2) is
implemented using STAM and floating point library
42Floating Point extension
43Fly integration
- start and end signal is attached at the entity of
power15, - A built-in function _power15() is introduced
inside fly compiler with slight modification
44Application IVN-Body Problem Simulation
- Calculate the acceleration force of each
particles by iteration - Used fly, float, and function generator in this
application
45N-Body problem - Fly implementation
-
- initialization, fetch xi,yi,zi
- while (j lt n)
- fetch xj,yj,zj from memory
- xj read_host(index)
- index index 1
- yj read_host(index)
- index index 1
- zj read_host(index)
- index index 2
- diffx xj .- xi
- diffy yj .- yi
- diffz zj .- zi
-
- x diffx . diffx
- y diffy . diffy
- z diffz . diffz
-
-
r1 x . y r2 z .
epsilon caculate rij rij r1 .
r2 call built-in function
power-1.5 tmp2 _power15(rij) tmpx
tmp2 . diffx tmpy tmp2 .
diffy tmpz tmp2 . diffz ax
ax . tmpx accumulate a ay ay .
tmpy az az . tmpz j j 1
46Summary
- STAM approach enhance the flexibility of fly
compiler - Arbitrary mathematical function is now support
through table lookup - Mechanism is similar to software programming
- N-body problem simulation shows that a real world
problem can be solved with this framework
47Results
48Experiment Environment
- The framework was integrated into the Pilchard
FPGA platform - Pilchard uses DIMM memory bus interface instead
of PCI bus (lower latency and higher bandwidth
than PCI) - Compilation and implementation process is
transparent to the user
49ResultApplication I - GCD
- A GCD coprocessor was implemented using the Fly
System - Implemented on Pilchard (Xilinx XCV300E-8)
- Fixed point 16bit integer
- Max. Frequency 126 MHz
- Slices Used 135 out of 3072 slices
- Computes a GCD every 1.63ms (including all
interface overheads)
50ResultFloating Point Generator
- Floating Point Operators was implemented
- Implemented on Pilchard (Xilinx XCV1000E-6)
- Different fraction size is measured, exponent
size is 8 - Max. Frequency (Multiplier) 103MHz
- Max. Frequency (Adder) 58MHz
- The result used to model the area relationship
51ResultFloating Point Generator
52ResultDigital Sine Cosine Generator
- Use Float Class in simulation to optimized the
size required for floating point operation - Use Fly compiler to produce bitstream for
implementation - Implemented on Pilchard (Xilinx XCV1000E-6)
- Max. Frequency 52.38MHz
- Area used 3470 out of 12288 slices
53Result Reference Output
54Result Quantization Error
55Result Optimization
56ResultOrdinary Differential Equation
- Use fly compiler to generate bitstream
- Floating point library was used to deal with
floating point arithmetic - Implemented on Pilchard (Xilinx XCV1000E-6)
- Max Frequency 64.5MHz
- Slices Used 2,349 out of 3,072 slices (for
single point precision arithmetic) - For h 1/16, need 28.7us for an execution
(including all interface overheads)
57ResultN-Body Problem Simulation
- Use fly compiler to generate bitstream
- Floating point library was used to deal with
floating point arithmetic - N10
- Implemented on Pilchard (Xilinx XCV1000E-6)
- Max. Frequency 44.79MHz
- Area 5475 out of 12288 slices
58Summary
59Conclusion
60Conclusion
- A framework consists of hardware compilation,
module generators, floating point arithmetic was
introduced - Allow any designer can use programming language
to implement a design - Hardware design background is no longer required
61Conclusion
- Single Description for both software and hardware
- Save Time in
- Software Debugging
- Hardware Interfacing
- Retraining and learning hardware design knowledge
- Reduced Error when
- Translating software design into hardware
datapath and control signal - Productivity increases
62Conclusion
- Floating Point Arithmetic
- No longer need to floating point to fixed point
algorithm ? time and error reduced - A floating point library make wide range of
floating point design could be implemented on
FPGA - Parameterized floating point operation can save
resource or enhance the accuracy to suit
different design constraints
63Conclusion
- Elementary Arithmetic
- Not necessary to implement mathematical library
for each design - Automatic generation save the design time
- It was demonstrated that the combination of fly,
float and function generator greatly reduces the
design effort required for the development of
complex floating point application such as the
N-body problem
64Conclusion
- Future Direction
- Allow different state machine generation
mechanism - Enhance the efficiency on certain implementation
- Generate fully-pipelined design
- Fully-utilized the datapath
- Detect parallelism automatically
- Speed up the design on the hardware environment
- Generate arbitrary function for floating point
arithmetic
65Publication
- C.H. Ho, M.P. Leong, P.H.W. Leong, J. Becker,
M.Glesner, "Rapid Prototyping of FPGA based
Floating Point DSP Systems", in Proceedings of
IEEE International Workshop on Rapid System
Prototyping, July 2002 - C.H. Ho, P.H.W. Leong, K.H. Tsoi, R. Ludewig, P.
Zipf, A.G. Ortiz, M.Glesner, "Fly - A Modifiable
Hardware Compiler", in Proceedings of
International Conference on Field Programmable
Logic and Applications, September 2002. - C.H. Ho, K.H. Tsoi, H.C. Yeung, Y.M. Lam, K.H.
Lee, P.H.W. Leong, R. Ludewig, P. Zipf, A.G.
Ortiz, M. Glesner, "Arbitrary Function
Approximation in HDLs", submitted to Proceedings
of IEEE International Conference on
Field-Programmable Technology, December 2003.
66Thank You