Title: ISSS
1ISSS01
Combined Instruction and Loop Level Parallelism
for Regular Array Synthesis on FPGAs
- S.Derrien, S.Rajopadhye, S.Sur-Kolay
- IRISA France ISI
calcutta
2Outline
- Context and motivation
- Space time transformations
- Transformation flow
- Experimental validation
- Conclusion
3High performance IP-Cores
- High-level specifications
- Matlab, C, C or specific language (Alpha)
- Targeting nested loops
- Core must be formally correct
- Hard/Soft co-generation
- Hardware RTL module (VHDL)
- Simple driver API (C)
- Regular Processor Arrays
- High data through-put, specialized datapath
- Well suited for VLSI/FPGA
4Targeting FPGAs
- Poor clock speed
- Typical clock speed is 1/10 Asic speed
- Very design dependant
- Good at low precision arithmetic (8 bits)
- Really bad for complex operations (floats)
- But high performance
- Optimized designs can compete with Asics
- Performance gain due to parallelism
- Pipeline comes for free (lots of DFFs)
5Processor Array Synthesis
Matrix multiplication example
Iteration are scheduled on their associated PE
For i1 to 3 For j1 to 3 For k1 to 3
Ci,jCi,j Ai,kBk,j
End for End for End for
Data dependence vector between iterations
Iteration domain extracted from loop bounds
Iteration domain is projected on the processor
grid
6PE Architecture
Temporal registers act as local memory
- Combinational datapath connected to registers
- Unidirectional flow and pipelined connections
- N classes of registers (N loop dimension)
- One critical path for each register class
- Operating frequency set by worst critical path
Spatio-temporal registers must be disambiguated
Spatial registers serve as interconnect between
PEs
7Conclusion
- Simplistic schedule inside a PE (no ILP)
- Complex loop bodies induces poor performance
- Floating point Matrix mult operating at 12MHz
- 2D SOR on 16 bits operating at 40MHz
- The PE architecture is not suited to FPGAs !!
- Proposed solution allowing pipelined
data-paths, by altering the PE architecture
through simple space-time transformations.
8Retiming
- Move registers to minimize clock period
- Handled by most FPGA RTL synthesis tools
- Efficient iff sufficient number of registers
- We just need to add registers in the PE !!
Tc 2 logic level
Tc 1 logic level
9Serialization (1/2)
- Regroup PEs into clusters
- Iterations in a cluster executed sequentially
- Through-put is slowed down by cluster size
- Local memory is duplicated
10Serialization (2/2)
Feed-back loop are created for all spatial
paths in the ith axis
- Decomposed along each spatial dimension
- Serialization impacts the PE according to simple
transformation rules - Loop level Parallelism traded for Instruction
Level Parallelism
Temporal registers duplicated by serialization
factor si
11Skewing
- Affects latency, but not through-put.
- Adds temporal registers along spatial axis
- Skewing can be used before and after
serialization - Cannot reduce original temporal critical path
Skewing by factor 2 along vertical PE axis
12Problem formulation
- Find the optimal set of transformations
parameters. - Minimize number of registers
- Preserve loop-level parallelism
Tc 86 ns, requires dj 6 stages to obtain Tc
15ns
Tc 70 ns, di5 stages to obtain Tc 15ns
Tc 60 ns requires dt4 stages to obtain Tc
15ns
13Proposed heuristic
- 1. Assumes si given (partitioning step)
- 4. Determine all the skewing parameters
- 2. Sort PE space axis in ascending order of Tc
- 2. For each PE axis i do
- i. Pre-serialization skewing lipre
- ii. Serialization si
- 4. For each PE axis i do
- i. Post-serialization skewing lipost
14Transformation example
1. Pre-skew along axis y by factor lypre 1.
2. Serialisation along axis y axis by factor sy
2.
3. Pre-skew along axis x by factor lxpre 2.
4. Serialisation along axis x by factor sx 2.
5. Post skew along axis y by factor lypost1.
6. Apply retiming
15Experimental validation
- Chosen benchmark
- Matrix multiplication (8,16 bits and floats)
- Adaptive filter (DLMS) (8,16 bits and floats)
- String matching (DNA, Protein)
- Performance metrics
- Ape PE area usage
- fpe PE operating frequency
- Raw performance rNpe.fpe
- Npe approximated by 1/Ape
16Area overhead
Area overhead decreases as combinational datapath
area cost grows
17Frequency improvement
Speed improvment up to one order of magnitude
(for floats)
18Raw performance
Speed improvment up to one order of magnitude
(for floats)
19Conclusion
- Extract very fine grain ILP from the datapath as
a whole - Simple space-time transformations but yield
impressive results. - Preserve circuit correctness and control logic
regularity and simplicity - Performance benefits are limited by the lack of
place route aware retiming tools.