ISSS - PowerPoint PPT Presentation

About This Presentation

Title:

ISSS

Description:

Title: L'environnement de production de documents l'IRISA Author: CONQ St phane Last modified by: sderrien Created Date: 6/17/1995 11:31:02 PM – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 20

Provided by: CONQ1

Category:

more less

Transcript and Presenter's Notes

Title: ISSS

1
ISSS01
Combined Instruction and Loop Level Parallelism
for Regular Array Synthesis on FPGAs

S.Derrien, S.Rajopadhye, S.Sur-Kolay
IRISA France ISI
calcutta

2
Outline

Context and motivation
Space time transformations
Transformation flow
Experimental validation
Conclusion

3
High performance IP-Cores

High-level specifications
Matlab, C, C or specific language (Alpha)
Targeting nested loops
Core must be formally correct
Hard/Soft co-generation
Hardware RTL module (VHDL)
Simple driver API (C)
Regular Processor Arrays
High data through-put, specialized datapath
Well suited for VLSI/FPGA

4
Targeting FPGAs

Poor clock speed
Typical clock speed is 1/10 Asic speed
Very design dependant
Good at low precision arithmetic (8 bits)
Really bad for complex operations (floats)
But high performance
Optimized designs can compete with Asics
Performance gain due to parallelism
Pipeline comes for free (lots of DFFs)

5
Processor Array Synthesis
Matrix multiplication example
Iteration are scheduled on their associated PE
For i1 to 3 For j1 to 3 For k1 to 3
Ci,jCi,j Ai,kBk,j
End for End for End for
Data dependence vector between iterations
Iteration domain extracted from loop bounds
Iteration domain is projected on the processor
grid
6
PE Architecture
Temporal registers act as local memory

Combinational datapath connected to registers
Unidirectional flow and pipelined connections
N classes of registers (N loop dimension)
One critical path for each register class
Operating frequency set by worst critical path

Spatio-temporal registers must be disambiguated
Spatial registers serve as interconnect between
PEs
7
Conclusion

Simplistic schedule inside a PE (no ILP)
Complex loop bodies induces poor performance
Floating point Matrix mult operating at 12MHz
2D SOR on 16 bits operating at 40MHz
The PE architecture is not suited to FPGAs !!
Proposed solution allowing pipelined
data-paths, by altering the PE architecture
through simple space-time transformations.

8
Retiming

Move registers to minimize clock period
Handled by most FPGA RTL synthesis tools
Efficient iff sufficient number of registers
We just need to add registers in the PE !!

Tc 2 logic level
Tc 1 logic level
9
Serialization (1/2)

Regroup PEs into clusters
Iterations in a cluster executed sequentially
Through-put is slowed down by cluster size
Local memory is duplicated

10
Serialization (2/2)
Feed-back loop are created for all spatial
paths in the ith axis

Decomposed along each spatial dimension
Serialization impacts the PE according to simple
transformation rules
Loop level Parallelism traded for Instruction
Level Parallelism

Temporal registers duplicated by serialization
factor si
11
Skewing

Affects latency, but not through-put.
Adds temporal registers along spatial axis
Skewing can be used before and after
serialization
Cannot reduce original temporal critical path

Skewing by factor 2 along vertical PE axis
12
Problem formulation

Find the optimal set of transformations
parameters.
Minimize number of registers
Preserve loop-level parallelism

Tc 86 ns, requires dj 6 stages to obtain Tc
15ns
Tc 70 ns, di5 stages to obtain Tc 15ns
Tc 60 ns requires dt4 stages to obtain Tc
15ns
13
Proposed heuristic

1. Assumes si given (partitioning step)
4. Determine all the skewing parameters
2. Sort PE space axis in ascending order of Tc
2. For each PE axis i do
i. Pre-serialization skewing lipre
ii. Serialization si
4. For each PE axis i do
i. Post-serialization skewing lipost

14
Transformation example
1. Pre-skew along axis y by factor lypre 1.
2. Serialisation along axis y axis by factor sy
2.
3. Pre-skew along axis x by factor lxpre 2.
4. Serialisation along axis x by factor sx 2.
5. Post skew along axis y by factor lypost1.
6. Apply retiming
15
Experimental validation

Chosen benchmark
Matrix multiplication (8,16 bits and floats)
Adaptive filter (DLMS) (8,16 bits and floats)
String matching (DNA, Protein)
Performance metrics
Ape PE area usage
fpe PE operating frequency
Raw performance rNpe.fpe
Npe approximated by 1/Ape

16
Area overhead
Area overhead decreases as combinational datapath
area cost grows
17
Frequency improvement
Speed improvment up to one order of magnitude
(for floats)
18
Raw performance
Speed improvment up to one order of magnitude
(for floats)
19
Conclusion

Extract very fine grain ILP from the datapath as
a whole
Simple space-time transformations but yield
impressive results.
Preserve circuit correctness and control logic
regularity and simplicity
Performance benefits are limited by the lack of
place route aware retiming tools.

Write a Comment

User Comments (0)