Title: Memory Optimizations In High Level Synthesis
1Memory Optimizations In High Level Synthesis
- By
- Nitin K. Agarwal
- Under the Supervision of
- Dr. Preeti Ranjan Panda
- Department of Computer Science, I.I.T. Delhi
2Overview
- Motivation Objective.
- Related Work.
- Behavioral VHDL generation.
- Structural VHDL generation.
- Partitioning Algorithms.
- References.
3ASSET Flow
Application Specification in C
HW/SW Partitioner
Functions mapped on Hardware
Part of My Project
Functions mapped to Software
C to VHDL
Compiler
Silicon Compiler
ASIC
ASIC
Processor
Memory
4Input/output of C to VHDL Converter.
Functional Specification In C
Component Lib.
C to VHDL Converter (Structural RTL)
C to VHDL Converter (Behavioral)
Structural RTL VHDL
Behavioral VHDL
5Assumption on Input C
- No pointers.
- No global variable should be used.
- No aggregate data type is supported.
- Only Int and Bool types are supported
- (For behavioral generation).
6Objective
- Update the existing CtoVHDL converter from SUIF1
to SUIF2 for behavioral VHDL generation. - Partition the floating registers into register
files. - Generation of Structural VHDL Code.
- Perform case studies to observe the gain.
7Partitioning of Registers
RF 1
R1,R5, R3
RF 2
R2, R8
R4, R6 R7
RF 3
8Before Partitioning
R1
R2
R3
R4
9After Partitioning
RF With 2 read 1 Write Port
10Related Work - SiliconC A Hardware Back End
For SUIF
C Code
SUIF IR
C to SUIF
EXIT 1
PORKY
Structural VHDL Code
SSA
BITSIZE
VGEN
11SiliconC Different Stages.
- EXIT 1
- Creates single-entry, single-exit functions.
- VHDL-generation stage to synthesize the function
output port, becomes simple. - PORKY (Optimization pass)
- Perform classical compilers optimizations
- Static Single Assignment (SSA)
- Each variable is assigned only once.
- BITSIZE
- Tries to find used bit width of each operand
- VGEN
- It generates structural VHDL code.
-
12Interface of Core Co-proc
CORE ASIC
Data Bus
Processor
Output Ready
Signals for a particular protocol
Input Ready
Start
Wrapper
13 Behavioral Code Generation
C to SUIF 1 IR
PORKY
SUIF1 to SUIF2
C Spec.
Identify Bool Var.
SUIF2 to VHDL
VHDL Code
14Partitioning
- Input is the CDFG on which
- Scheduling is done
- FU binding is done
- Output is the CDFG in which
- Each register will be allocated to some RF
- Assumptions
- It would not change the schedule
- Benefit
- Partitioning will reduce the number of point to
point - connections and multiplexers width.
15Need of Writing Synthesizer
C to VHDL (Behavioral Code)
C Spec.
Behavioral Compiler
Net List
16Complete Flow
C to MachSuif
Opcode Splitting
List Scheduling
C Spec.
Comp. Lib.
Partitioning Port Binding
Data Path Generation
Structural VHDL Code
FU Binding
Register Allocation
Control Part Synthesis
Implemented By me
17Opcode Splitting
- Input is the (Mach Suif) IR in which
- Each opcode is of Suif virtual machine
- Output is the IR in which
- Each opcode is of Hw virtual machine
- It splits the opcode of Suif virtual machine to
the - primitive opcodes of Hw virtual machine.
- One to many mapping is possible (Not implemented
yet).
18Component Library
- Component library specifies
- Resource Allocation
- Interface of component.
- Number of components.
- Architecture of component
- Module Selection
- Mapping of FU to opcodes of Hw virtual machine.
- Component Library is specified using MDES.
- It is parsed using MQS.
- Extracted information is put into the IR for the
further passes.
19Partitioning
- Input is the IR on which
- Scheduling is done
- FU binding is done
- Output is the IR in which
- Each register will be allocated to some RF
- Assumptions
- It would not change the schedule
- Benefit
- Partitioning will reduce the number of point to
point - connections and multiplexers width.
20Partitioning On Values
- Pros
- Less interconnections
- Cons
- More registers
- Scheduling and FU binding
- should be done.
- IR has to be converted into
- SSA form.
Partitioning On Values
Port Mapping
Register Allocation Within RF
21Partitioning On Registers
- Pros
- Less registers
- Cons
- More interconnections
- Scheduling, FU binding
- and Register Allocation
- should be done
Global Register Allocation
Partitioning of Registers into RFs
Port Mapping
22Example
-
- Add V1, V2 -gt V5
- Mul V5, V6 -gt V3
- ..
- ..
- Sub V3 , V4 -gt V7
- Mul V7, V8 -gt V2
-
Control Step i
Control Step i1
RF has 2 R 1 W Ports
23Partitioning On Values
V1 V2 V3 V4
V5 V6 V7 V8
8 Registers , 8 Point to point connections
24Partitioning On Registers
R1 R2
R3 R4
Reg. Alloc. V1, V7 -gt R1 V2, V8 -gt R2 V3, V5 -gt
R3 V4, V6 -gt R4
4 Registers , 12 Point to point Connections
25Partitioning Algorithms
- Assumptions
- Do not change the schedule
- Scheduling FU Binding should be done.
- Iterative Version
- Put as many registers as you can, in a single
register file (using ILP). - Repeat the process till all the registers are
allocated. - Concurrent Version
- Tries to allocate all the registers
simultaneously in some - Register files.
- Objective Minimizes the number of RFs.
- More complex but more optimal results
26Port Mapping Algorithms
- Assumptions
- Scheduling, FU binding, Partitioning should be
done. - It can be done on values. (Before Reg.
Allocation). - Iterative Version
- Take each control step and perform port mapping.
- Objective Minimizes the interconnections
- Concurrent Version
- Takes all the control step simultaneously.
- Objective - Minimizes the interconnections .
27Partitioning Algorithm
- It Tries to allocate all the registers
simultaneously in some Register Files. - Pick each register and find out the list of
candidate - RFs.
- Chose one RF using Interconnection Cost Metric.
28Algorithm.
- Partitioning_concurrent_heuristic
-
- For each register(R1) Do
-
- reg_file_list genRegFileList(R1)
- If (reg_file_list ! Empty) then
- reg_file pickBestOne(reg_file_list)
- else
- create new reg_file
- allocate(R1,reg_file)
-
-
-
29Generate the Candidate RF List for Register
- genRegRileList(Register R)
-
- ret_list all available register files
- For each cstep(C1) in which R is accessed
- Prune the ret_list according to access in
- C1
- return ret_list
-
-
-
30Example
Registers 1,2,3,4,5,6,7,8,9,10
C1 1,2,5,4,7,8
C2 5,6,2,7,9
C3 1,4,6,3,8,10
R10
R9
R9
R6
R3
R5
R8
R2
R4
R7
R1
31Comments
- Time Complexity
- O(KN2)
- Where K is number of control steps
- N is number of registers
- Heuristics can be applied on two places
- Sequence in which individual register is taken
for allocation. - When multiple registers files are available for
- a single register.
32References
- SiliconC - C. Scott Ananian
- Hardware backend for Suif
- http//www.flex-compiler.lcs.mit.edu/SiliconC/
- Embedded system group, I.I.T., Delhi
- http//www.iitd.ernet.in/esproject
- Â Stanford compiler group.
- http//www.suif.stanford.edu/
- M. Balakirshnan et al., Allocation of Mutiport
Memories in Data Path Synthesis, in IEEE Trans.
on Computer Aided Design, 1988.
33