Title: A Genetic/Local Search Hybrid Architecture for VLSI Circuit Partitioning
1A Genetic/Local Search Hybrid Architecture for
VLSI Circuit Partitioning
- By Shawki Areibi
- University of Guelph
- School of Engineering
- Guelph, Ontario, Canada
2Outline
- Introduction
- Circuit Layout
- Motivation
- Background
- Methodology
- Design Challenges
- Hardware Approach
- Results
- Future Work
3 VLSI Design Circuit Layout
Specification
Physical Design Cycle
Partitioning
Divide a circuit into smaller parts.
Placement
Place modules on a chip
Routing
Determine how the wires connect the modules
Extraction Verification
Fabrication
4Motivation Interconnect Delay
- Prior to 1.0µ Gate Delay
- More than 10 Million Transistors
- After 1.0µ Interconnect Delay
10
Typical Gate Delay
Delay (ns)
Interconnect Delay
1.0
0.1
2.0 µ
1.5 µ
1.0 µ
0.8 µ
0.5 µ
0.35 µ
Minimum Feature Size
5Motivation Hardware Accelerators
- Complexity and size of circuits are rapidly
increasing (3 billion transistors by 2008!) - Placing a demand on EDA for faster and more
efficient techniques for physical Design
Automation. - It will be relatively impossible for even the
fastest computers to solve effectively these
problems within an acceptable time frame - One possible solution is in the form of hardware
accelrators - The research investigates the development of a
hardware accelerated Memetic Algorithm for
Circuit Partitioning
6 Circuit Partitioning
Block 1
Block 0
Modules
0
1
2
3
4
5
Net 3
0
0
0
0
1
1
1
Net 2
0
0
0
1
1
1
2
Nets
M0
M2
M4
M3
M1
M5
0
0
0
0
1
1
3
0
0
0
0
1
1
4
Net 1
5
0
0
0
0
1
1
Net 5
Net 4
0
1
2
3
4
5
Objective Value
2
3
0
1
1
0
1
0
0
1
(Uncut Nets)
7Heuristic/Meta Heuristic Techniques
- Local Search
- Single point based heuristic
- Swap/Move based technique
- Iteratively improves solution
- Expolits the solution space
- Genetic Algorithm
- Population based heuristic
- Based on biological reproduction
- survival of the fittest
- Explores solution space
8Genetic Algorithm
Genetic vs Local Search
Local Search
Not Global Minimum
9Research Goals
- Hardware Implementation of GA
- Hardware Implementation of Local Search
- Achieve Speedup
- Investigate Hybrid Algorithms Techniques
- Improve Performance
- Investigate the suitability of High Level
Languages i.e Handel-C in designing systems.
10Design Restrictions
- Architecture must
- Fit common FPGA devices
- Adapt to other optimization problems
- TSP
- 0-1 Knapsack problem
- Handle large circuits
- Have user programmable parameters from host PC
11MCNC Benchmark Suite
25,114
125
12Celoxica Handel-C
- High-level language based on ISO/ANSI-C
- Eliminates need for retraining software engineers
- Generates VHDL or a EDIF code
- Support for most of FPGA devices
- Optimizes second-party PAR programs
13Approach
- Explore the most efficient design
- Achieve increased performance
- pipelining and parallelization
- Divide the tasks into separate but concurrent
components
FPGA Chip
Different Tasks of algorithm
14Hardware Parallelism
- Different sections of design operate concurrently
- Multiple tasks completed within a single clock
cycle
F(x) (2 3) (2 5) (4 2)
Basic Computers
Hardware Design
Multiplication
Addition
Division
Addition
15Hardware Pipelining
- Assembly line
- Different stages processed at the same time
- Number of stages determines throughput
Basic Sequential Computers
Task 1
Task 1
Task 1
Task 2
Task 2
Task 2
Pipelined Hardware Design
Three times the throughput
Task 1
Task 1
Task 1
Task 2
Task 2
Task 2
Task 3
Task 3
Task 4
Task 3
Task 4
Task 4
16Bitwise Representation
- To improve timing, each bit of the word
represents a cell within the solution
. . .
0
1
2
0
1
1
0
0
0
1
1
0
0
1
Cell in partition 1
Cell in partition 0
Parent 0
- Multiple cells manipulated in a single cycle
1
0
1
1
0
Parent 1
Uniform Mask
0
1
1
0
1
1
0
1
0
0
Offspring
17Aim of Genetic Algorithm in Hardware
Child 1
Pipelined Flow
Child 0
18Memory Issues
- External ram limits the architecture
- Semaphores add one clock cycle to memory access
- Memory intensive routines execute sequentially
Fitness Calculation
External Memory
Crossover Routine
Switch
Repair Routine
19Memory Problems
Request
Request
Request
Request
Request
2
Previous clock
0
4
Memory
1
Memory Access
1
Semaphore
2
Total clock
4
6
20Memory Solution 1
Request
Request
Previous clock
0
2
4
1
Memory Access
1
Memory
Semaphore
2
Total clock
4
6
21Fitness Memory
- Problem with parallelization and memory
- Limits parallelization
Fitness Calculation
Benchmark Memory
Fitness Calculation
Fitness Module
Fitness Module
22Genetic Algorithm Timing Results
Results gathered from 5 trial runs Areibi
Software Sun Blade 3000, 900MHz UltraSparc
111 Bitwise Software HP Workstation 2100, 2.4
GHz Intel Pentium 4
23Genetic Algorithm Solution Results
Results gathered from 5 trial runs Areibi
Software Sun Blade 3000, 900MHz UltraSparc 111
24Genetic Algorithm Discussion
- Execution Speed Limitation
- Sequential Fitness Calculation
- Operating at ¼ external clock rate
- Handel-Cs use of Semaphores
- Potential Design Solution Quality Limitation
- Difference in Crossover techniques
- Handel-C (Uniform)
- Areibi Software (2-Point)
- Effect of Random Number Generator
25-
- Question Is it possible to improve the solution
generated by the Genetic Algorithm? - Answer YES!
- Genetic Algorithms are known to be good at
exploring the solution space but are weak at fine
tuning the solutions - Solution Insert a hill climbing simple Local
Search to improve upon the solutions
26Local Search Algorithm
- The proposed Local Search forces net to be
contained exclusively within one partition
Partition 1
Partition 0
Net 3
Net 2
M0
M2
M1
M5
Net 1
Net 5
Objective Value
2
3
Net 4
(Uncut Nets)
Cell Data
0
1
2
3
4
5
1
2
3
4
5
0
1
1
0
1
0
0
0
1
0
Partition 0
0
0
1
1
0
1
0
Partition 1
0
0
0
27Update Partition Data
Netlist
Backup Data and Apply Next Move
Net1
Net2
Net3
Net4
Net5
1
1
M0
1
M1
Determine Modules Connected to Net
Determine Cells Connected to Net
1
1
1
M2
1
M3
Determine Which Other Nets are Connected to this
Module
Determine Which Other Nets are Connected to this
Cells
1
1
M4
1
1
M5
Determine Status of these Nets
Determine Status of these Nets
28Sequential issues
Select Next Move
Copy Solution
Update Net Info
Block Ram
Block Ram
Block Ram
Block Ram
29Handel-C Local Search Timing Results
Results gathered from 5 trial runs Areibi
Software Sun Blade 3000, 900MHz UltraSparc
111 Bitwise Software HP Workstation 2100, 2.4
GHz Intel Pentium 4
30Local Search Discussion
- Performance Improvement
- Handel-C achieves 2.1 time speedup over software
- Improvement caused by the balancing criteria in
parallel - generates 85 of the total software execution
time - Cause of bottlenecks
- Creating backup copies of the original data
- Limitation due to memory
31Memetic Algorithm
- Two Memetic Algorithms are developed from the
Genetic Algorithm and Local Search architectures - Exhaustive Memetic
- Applies the Local Search to an random pool of
individuals from the final Genetic Algorithm
population - Forces the individuals to local maximums
- Intermediate Memetic
- Applies the Local Search to a few individuals
after every X generations of the Genetic
Algorithm - Attempts to steer the population towards higher
fit solutions
32Algorithm Solution Quality
33Algorithm Timing Results
34Memetic Algorithm Discussion
- Faster than software GA
- Solution qualities not equal software GA
- Exhaustive Memetic Algorithm
- Equal solution quality to Local Search
- nearly twice the execution time.
- Intermediate Memetic architecture
- weaker results than both Local Search and
Exhaustive Memetic - significantly more execution time.
35CAD Algorithm Results
- Genetic Algorithm
- five times faster than traditional software
- 85 solution quality of traditional software
- 2 times slower than bitwise software
- Local Search
- 2.1 times faster than bitwise LS software
- Memetic Algorithms
- Slightly weaker results than traditional GA
software - Solution qualities equal to pure Local Search
architecture
36Current Work
Future Work
Handel-C Local Search
Handel-C Genetic Algorithm
Handel-C Memetic Algoirthm
Investigated Handel-C vs VHDL
Implement the Genetic Algorithm to further
investigate findings
Investigate effects of Crossover and RNG
Improve Fitness Calculation (Pipeline/Parallelism)
Incorporate Memory to eliminate repetitive
searching
37Conclusion
- Development of a Handel-C implementation of a
Memetic Algorithm to incorporate a new local
search methodology for circuit partitioning - Development of a VHDL and Handel-C implementation
of a Local Search algorithm for circuit
partitioning including a detailed comparison
between the two approaches - Comparison of a true speed performance of
Handel-C architectures with software architectures
38 39Why Develop in Handel-C vs VHDL
- Celoxicas Highlights for Handel-C
- Software engineers design hardware without
retraining - Rapid development of multi-million gate FPGAs
- Predictable and controllable hardware behavior
- Enables efficient use of available hardware
- Decrease in development time (factor of 3 to 4)
- Questions
- Are these claims accurate? And is it practical
for designing hardware?
40Handel-C Findings
- Advantages
- Development time Handel-C (1 week, 1400 lines of
code) - VHDL (5 weeks, 8000 lines of code)
- Ease of learning language
- Disadvantages
- Memory Management ¼ the external clock rate
- Resources used 1.2 times the un-optimized VHDL
design - Slower execution time than the VHDL design
41High vs Low level Languages
Performance results of the Local Search
Architectures
- VHDL requires 55 of the execution time needed by
Handel-C while operating at half the frequency - Reasons for improvement
- No Semaphores
- VHDL operates a full clock rate
- Architecture is optimized to perform specific task
42Handel-C Discussion
- Handel-C
- There are three areas of concern in most hardware
designs - Minimize development time ()
- Increase execution Performance
- Decrease size of design
- The only benefit to the Handel-C language is
minimizing development time (designing and
debugging)
43Random Number Generator
Add and Mult
LFSR
44Initial Genetic Algorithm Fitness values (prim2)
Hardware
Software
- Initial Population
- Mean 1050.23
- SD 23.69
- Best 1111.6
- Worst 996.4
- Final Value
- Mean 1742.9
- SD 6.927
- Best 1753.6
- Worst 1722.6
-
Initial Population Mean 1085.06 SD
26.32 Best 1147 Worst 1011.0 Final
Value Mean 2536.7 SD 15.114 Best
2574.4 Worst 2493.6
45Reconfigurable memory
- Cause a break in the pipeline
- Perform the same task as the current sequential
fitness calculation except it would allow for
larger benchmarks - Beneficial if configuring block rams (eliminate
the need for the 4 clock read/write)
46Other Optimization problems to addapt
Repair Module
Fitness Module
Replacement
Repair Module
Fitness Module
47Future Work
- Investigate the difference between the Genetic
Algorithm software (Areibi01) and the Handel-C
architecture - Optimize the Genetic Algorithm by implementing in
VHDL - Adapt the current design to perform two fitness
calculations using a single memory read - Divide the Fitness Calculation into numerous
pipeline stages to increase throughput - Incorporate memory into the Local Search to
eliminate repetitive searching
48FPGAs
- Re-programmable hardware to perform different
tasks - Inexpensive hardware development
-
- Exploits qualities of hardware
- Parallelism
- Pipelining
49VHDL
- Describes the behavior of circuit
- Programmed in concurrent manner
- Complex sequential algorithms
- Lengthy debugging process