Title: AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS
1AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO
ADAPTIVE COMPUTING SYSTEMS
- MAPLD-99
- Laurel, MD
- September 29, 1999
- Senthil Natarajan, Ben Levine, Chandra Tan, Danny
Newport and Don Bouldin - Electrical Computer Engineering
- University of Tennessee
- Knoxville, TN 37996-2100
- TEL (423)-974-5444
- FAX (423)-974-8245
- dbouldin_at_utk.edu
2 INITIAL DESIGN CAPTURE AND ALGORITHM
VERIFICATION USING KHOROS
3KHOROS/CANTATA IS A VISUAL PROGRAMMING LANGUAGE
FOR PROTOTYPING ALGORITHMS
4ADAPTIVE COMPUTING SYSTEMS CONSIST OF ACCELERATOR
BOARDS OF FPGAS
5CURRENT STATE-OF-THE-ART
KHOROS
MISSING LINK
PARTITIONING ONTO MULTIPLE FPGAS
MISSING LINK
6CHAMPION WILL AUTOMATICALLY MAP KHOROS DESIGNS
ONTO ADAPTIVE COMPUTING SYSTEMS
7CHAMPION WILL IMPROVE PRODUCTIVITY
Manual Mapping Onto An Adaptive Computing System
KHOROS
- GOAL Automate the mapping of
Khoros-based applications onto adaptive
computing systems to improve designer
productivity by 100x. - IMPACT
- More application designers will be able to
achieve higher quality implementations in less
time. - Adaptive computing systems will be utilized
more effectively and by a wider audience.
ACS
TIME (WEEKS)
Champion Will Improve Productivity By Using
Estimation and Automatic Mapping of Precompiled
Library Primitives
KHOROS
ESTIMATION
ACS
TIME (WEEKS)
8OUTLINE OF THIS PRESENTATION
- Application Development Flow
- Library Development and Verification
- Manual Implementation
- ATR Executions on ACS
- Automated Partitioning Algorithms
- Lessons Learned and Future Plans
9APPLICATION DEVELOPMENT FLOW
APPLICATION
KHOROS/CANTATA
DATA WIDTH MATCHING SYNCHRONIZATION
PARTITIONING
Precompiled Libraries
SYNTHESIS PLACEMENT/ROUTING
Destination Hardware Architecture
ADAPTIVE COMPUTING SYSTEM
10KHOROS/CANTATA IMPLEMENTATIONTOP LEVEL
11KHOROS/CANTATA IMPLEMENTATION--gt FIND TARGETS
12KHOROS/CANTATA IMPLEMENTATION--gt MARK FRAME
PIXELS
13ALGORITHM STRUCTURE
Find Targets and Label Image The target pixel
map is then used to identify square regions that
are considered to contain targets. These target
regions are then masked off (it is assumed that
there is only one target per region). The target
region location is then used to draw a frame that
will identify the target in the output image.
This is repeated six times.
14DEVELOP AND PRECOMPILE LIBRARY CELLS
Test Inputs
Responses
KHOROS--C Floating Point
KHOROS--C Fixed Point
VHDL
Each Library Primitive Will Be Developed at Each
Level, Verified, and Characterized.
FPGA
15KHOROS AND CHAMPION LIBRARY CELLS
- Champion Cells
- Each hardware cell has only one specific
function and one data type. - Hardware cells are parametrized to correspond to
the desired data bit widths. - Data is transferred between hardware cells
sequentially one pixel at a time per clock cycle. - Synchronization of data arrival to hardware
cells is necessary through the insertion of delay
elements by Champion.
- Khoros Traditional Cells
- Some Khoros cells have multiple functions for
user to select. - A single cell can handle all input dimension
sizes. - Cells can handle inputs of any data type.
- Data between cells are stored on the host CPU
system as temp files. - Khoros handles data movement between cells. Each
cell begins its execution only after all its
inputs have been written onto the host file
system.
16DATA BIT-WIDTHS MUST BE MATCHED
IN
CONVSTREAM_8_256_256
RIGHT SHIFT 3
ADD
ADD_8
ADD_9
CLIP HIGH 255
ADD_8
ADD_10
ADD_8
ADD_9
ADD_8
OUT
17DATA MUST BE SYNCHRONIZED DUE TO DIFFERENT PATH
DELAYS
Data synchronization error! Input times are not
equal.
IN
T0
11
T 257
PAD_HIGH_8_11 L 0
CONVSTREAM_8_256_256 L 257
RIGHT_SHIFT_12_ 3
ADD_11
12
ADD_8 L 1
T 258
ADD_9 L 1
T260
CLIP_HIGH_12_ 255
T 259
ADD_8 L 1
T 258
ADD_10 L 1
ADD_8 L 1
T 258
T 259
ADD_9 L 1
TRUNCATE_HIGH_12_8
ADD_8 L 1
T 258
8
T257
OUT
18ORIGINAL KHOROS TASK GRAPH
S32
RAM_Read_pf4_var_8
R
D-
8
S404
Sobel_8_8_256_256
M
S346
D262
8
Lowpass_8_8_256_256
M
D262
8
S354
START_Mean_SD
M
8
DS14
S354
START_Mean_SD
M
8
DS14
8
8
8
8
9
S0
shift_left_9_1
D0
10
S0
9
9
shift_left_10_2
D0
S12
8
add_9
10
D1
10
S13
10
add_10
D1
S4
and_1
11
D1
1
S168
Lowpass_1_4_256_256
M
D262
1
D0
4
8
8
8
S9
S11
gte_4_4
gte_8
D1
D1
S11
1
1
1
gte_8
S168
D1
S63
Lowpass_1_4_256_256
M
MITR
M
D262
D5
4
1
S9
gte_4_4
D1
A
19HARDWARE TASK GRAPH WITH DATA BIT-WIDTH MATCHED
AND SYNCHRONIZED
S32
RAM_Read_pf4_var_8
R
D-
8
S404
Sobel_8_8_256_256
M
S346
D262
8
Lowpass_8_8_256_256
M
D262
8
S354
START_Mean_SD
M
DS14
8
S354
START_Mean_SD
M
8
DS14
8
S56
S0
RAM_buffer_pf4_8
R
8
8
pad_8_9
D16S
8
8
D0
S0
9
pad_8_9
S0
S0
D0
pad_8_10
pad_8_10
S0
D0
D0
shift_left_9_1
D0
10
S0
9
9
shift_left_10_2
D0
S12
8
add_9
10
D1
10
S13
10
add_10
S56
D1
S4
RAM_buffer_pf4_8
R
and_1
S11
D16S
11
D1
clip_high_10_8
D1
S11
clip_high_11_8
1
D1
10
11
S168
Lowpass_1_4_256_256
M
S0
S0
D262
1
trunc_high_11_8
trunc_high_10_8
D0
D0
4
8
8
8
S9
S11
gte_4_4
gte_8
D1
D1
S11
1
1
1
gte_8
S168
D1
S63
Lowpass_1_4_256_256
M
MITR
M
D262
D5
4
1
S9
gte_4_4
D1
A
20OUR WILDFORCE ACS USED AS A LINEAR ARRAY
PCI Interface
Local Bus
32
36-bit Data Path
Crossbar
PE0
PE2
PE1
PE4
PE3
21PARTITION EARLY INSTEAD OF LATE TO SHORTEN THE
HARDWARE MAPPING TIME
EARLY
Precompiled Library Cells
Place Route
P1
Merge
SUCCESS
Place Route
Design Input in Khoros
P2
Merge
K-way partitioning Global Place Route
Workspace to Netlist
Place Route
- Coarser granularity -gt smaller netlist.
- Hierarchical and functional flow information are
preserved. - Timing Synchronization greatly facilitated.
- Less resource utilization.
P3
Merge
Place Route
Merge
Pk
LATE
Optimizer
Flatten
Hardware Configuration
SUCCESS
P1
Place Route
K-way partitioning Global Place Route
P2
Place Route
VHDL
- Finer granularity -gt larger netlist.
- Functional and algorithmic flow of the design
are lost. - Timing Synchronization can be a problem.
- More resource utilization.
- The resulted subcircuits are more likely to be
placeable and routable.
P3
Place Route
Pk
Place Route
22MULTI-FPGA PARTITIONING
23TIMING RESULTS FOR atr ON OUR WILDFORCE
- OUR WILDFORCE ACS IS 156X FASTER THAN
KHOROS/CPU NOW. - IF WE HAVE SUFFICIENT LOGIC AND MEMORY SUCH
THAT NO RECONFIGURATIONS ARE NEEDED, THE ACS
COULD BE 667X FASTER. - IF FULLY PIPELINED, THE ACS COULD BE 32,000X
FASTER.
Data Processing
33
Data Transfer
34
Host Code
1544
Reconfiguration
5159
0
1000
2000
3000
4000
5000
6000
24PARTITIONING - 1st BOARD CONFIGURATION PHASE
Blank Frame Map
Compute Edge Stats
Find First Target Pixel
RAM
RAM
Mask Target Pixels
Input Image
Check Intensity Stats
Mark Frame Pixels
11
11
4
4
4
500
554
PE3
PE1
Low-Pass Filter
RAM
AND
Compute Intensity Stats
Sobel Filter
Check Edge Stats
Write to RAM - A
11
11
Low-Pass Filter Check gt 4
11
1296
Low-Pass Filter Check gt 4
CPE0
72
Mask Invalid Target Region
4
548
PE4
PE2
25PARTITIONING - 2nd BOARD CONFIGURATION PHASE
Find First Target Pixel
Find First Target Pixel
RAM
RAM
Mask Target Pixels
Mask Target Pixels
Mark Frame Pixels
Mark Frame Pixels
4
4
4
4
500
500
PE3
PE1
Read from RAM - A
5
Find First Target Pixel
RAM
Write to RAM - B
53
Mask Target Pixels
CPE0
Mark Frame Pixels
4
4
72
PE4
PE2
500
26PARTITIONING - 3rd BOARD CONFIGURATION PHASE
Find First Target Pixel
RAM
Mask Target Pixels
Mark Frame Pixels
4
4
0
500
PE3
PE1
Read from RAM - B
5
Find First Target Pixel
RAM
Write to RAM - C
53
Mask Target Pixels
CPE0
Mark Frame Pixels
4
4
72
PE4
PE2
500
27PARTITIONING - 4th BOARD CONFIGURATION PHASE
RAM
Read from RAM - C
Find Max Intensity
Combine Image and Frames
4
11
11
11
119
75
PE3
PE1
Input Image
11
Output Image
RAM
53
CPE0
4
11
11
72
PE4
PE2
90
28PRODUCTIVITY IMPROVEMENT IS 100X(250 hours
manually vs. 2.5 hours automatically)
Application
Khoros
Partitioning Suite
Data Matching Data Synchronization
WSP2NETLIST
NETLIST2STV
Synthesis/Place Route
ACS
Automatic
Manual
time
29LESSONS LEARNED
- Learned that the translation from KHOROS to
hardware is complicated by several factors
including - Differences in the way blocks of data are passed
from operator to operator. - Parameters for data bit-widths must be specified
for each cell. - Difference between data-driven KHOROS cells and
clock-driven hardware cells creates a need for
data synchronization. - Determined that reconfiguration time was the
major obstacle to achieving high performance, and
that RAM access conflicts required more
reconfigurations than would be otherwise
necessary. - Learned that manual implementation of KHOROS
applications on WildForce is very time-consuming
and tedious (250 hours). - Thus, great potential exists for making a
significant (100x) improvement on productivity
via automation.
30SCHEDULE AND MILESTONES
May 98 Demonstrated the manual mapping of a
simple KHOROS network on a
Xilinx-based ACS (EVC-1). We also validated our
method for library development at
the KHOROS, VHDL and FPGA levels. Sep
98 Demonstrated the manual mapping of a more
complex KHOROS network on a
Xilinx-based ACS (Wildforce). Mar
99 Demonstrated the manual mapping of a complex
KHOROS network with some automated FPGA
partitioning on the Wildforce. Sep
99 Automated additional portions of the
application development flow. Jan 00 Will
demonstrate the Army Night Vision Lab challenge
problem with automatic mapping onto the
Wildforce. Mar 00 Will demonstrate two
additional challenge problems (e.g. Face
Detection and Image Backprojection on the
Wildforce). Sep 00 Will demonstrate all three
challenge problems on two additional ACS
platforms (e.g. Altera-based ACS and latest
Xilinx-Virtex ACS).
31CHAMPION A SOFTWARE DESIGN ENVIRONMENT FOR
ADAPTIVE COMPUTING SYSTEMS