Title: Resource Saving in Micro-Computer Software
1Resource Saving in Micro-Computer Software
FPGA Firmware Designs
- Wu, Jinyuan
- Fermilab
- Nov. 2006
2Resource Saving in FPGAFrom CompactFPGAdesign.p
df
- Glue Logic
- Digitization
- TDC, (ADC), etc.
- Communication
- C5, Digital Phase Follower, etc.
- Data Organization
- Zero-Suppression, Parasitic Event Building, etc.
- Reconfigurable Computing
- Hash Sorter, TTF, ELMS, etc.
Software -- Firmware
3Computer Is Fast
- This is the first impression of many beginners.
- FPGA is big.
- Program Creation Time gt Execution Time
4How to Slow Down Computers?
Square Wave Generator
5 56 2 - 1 16 2 -
.
T
LD A,A NOOP 1 NOOP spends 1ms 1,000,000
NOOP spends 1s
CPU Z80
4MHz
LD B,255 BACKB LD A,255 BACKA NOOP DEC
A JP NZ, BACKA LD A,B DEC B DEC A JP NZ,
BACKB
LD A,255 BACKA NOOP DEC A JP NZ, BACKA
- Single Layer Loop
- 256 x 3 x 4 x 0.25 us 0.75 ms
- Nested Loops
- 256 x0.75 ms .19 s
5Knowing Slow, Knowing FastWhere Resources Can Be
Saved
- For micro-computer software
- Pay attention to loops and frequently called
subroutines, - Especially inner-most nested loops.
- For FPGA firmware
- Algorithms rooted in micro-computer software.
- Reusable blocks.
- Occasionally used functions.
6Example Inner-Product
LD R1, n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6 DEC R1 BRNZ BckA1
R1--
a
R2
X
R3
R4
R5
x
- Multiplier-less algorithms.
- Avoid using conditional branch for loop control
ELMS - Saves 25 execution time in this case.
R6
- Reuse computations Using fast algorithms like
FFT.
R7
- Avoid entering the loop Using early constraints.
7Computing Module in Micro-processor FPGA
(1003-4)57 ?
100
3
Data 100,3,4,5,7
4
5
7
Control
LD
(-)
()
()
()
- Micro-processors use full sequencing approach.
One operation is performed in each clock cycle. - In FPGA, flatten logics are allowed and are fast
but take large silicon area.
8Sequencing in FPGA for Resource Control
Initialization
Initialization1
Initialization2
Sum4
Sum3
Sum2
CH0
CH0
Sum1
Sum2
Sum3
Sum4
Sum1
Sum4
Sum3
Sum2
Sum1
Sum2
Sum3
Sum4
Sum1
CH1
CH1
Sum4
Sum3
Sum2
Sum1
Sum2
Sum3
Sum4
Sum1
CH2
CH2
Sum4
Sum3
Sum2
Sum1
Sum2
Sum3
Sum4
Sum1
CH3
CH3
- Sequencing is a very efficient means of resource
control in FPGA. - Reuse processing resource for similar function
and/or different channels. - Pay attention to occasionally-used functions like
initialization.
9Suggestion (1)
Use partially flatten and partially sequential
logic to reach balance of speed and size.
10ELMS Enclosed Loop Micro-Sequencer
- A PCROM structure can be a very good sequencer
in FPGA. - The Conditional Branch Logic is added to support
regular conditional branch as in
micro-processors. - The Loop Return Logic Stack are added to
support FOR loops with pre-defined iterations at
machine code level. - The resource usage of ELMS in FPGA is very small.
FOR BckA1 EndA1 n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6
11ELMS Detailed Block Diagram
12FOR Loops at Machine Code Level
LD R1, n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6 DEC R1 BRNZ BckA1
FOR BckA1 EndA1 n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6
- Looping sequence is known in this example before
entering the loop. - Regular micro-processor treat the sequence as
unknown. - ELMS supports FOR loops with pre-defined
iterations at machine code level.
13Suggestion (2)
Eliminate unnecessary instructions, functions,
time slots, etc. whenever it is possible.
14Do You SUDOKU?
- Fill in 1-9 so that
- Each column contains 1-9 without repeating.
- Each row contains 1-9 without repeating.
- Each 3x3 box contains 1-9 without repeating.
- It is fun to solve by hand.
- It is also fun to write a solver program, or read
a good one.
15A Possible SUDOKU Solver?
- For all empty boxes, assign 1-9 to each.
- Check correct or not.
- If not, repeat.
81-2853 empty boxes 9 possibilities for each
box. Total possibilities 953. Assume a computer
checks 1010 possibilities/sec. A year 3x107
sec. Total time to solve 953 /(1010 x 3x107) gtgt
1000 years
16A Real SUDOKU Solver
- Eliminate impossible values for each empty box.
- Assign a possible value to the box.
- Repeat.
Total time to solve lt 1 sec
17sudoku.c
/ check_row() -- check the row / int
check_row(int b99, int x, int y,
int v) int i for (i 0 i lt 9
i) if (i ! y) if
(bxi v) return 0
return v / check_column() -- check the
column / int check_column(int b99,
int x, int y, int v) int i for
(i 0 i lt 9 i) if (i ! x)
if (biy v)
return 0 return v / check_square()
-- check the square / int check_square(int
b99, int x, int y, int v)
int i, j, x0, y0 x0 x / 3 y0 y
/ 3 for (i x0 3 i lt x0 3 3
i) for (j y0 3 j lt y0 3 3
j) if (!((x i) (y
j))) if (bij v)
return 0 return
v / unique_solution() -- find the unique
solution for i, j / int unique_solution(int
b99, int x, int y) int s
0, n 0, i, j, v for (v 1 v lt 10
v) if (check_row(b, x, y,
v) check_column(b, x, y, v)
check_square(b, x, y, v))
s v n
if (n 1) return
s else return 0 / possible
solutions() -- find the possible solutions for
i, j / int possible_solutions(int b99,
int x, int y, int s) int n 0,
i, j, v for (v 1 v lt 10 v)
if (check_row(b, x, y, v)
check_column(b, x, y, v)
check_square(b, x, y, v))
sn v
return n
include ltstdio.hgt include ltstrings.hgt void
show_board(int b99) int i, j
printf("---------------------\n") for
(i 0 i lt 9 i)
printf("") for (j 0 j lt 9 j)
if (bij 0)
printf(" ")
else printf(" d",
bij) if (j 3 2)
printf(" ")
printf("\n") if (i 3
2) printf("------------------
---\n") / init_board() --
initialize the board with all 0 / void
init_board(int b99) int i, j
for (i 0 i lt 9 i) for (j 0 j
lt 9 j) bij 0 /
read_board() -- read the board from input file
/ void read_board(FILE fp, int
b99) char s10 int i, j, c
i 0 j 0 while ((c fgetc(fp))
! EOF) if (c '\n')
i j
0 else
if (c ! ' ')
bij c - '0'
j / check_row() --
check the row / int check_row(int b99,
int x, int y, int v) int i
for (i 0 i lt 9 i) if (i ! y)
if (bxi v)
return 0 return v
/ solve1() -- one pass to solve the puzzle
/ int solve1(int b99) int i, j
int solved 0 for (i 0 i lt 9 i)
for (j 0 j lt 9 j)
if (bij 0)
bij unique_solution(b, i,
j) if (bij)
solved
return (solved) int solve(int
b99) int b299, i, j, k, n
int ps9, s9, pn, x, y / copy the
board for recurrsion / for (i 0 i lt 9
i) for (j 0 j lt 9 j)
b2ij bij while (solve1(b2))
show_board(b2) /
figure out possible solution for unknown /
pn 10 for (i 0 i lt 9 i) for
(j 0 j lt 9 j) if
(b2ij 0)
for (k 0 k lt 9 k)
sk 0 n
possible_solutions(b2, i, j, s)
if (n lt pn)
pn n for (k
0 k lt n k)
psk sk x i
y j
if (pn 10) /
that's it / for (i 0 i lt
9 i) for (j 0 j lt 9 j)
if (b2ij 0)
return 0 return 1
for (i 0 i lt pn i)
b2xy psi show_board(b2)
if (solve(b2))
return 1
return 0
main(int argc, char argv) int
board99 FILE fp int i, j, k,
n int s9 if (argc gt 1)
fp fopen(argv1, "r")
else fp stdin
init_board(board) read_board(fp,
board) show_board(board)
solve(board)
18A Possible Track Finder?
- Choose a hit for each layer.
- Fit and calculate c2.
- Cut on c2.
10 layers O(n10) 100 hits/layer. Total
possibilities 1020. Assume a computer checks
1010 possibilities/sec. A year 3x107 sec. Total
time to check all possibilities 1020 /(1010 x
3x107) gt 300 years
19A Better Track Finder
- Choose a hit for each of layer 1 and 2.
- Choose only compactable hits on layers 3 to 10.
- Calculate c2.
- Cut on c2.
First constrain at layer 3 O(n3) 100
hits/layer. Total possibilities 106. Assume a
computer checks 1010 possibilities/sec. Total
time to check all possibilities 106 /(1010) gt
0.1 ms
20Suggestion (2)
(e.g. Offset, rather than c2)
- Use early constraints to reduce number of
iterations. - Evaluate the first constraint as simply as
possible. - Apply the first constraint as early as possible.
(e.g. At layer 3, not until 10)
21 Triplets
- Triplet
- Data item with 2 free parameters.
- of measurements - of constraints 2.
- A triplet is not necessarily a straight track
segment. - A triplet may have more than 3 measurements.
- Circular track with known interaction point is a
triplet since it has 2 free parameters.
(Otherwise it has 3 parameters.)
22Triplet Finding
- Triplet finding can be done in software or in
firmware. - Tiny Triplet Finder (TTF) is a firmware
implementation developed in Fermilab BTeV. - Tiny small silicon usage.
- For more info on TTF, see handout.
Triplet Finding
O(n3) Software Processes
O(n) FPGA Firmware Functions
O(Nlog(N)) Implementation Tiny Triplet Finder
O(N2) Implementations CAM, Hough Trans., etc.
23DFT and FFT
DFT O(N2)
FFT O(Nlog(N))
- Why log(N)?
- Information propagation
- Multiplication reuse of rotational factors
24FFT for Arbitrary Precision Multiplications
- Multiplication of two very long integers consumes
O(N2) computation. - It can be viewed as a convolution.
- Convolutions can be computed using FFT with
O(Nlog(N)) computation.
25Suggestion (3)
Take advantages of fast (like FFT) or tiny (like
Tiny Triplet Finder) algorithms.
26Multiplier-less (ML) Approaches
- Canonic signed digit (CSD) and sum of powers of
two (SOPOT) representations - 5xA 4xA A, 248xA 256xA - 8xA
- Recursive implementation of finite impulse
respond (FIR) filter - Sliding sum, sinc2, etc.
- CORDIC or similar algorithms
- ML FFT, rotators, etc.
- Distributed Arithmetic (DA) designs
- Look-up tables.
- Single-bit sinc3 FIR decimation filter
- In delta-sigma ADC
27Least-Square (LS) Track Fitter
- Standard least square fitting uses large amount
of multiplications and possibly divisions.
28Multiplier-less (ML) Track Fitter
- The coefficients are scaled to avoid using
dividers. - The coefficients for ML approximate fitting
algorithm are two-bit integers. The full
multiplications are replaced by two integer
shift-additions
29Errors of LS and ML Track Fitters
- The errors of ML approximate fitting algorithm
are only slightly larger than LS fitting errors..
30Errors Several Track Fitters
- Generally speaking, more computations yield
better quality of the results. - However, after certain point, the quality of the
results does not improve as rapidly as before. - It is common that large amount of computation
brings only small improvement in the
mathematically perfect algorithms.
31Suggestion (4)
Consider resource/power friendly algorithms such
as multiplier-less, divider-less algorithms.
32Why Saving Resource
33Moores Law
Taken from www.intel.com
- Number of transistors in a package
- x2 /18months
34The Fever of Moores Law vs. Maxwell Equations
Op/sec
1998 2000 2002 2004 2006 2008 2010
MIT, 2002
- During the fever of Moores law, saving computing
resource became non-critical, if not impossible. - From basic principle like Maxwell Equations, it
was know the fever would not last.
35Moores Law Today
Taken from www.intel.com
- of transistors
- Yes, via multi-core.
- Clock Speed
- ?
36Total Useful Works (Clock Frequency) x
(Silicon Size) x (Efficiency)
E
E
F
F
S
S
- There is big room for improvement on computation
efficiency in both micro-computer software and
FPGA firmware. - Resource saving helps today when technology
stales. - Resource saving helps future with technology
progresses.
37Resource Saving Helps FutureWhere Resources Can
Be Saved
- Todays subroutines or FPGA blocks are to be
reused thousands of times in the future - If todays design is slightly too slow, too big
- Todays students as well as old people gain
experience from todays work and become bosses,
reviewers, etc. in the future - The experience (?)
- E. g. Is a wedding with 20K budget possible?
(Given the experience of 1000/pizza?).
38The End
39Triplet Finding
- Three layers of nested loops are needed if the
process is implemented in software. - A total of n3 combinations must be checked (e.g.
5x5x5125). - In FPGA, to unroll 2 layers of loops, large
silicon resource may be needed without careful
planning O(N2)
Plane A
Plane B
Plane C
for (i0 iltN_A i) for (j0 jltN_B j)
for (k0 kltN_C k)
40Circular Tracks from Collision Pointon
Cylindrical Detectors
(F2-F3)64
(F1-F3)64
- For a given hit on layer 3, the coincident
between a layer 2 and a layer 1 hit satisfying
coincident map signifies a valid circular track. - A track segment has 2 free parameters, i.e., a
triplet. - The coincident map is invariant of rotation.
41Tiny Triplet FinderReuse Coincident Logic via
Shifting Hit Patterns
C3
C2
C1
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift
the hit patterns for C1 and C2 to search for
coincidence.
42Tiny Triplet Finder for Circular Tracks
Also works with more than 3 layers
Shifter
Shifter
Bit-wise Coincident Logic
Bit Array
Bit Array
- Fill the C1 and C2 bit arrays. (n1 clock cycles)
- Loop over C3 hits, shift bit arrays and check for
coincidence. (n3 clock cycles)
R1/R3
R2/R3
Triplet Map Output To Decoder