Resource Saving in Micro-Computer Software - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Resource Saving in Micro-Computer Software

Description:

ELMS Enclosed Loop Micro-Sequencer. A PC ROM structure can be a very good sequencer in FPGA. ... ELMS supports FOR loops with pre-defined iterations at machine ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 43
Provided by: jywu2
Category:

less

Transcript and Presenter's Notes

Title: Resource Saving in Micro-Computer Software


1
Resource Saving in Micro-Computer Software
FPGA Firmware Designs
  • Wu, Jinyuan
  • Fermilab
  • Nov. 2006

2
Resource Saving in FPGAFrom CompactFPGAdesign.p
df
  • Glue Logic
  • Digitization
  • TDC, (ADC), etc.
  • Communication
  • C5, Digital Phase Follower, etc.
  • Data Organization
  • Zero-Suppression, Parasitic Event Building, etc.
  • Reconfigurable Computing
  • Hash Sorter, TTF, ELMS, etc.

Software -- Firmware
3
Computer Is Fast
  • This is the first impression of many beginners.
  • FPGA is big.
  • Program Creation Time gt Execution Time

4
How to Slow Down Computers?
Square Wave Generator
5 56 2 - 1 16 2 -
.
T
LD A,A NOOP 1 NOOP spends 1ms 1,000,000
NOOP spends 1s
CPU Z80
4MHz
LD B,255 BACKB LD A,255 BACKA NOOP DEC
A JP NZ, BACKA LD A,B DEC B DEC A JP NZ,
BACKB
LD A,255 BACKA NOOP DEC A JP NZ, BACKA
  • Single Layer Loop
  • 256 x 3 x 4 x 0.25 us 0.75 ms
  • Nested Loops
  • 256 x0.75 ms .19 s

5
Knowing Slow, Knowing FastWhere Resources Can Be
Saved
  • For micro-computer software
  • Pay attention to loops and frequently called
    subroutines,
  • Especially inner-most nested loops.
  • For FPGA firmware
  • Algorithms rooted in micro-computer software.
  • Reusable blocks.
  • Occasionally used functions.

6
Example Inner-Product
LD R1, n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6 DEC R1 BRNZ BckA1
R1--
a
R2
X
R3
R4
R5
x
  • Multiplier-less algorithms.
  • Avoid using conditional branch for loop control
    ELMS
  • Saves 25 execution time in this case.

R6
  • Reuse computations Using fast algorithms like
    FFT.

R7
  • Avoid entering the loop Using early constraints.

7
Computing Module in Micro-processor FPGA
(1003-4)57 ?
100
3
Data 100,3,4,5,7
4
5
7
Control
LD
(-)
()
()
()
  • Micro-processors use full sequencing approach.
    One operation is performed in each clock cycle.
  • In FPGA, flatten logics are allowed and are fast
    but take large silicon area.

8
Sequencing in FPGA for Resource Control
Initialization
Initialization1
Initialization2
Sum4
Sum3
Sum2
CH0
CH0
Sum1
Sum2
Sum3
Sum4
Sum1
Sum4
Sum3
Sum2
Sum1
Sum2
Sum3
Sum4
Sum1
CH1
CH1
Sum4
Sum3
Sum2
Sum1
Sum2
Sum3
Sum4
Sum1
CH2
CH2
Sum4
Sum3
Sum2
Sum1
Sum2
Sum3
Sum4
Sum1
CH3
CH3
  • Sequencing is a very efficient means of resource
    control in FPGA.
  • Reuse processing resource for similar function
    and/or different channels.
  • Pay attention to occasionally-used functions like
    initialization.

9
Suggestion (1)
Use partially flatten and partially sequential
logic to reach balance of speed and size.
10
ELMS Enclosed Loop Micro-Sequencer
  • A PCROM structure can be a very good sequencer
    in FPGA.
  • The Conditional Branch Logic is added to support
    regular conditional branch as in
    micro-processors.
  • The Loop Return Logic Stack are added to
    support FOR loops with pre-defined iterations at
    machine code level.
  • The resource usage of ELMS in FPGA is very small.

FOR BckA1 EndA1 n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6
11
ELMS Detailed Block Diagram
12
FOR Loops at Machine Code Level
LD R1, n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6 DEC R1 BRNZ BckA1
FOR BckA1 EndA1 n LD R2, addr_a LD R3,
addr_X LD R7, 0 BckA1 LD R4,
(R2) INC R2 LD R5, (R3) INC R3 MUL R6, R4,
R5 EndA1 ADD R7, R7, R6
  • Looping sequence is known in this example before
    entering the loop.
  • Regular micro-processor treat the sequence as
    unknown.
  • ELMS supports FOR loops with pre-defined
    iterations at machine code level.

13
Suggestion (2)
Eliminate unnecessary instructions, functions,
time slots, etc. whenever it is possible.
14
Do You SUDOKU?
  • Fill in 1-9 so that
  • Each column contains 1-9 without repeating.
  • Each row contains 1-9 without repeating.
  • Each 3x3 box contains 1-9 without repeating.
  • It is fun to solve by hand.
  • It is also fun to write a solver program, or read
    a good one.

15
A Possible SUDOKU Solver?
  • For all empty boxes, assign 1-9 to each.
  • Check correct or not.
  • If not, repeat.

81-2853 empty boxes 9 possibilities for each
box. Total possibilities 953. Assume a computer
checks 1010 possibilities/sec. A year 3x107
sec. Total time to solve 953 /(1010 x 3x107) gtgt
1000 years
16
A Real SUDOKU Solver
  • Eliminate impossible values for each empty box.
  • Assign a possible value to the box.
  • Repeat.

Total time to solve lt 1 sec
17
sudoku.c
/ check_row() -- check the row / int
check_row(int b99, int x, int y,
int v) int i for (i 0 i lt 9
i) if (i ! y) if
(bxi v) return 0
return v / check_column() -- check the
column / int check_column(int b99,
int x, int y, int v) int i for
(i 0 i lt 9 i) if (i ! x)
if (biy v)
return 0 return v / check_square()
-- check the square / int check_square(int
b99, int x, int y, int v)
int i, j, x0, y0 x0 x / 3 y0 y
/ 3 for (i x0 3 i lt x0 3 3
i) for (j y0 3 j lt y0 3 3
j) if (!((x i) (y
j))) if (bij v)
return 0 return
v / unique_solution() -- find the unique
solution for i, j / int unique_solution(int
b99, int x, int y) int s
0, n 0, i, j, v for (v 1 v lt 10
v) if (check_row(b, x, y,
v) check_column(b, x, y, v)
check_square(b, x, y, v))
s v n
if (n 1) return
s else return 0 / possible
solutions() -- find the possible solutions for
i, j / int possible_solutions(int b99,
int x, int y, int s) int n 0,
i, j, v for (v 1 v lt 10 v)
if (check_row(b, x, y, v)
check_column(b, x, y, v)
check_square(b, x, y, v))
sn v
return n
include ltstdio.hgt include ltstrings.hgt void
show_board(int b99) int i, j
printf("---------------------\n") for
(i 0 i lt 9 i)
printf("") for (j 0 j lt 9 j)
if (bij 0)
printf(" ")
else printf(" d",
bij) if (j 3 2)
printf(" ")
printf("\n") if (i 3
2) printf("------------------
---\n") / init_board() --
initialize the board with all 0 / void
init_board(int b99) int i, j
for (i 0 i lt 9 i) for (j 0 j
lt 9 j) bij 0 /
read_board() -- read the board from input file
/ void read_board(FILE fp, int
b99) char s10 int i, j, c
i 0 j 0 while ((c fgetc(fp))
! EOF) if (c '\n')
i j
0 else
if (c ! ' ')
bij c - '0'
j / check_row() --
check the row / int check_row(int b99,
int x, int y, int v) int i
for (i 0 i lt 9 i) if (i ! y)
if (bxi v)
return 0 return v
/ solve1() -- one pass to solve the puzzle
/ int solve1(int b99) int i, j
int solved 0 for (i 0 i lt 9 i)
for (j 0 j lt 9 j)
if (bij 0)
bij unique_solution(b, i,
j) if (bij)
solved
return (solved) int solve(int
b99) int b299, i, j, k, n
int ps9, s9, pn, x, y / copy the
board for recurrsion / for (i 0 i lt 9
i) for (j 0 j lt 9 j)
b2ij bij while (solve1(b2))
show_board(b2) /
figure out possible solution for unknown /
pn 10 for (i 0 i lt 9 i) for
(j 0 j lt 9 j) if
(b2ij 0)
for (k 0 k lt 9 k)
sk 0 n
possible_solutions(b2, i, j, s)
if (n lt pn)
pn n for (k
0 k lt n k)
psk sk x i
y j
if (pn 10) /
that's it / for (i 0 i lt
9 i) for (j 0 j lt 9 j)
if (b2ij 0)
return 0 return 1
for (i 0 i lt pn i)
b2xy psi show_board(b2)
if (solve(b2))
return 1
return 0
main(int argc, char argv) int
board99 FILE fp int i, j, k,
n int s9 if (argc gt 1)
fp fopen(argv1, "r")
else fp stdin
init_board(board) read_board(fp,
board) show_board(board)
solve(board)
18
A Possible Track Finder?
  • Choose a hit for each layer.
  • Fit and calculate c2.
  • Cut on c2.

10 layers O(n10) 100 hits/layer. Total
possibilities 1020. Assume a computer checks
1010 possibilities/sec. A year 3x107 sec. Total
time to check all possibilities 1020 /(1010 x
3x107) gt 300 years
19
A Better Track Finder
  • Choose a hit for each of layer 1 and 2.
  • Choose only compactable hits on layers 3 to 10.
  • Calculate c2.
  • Cut on c2.

First constrain at layer 3 O(n3) 100
hits/layer. Total possibilities 106. Assume a
computer checks 1010 possibilities/sec. Total
time to check all possibilities 106 /(1010) gt
0.1 ms
20
Suggestion (2)
(e.g. Offset, rather than c2)
  • Use early constraints to reduce number of
    iterations.
  • Evaluate the first constraint as simply as
    possible.
  • Apply the first constraint as early as possible.

(e.g. At layer 3, not until 10)
21
Triplets
  • Triplet
  • Data item with 2 free parameters.
  • of measurements - of constraints 2.
  • A triplet is not necessarily a straight track
    segment.
  • A triplet may have more than 3 measurements.
  • Circular track with known interaction point is a
    triplet since it has 2 free parameters.
    (Otherwise it has 3 parameters.)

22
Triplet Finding
  • Triplet finding can be done in software or in
    firmware.
  • Tiny Triplet Finder (TTF) is a firmware
    implementation developed in Fermilab BTeV.
  • Tiny small silicon usage.
  • For more info on TTF, see handout.

Triplet Finding
O(n3) Software Processes
O(n) FPGA Firmware Functions
O(Nlog(N)) Implementation Tiny Triplet Finder
O(N2) Implementations CAM, Hough Trans., etc.
23
DFT and FFT
DFT O(N2)
FFT O(Nlog(N))
  • Why log(N)?
  • Information propagation
  • Multiplication reuse of rotational factors

24
FFT for Arbitrary Precision Multiplications
  • Multiplication of two very long integers consumes
    O(N2) computation.
  • It can be viewed as a convolution.
  • Convolutions can be computed using FFT with
    O(Nlog(N)) computation.

25
Suggestion (3)
Take advantages of fast (like FFT) or tiny (like
Tiny Triplet Finder) algorithms.
26
Multiplier-less (ML) Approaches
  • Canonic signed digit (CSD) and sum of powers of
    two (SOPOT) representations
  • 5xA 4xA A, 248xA 256xA - 8xA
  • Recursive implementation of finite impulse
    respond (FIR) filter
  • Sliding sum, sinc2, etc.
  • CORDIC or similar algorithms
  • ML FFT, rotators, etc.
  • Distributed Arithmetic (DA) designs
  • Look-up tables.
  • Single-bit sinc3 FIR decimation filter
  • In delta-sigma ADC

27
Least-Square (LS) Track Fitter
  • Standard least square fitting uses large amount
    of multiplications and possibly divisions.

28
Multiplier-less (ML) Track Fitter
  • The coefficients are scaled to avoid using
    dividers.
  • The coefficients for ML approximate fitting
    algorithm are two-bit integers. The full
    multiplications are replaced by two integer
    shift-additions

29
Errors of LS and ML Track Fitters
  • The errors of ML approximate fitting algorithm
    are only slightly larger than LS fitting errors..

30
Errors Several Track Fitters
  • Generally speaking, more computations yield
    better quality of the results.
  • However, after certain point, the quality of the
    results does not improve as rapidly as before.
  • It is common that large amount of computation
    brings only small improvement in the
    mathematically perfect algorithms.

31
Suggestion (4)
Consider resource/power friendly algorithms such
as multiplier-less, divider-less algorithms.
32
Why Saving Resource
  • ?
  • ?
  • ?
  • ?
  • ?
  • ?
  • ?

33
Moores Law
Taken from www.intel.com
  • Number of transistors in a package
  • x2 /18months

34
The Fever of Moores Law vs. Maxwell Equations
Op/sec
1998 2000 2002 2004 2006 2008 2010
MIT, 2002
  • During the fever of Moores law, saving computing
    resource became non-critical, if not impossible.
  • From basic principle like Maxwell Equations, it
    was know the fever would not last.

35
Moores Law Today
Taken from www.intel.com
  • of transistors
  • Yes, via multi-core.
  • Clock Speed
  • ?

36
Total Useful Works (Clock Frequency) x
(Silicon Size) x (Efficiency)
E
E
F
F
S
S
  • There is big room for improvement on computation
    efficiency in both micro-computer software and
    FPGA firmware.
  • Resource saving helps today when technology
    stales.
  • Resource saving helps future with technology
    progresses.

37
Resource Saving Helps FutureWhere Resources Can
Be Saved
  • Todays subroutines or FPGA blocks are to be
    reused thousands of times in the future
  • If todays design is slightly too slow, too big
  • Todays students as well as old people gain
    experience from todays work and become bosses,
    reviewers, etc. in the future
  • The experience (?)
  • E. g. Is a wedding with 20K budget possible?
    (Given the experience of 1000/pizza?).

38
The End
  • Thanks

39
Triplet Finding
  • Three layers of nested loops are needed if the
    process is implemented in software.
  • A total of n3 combinations must be checked (e.g.
    5x5x5125).
  • In FPGA, to unroll 2 layers of loops, large
    silicon resource may be needed without careful
    planning O(N2)

Plane A
Plane B
Plane C
for (i0 iltN_A i) for (j0 jltN_B j)
for (k0 kltN_C k)
40
Circular Tracks from Collision Pointon
Cylindrical Detectors
(F2-F3)64
(F1-F3)64
  • For a given hit on layer 3, the coincident
    between a layer 2 and a layer 1 hit satisfying
    coincident map signifies a valid circular track.
  • A track segment has 2 free parameters, i.e., a
    triplet.
  • The coincident map is invariant of rotation.

41
Tiny Triplet FinderReuse Coincident Logic via
Shifting Hit Patterns
C3
C2
C1
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift
the hit patterns for C1 and C2 to search for
coincidence.
42
Tiny Triplet Finder for Circular Tracks
Also works with more than 3 layers
Shifter
Shifter
Bit-wise Coincident Logic
Bit Array
Bit Array
  1. Fill the C1 and C2 bit arrays. (n1 clock cycles)
  2. Loop over C3 hits, shift bit arrays and check for
    coincidence. (n3 clock cycles)

R1/R3
R2/R3
Triplet Map Output To Decoder
Write a Comment
User Comments (0)
About PowerShow.com