Title: General Optimization Issues
1General Optimization Issues
2To be tackled today
- Most optimized TigerSHARC instruction
- Integer and float
- Systematic optimization procedure
- SISD and SIMD modes
- Exercises
3Most optimized SIMD Floating point(32-bit)TigerSH
ARC instruction
- xR30 CB Qj0 4 yR30 CB Qk0 4
xyFR4 R5 R6 xyFR7 R8 R9, FR10 R8 -
R9 - xR30 CB Qj0 4 / Fetches 4 values on J
BUS into x compute registers XR3, XR2,
XR1, XR0 Increments J register and
adjusts for circular buffer
operation / - yR30 CB Qk0 4 / Fetches 4 values on J
BUS into x compute registers XR3, XR2,
XR1, XR0 Increments J register and
adjusts for circular buffer
operation / - xyFR4 R5 R6 / Two multiplications XFR5
XFR6 and YFR5 YFR6 / - xyFR7 R8 R9, FR10 R8 - R9 / Two
additions XFR8 XFR9 and YFR8 YFR9 AND Two
subtractions XFR8 - XFR9 and YFR8 - YFR9 / - / Same register must be used either side
of and operators /
4Most optimized SIMD Integer (short)(16-bit)TigerS
HARC instruction
- xR30 CB Qj0 4 yR30 CB Qk0 4
R76 R54 R32 xySR98 R76R10,SR1110
R76-R10 - xR30 CB Qj0 4 / Fetches 4 values on J
BUS into x compute registers XR3, XR2,
XR1, XR0 Increments J register and
adjusts for circular buffer
operation / - yR30 CB Qk0 4 / Fetches 4 values on J
BUS into x compute registers XR3, XR2,
XR1, XR0 Increments J register and
adjusts for circular buffer
operation / - xyR76 R54 R32 / Eight multiplications
XR5.H XR3.H, and XR5.L XR3.L, XR4.H
XR2.H, XR4.L XR3.L ditto YR / - xySR98 R76 R10, R1110 R76 R10
/ Eight additions ???????
AND Eight subtractions
????????????????? /
5ExerciseWrite out the 16 operations performed
- xySR98 R76 R10, R1110 R76 R10
/ Eight additions ???????
AND Eight subtractions
????????????????? / - Now do a sideways add on xySR98 and get a value
6Steps to optimize
- Get the algorithm to work in C
- Determine how much time is available
- If Timing already okay quit
- Determine maximum number of each type of
operation (add, subtract, multiple, memory
fetches) - Divide the calculated maximum by the number of
available resources for that type of operation - The largest division result is the in theory
number of cycles needed for the algorithm - If that minimum time is more than 100 of the
time available find a new algorithm - If that minimum time is less than 40 of the time
available perhaps you can optimize the code to
meet the speed requirements
7Code optimization 32 bit integersor 32-bit
floats
2 SIZE additions 2 SIZE Memory fetches If
done correctly Can do 2 additions AND 2 memory
fetches each cycle Therefore optimum isSIZE
cycles IFF can find all optimizations
8Code optimization 32 bit integersor 32-bit
floats
2 SIZE additions 2 SIZE Memory fetches Left
fetched on J-bus And done in X-compute Right
fetched on K-bus And done in Y-compute
916-bit integers (short int) might be okay in some
circumstances
2 SIZE additions 2 SIZE Memory fetches If
done correctly Can do 8 short additions AND 32
short memory fetches each cycle Therefore
optimum isSIZE / 4 cycles IFF can find all
optimizations
10FIR optimization
SIZE additions SIZE multiplications SIZE 2
memory fetches 2 additions, 2 multiplications
and 8 fetches per cycles Should be able to do it
in SIZE / 2 cycles
11FIR optimization
SIZE additions SIZE multiplications SIZE 2
memory fetches Fetch 2 values along J-bus into
XA and YA compute Fetch 2 coefficients along
K-bus into XB and YB compute
12Need a systematic approach to handling the
optimization of code
- Get the C code to work
- Rewrite code in simplest format one operation
per line - Recommend rewrite code using register names
- Unwrap the loop start with twice
- Rewrite the second part of the loop using
different register names avoids setting up
unexpected dependencies - Overlap the first and second parts of loops
- Rearrange start-up and ending code
13STAGE 1Get the C code to work
14Need a systematic approach to handling the
optimization of code
- Get the C code to work
- Rewrite code in simplest format one operation
per line - Recommend rewrite code using register names
- Unwrap the loop start with twice
- Rewrite the second part of the loop using
different register names avoids setting up
unexpected dependencies - Overlap the first and second parts of loops
- Rearrange start-up and ending code
15Stage 2 Rewrite in simplest format
Note naming convention Single operation per
line Note other changes
16Need a systematic approach to handling the
optimization of code
- Get the C code to work
- Rewrite code in simplest format one operation
per line - Recommend rewrite code using register names
- Unwrap the loop start with twice
- Rewrite the second part of the loop using
different register names avoids setting up
unexpected dependencies - Overlap the first and second parts of loops
- Rearrange start-up and ending code
17Step 3 -- Unwrap the loop
Again Note naming convention
18Need a systematic approach to handling the
optimization of code
- Get the C code to work
- Rewrite code in simplest format one operation
per line - Recommend rewrite code using register names
- Unwrap the loop start with twice
- Rewrite the second part of the loop using
different register names avoids setting up
unexpected dependencies - Overlap the first and second parts of loops
- Rearrange start-up and ending code
19Step 4Overlap the first and second parts of
loops
Note The C code goes no faster, but using
this format for translating into parallel
assembly code will Step 1 -- 4 N Step 3 8
(N / 2) 2 Step 4 6 (N / 2) 2
20Need a systematic approach to handling the
optimization of code
- Get the C code to work
- Rewrite code in simplest format one operation
per line - Recommend rewrite code using register names
- Unwrap the loop start with twice
- Rewrite the second part of the loop using
different register names avoids setting up
unexpected dependencies - Overlap the first and second parts of loops
- Rearrange start-up and ending code
21Step 5A - Rearrange start-up and ending code
Software Pipeline Move first read outside Need
to add extra read at the end of the
loop Timing 2 (N/2 1) 6 Need to adjust
loop start (Is it done correctly? Are we
one-out) CAUTION NEED TO FIX
22Step 5B - Rearrange start-up and ending code
Can now parallel additional adds and memory
fetches Note loop still in error
23Exercise -- Get the loop control correct
24Exercise 1 -- Get the loop control correct
BUFFER_SIZE 1 BUFFER_SIZE 2 BUFFER_SIZE
4 BUFFER_SIZE 5 BUFFER_SIZE 8 BUFFER_SIZE
128
25Exercise 2 -- Rewrite the code when it is known
that BUFFER_SIZE 127
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
26Code to this point is SISD parallel optimization
- SISD single instruction single data
- Using X_compute block and J memory bus
- Next stage SIMD single instruction multiple
data - Using X_compute block and J memory bus for left
- Using Y_compute block and K memory bus for right
- Will need similar but different code when you are
doing FIR in Lab. 3
27Exercise 3 -- BUFFER_SIZE 128Rewrite so that
X and Y ops done together
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
28Exercise 4 -- BUFFER_SIZE 128Rewrite so that
expect no data dependency stalls
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
29To be tackled today
- Most optimized TigerSHARC instruction
- Integer and float
- Systematic optimization procedure
- SISD and SIMD modes
- Exercises