Title: MMX Architecture Programming and Performance Optimization
1MMX Architecture Programmingand Performance
Optimization
- The MMX Instruction Set
- MMX Technology Optimization Techniques
2MMX Instruction Overview
- 57 new opcodes introduced
- Instructions are grouped into the following
- packed arithmetic (e.g. padd,
psub) - conversions (e.g. pack,
unpack) - logical operations (e.g. pand, por, pxor)
- data transfer operations (e.g. movd, movq)
- EMMS (Empty Multimedia State)
3Packed Arithmetic Example 1
- paddsw MM2, MM4
- (Packed Add with Saturation for Word)
- p Packed
- add the Instruction
- s Saturation
- w Word
4Packed Arithmetic Example 1
Saturated
Saturated
5Packed Arithmetic Example 2
- pmaddwd MM2, MM4
- (Packed Multiply and Add)
- Multiply packed words in parallel
- Add the 32-bit results pairwise
- Store in MMX register as dwords
6Packed Arithmetic Example 2
Wraps around only when all source data elements
are 8000
7Using Saturating Arithmetic
- Example(absolute difference for 8-bit unsigned
data)
Absolute differences
8Optimization Techniques
- MMX instructions utilize U and V pipes of the
Pentium Processor - To write optimized MMX code, one must
- understand the MMX instruction latencies
- know how to pair MMX instructions
- learn how to efficiently mix MMX and regular
integer instructions - take cache structure into considerations
9MMX Instruction Latencies
- Latency rule 1 After modifying an MMX register,
wait until the next clock(at least) before
reading the same register(avoiding RAW hazards) - Example
- movq mm0, eax U pipe 1
- movq mm3, mm2 V pipe 1
- paddw mm0, mm1 U pipe 2
- movq mm2, mm0 STALL 3
10MMX Instruction Latencies
- Example(no stall!)
- movq mm0, eax V pipe
- movq mm3, mm2 U pipe
- paddw mm0, mm1 V pipe
- movq mm2, mm0 U pipe
- Determining how the sequence of instructions will
line up relative to the U and V pipes - branch target will be executed in U pipe
11MMX Instruction Latencies
- Latency rule 2 After issuing a multiply
instruction(pmaddwd, pmulhw, pmullw), wait until
three clocks later before using the result - Example
- pmaddwd mm1,esi4ecx U pipe
- movq mm0, mm2 V pipe
- xxx dont use mm1
- xxx still dont
- xxx still dont
- xxx still dont
- paddd mm7, mm1 now you can!
12MMX Instructions Latencies
- Latency rule 3 After modifying an MMX register,
wait until two clocks later before storing the
result to either memory or an integer
register(EAX, EBX, and so on) - Example
- psubw mm0, mm1 mm0 gets modified
- xxx dont store yet
- xxx still dont
- xxx still dont
- movq edi,mm0 OK to store
13Pairing MMX Instructions
- Goal To achieve a maximum throughput of two
instructions per processor clock - Following four basic MMX instruction pairing
rules on the Pentium processor
14Pairing MMX Instructions
- Pairing rule 1 In each clock, at most one MMX
multiplication instruction(pmaddwd, pmulhw, or
pmullw) can be executed. - Example
- pmaddwd mm0,mm1 a multiply
- pmulhw mm2,mm3 another multiply will not pair
- It will take 2 clocks to execute( or more if
there are stalls due to the latency rules!)
15Pairing MMX Instructions
- Pairing rule 2 In each clock, at most one MMX
shift or pack or unpack instruction can be
executed. - Example The following combinations will not pair
- 1.psrad 2.psrad 3.packdwss
- psllw packdwss punpckwd
16Pairing MMX Instructions
- Pairing rule 3 In each clock, the UV
instruction pair can contain at most one memory
or integer register(EAX, EBX, etc.) reference,
and if it contains one, it must be executed in
the U pipe. - Example The following combinations will not pair
- 1.movq mm0,esi 2.paddd mm0,mm1-U pipe
- movd eax, mm1 movq mm2,esi
17Pairing MMX Instructions
- Pairing rule 4 For optimal pairing, avoid
instructions that are more than 7 bytes long.
Such instruction must be executed in U pipe and
usually will not pair.(e.g. an 8 byte or longer
instruction is one with a memory operand
containing a base register, an index, and a
32-bit displacement) - Example
- paddusb mm3,esi4ecx10248 byte long
- paddusb mm3,esi1024 index removed
- Or
- paddusb mm3,esi4ecx64 shortening the
- offset to 8 bits
- (offset can be
0,8,32 - bits only)
18Mixing Integer and MMX Instructions
- Integer and MMX instruction will pair if
- a. the integer instruction is a pairable
instruction for the pipe where it is being
executed - b. the MMX instruction does not reference memory
or integer register - Example
- pmulh mm0,mm1 no mem/int-reg reference
- add eax,4 V-pairable
- add esi,edi U-pairable
- padd mm2,mm3 no mem/int-reg reference
19Mixing Floating-Point and MMX Instructions
- Each transition between MMX code and
floating-point code costs about 50 clocks - Do not mix floating-point and MMX instructions at
instruction level - Reasonable to mix floating-point and MMX
instructions at the module(function) level - Use EMMS instructions at the end of every MMX
code sequence. If you dont - incorrect floating-point resulted produced
- floating-point exceptions generated degrades
performance
20Software Pipelining
- Dot product of and is
- , two real vectors, can be conveniently
represented for MMX technology programming as
arrays of 16-bit integers of some length
21Software Pipelining
- Dot product in C code
- dotprod 0
- for (i0 iltn i)
- dotprod xiyi
- MMX implementation for the inner loop
- esi points to x array, edi points to y
- ecx is loop counter
- dotprod
- movq mm0,esi8ecx load xi
- pmaddwd mm0,edi8ecx multiply by yi
- paddd mm1,mm0 accumulate to mm1
- dec ecx
- jge dotprod
22Software Pipelining
- Analyzing the code using latency and pairing
rules - code executed on the Pentium processor
- dotprod pipe clock
- movq mm0,esi8ecx U 1
- V-pipe stall V
- pmaddwd mm0,edi8ecx U 2
- V-pipe stall V
- U-pipe stall U 3
- V-pipe stall V
- U-pipe stall U 4
- V-pipe stall Vmultiply now done
- paddd mm1,mm0 U 5
- dec ecx V (note int/MMX
pairing) - jge dotprod U 6
- V-pipe stall V
- It takes six clocks to process four elements per
iteration, i.e., 1.5 clocks per array element
23Software Pipelining
- Optimization Use additional iterations of the
same loop to fill the empty slots - Abstraction of each instruction into a symbol
- -Xi, where X could be
- M (multiply)
- S (shift, pack, unpack)
- A (anything else)
- the leading dash indicates a memory or integer
register operand - subscript i distinguishes different instructions
- Programming task reduced to a problem of filling
a 2-D table with letters
24Software Pipelining
- Symbolizing the three instructions in the product
code - -L movq
- -M pmaddwd
- A paddd
- Each M falls one clock(or more) after the
corresponding L, and each A falls three clocks(or
more) after the corresponding M - Each U,V pair contains at most one M and at most
one - (and the -, if present, must be in the U
pipe)
25Software Pipelining
- Interleaving three iterations of the
loop(dependencies are shown with arrows
interleave factor k3)
26Software Pipelining
- Adding loop control
- DEC and JGE can pair, but still need one extra
clock for the decreased ECX to be available for
next memory reference. - Including the loop control, it takes 8 clocks to
process 12(34) array elements, i.e., 0.75
clocks per array element. - Speed Up2(excluding loop prologue and epilogue)
27Software Pipelining
- The optimized code structure
- I n/4-3 start with last quadword
triplet - -L(I1) loop prologue
- -M(I1)
- -L(I2)
- -M(I2)
- --------------------------------------------------
- loop_top loop begins here
- -L(I)
- A
- -M(I)
- stall(V)
- -L(I-2) note change in index(decr. by
3) - A
28Software Pipelining
- The optimized code structure(continued)
- -M(I-2)
- stall(v)
- -L(I-1) note change in index
- A
- -M(I-1)
- stall(v)
- II-3 decrement I by 3
- jg loop_top
- --------------------------------------------------
- -L(0) epilogue
- A need to do A for I1,2
- -M(0) and the whole(L,M,A)
- A for I0
- A
29Cache Considerations
- Optimization for the cache becomes even more
important because of SIMD properties of MMX
instructions - To optimize the way the program use the data
cache, rearrange the way the data is located in
the memory using these techniques - data alignment
- separate v.s. compound array
- rearranging data structure
- padding and alignment
30Cache Considerations
- Restructure the way the code accesses data
- loop interchange
- loop fusion
- blocking
- To optimize instruction cache utilization,
restructure the code to reduce code size at
Assembly level
31Data Alignment
- Pentium processor can access at any byte boundary
- Misaligned access cost 3 extra cycles
- To avoid misaligned penalty, align data object
according to their size - align 2-byte data so that it doesnt cross 4-byte
boundary align 2-byte data on 2-byte boundary - align 8-byte MMX data on 8-byte boundary
- Aligning C data structure example(clip)
32Data Alignment
- If (amount of computationgtgtamount of input data),
then duplicate the input data with different
alignments, so that any segment of the original
data can be read by reading one of the aligned
duplicate arrays. - If amount of input data is too large, then align
it on the fly. - for example to read 8 bytes misaligned, read 2
aligned quadwords on either side, then shift and
OR together.
33Separate v.s. Compound Array
- First look at how the code accessed the array
elements, then declare the structure in the same
way - Example
34Rearranging Data Structure
- Put data elements that are accessed in parallel
together - Put frequently used data elements together
- When rearranging, be careful not to misalign the
data
35Padding and Aligning Arrays
- When accessing array of data structure randomly
- padding and aligning makes each structure span
minimum number of cache lines necessary - for each structure, one cache miss is avoided
- - padding enlarges data structure, thus less
information stored in cache - - potential of capacity misses
36Loop Interchange
- Structure the code so that it access array
elements within each cache line - C(row-wise) Fortran(column-wise
) - Spatial locality-all the elements in a cache line
are used before the line is replaced
37Loop Fusion
- Combine multiple loops over the same array into
one single loop - Increase temporal locality
- Often reduce capacity misses
- separate loops fused loops
38Blocking
- Restructure the program so that it uses smaller
blocks of data - blocking increases the temporal locality of the
code - useful when multiplying large matrixes that can
not fit into cache at the same time
39Blocking
- Blocked code
Original code - few iteration means fewer misses
- 4 elements from each line of A are used before
the line is replaced - hit rate 96.68 ? 97.07 (1.38 improvement)
- a reduction of 1 million cache misses, saving at
least 3 million clocks
40Reducing Code Size(Assembly Code)
- Such that code size does not exceed 8K byte - the
size of the instruction cache - Replace a sequence of single cycle instructions
with single multi-cycle instructions - Pull address calculation into load/store
instructions - Use shorter opcodes
- Eliminate compares with immediate zeros
- Shorten instructions using Pentium processors
eax register
41MMX Optimization Summary
- Make data structure and memory access 8-byte
aligned - Structure program and data to maximize
instruction and data cache hits - For each function in the program, craft a minimal
sequence of MMX codes with SIMD-like thinking. - Use software pipelining and loop unrolling
42References
- Intel Corp., David Bistry et. al, The Complete
Guide to MMX Technology, McGraw-Hill, Inc., 1997 - Hennesy, John L., and David A. Patterson,
Computer Architecture A Quantitative Approach.
2nd edition, Morgan Kaufmann, 1996