MMX Architecture Programming and Performance Optimization - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

MMX Architecture Programming and Performance Optimization

Description:

conversions (e.g. pack, unpack) logical operations (e.g. ... most one MMX shift or pack or unpack instruction can be executed. ... pack, unpack) A (anything ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 43

Provided by: eceW

Category:

more less

Transcript and Presenter's Notes

Title: MMX Architecture Programming and Performance Optimization

1
MMX Architecture Programmingand Performance
Optimization

The MMX Instruction Set
MMX Technology Optimization Techniques

2
MMX Instruction Overview

57 new opcodes introduced
Instructions are grouped into the following
packed arithmetic (e.g. padd,
psub)
conversions (e.g. pack,
unpack)
logical operations (e.g. pand, por, pxor)
data transfer operations (e.g. movd, movq)
EMMS (Empty Multimedia State)

3
Packed Arithmetic Example 1

paddsw MM2, MM4
(Packed Add with Saturation for Word)
p Packed
add the Instruction
s Saturation
w Word

4
Packed Arithmetic Example 1

paddsw MM2, MM4

Saturated
Saturated
5
Packed Arithmetic Example 2

pmaddwd MM2, MM4
(Packed Multiply and Add)
Multiply packed words in parallel
Add the 32-bit results pairwise
Store in MMX register as dwords

6
Packed Arithmetic Example 2

pmaddwd MM2, MM4

Wraps around only when all source data elements
are 8000
7
Using Saturating Arithmetic

Example(absolute difference for 8-bit unsigned
data)

Absolute differences
8
Optimization Techniques

MMX instructions utilize U and V pipes of the
Pentium Processor
To write optimized MMX code, one must
understand the MMX instruction latencies
know how to pair MMX instructions
learn how to efficiently mix MMX and regular
integer instructions
take cache structure into considerations

9
MMX Instruction Latencies

Latency rule 1 After modifying an MMX register,
wait until the next clock(at least) before
reading the same register(avoiding RAW hazards)
Example
movq mm0, eax U pipe 1
movq mm3, mm2 V pipe 1
paddw mm0, mm1 U pipe 2
movq mm2, mm0 STALL 3

10
MMX Instruction Latencies

Example(no stall!)
movq mm0, eax V pipe
movq mm3, mm2 U pipe
paddw mm0, mm1 V pipe
movq mm2, mm0 U pipe
Determining how the sequence of instructions will
line up relative to the U and V pipes
branch target will be executed in U pipe

11
MMX Instruction Latencies

Latency rule 2 After issuing a multiply
instruction(pmaddwd, pmulhw, pmullw), wait until
three clocks later before using the result
Example
pmaddwd mm1,esi4ecx U pipe
movq mm0, mm2 V pipe
xxx dont use mm1
xxx still dont
xxx still dont
xxx still dont
paddd mm7, mm1 now you can!

12
MMX Instructions Latencies

Latency rule 3 After modifying an MMX register,
wait until two clocks later before storing the
result to either memory or an integer
register(EAX, EBX, and so on)
Example
psubw mm0, mm1 mm0 gets modified
xxx dont store yet
xxx still dont
xxx still dont
movq edi,mm0 OK to store

13
Pairing MMX Instructions

Goal To achieve a maximum throughput of two
instructions per processor clock
Following four basic MMX instruction pairing
rules on the Pentium processor

14
Pairing MMX Instructions

Pairing rule 1 In each clock, at most one MMX
multiplication instruction(pmaddwd, pmulhw, or
pmullw) can be executed.
Example
pmaddwd mm0,mm1 a multiply
pmulhw mm2,mm3 another multiply will not pair
It will take 2 clocks to execute( or more if
there are stalls due to the latency rules!)

15
Pairing MMX Instructions

Pairing rule 2 In each clock, at most one MMX
shift or pack or unpack instruction can be
executed.
Example The following combinations will not pair
1.psrad 2.psrad 3.packdwss
psllw packdwss punpckwd

16
Pairing MMX Instructions

Pairing rule 3 In each clock, the UV
instruction pair can contain at most one memory
or integer register(EAX, EBX, etc.) reference,
and if it contains one, it must be executed in
the U pipe.
Example The following combinations will not pair
1.movq mm0,esi 2.paddd mm0,mm1-U pipe
movd eax, mm1 movq mm2,esi

17
Pairing MMX Instructions

Pairing rule 4 For optimal pairing, avoid
instructions that are more than 7 bytes long.
Such instruction must be executed in U pipe and
usually will not pair.(e.g. an 8 byte or longer
instruction is one with a memory operand
containing a base register, an index, and a
32-bit displacement)
Example
paddusb mm3,esi4ecx10248 byte long
paddusb mm3,esi1024 index removed
Or
paddusb mm3,esi4ecx64 shortening the
offset to 8 bits
(offset can be
0,8,32
bits only)

18
Mixing Integer and MMX Instructions

Integer and MMX instruction will pair if
a. the integer instruction is a pairable
instruction for the pipe where it is being
executed
b. the MMX instruction does not reference memory
or integer register
Example
pmulh mm0,mm1 no mem/int-reg reference
add eax,4 V-pairable
add esi,edi U-pairable
padd mm2,mm3 no mem/int-reg reference

19
Mixing Floating-Point and MMX Instructions

Each transition between MMX code and
floating-point code costs about 50 clocks
Do not mix floating-point and MMX instructions at
instruction level
Reasonable to mix floating-point and MMX
instructions at the module(function) level
Use EMMS instructions at the end of every MMX
code sequence. If you dont
incorrect floating-point resulted produced
floating-point exceptions generated degrades
performance

20
Software Pipelining

Dot product of and is
, two real vectors, can be conveniently
represented for MMX technology programming as
arrays of 16-bit integers of some length

21
Software Pipelining

Dot product in C code
dotprod 0
for (i0 iltn i)
dotprod xiyi
MMX implementation for the inner loop
esi points to x array, edi points to y
ecx is loop counter
dotprod
movq mm0,esi8ecx load xi
pmaddwd mm0,edi8ecx multiply by yi
paddd mm1,mm0 accumulate to mm1
dec ecx
jge dotprod

22
Software Pipelining

Analyzing the code using latency and pairing
rules
code executed on the Pentium processor
dotprod pipe clock
movq mm0,esi8ecx U 1
V-pipe stall V
pmaddwd mm0,edi8ecx U 2
V-pipe stall V
U-pipe stall U 3
V-pipe stall V
U-pipe stall U 4
V-pipe stall Vmultiply now done
paddd mm1,mm0 U 5
dec ecx V (note int/MMX
pairing)
jge dotprod U 6
V-pipe stall V
It takes six clocks to process four elements per
iteration, i.e., 1.5 clocks per array element

23
Software Pipelining

Optimization Use additional iterations of the
same loop to fill the empty slots
Abstraction of each instruction into a symbol
-Xi, where X could be
M (multiply)
S (shift, pack, unpack)
A (anything else)
the leading dash indicates a memory or integer
register operand
subscript i distinguishes different instructions
Programming task reduced to a problem of filling
a 2-D table with letters

24
Software Pipelining

Symbolizing the three instructions in the product
code
-L movq
-M pmaddwd
A paddd
Each M falls one clock(or more) after the
corresponding L, and each A falls three clocks(or
more) after the corresponding M
Each U,V pair contains at most one M and at most
one - (and the -, if present, must be in the U
pipe)

25
Software Pipelining

Interleaving three iterations of the
loop(dependencies are shown with arrows
interleave factor k3)

26
Software Pipelining

Adding loop control
DEC and JGE can pair, but still need one extra
clock for the decreased ECX to be available for
next memory reference.
Including the loop control, it takes 8 clocks to
process 12(34) array elements, i.e., 0.75
clocks per array element.
Speed Up2(excluding loop prologue and epilogue)

27
Software Pipelining

The optimized code structure
I n/4-3 start with last quadword
triplet
-L(I1) loop prologue
-M(I1)
-L(I2)
-M(I2)
--------------------------------------------------
loop_top loop begins here
-L(I)
A
-M(I)
stall(V)
-L(I-2) note change in index(decr. by
3)
A

28
Software Pipelining

The optimized code structure(continued)
-M(I-2)
stall(v)
-L(I-1) note change in index
A
-M(I-1)
stall(v)
II-3 decrement I by 3
jg loop_top
--------------------------------------------------
-L(0) epilogue
A need to do A for I1,2
-M(0) and the whole(L,M,A)
A for I0
A

29
Cache Considerations

Optimization for the cache becomes even more
important because of SIMD properties of MMX
instructions
To optimize the way the program use the data
cache, rearrange the way the data is located in
the memory using these techniques
data alignment
separate v.s. compound array
rearranging data structure
padding and alignment

30
Cache Considerations

Restructure the way the code accesses data
loop interchange
loop fusion
blocking
To optimize instruction cache utilization,
restructure the code to reduce code size at
Assembly level

31
Data Alignment

Pentium processor can access at any byte boundary
Misaligned access cost 3 extra cycles
To avoid misaligned penalty, align data object
according to their size
align 2-byte data so that it doesnt cross 4-byte
boundary align 2-byte data on 2-byte boundary
align 8-byte MMX data on 8-byte boundary
Aligning C data structure example(clip)

32
Data Alignment

If (amount of computationgtgtamount of input data),
then duplicate the input data with different
alignments, so that any segment of the original
data can be read by reading one of the aligned
duplicate arrays.
If amount of input data is too large, then align
it on the fly.
for example to read 8 bytes misaligned, read 2
aligned quadwords on either side, then shift and
OR together.

33
Separate v.s. Compound Array

First look at how the code accessed the array
elements, then declare the structure in the same
way
Example

34
Rearranging Data Structure

Put data elements that are accessed in parallel
together
Put frequently used data elements together
When rearranging, be careful not to misalign the
data

35
Padding and Aligning Arrays

When accessing array of data structure randomly
padding and aligning makes each structure span
minimum number of cache lines necessary
for each structure, one cache miss is avoided
- padding enlarges data structure, thus less
information stored in cache
- potential of capacity misses

36
Loop Interchange

Structure the code so that it access array
elements within each cache line
C(row-wise) Fortran(column-wise
)
Spatial locality-all the elements in a cache line
are used before the line is replaced

37
Loop Fusion

Combine multiple loops over the same array into
one single loop
Increase temporal locality
Often reduce capacity misses
separate loops fused loops

38
Blocking

Restructure the program so that it uses smaller
blocks of data
blocking increases the temporal locality of the
code
useful when multiplying large matrixes that can
not fit into cache at the same time

39
Blocking

Blocked code
Original code
few iteration means fewer misses
4 elements from each line of A are used before
the line is replaced
hit rate 96.68 ? 97.07 (1.38 improvement)
a reduction of 1 million cache misses, saving at
least 3 million clocks

40
Reducing Code Size(Assembly Code)

Such that code size does not exceed 8K byte - the
size of the instruction cache
Replace a sequence of single cycle instructions
with single multi-cycle instructions
Pull address calculation into load/store
instructions
Use shorter opcodes
Eliminate compares with immediate zeros
Shorten instructions using Pentium processors
eax register

41
MMX Optimization Summary

Make data structure and memory access 8-byte
aligned
Structure program and data to maximize
instruction and data cache hits
For each function in the program, craft a minimal
sequence of MMX codes with SIMD-like thinking.
Use software pipelining and loop unrolling

42
References

Intel Corp., David Bistry et. al, The Complete
Guide to MMX Technology, McGraw-Hill, Inc., 1997
Hennesy, John L., and David A. Patterson,
Computer Architecture A Quantitative Approach.
2nd edition, Morgan Kaufmann, 1996

Write a Comment

User Comments (0)