Performance Programming Exploiting the Power Processor - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Performance Programming Exploiting the Power Processor

Description:

FMA pipe 3-4 cycles long. Memory hierarchy. 8/20/09. SDSC Parallel ... for (j=0; j n; j =4) ... 1-way (4-way on Nighthawk 2) set associative, randomized ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 21

Provided by: cha8159

Category:

more less

Transcript and Presenter's Notes

Title: Performance Programming Exploiting the Power Processor

1
Performance Programming Exploiting the Power
Processor

Larry Carter
Sean Peisert

2
Topics

Exploiting the Power Processor
Peak processor performance
Is it attainable?
Reducing memory operations
Instruction-level parallelism
C versus Fortran
Cache and TLB issues
Examples

3
Approach

Engineers method
DO UNTIL (exhausted)
tweak something
IF (better) THEN accept_change
Scientific method
DO UNTIL (enlightened)
make hypothesis
experiment
revise hypothesis

4
Power3s power and limits

Eight pipelined functional units
2 floating point
2 load/store
2 single-cycle integer
1 multi-cycle integer
1 branch
Powerful operations
Fused multiply-add (FMA)
Load (or Store) update
Branch on count

Launch 4 ops per cycle
Cant launch 2 stores/cyc
FMA pipe 3-4 cycles long
Memory hierarchy

5
Can its power be harnessed?

CL.6 FMA
fp31fp31,fp2,fp0,fcr LFL fp1()double(gr3,16) F
NMS fp30fp30,fp2,fp0,fcr LFDU fp3,gr3()double(g
r3,32) FMA fp24fp24,fp0,fp1,fcr FNMS
fp25fp25,fp0,fp1,fcr LFL fp0()double(gr3,24) F
MA fp27fp27,fp2,fp3,fcr FNMS fp26fp26,fp2,fp3,f
cr LFL fp2()double(gr3,8) FMA
fp29fp29,fp1,fp3,fcr FNMS fp28fp28,fp1,fp3,fcr B
CT ctrCL.6,
for (j0 jltn j4) p00 aj0aj2
m00 - aj0aj2
p01 aj1aj3
m01 - aj1aj3
p10 aj0aj3
m10 - aj0aj3 p11
aj1aj2 m11 -
aj1aj2 8 FMAs 4 Loads Runs at
4.6 cycles/iteration ( 772 MFLOP/S)

6
Can its power be harnessed (part II)

8 FMA, 4 Load - 1.15 cycle/load (previous slide)
8 FMA, 8 Load
for (j0 jltn j8)
p00 aj0aj2
m00 - aj0aj2
p01 aj1aj3
m01 - aj1aj3
p10 aj4aj6
m10 - aj4aj6
p11 aj5aj7
m11 - aj5aj7
Batch job 0.6 cycle/load (740 MFLOP/sec)
Interactive 1.2 cycle/load (370 MFLOP/sec)
Interactive nodes have 1 cycle/MemOp barrier!
the AGEN unit is disabled _at_!

7
FLOP to MemOp ratio

Most programs have at most one FMA per MemOp
Matrix-vector product (k1) loads, k FMAs
FFT butterfly 8 MemOps, 10 floats (but 5 or 6
FMA)
DAXPY 2 Loads, 1 Store, 1 FMA
DDOT 2 Loads, 1 FMA
A few have more (but they are in libraries)
Matrix multiply (well-tuned) 2 FMA per load
Radix-8 FFT
Your program is limited by Memory Operations!

8
Decreasing MemOp to FLOP Ratio

for (i1 iltN i)
for (j1 jltN j)
bi,j 0.25
(ai-1j ai1j
ai,j-1 aij-1)
for (i1 iltN-2 i3)
for(j1 jltN j)
bi0j ...
bi1j ...
bi2j ...
for (i i i lt N i)
... / Do last rows /

3 loads 1 store
4 floats
5 loads 3 store
12 floats
9
The effect of pipeline latency
for (i0 iltsize i) sum ai sum
3.86 cycles/addition
Next add cant start until previous is finished
(3 to 4 cycles later)
for (i0 iltsize i8) s0 ai0 s4
ai4 s1 ai1 s5 ai5 s2
ai2 s6 ai6 s3 ai3 s7
ai7 sum s0s1s2s3s4s5s6s7
0.5 cycle/addition
May change answer due to different rounding.
10
Whats so great about Fortran??
for (i0 iltN i) bi ai
DO I 1, N A(I) B(I) ENDDO
CL.6 ST4U gr4,()int(gr4,4)gr24 L4AU
gr24,gr3()int(gr3,4) BCT ctrCL.6,
CL.8 L4A gr0b(gr5,4) L4A
gr6b(gr5,8) L4A gr7b(gr5,12) L4AU
gr8,gr5b(gr5,16) ST4A a(gr4,8)gr6 ST4A
a(gr4,4)gr0 ST4A a(gr4,12)gr7 ST4U
gr4,a(gr4,16)gr8 BCT ctrCL.8,
11
Fortran vs C - whats going on??

C prevents compiler from unrolling code
A feature, not a bug!
User may want b0 and a1 to be same location
tricky way to set an .. a1 a0
Most C compilers dont try to prove non-aliasing
a and b were malloc-ed in this example
Fortran doesnt allow arrays to be aliased
Unless explicit, e.g. via EQUIVALENCE

12
Fortran vs. C - does it matter??

Yes - Fortran is 1.1 cycle per load-store
- C code is 2.2 cycle per load-store
No - you could get the Fortran object code
from

for (i0 iltN-3 i4) b0 ai0 b1
ai1 b2 ai2 b3 ai3 bi0
b0 bi1 b1 bi2 b2 bi3
b3 for ( iltN i) bi ai
13
The Memory Hierarchy

L1 data cache
64 KBytes 512 lines, each 128 Bytes long
128-way set associative (!!)
Prefetching for up to 4 streams
6 or 7 cycle penalty for L1 cache miss
Data TLB
1 MB 256 pages, each 4KBytes
2-way set associative
25 to 100s cycle penalty for TLB miss
L2 cache
4 MByte 32,768 lines, each 128 Bytes long
1-way (4-way on Nighthawk 2) set associative,
randomized (!!)
Only can hold data from 1024 different pages
35 cycle penalty for L2 miss

14
SO WHAT ??

SPs should do well with power-of-2 strides
L1 cache is almost fully-associative
Randomization makes L2 associativity
unpredictable
Sequential access (or small stride) are good
Random access within a limited range is OK
Within 64 KBytes in L1 1 cycle per memop
Within 1 MByte up to 7-cycle L1 penalty
Within 4 Mbyte May get 25 cycle TLB penalty
Larger range huge (80 200 cycle) penalities

15
Stride one memory accesssum list of ints,
interactive nodesecond time through data
L1 cache
L2 cache
16
Stride-one memory accesssum list of floats,
batch jobsecond time through data
Uncached data 4.6 cycles per load
L1 cache
L2 cache
TLB
17
Strided Memory Access
Program adds 4440 integers located at given
stride (performed on interactive node)
gt 1 element/cacheline
gt 1 element/page
1 element/page (as bad as it gets)
TLB misses start
L1 misses start
18
Strided Memory Access
Program adds 22200 integers located at given
stride
gt 1 element/cacheline (shows effect of L2 cache)
gt 1 element/page (shows effect of TLB)
1 element/page (as bad as it gets)
Stride 64
19
Strided Memory Access
Square - 4,440 element sum, diamond - 22,200
element sum
gt 1 element/cacheline
gt 1 element/page
1 element/page (as bad as it gets)
110
Stride 64
55
20
Miscellany