Performance Programming Exploiting the Power Processor - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Performance Programming Exploiting the Power Processor

Description:

FMA pipe 3-4 cycles long. Memory hierarchy. 8/20/09. SDSC Parallel ... for (j=0; j n; j =4) ... 1-way (4-way on Nighthawk 2) set associative, randomized ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 21
Provided by: cha8159
Category:

less

Transcript and Presenter's Notes

Title: Performance Programming Exploiting the Power Processor


1
Performance Programming Exploiting the Power
Processor
  • Larry Carter
  • Sean Peisert

2
Topics
  • Exploiting the Power Processor
  • Peak processor performance
  • Is it attainable?
  • Reducing memory operations
  • Instruction-level parallelism
  • C versus Fortran
  • Cache and TLB issues
  • Examples

3
Approach
  • Engineers method
  • DO UNTIL (exhausted)
  • tweak something
  • IF (better) THEN accept_change
  • Scientific method
  • DO UNTIL (enlightened)
  • make hypothesis
  • experiment
  • revise hypothesis

4
Power3s power and limits
  • Eight pipelined functional units
  • 2 floating point
  • 2 load/store
  • 2 single-cycle integer
  • 1 multi-cycle integer
  • 1 branch
  • Powerful operations
  • Fused multiply-add (FMA)
  • Load (or Store) update
  • Branch on count
  • Launch 4 ops per cycle
  • Cant launch 2 stores/cyc
  • FMA pipe 3-4 cycles long
  • Memory hierarchy

5
Can its power be harnessed?

CL.6 FMA
fp31fp31,fp2,fp0,fcr LFL fp1()double(gr3,16) F
NMS fp30fp30,fp2,fp0,fcr LFDU fp3,gr3()double(g
r3,32) FMA fp24fp24,fp0,fp1,fcr FNMS
fp25fp25,fp0,fp1,fcr LFL fp0()double(gr3,24) F
MA fp27fp27,fp2,fp3,fcr FNMS fp26fp26,fp2,fp3,f
cr LFL fp2()double(gr3,8) FMA
fp29fp29,fp1,fp3,fcr FNMS fp28fp28,fp1,fp3,fcr B
CT ctrCL.6,
for (j0 jltn j4) p00 aj0aj2
m00 - aj0aj2
p01 aj1aj3
m01 - aj1aj3
p10 aj0aj3
m10 - aj0aj3 p11
aj1aj2 m11 -
aj1aj2 8 FMAs 4 Loads Runs at
4.6 cycles/iteration ( 772 MFLOP/S)

6
Can its power be harnessed (part II)
  • 8 FMA, 4 Load - 1.15 cycle/load (previous slide)
  • 8 FMA, 8 Load
  • for (j0 jltn j8)
  • p00 aj0aj2
  • m00 - aj0aj2
  • p01 aj1aj3
  • m01 - aj1aj3
  • p10 aj4aj6
  • m10 - aj4aj6
  • p11 aj5aj7
  • m11 - aj5aj7
  • Batch job 0.6 cycle/load (740 MFLOP/sec)
  • Interactive 1.2 cycle/load (370 MFLOP/sec)
  • Interactive nodes have 1 cycle/MemOp barrier!
  • the AGEN unit is disabled _at_!

7
FLOP to MemOp ratio
  • Most programs have at most one FMA per MemOp
  • Matrix-vector product (k1) loads, k FMAs
  • FFT butterfly 8 MemOps, 10 floats (but 5 or 6
    FMA)
  • DAXPY 2 Loads, 1 Store, 1 FMA
  • DDOT 2 Loads, 1 FMA
  • A few have more (but they are in libraries)
  • Matrix multiply (well-tuned) 2 FMA per load
  • Radix-8 FFT
  • Your program is limited by Memory Operations!

8
Decreasing MemOp to FLOP Ratio
  • for (i1 iltN i)
  • for (j1 jltN j)
  • bi,j 0.25
  • (ai-1j ai1j
    ai,j-1 aij-1)
  • for (i1 iltN-2 i3)
  • for(j1 jltN j)
  • bi0j ...
  • bi1j ...
  • bi2j ...
  • for (i i i lt N i)
  • ... / Do last rows /

3 loads 1 store
4 floats
5 loads 3 store
12 floats
9
The effect of pipeline latency
for (i0 iltsize i) sum ai sum
3.86 cycles/addition
Next add cant start until previous is finished
(3 to 4 cycles later)
for (i0 iltsize i8) s0 ai0 s4
ai4 s1 ai1 s5 ai5 s2
ai2 s6 ai6 s3 ai3 s7
ai7 sum s0s1s2s3s4s5s6s7
0.5 cycle/addition
May change answer due to different rounding.
10
Whats so great about Fortran??
for (i0 iltN i) bi ai
DO I 1, N A(I) B(I) ENDDO
CL.6 ST4U gr4,()int(gr4,4)gr24 L4AU
gr24,gr3()int(gr3,4) BCT ctrCL.6,
CL.8 L4A gr0b(gr5,4) L4A
gr6b(gr5,8) L4A gr7b(gr5,12) L4AU
gr8,gr5b(gr5,16) ST4A a(gr4,8)gr6 ST4A
a(gr4,4)gr0 ST4A a(gr4,12)gr7 ST4U
gr4,a(gr4,16)gr8 BCT ctrCL.8,
11
Fortran vs C - whats going on??
  • C prevents compiler from unrolling code
  • A feature, not a bug!
  • User may want b0 and a1 to be same location
  • tricky way to set an .. a1 a0
  • Most C compilers dont try to prove non-aliasing
  • a and b were malloc-ed in this example
  • Fortran doesnt allow arrays to be aliased
  • Unless explicit, e.g. via EQUIVALENCE

12
Fortran vs. C - does it matter??
  • Yes - Fortran is 1.1 cycle per load-store
  • - C code is 2.2 cycle per load-store
  • No - you could get the Fortran object code
    from

for (i0 iltN-3 i4) b0 ai0 b1
ai1 b2 ai2 b3 ai3 bi0
b0 bi1 b1 bi2 b2 bi3
b3 for ( iltN i) bi ai
13
The Memory Hierarchy
  • L1 data cache
  • 64 KBytes 512 lines, each 128 Bytes long
  • 128-way set associative (!!)
  • Prefetching for up to 4 streams
  • 6 or 7 cycle penalty for L1 cache miss
  • Data TLB
  • 1 MB 256 pages, each 4KBytes
  • 2-way set associative
  • 25 to 100s cycle penalty for TLB miss
  • L2 cache
  • 4 MByte 32,768 lines, each 128 Bytes long
  • 1-way (4-way on Nighthawk 2) set associative,
    randomized (!!)
  • Only can hold data from 1024 different pages
  • 35 cycle penalty for L2 miss

14
SO WHAT ??
  • SPs should do well with power-of-2 strides
  • L1 cache is almost fully-associative
  • Randomization makes L2 associativity
    unpredictable
  • Sequential access (or small stride) are good
  • Random access within a limited range is OK
  • Within 64 KBytes in L1 1 cycle per memop
  • Within 1 MByte up to 7-cycle L1 penalty
  • Within 4 Mbyte May get 25 cycle TLB penalty
  • Larger range huge (80 200 cycle) penalities

15
Stride one memory accesssum list of ints,
interactive nodesecond time through data
L1 cache
L2 cache
16
Stride-one memory accesssum list of floats,
batch jobsecond time through data
Uncached data 4.6 cycles per load
L1 cache
L2 cache
TLB
17
Strided Memory Access
Program adds 4440 integers located at given
stride (performed on interactive node)
gt 1 element/cacheline
gt 1 element/page
1 element/page (as bad as it gets)
TLB misses start
L1 misses start
18
Strided Memory Access
Program adds 22200 integers located at given
stride
gt 1 element/cacheline (shows effect of L2 cache)
gt 1 element/page (shows effect of TLB)
1 element/page (as bad as it gets)
Stride 64
19
Strided Memory Access
Square - 4,440 element sum, diamond - 22,200
element sum
gt 1 element/cacheline
gt 1 element/page
1 element/page (as bad as it gets)
110
Stride 64
55
20
Miscellany
  • Excellent reference
  • RS/6000 Scientific and Technical Computing
  • Power3 Introduction and Tuning Guide
  • Use ESSL and PESSL if appropriate
  • MASS is much faster for intrinsic functions
  • But may differ in last bit from IEEE standard
  • Im carter_at_cs.ucsd.edu
  • But Im outta here for sabbatical!
Write a Comment
User Comments (0)
About PowerShow.com