Introduccion de nuevos servicios para el publico Portuguese - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Introduccion de nuevos servicios para el publico Portuguese

Description:

Optimization for the Cray XT3 MPP Supercomputer. John M. Levesque. June, 2005. 9/15/09 ... Symmetric Heap management. shmalloc. shfree. shrealloc. Fortran Interface ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 83
Provided by: Virgini112
Category:

less

Transcript and Presenter's Notes

Title: Introduccion de nuevos servicios para el publico Portuguese


1
Optimization for the Cray XT3MPP Supercomputer
John M. Levesque June, 2005
2
Outline of Optimization Section
  • Single node (socket) Optimization
  • I/O optimization
  • MPI optimization
  • SHMEM
  • Scaling to large processor counts

3
Bandwidth as a function of Array Size
4
AMD Opteron Processor
  • 36 entry FPU instruction scheduler
  • 64-bit/80-bit FP Realized throughput (1 Mul 1
    Add)/cycle 1.9 FLOPs/cycle
  • 32-bit FP Realized throughput (2 Mul 2
    Add)/cycle 3.4 FLOPs/cycle

5
Simplified memory hierachy on the AMD Opteron
registers
16 SSE2 128-bit registers 16 64 bit registers
2 x 8 Bytes per clock, i.e. Either 2 loads, 1
load 1 store, or 2 stores (38 GB/s on 2.4 Ghz)
  • 64 Byte cache line
  • complete data cache lines are loaded from main
  • memory, if not in L2 cache
  • if L1 data cache needs to be refilled, then
  • storing back to L2 cache
  • 64 Byte cache line
  • write back cache data offloaded from L1 data
  • cache are stored here first
  • until they are flushed out to main memory

L1 data cache
8 Bytes per clock
L2 cache
...
16 Bytes wide data bus gt 6.4 GB/s for DDR400
Main memory
6
C 11 OPERATIONS - 2 OPERANDS RATIO 11/2
II1 ISTRIDE 128 DO
41075 I 1, N Y(II) c0 X(II) (C1
X(II) (C2 X(II)
(C3 X(II) (C4 X(II)
(C5 X(II) ))))) II
II ISTRIDE41075 CONTINUE
7
C 3 OPERATIONS - 5 OPERANDS RATIO 3/5
DO 41023 I1, N A(I) B(I)
C(I) D(I) E(I)41023 CONTINUE
8
C 17 OPERATIONS - 2 OPERANDS RATIO
17/2 DO LLLLL 1,NREPS DO 41018 I
1,N Y(IY(I)) c0 X(IX(I)) (C1
X(IX(I)) (C2 X(IX(I)) (C3
X(IX(I)) (C4 X(IX(I)) (C5
X(IX(I)) (C6 X(IX(I)) (C7
X(IX(I)) (C8 X(IX(I))
))))))))41018 CONTINUE
9
(No Transcript)
10
(No Transcript)
11
C DIMENSION A(128,N) DO 41080 I
1,N A( 1,I) C1A(13,I) C2 A(12,I)
C3A(11,I) C4A(10,I) C5 A(
9,I) C6A( 8,I) C7A( 7,I)
C0(A( 5,I) A( 6,I) ) A( 3,I)41080 CONTINUE
12
C DIMENSION B(13,N) DO 41081 I 1,N
B( 1,I) C1B(13,I) C2 B(12,I)
C3B(11,I) C4B(10,I) C5 B(
9,I) C6B( 8,I) C7B( 7,I)
C0(B( 5,I) B( 6,I) ) B( 3,I)41081 CONTINUE
13
(No Transcript)
14
C THE ORIGINAL DO 41090 K KA, KE,
-1 DO 41090 J JA, JE DO 41090
I IA, IE A(K,L,I,J) A(K,L,I,J) -
B(J,1,i,k)A(K1,L,I,1) -
B(J,2,i,k)A(K1,L,I,2) - B(J,3,i,k)A(K1,L,I,3)
- B(J,4,i,k)A(K1,L,I,4) -
B(J,5,i,k)A(K1,L,I,5)41090 CONTINUE
15
C THE RESTRUCTURED DO 41091 K KA,
KE, -1 DO 41091 J JA, JE DO
41091 I IA, IE AA(I,K,L,J)
AA(I,K,L,J)-BB(I,J,1,K)AA(I,K1,L,1) -
BB(I,J,2,K)AA(I,K1,L,2)-BB(I,J,3,K)AA(I,K1,L,3
) - BB(I,J,4,K)AA(I,K1,L,4)-BB(I,J,5,K
)AA(I,K1,L,5) 41091 CONTINUE
16
(No Transcript)
17
C GAUSS ELIMINATION DO
43020 I 1, MATDIM A(I,I) 1. / A(I,I)
DO 43020 J I1, MATDIM A(J,I)
A(J,I) A(I,I) DO 43020 K I1, MATDIM
A(J,K) A(J,K) - A(J,I) A(I,K)43020
CONTINUE
18
C GAUSS ELIMINATION DO
43020 I 1, MATDIM A(I,I) 1. / A(I,I)
DO 43020 J I1, MATDIM A(J,I)
A(J,I) A(I,I)cpgil nodepchk DO 43020
K I1, MATDIM A(J,K) A(J,K) - A(J,I)
A(I,K)43020 CONTINUE
19
(No Transcript)
20
C THE ORIGINAL DO 43030 I 2, N
DO 43030 K 1, I-1 A(I) A(I)
B(I,K) A(I-K)43030 CONTINUE
21
DO 43031 I 2, Ncpgil nodepchk
DO 43031 K 1, I-1 A(I) A(I) B(I,K)
A(I-K)43031 CONTINUE
22
(No Transcript)
23
DO 43040 J 2, 8 N1 J N2 J -
1 DO 43040 I 2, N A(I,N1)
A(I-1,N2) B(I,J) C(I)43040 CONTINUE
24
DO 43041 J 2, 8 DO 43041 I 2,
N A(I,J) A(I-1,J-1) B(I,J)
C(I)43041 CONTINUE
25
(No Transcript)
26
C THE ORIGINAL DO 43060 KX 2, 3 DO
43060 KY 2, N D(KY) A(KX,KY1,NL12) -
A(KX,KY-1,NL12) E(KY) B(KX,KY1,NL22) -
B(KX,KY-1,NL22) F(KY) C(KX,KY1,NL32) -
C(KX,KY-1,NL32) A(KX,KY,NL11)
A(KX,KY,NL11) C1D(KY)
C2E(KY) C3F(KY)
C0(A(KX1,KY,NL1) - 2.A(KX,KY,NL1)
A(KX-1,KY,NL1)) B(KX,KY,NL21)
B(KX,KY,NL21) C4D(KY)
C5E(KY) C6F(KY)
C0(B(KX1,KY,NL1) - 2.B(KX,KY,NL1)
B(KX-1,KY,NL1)) C(KX,KY,NL31)
C(KX,KY,NL31) C7D(KY)
C8E(KY) C9F(KY)
C0(C(KX1,KY,NL1) - 2.C(KX,KY,NL1)
C(KX-1,KY,NL1))43060 CONTINUE
27
DO 43061 KX 2, 3cpgil nodepchk DO
43061 KY 2, N D(KY) A(KX,KY1,NL12) -
A(KX,KY-1,NL12) E(KY) B(KX,KY1,NL22) -
B(KX,KY-1,NL22) F(KY) C(KX,KY1,NL32) -
C(KX,KY-1,NL32) A(KX,KY,NL11)
A(KX,KY,NL11) C1D(KY)
C2E(KY) C3F(KY)
C0(A(KX1,KY,NL1) - 2.A(KX,KY,NL1)
A(KX-1,KY,NL1)) B(KX,KY,NL21)
B(KX,KY,NL21) C4D(KY)
C5E(KY) C6F(KY)
C0(B(KX1,KY,NL1) - 2.B(KX,KY,NL1)
B(KX-1,KY,NL1)) C(KX,KY,NL31)
C(KX,KY,NL31) C7D(KY)
C8E(KY) C9F(KY)
C0(C(KX1,KY,NL1) - 2.C(KX,KY,NL1)
C(KX-1,KY,NL1))43061 CONTINUE
28
(No Transcript)
29
C THE ORIGINAL DO 43070 I 1, N
A(IA(I)) A(IA(I)) C0 B(I)43070 CONTINUE
30
cpgil nodepchk DO 43071 I 1, N
A(IA(I)) A(IA(I)) C0 B(I)43071 CONTINUE
31
(No Transcript)
32
C THE ORIGINAL DO 43100 J 2, N AH
B(J) - B(J-1) DO 43100 I 2, N
A(I,J) AH A(I-1,J) C(I,J)43100 CONTINUE
33
DO 43101 J 2, N VAH(J) B(J) -
B(J-1)43101 CONTINUE DO 43102 I 2, N
DO 43102 J 1, N A(I,J) VAH(J)
A(I-1,J) C(I,J)43102 CONTINUE
34
(No Transcript)
35
C THE ORIGINAL DO 43111 J 2, N
AH B(J) - B(J-1) DO 43110 I 2, N
A(I,J) AH A(I-1,J) C(I,J)43110
CONTINUE BH D(J) - D(J-1) DO
43112 I N, 2, - 1 A(I,J) BH
A(I1,J) C(I,J)43112 CONTINUE43111 CONTINUE
36
C THE RESTRUCTURED DO 43113 J 2, N
VAH(J) B(J) - B(J-1)43113 CONTINUE
DO 43114 I 2, N DO 43114 J 2, N
A(I,J) VAH(J) A(I-1,J) C(I,J)43114
CONTINUE DO 43115 J 2, N VBH(J)
D(J) - D(J-1)43115 CONTINUE DO 43116
I N, 2, - 1 DO 43116 J 2, N
A(I,J) VBH(J) A(I1,J) C(I,J)43116
CONTINUE
37
(No Transcript)
38
DO 43140 J 2, N DO 43140 I 2,
N A(I,J,1) A(I,J,1) - B(I,J)
A(I-1,J,1) - C(I,J)
A(I,J-1,1) A(I,J,2) A(I,J,2) - B(I,J)
A(I-1,J,2) - C(I,J)
A(I,J-1,2) A(I,J,3) A(I,J,3) - B(I,J)
A(I-1,J,3) - C(I,J)
A(I,J-1,3)43140 CONTINUE
39
C THE RESTRUCTURED NDIAGS 2 N - 3
ISTART 1 JSTART 2 LDIAG
0 DO 43141 IDIAGS 1, NDIAGS
IF(IDIAGS .LE. N-1 ) THEN ISTART
ISTART 1 LDIAG LDIAG 1
ELSE JSTART JSTART 1
LDIAG LDIAG - 1 ENDIF I
ISTART 1 J JSTART - 1!pgil nodepchk
DO 43142 IPOINT 1, LDIAG I I -
1 J J 1 A(I,J,1) A(I,J,1) -
B(I,J) A(I-1,J,1) -
C(I,J) A(I,J-1,1) A(I,J,2) A(I,J,2) -
B(I,J) A(I-1,J,2) -
C(I,J) A(I,J-1,2) A(I,J,3) A(I,J,3) -
B(I,J) A(I-1,J,3) -
C(I,J) A(I,J-1,3)43142 CONTINUE43141 CONTINUE
40
(No Transcript)
41
C THE ORIGINAL BSQ(1) 0.0
A(1) 0.0 B 0.0 DO 44022 I 2,
N B B DELB BSQ(I) B 2
A(I) C(I) ( DELB C(I) (BSQ(I) -
BSQ(I-1)))44022 CONTINUEC
42
C THE ORIGINAL BR 0.0 DO 44020
I 1, N BL BR BR (I-1) DELB
A(I) (BR - BL) C(I) (BR2 - BL2)
C(I)244020 CONTINUE
43
BSQ(1) 0.0 A(1) 0.0 B
0.0 DO 44022 I 2, N B B
DELB BSQ(I) B 2 A(I) C(I)
( DELB C(I) (BSQ(I) - BSQ(I-1)))44022
CONTINUEC
44
(No Transcript)
45
PF 0.0 DO 44030 I 2, N AV
B(I) RV PB PF PF
C(I) IF ((D(I) D(I1)) .LT. 0.) PF
-C(I1) AA E(I) - E(I-1) F(I) -
F(I-1) 1 G(I) G(I-1) - H(I) -
H(I-1) BB R(I) S(I-1) T(I)
T(I-1) 1 - U(I) - U(I-1) V(I)
V(I-1) 2 - W(I) W(I-1) - X(I)
X(I-1) A(I) AV (AA BB PF - PB
Y(I) - Z(I)) A(I)44030 CONTINUE
46
VPF(1) 0.0 DO 44031 I 2, N
AV B(I) RV VPF(I) C(I) IF
((D(I) D(I1)) .LT. 0.) VPF(I) -C(I1)
AA E(I) - E(I-1) F(I) - F(I-1) 1
G(I) G(I-1) - H(I) - H(I-1) BB
R(I) S(I-1) T(I) T(I-1) 1 - U(I)
- U(I-1) V(I) V(I-1) 2 - W(I)
W(I-1) - X(I) X(I-1) A(I) AV (AA
BB VPF(I) - VPF(I-1) Y(I) - Z(I))
A(I)44031 CONTINUE
47
(No Transcript)
48
DO 44050 I 1, N DO 44050 J 1, N
A(I,J) 0.0 DO 44050 K 1, N
A(I,J) A(I,J) B(I,K) C(K,J)44050 CONTINUE
49
DO 44051 J 1, N DO 44051 I 1,
N A(I,J) 0.044051 CONTINUE DO
44052 K 1, N DO 44052 J 1, N
DO 44052 I 1, N A(I,J) A(I,J)
B(I,K) C(K,J)44052 CONTINUE
50
(No Transcript)
51
DO 44060 I 1, N A(I) 0.0
DO 44060 J 1, I A(I) A(I) B(I,J)
C(J,I)44060 CONTINUE
52
DO 44061 I 1, N A(I)
0.044061 CONTINUE DO 44062 J 1, N
DO 44062 I J, N A(I) A(I) B(I,J)
C(J,I)44062 CONTINUE
53
(No Transcript)
54
C THE ORIGINAL DO 46011 J 1, 4
DO 46010 I 1, N C(J,I)0.046010
CONTINUE DO 46011 K 1,4 DO
46011 I 1,N C(J,I) C(J,I) A(J,K)
B(K,I)46011 CONTINUE
55
C THE RESTRUCTURED DO 46012 I 1, N
C(1,I) A(1,1) B(1,I) A(1,2) B(2,I)
A(1,3) B(3,I) A(1,4) B(4,I)
C(2,I) A(2,1) B(1,I) A(2,2) B(2,I)
A(2,3) B(3,I) A(2,4) B(4,I)
C(3,I) A(3,1) B(1,I) A(3,2) B(2,I)
A(3,3) B(3,I) A(3,4) B(4,I)
C(4,I) A(4,1) B(1,I) A(4,2) B(2,I)
A(4,3) B(3,I) A(4,4)
B(4,I)46012 CONTINUE
56
(No Transcript)
57
C THE ORIGINAL DO 46020 I 1,N
DO 46020 J 1,4 A(I,J) 0. DO
46020 K 1,4 A(I,J) A(I,J) B(I,K)
C(K,J)46020 CONTINUE
58
DO 46021 I 1, N A(I,1) B(I,1)
C(1,1) B(I,2) C(2,1) B(I,3)
C(3,1) B(I,4) C(4,1) A(I,2) B(I,1)
C(1,2) B(I,2) C(2,2) B(I,3)
C(3,2) B(I,4) C(4,2) A(I,3) B(I,1)
C(1,3) B(I,2) C(2,3) B(I,3)
C(3,3) B(I,4) C(4,3) A(I,4) B(I,1)
C(1,4) B(I,2) C(2,4) B(I,3)
C(3,4) B(I,4) C(4,4)46021 CONTINUE
59
(No Transcript)
60
DO 46030 J 1, N DO 46030 I 1,
N A(I,J) 0.46030 CONTINUE DO
46031 K 1, N DO 46031 J 1, N
DO 46031 I 1, N A(I,J) A(I,J)
B(I,K) C(K,J)46031 CONTINUE
61
C THE RESTRUCTURED DO 46032 J 1,
N DO 46032 I 1, N
A(I,J)0.46032 CONTINUEC DO 46033 K
1, N-5, 6 DO 46033 J 1, N DO
46033 I 1, N A(I,J) A(I,J) B(I,K
) C(K ,J) B(I,K1)
C(K1,J) B(I,K2)
C(K2,J) B(I,K3)
C(K3,J) B(I,K4)
C(K4,J) B(I,K5)
C(K5,J)46033 CONTINUEC DO 46034 KK K,
N DO 46034 J 1, N DO 46034 I
1, N A(I,J) A(I,J) B(I,KK) C(KK
,J)46034 CONTINUE
62
(No Transcript)
63
I/O Optimization
  • Lustre
  • Stripe Setting
  • I/O Buffering

64
Determining Stripe Size
  • lfs find -v /lustre/scratch/jlbeck
  • Notice this directory has no stripe information.
    This means the width is ONE. If I were to create
    a file in this directory it would inherit the
    properties of the directory.

OBDS 0 ost1_UUID ACTIVE ... 15 ost16_UUID
ACTIVE /lustre/scratch/jlbeck/ has no stripe info
65
Stripe setting
  • What should my stripe width be?
  • Lots of processors writing to individual files
    sw 1
  • 1 I/O process writing for all, sw small gt 1
  • Lots of processes all writing to 1 file. Sw
    large
  • Set your stripe width
  • lfs setstripe ltfilenamegt ltstripe-sizegt
    ltstart-ostgt ltstrip countgt
  • lfs setstripe file 0 1 5
  • lfs setstrip file 0 0 -1

66
IOBUF
  • iobuf is an I/O buffering library. iobuf
    intercepts the I/O calls (open, read, etc.) from
    a program and provides an additional layer of
    buffering. In the case of XT3, iobuf replaces the
    stdio (glibc/libio) layer of buffering. By
    asynchronously prefetching and caching file data,
    I/O wait time for programs which read or write
    large file sequentially can be reduced
  • iobuf can gather run time statistics and print a
    summary report of I/O activity for each file.
  • If a memory allocation error occurs, buffering is
    reduced or disabled for that file and a
    diagnostic is printed to stderr. When the file is
    opened, a single buffer is allocated if buffering
    is enabled. The allocation of additional buffers
    is done when a buffer is needed. When a file is
    closed, its buffers are freed unless asynchronous
    I/O is still pending on the buffer

67
IOBUF
  • File selection and parameters for buffering is
    set through the IOBUF_PARAMS environment
    variable. The default if IOBUF_PARAMS is not set
    is no buffering (the I/O call is passed onto the
    next layer without intervention). The general
    format is a comma-separated list of
    specifications.
  • IOBUF_PARAMSspec1,spec2,spec3,...

68
IOBUF
69
Example
  • export IOBUF_SIZE32768
  • export IOBUF_COUNT5
  • export IOBUF_PARAMS'//eagerflushlazyclose,s
    tdouteagerflushlazyclose'

70
1. Introduction to SHMEM
  • Programming Model
  • Memory is private to each processRemotely
    accessible, not shared
  • SHMEM is one-sided message passing modelPut and
    get operations
  • SHMEM is SPMD programming model
  • SHMEM application can be part of MPMD MPI job

71
Introduction(cont.)
  • Symmetric Data Objects
  • Primary concept in SHMEM
  • Virtual addresses of symmetric data object on
    different processes have definite, known
    relationship
  • Access remote symmetric data objects by using
    address of corresponding local data object
  • C global, static or shmallocd data
  • Fortran common block, SAVE attribute or
    shpallocd data

72
Introduction (cont.)
S
stack
stack
S
long A10 void foobar(void) long S10
if(my_pe 0) shmem_put64(A,S,10,1)

heap
heap
symmetric heap
symmetric heap
data
data
A
A
text
text
PE 0
PE 1
73
Introduction(cont.)
  • Goal
  • Deliver best possible communication
    performance
  • by minimizing overhead associated with data
  • transfer

74
2. Cray SHMEM Implementation on XT3
  • XT3 uses Portals Networking Protocol
  • One-sided RMA protocol
  • Guarantees reliable, ordered message delivery
    between pairs of processes
  • Connection-less
  • Designed specifically for scalability
  • Cray SHMEM layered on top of Portals 3.3

75
Cray SHMEM on XT3 (cont.)
  • Portals resources
  • Memory Descriptor (MD) identifies a memory region
    to be used in operation
  • Event Queue (EQ) used to record information about
    operation
  • SHMEM start-up
  • Set up Portals resources
  • MDs to describe four memory regions
  • EQ to monitor transfer completions

76
Cray SHMEM on XT3 (cont.)
  • SHMEM data transfer
  • Source and target addresses determine which MDs
    and EQs to supply to Portals call
  • Execute Portals put or get command
  • Monitor EQ for completion event if necessary
  • Persistent Portals resources gt low overhead on
    transmit path

77
3. Cray SHMEM 1.0 Release
  • Functionality Supported
  • Initialization and Clean up
  • shmem_init or start_pes
  • shmem_finalize
  • Queries
  • shmem_my_pe, shmem_n_pes
  • Puts and Gets
  • shmem_xxx_put,get (generic different types)
  • shmem_put,getxxx (different bit counts)
  • shmem_put,getmem

78
Cray SHMEM 1.0 Release (cont.)
  • Functionality Supported (cont.)
  • Synchronization
  • shmem_fence
  • shmem_quiet
  • shmem_barrier_all
  • shmem_barrier
  • Wait
  • shmem_xxx_wait (generic different integer
    types)
  • shmem_xxx_wait_until (generic different
    integer types)

79
Cray SHMEM 1.0 Release (cont.)
  • Functionality Supported (cont.)
  • Broadcast
  • shmem_broadcastxxx (generic different bit
    counts)
  • Reductions
  • shmem_xxx_yyy_to_all for operations sum, prod,
    max,
  • min, and, or, xor (different types)
  • Currently supported on all PEs only

80
Cray SHMEM 1.0 Release (cont.)
  • Functionality Supported (cont.)
  • Events
  • shmem_clear,set,test,wait_event
  • Strided Puts and Gets
  • shmem_xxx_iput,get (generic different types)
  • shmem_iput,getxxx (different bit counts)

81
Cray SHMEM 1.0 Release (cont.)
  • Functionality Supported (cont.)
  • Symmetric Heap management
  • shmalloc
  • shfree
  • shrealloc
  • Fortran Interface
  • Functions corresponding to C interface
  • include mpp/shmem.fh

82
Cray SHMEM 1.0 Release (cont.)
  • Preliminary Performance Data
  • Simple SHMEM get/put operations map well
  • onto XT3 architecture
  • Advanced SHMEM operations do not map well
  • onto XT3 architecture
  • Portals not tuned yet, e.g. no OS-bypass,
    mostPotrals calls require system call
Write a Comment
User Comments (0)
About PowerShow.com