Title: Eliminating%20affinity%20tests%20and%20simplifying%20shared%20accesses%20in%20UPC
1Eliminating affinity tests and simplifying shared
accesses in UPC
- Rahul Garg, Kit Barton, Calin Cascaval
- Gheorghe Almasi, Jose Nelson Amaral
- University of Alberta
- IBM Research
2(No Transcript)
3Shared arrays
- Arrays can be shared b/w all threads
- Eg shared 2 double A9
- Assuming THREADS3
- 1-d block cyclic distribution similar to HPF
cyclic(k)
0
1
2
3
4
5
6
7
8
4Vector addition example
- include ltupc.hgt
- include ltstdio.hgt
- shared 2 double A10
- shared 3 double B10,C10
- int main()
- int i
- upc_forall(i0ilt10iCi)
- Ci Ai Bi
5Outline of talk
- upc_forall loops syntax and uses
- Compiling upc_forall loops
- Data distributions in UPC
- Multiblocking distributions
- Privatization of access
- Results
6upc_forall and affinity tests
- upc_forall is a work distribution construct
- Form
- shared BF double AM
- upc_forall(i0 iltN i Ai)
- //loop body
-
- Affinity test expression determines which
thread executes which iteration.
Affinity test expression
7Affinity test elimination naive
shared BF double AM upc_forall(i0iltMi
Ai) //loop body
shared BF double AM for(i0 iltM
i) if(upc_threadof(Ai)MYTHREAD) //loop
body
8Affinity test elimination optimized
shared BF double AM upc_forall(i0iltMi
Ai) //loop body
shared BF double AM for(iMYTHREADBF iltM
i(BFTHREADS)) for(ji jltiBF
j) //loop body
9Integer Affinity Tests
upc_forall(i0iltMi i) //loop body
for(iMYTHREAD iltM iTHREADS) //loop body
10Data distributions for shared arrays
- UPC official spec only supports 1d block cyclic
- IBM xlupc compiler supports more general data
distribution 'multi-dimensional blocking' - Eg shared 23 double A55
- Divide the array into multidimensional tiles
- Distribute the tiles among processors in cyclic
fashion - More general than UPC spec, but not as general as
ScaLAPACK or HPF
11(No Transcript)
12Locality analysis and privatization
- Consider
- shared 23 A56,B56
- for(i0 ilt4 i)
- upc_forall(j0 jlt4 j Aij)
- Aij Bi1j
-
-
- What code should we generate for references
Aij and Bi1j?
13Shared access code generation
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
for(i0ilt4i) upc_forall(j0jlt4jAij
) val shared_deref(B,i1,j) shared_assign(
A,i,j,val)
14Shared access code generation
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
- Do we really need the function calls?
- Aij should only be a memory load/store??
- What about Bi1j on SMP? This should be just
a load? On hybrids?
15(No Transcript)
16Locality Analysis Intuition
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
- The locality can only change if index (i1)
crosses block boundaries in a direction - Block boundaries 0, BF , 2BF ...
- (i1)BF0 gives block boundary
- So we only need to see if (i1)BF0 to find
places where locality can change!
17Locality Analysis
for(i0ilt4i) upc_forall(j0jlt4jAij
) Aij Bi1j
- Define offset vector k1 k2 k11, k20
- k1 and k2 are integer constants
- Cross block boundary at (ik1)BF 0
- Cases iBFlt(BF-k1BF) and iBFgt (BF-k1BF)
- iBFlt(BF-k1) we refer it to as 'cut'
18Shared access code generation
for(i0ilt4i) if((i2lt1) upc_forall(j0jlt
4jAij) val memory_load(B,i1,j)
memory_store(A,i,j,val) else upc_forall
(j0jlt4j Aij) val
shared_deref(B,i1,j) memory_store(A,i,j,val)
19Locality analysis algorithm
- For each shared reference in loop
- Check if blocking factor matches
- Check if distance vector is constant
- If reference is eligible
- Generate cut expressions
- Put cut in a sorted cut list
- Replicate loop body as necessary
- Insert memory load/store if local reference
otherwise insert RTS call
20Improvements of locality analysis in isolation
21Improvements of affinity test elimination in
isolation
22Results Vector addition
23Matrix-vector multiplication
24Matrix-vector scalability
25Conclusions
- UPC requires extensive compiler support
- upc_forall is a challenging construct to compile
efficiently - Shared access implementation requires compiler
support - Optimizations working together produce good
results - Compiler optimizations can produce gt80x speedup
over unoptimized code - If one optimization fails, then results can still
be bad