Title: Administrative Stuff
1Administrative Stuff
- Location change for the lecture on Friday March 2
- Education Building Room 2 - one lecture only.
2CS 320/ECE 392/CSE 302
Data Parallelism in High Performance Fortran
Department of Computer Science University of
Illinois at Urbana-Champaign
3Contents
- High Performance Fortran
- Parallelism constructs
- FORALL
- PURE functions
- INDEPENDENT
- Data Distribution Directives
- ALIGN
- DISTRIBUTE
- TEMPLATE
- PROCESSORS
4References
- HPF specification (v2.0 available online
http//dacnet.rice.edu/Depts/CRPC/HPFF/versions/hp
f2/hpf-v20/index.html - Includes material from Documentation, slides and
Papers on HPF at Rice University.
5What is HPF?
- HPF is a standard for data-parallel programming.
- Extends Fortran-77 or Fortran-90 (in theory also
C - not used in practice).
6Principle of HPF
- Extending sequential language with data
distribution directives that specify on which
processor a certain part of an array should
reside. - Source program for single processor.
- Compiler then produces
- data parallel program (SPMD),
- communication between the processes.
7What the Standard Says
- Can be used with both Fortran-77 and Fortran-90.
- Distribution directives are just a hint, compiler
can ignore them. - HPF can be used on both shared memory and
distributed memory hardware platforms.
8In Reality
- HPF is always used with Fortran-90.
- Distribution directives are a must.
- HPF used on both shared memory and distributed
memory platforms. - But the truth is that the language was really
meant for distributed memory platforms.
9HPF Additional Expressions of Parallelism
- FORALL (data parallel) array assignment.
- PURE functions
- INDEPENDENT construct.
10FORALL Array Assignment
- FORALL( subscript lower_bound upper_bound
stride, mask) array-assignment - Execute all iterations of the subscript loop in
parallel for the given set of indices, where mask
is true. - May have multiple dimensions.
- Same semantics first compute right hand side,
then assign to left hand side. - Only one assignment to particular element (not
checked by the compiler!).
11Examples
- Example1
- do i 1,100
- X(i,i) 0.0
- enddo
- becomes
- FORALL(i1100) X(i,i) 0.0
- Example2
- FORALL(i150) D(i) E(2i-1) E(2i)
12Examples
- A multiple dimension example with use of the mask
option. - Set all the elements of X above the diagonal to
the sum of their indices. - FORALL(i1100, j1100, iltj) X(i,j) ij
13PURE functions/subroutines
- Defined to be side-effect free so can be executed
concurrently. - Example If nitns() is declared as a PURE
functions then - FORALL(i1M, j1N) mandel(i,j)
nitns(CMPLX(.1I, .1J))
14The INDEPENDENT Clause
- !HPF INDEPENDENT
- DO
-
- ENDDO
- Specifies that the iterations of the loop can be
executed in any order (concurrently).
15Examples
- !HPF INDEPENDENT
- DO i1, 100
- DO j 1, 100
- IF(i.NE.j) A(i,j) 1.0
- IF(i.EQ.j) A(i,j) 0.0
- ENDDO
- ENDDO
16Examples Nesting
- !HPF INDEPENDENT
- DO i1, 100
- !HPF INDEPENDENT
- DO j 1, 100
- IF(i.NE.j) A(i,j) 1.0
- IF(i.EQ.j) A(i,j) 0.0
- ENDDO
- ENDDO
17HPF/Fortran-90 Matrix Multiply
18HPF Matrix Multiply
- C 0.0
- do k 1, n
- FORALL(i1n, j1n )
- C(i,j) C(i,j) A(i,k) B(k,j)
- enddo
19HPF Matrix Multiply
- !HPF INDEPENDENT
- DO i1,n
- DO j1,n
- C(i,j) 0.0
- DO k1,n
- C(i,j) C(i,j) A(i,k) B(k,j)
- ENDDO
- ENDDO
- ENDDO
20HPF Matrix Multiply
- !HPF INDEPENDENT
- DO i1,n
- !HPF INDEPENDENT
- DO j1,n
- C(i,j) 0.0
- DO k1,n
- C(i,j) C(i,j) A(i,k) B(k,j)
- ENDDO
- ENDDO
- ENDDO
21PROCESSORS Directive
- Declare abstract processor arrangements (single
processors or processor arrays)
!HPF PROCESSORS p(4), q(NUMBER_OF_PROCESSORS()
/2, 2)
22ALIGN Directive
- Relates elements of an array to those of another
array or templace such that the aligned elements
are stored on the same processor(s)
REAL a(4), b(4), c(8), a2(4,4), b2(4,4) !HPF
ALIGN a() with b() !HPF ALIGN a() with
c(282) !HPF ALIGN a() with c(41-1) !HPF
ALIGN a() with b2(, ) !HPF ALIGN a2(,) with
b () !HPF ALIGN a2(I,J) with b2(J,I)
23TEMPLATE Directive
- Declares an abstract scalar or an array
- no storage allocated
- used just for data alignment and distribution
data objects can be aligned with templates and
templates can be distributed.
!HPF TEMPLATE t(10), t2(10,10), u(mn)
24Data Distributions
- HPF provides data distribution directives to
specify which processor owns what data. - Owner-computes rule the owner of the data does
the computation on the data. - Goal improve locality, reduce communication, and
improve performance.
25Data Distribution Definition
- !HPF DISTRIBUTE ltarraygt ltdistributiongt
- ltdistributiongt (in each dimension)
- no distribution
- BLOCK
- BLOCK(k) k is block size, default n/p
- CYCLIC
- CYCLIC(k) k is cycle size, default 1
- Array without distribution is replicated.
26Data Distribution Examples
- !HPF DISTRIBUTE A(BLOCK,BLOCK)
27Data Distribution Examples
- !HPF DISTRIBUTE A(BLOCK,)
28Data Distribution Examples
- !HPF DISTRIBUTE A(,BLOCK)
29Data Distribution Examples
- !HPF DISTRIBUTE A(,CYCLIC)
30Data Distribution Examples
- !HPF DISTRIBUTE A(,CYCLIC(2))
31Data Distribution Examples
- !HPF DISTRIBUTE A(BLOCK,CYCLIC)
32Difference between OpenMP and HPF
- In OpenMP, user specifies distribution of
iterations. - Data travels to processor executing iteration.
- In HPF, user specifies distribution of data.
- Computation is done by processor owning data.
33HPF Matrix Multiply
- !HPF DISTRIBUTE C(BLOCK,)
- ltstandard matrix multiply codegt
- Leads to the same computation as the OpenMP
expression of matrix multiply each processor
computes a contiguous set of rows.
34HPF Matrix Multiply
- !HPF DISTRIBUTE C(,BLOCK)
- lt standard matrix multiply code gt
- Would cause each processor compute a contiguous
set of columns.
35HPF Matrix Multiply
- !HPF DISTRIBUTE C(BLOCK,BLOCK)
- lt standard matrix multiply code gt
- Each processor computes a rectangular sub-array
of the result.
36Gaussian elimination
- (without pivoting)
- for( i0 iltn i )
- for( ji1 jltn j )
- for( ki1 kltn k )
- ajk ajk - aikaij / ajj
- For-j loop is outermost parallelizable loop.
37OpenMP Gauss
- for( i0 iltn i )
- pragma omp parallel for private(k)
- for( ji1 jltn j )
- for( ki1 kltn k )
- ajk ajk - aikaij / ajj
38HPF Gauss
- !HPF DISTRIBUTE A(CYCLIC,)
- DO i1,n
- !HPF INDEPENDENT
- DO ji1,n
- DO ki1,n
- A(j,k) A(j,k) - A(i,k)A(i,j)/A(j,j)
-
39Difference with OpenMP Gauss
- In HPF, cyclic distribution of A is useful for
load balance. - In OpenMP, block scheduling of iterations
suffices (because iterations are re-distributed
at each new pragma).
40Difference with OpenMP Gauss
- In HPF, each processor keeps on working on the
same data/row (owner computes). - In OpenMP, data/rows move between processors.
- HPF potentially more efficient (increased
locality, more about this later).
41How an HPF compiler works
- Parallelization based on Fortran-90 and HPF
concurrency constructs. - Assign data to processor based on distributions.
- Compute data on owning processor.
- Move other data necessary for computation to that
processor.
42Hard Part of HPF Compiler
- Communication optimization.
- Avoid lots of small messages, optimize towards
few large messages. - Absolutely critical to good performance.
43Performance impact of distribution
- Back to Matrix Multiply
- !HPF DISTRIBUTE C(BLOCK,)
- Causes C to be row-distributed, A and B to be
replicated. - No communication.
44Performance impact of distribution
- Back to Matrix Multiply
- !HPF DISTRIBUTE C(BLOCK,), A(BLOCK,)
- Causes C and A to be row-distributed, and B to be
replicated. - No communication.
45Performance impact of distribution
- Back to Matrix Multiply
- !HPF DISTRIBUTE C(BLOCK,), A(,BLOCK)
- Causes C to be row-distributed, A to be
column-distributed, and B to be replicated. - Lots of communication!
46Performance impact of distribution
C
A
B
X
(replicated)
This data will have to move to processor 0
47Things Can Get Worse
- Sometimes the compiler cannot determine exactly
what data needs to be moved. - B(i) A(INDEX(i))
- (where INDEX(i) is determined dynamically)
- Conservative estimation needs to be made.
- Often leads to broadcast of all data.
- Better methods are known but difficult.
48Summary
- Data parallelism features of HPF
- Comparison with OpenMP