Compiling Fortran D - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Compiling Fortran D

Description:

Because parallel programs are machine-specific scientists are discouraged from ... Represents the translation of the problem onto the finite resources of the machine ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 29

Provided by: wwwcsa

Category:

more less

Transcript and Presenter's Notes

Title: Compiling Fortran D

1
Compiling Fortran D

For MIMD Distributed Machines
Authors Seema Hiranandani, Ken Kennedy, Chau-Wen
Tseng
Published 1992
Presented by Sunjeev Sikand
Thursday, September 17, 2009

2
Problem

Parallel computers represent the only plausible
way to continue to increase the computational
power available to scientists and engineers
However, they are difficult to program
In particular MIMD machines require
message-passing to separate address spaces and
synchronizing among processors

3
Problem cont.

Because parallel programs are machine-specific
scientists are discouraged from utilizing them
because they lose their investment when the
program changes or a new architecture arrives
However, vectorizable programs are easily
maintained, debugged, portable, and the compilers
do all the work

4
Solution

Previous Fortran dialects lack a means of
specifying a data decomposition
The authors believe that if a program is written
in a data parallel programming style with
reasonable data decompositions it can be
implemented efficiently.
Thus they propose to develop a compiler
technology to establish such a machine-independent
programming model.
Want to reduce both communication and load
imbalance

5
Data Decomposition

A decomposition is an abstract problem or index
domain it does not require any storage
Each element of a decomposition represents a unit
of computation
The DECOMPOSITION statement declares the name,
dimensionality, and size of a decomposition for
later use
There are two levels of parallelism in data
parallel applications

6
Decomposition Statement
DECOMPOSITION D(N,N)
7
Data Decomposition - Alignment

First level of parallelism is array
alignment/problem mapping that is how arrays are
aligned with respect to one another
Represents the minimal requirements for reducing
data movement for the program given an unlimited
number of processors
Machine independent and depends on the
fine-grained parallelism defined by the
individual member of data arrays

8
Alignment cont.

Corresponding elements in aligned arrays are
always mapped to the same processor
Array operations between aligned arrays are
usually more efficient than array operations
between arrays that are not known to be aligned.

9
Alignment Example

REAL A(N,N)
DECOMPOSITION D(N,N)
ALIGN A(I,J) with D(J-2,I3)

10
Data Decomposition - Distribution

Other level of parallelism is distribution/machine
mapping that is how arrays are distributed on
the actual parallel machine
Represents the translation of the problem onto
the finite resources of the machine
Affected by the topology, communication
mechanisms, size of local memory, and number of
processors on the underlying machine

11
Distribution cont.

Specified by assigning an independent attribute
to each dimension.
Predefined attributes include BLOCK, CYCLIC, and
BLOCK_CYCLIC
The symbol marks dimensions that are not
distributed

12
Distribution Example 1
DISTRIBUTE D(,BLOCK)
13
Distribution Example 2
DISTRIBUTE D(,CYCLIC)
14
Fortran D Compiler

Two major steps in writing a data parallel
program are selecting a data decomposition and
using it to derive node programs with explicit
movement
The former is left to user
Latter is automatically generated by the compiler
when given a data decomposition
Translated program to a SPMD program with
explicit message passing that execute directly on
the nodes of the distributed-memory machine

15
Fortran D Compiler Structure

1 Program Analysis
a-Dependence Analysis
b-Data Decomposition Analysis
c-Partitioning Analysis
d-Communication Analysis
2 Program optimization
a-Message vectorization
b-Collective communications
c-Run-Time processing
d-Pipelined computations
3 Code generation
a-Program partitioning
b-Message generation
c-Storage management

16
Partition Analysis

Converting global to local indices

Original program REAL A(100) do i I, I00 A(i)
0.0 enddo
SPMD node Program REAL A(25) do i i, 25 A(i)
0.0 enddo
17
Jacobi Relaxation

In the grid approximation that discretizes the
physical problem, the heat flow into any given
point at a given moment is the sum of the four
temperature differences between that point and
each of the four points surrounding it.
Translating this into an iterative method, the
correct solution can be found if the temperature
of a given grid point at a given iteration is
taken to be the average of the temperatures of
the four surrounding grid points at the previous
iteration.

18
Jacobi Relaxation Code

REAL A(100,100), B(100,100)
DECOMPOSITION D(100,100)
ALIGN A, B with D
DISTRIBUTE D(,BLOCK)
do k l,time
do j 2,99
do i 2,99
S1 A(i,j) (B(i,j-l)B(i-l,j)
B(il,j)B(i,jl))/4
enddo
enddo
do j 2,99
do i 2,99
S2 B(i,j) A(i,j)
enddo
enddo
enddo

19
Jacobi Relaxation Processor Layout

Compiling for a four-processor machine.
Both arrays A and B are aligned identically with
decomposition D, so they have the same
distribution as D.
Because the first dimension of D is local and the
second dimension is block-distributed, the local
index set for both A and B on each processor (in
local indices) is 1100,125.

20
Jacobi Relaxation cont.
21
Jacobi Relaxation cont.

The iteration set of the loop nest (in global
indices) is ltime,299,299.
Local iteration sets for each processor (in local
indices)
Proc(1) 1 time, 2 25, 2 99
Proc(2 3) 1 time, 1 25, 2 99
Proc(4) 1 time, 1 24, 2 99

22
Generated Jacobi

REAL A(100,25), B(100,026)
if (Plocal 1) lb1 2 else lb1 1
if (Plocal 4) ub1 24 else ub1 25
do k l,time
if (Plocal gt l) send(B(299,1), Pleft)
if (Plocal lt 4) send(B(299,25), Pright)
if (Plocal lt 4) recv(B(299,26), Pright)
if (Plocal gt 1) recv(B(299,0) , Pleft)
do j lb1, ub1
do i 2,99
A(i,j) (B(i,j-l)B(i-l,j)
B(il,j)B(i,jl) )/4
enddo
enddo

23
Generated Jacobi cont.

do j lb1,ub1
do i 2,99
B(i,j) A(i,j)
enddo
enddo
enddo
Only true cross-processor dependences are on the
k loop thus able to vectorize messages

24
Pipelined Computation

In loosely synchronous all processors execute in
loose lockstep, alternating between phases of
local computation and global communication e.g.
Red Black SOR and Jacobi
However some computations such as SOR contain
loop carried dependences
They present an opportunity to exploit
parallelism through pipelining.

25
Pipelined Computation cont.

The observation is that for some pipelined
computations, the program order must be changed.
Fine grained pipelining interchanges cross
processor loops as deeply as possible to improve
sequential computation but incurs the most
communication overhead
Coarse grained pipelining uses strip mining and
loop interchange to adjust the granularity of the
pipelining. Decreases communication overhead at
the expense of some parallelism

26
Conclusions

A usable and efficient machine independent
parallel programming model is needed to make
large-scale parallel machines useful to
scientific programmers
Fortran D with its data decomposition model
performs message vectorization, collective
communication, fine-grained pipelining, and
several other optimizations for block distributed
arrays
Fortran D compiler will generate efficient code a
for a large class of data parallel programs with
minimal effort

27
Discussion

Q How is this applicable to sensor networks?
A There is no reference to sensor networks
explicitly as this paper was written over a
decade ago. But they provide a unified
programming methodology to distribute data and
communicate among processors. Replace this with
motes and youll this is indeed relevant

28
Discussion cont.

Q What about issues such as fault tolerance?
A Point well taken. If a message is lost it
doesnt seem as though the infrastructure is
there to deal with this. The model could be
extended to have redundant computation. Perhaps
even check pointing but as someone mentioned the
memory of motes may be an issue here.
Q They provide a means for load balancing is
this even applicable to sensor networks?
A Yes, it is in sensor networks as we want to
balance the load so energy isnt completely spent
on a mote.