Compiling Fortran D - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Compiling Fortran D

Description:

Because parallel programs are machine-specific scientists are discouraged from ... Represents the translation of the problem onto the finite resources of the machine ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 29
Provided by: wwwcsa
Category:

less

Transcript and Presenter's Notes

Title: Compiling Fortran D


1
Compiling Fortran D
  • For MIMD Distributed Machines
  • Authors Seema Hiranandani, Ken Kennedy, Chau-Wen
    Tseng
  • Published 1992
  • Presented by Sunjeev Sikand
  • Thursday, September 17, 2009

2
Problem
  • Parallel computers represent the only plausible
    way to continue to increase the computational
    power available to scientists and engineers
  • However, they are difficult to program
  • In particular MIMD machines require
    message-passing to separate address spaces and
    synchronizing among processors

3
Problem cont.
  • Because parallel programs are machine-specific
    scientists are discouraged from utilizing them
    because they lose their investment when the
    program changes or a new architecture arrives
  • However, vectorizable programs are easily
    maintained, debugged, portable, and the compilers
    do all the work

4
Solution
  • Previous Fortran dialects lack a means of
    specifying a data decomposition
  • The authors believe that if a program is written
    in a data parallel programming style with
    reasonable data decompositions it can be
    implemented efficiently.
  • Thus they propose to develop a compiler
    technology to establish such a machine-independent
    programming model.
  • Want to reduce both communication and load
    imbalance

5
Data Decomposition
  • A decomposition is an abstract problem or index
    domain it does not require any storage
  • Each element of a decomposition represents a unit
    of computation
  • The DECOMPOSITION statement declares the name,
    dimensionality, and size of a decomposition for
    later use
  • There are two levels of parallelism in data
    parallel applications

6
Decomposition Statement
DECOMPOSITION D(N,N)
7
Data Decomposition - Alignment
  • First level of parallelism is array
    alignment/problem mapping that is how arrays are
    aligned with respect to one another
  • Represents the minimal requirements for reducing
    data movement for the program given an unlimited
    number of processors
  • Machine independent and depends on the
    fine-grained parallelism defined by the
    individual member of data arrays

8
Alignment cont.
  • Corresponding elements in aligned arrays are
    always mapped to the same processor
  • Array operations between aligned arrays are
    usually more efficient than array operations
    between arrays that are not known to be aligned.

9
Alignment Example
  • REAL A(N,N)
  • DECOMPOSITION D(N,N)
  • ALIGN A(I,J) with D(J-2,I3)

10
Data Decomposition - Distribution
  • Other level of parallelism is distribution/machine
    mapping that is how arrays are distributed on
    the actual parallel machine
  • Represents the translation of the problem onto
    the finite resources of the machine
  • Affected by the topology, communication
    mechanisms, size of local memory, and number of
    processors on the underlying machine

11
Distribution cont.
  • Specified by assigning an independent attribute
    to each dimension.
  • Predefined attributes include BLOCK, CYCLIC, and
    BLOCK_CYCLIC
  • The symbol marks dimensions that are not
    distributed

12
Distribution Example 1
DISTRIBUTE D(,BLOCK)
13
Distribution Example 2
DISTRIBUTE D(,CYCLIC)
14
Fortran D Compiler
  • Two major steps in writing a data parallel
    program are selecting a data decomposition and
    using it to derive node programs with explicit
    movement
  • The former is left to user
  • Latter is automatically generated by the compiler
    when given a data decomposition
  • Translated program to a SPMD program with
    explicit message passing that execute directly on
    the nodes of the distributed-memory machine

15
Fortran D Compiler Structure
  • 1 Program Analysis
  • a-Dependence Analysis
  • b-Data Decomposition Analysis
  • c-Partitioning Analysis
  • d-Communication Analysis
  • 2 Program optimization
  • a-Message vectorization
  • b-Collective communications
  • c-Run-Time processing
  • d-Pipelined computations
  • 3 Code generation
  • a-Program partitioning
  • b-Message generation
  • c-Storage management

16
Partition Analysis
  • Converting global to local indices

Original program REAL A(100) do i I, I00 A(i)
0.0 enddo
SPMD node Program REAL A(25) do i i, 25 A(i)
0.0 enddo
17
Jacobi Relaxation
  • In the grid approximation that discretizes the
    physical problem, the heat flow into any given
    point at a given moment is the sum of the four
    temperature differences between that point and
    each of the four points surrounding it.
  • Translating this into an iterative method, the
    correct solution can be found if the temperature
    of a given grid point at a given iteration is
    taken to be the average of the temperatures of
    the four surrounding grid points at the previous
    iteration.

18
Jacobi Relaxation Code
  • REAL A(100,100), B(100,100)
  • DECOMPOSITION D(100,100)
  • ALIGN A, B with D
  • DISTRIBUTE D(,BLOCK)
  • do k l,time
  • do j 2,99
  • do i 2,99
  • S1 A(i,j) (B(i,j-l)B(i-l,j)
  • B(il,j)B(i,jl))/4
  • enddo
  • enddo
  • do j 2,99
  • do i 2,99
  • S2 B(i,j) A(i,j)
  • enddo
  • enddo
  • enddo

19
Jacobi Relaxation Processor Layout
  • Compiling for a four-processor machine.
  • Both arrays A and B are aligned identically with
    decomposition D, so they have the same
    distribution as D.
  • Because the first dimension of D is local and the
    second dimension is block-distributed, the local
    index set for both A and B on each processor (in
    local indices) is 1100,125.

20
Jacobi Relaxation cont.
21
Jacobi Relaxation cont.
  • The iteration set of the loop nest (in global
    indices) is ltime,299,299.
  • Local iteration sets for each processor (in local
    indices)
  • Proc(1) 1 time, 2 25, 2 99
  • Proc(2 3) 1 time, 1 25, 2 99
  • Proc(4) 1 time, 1 24, 2 99

22
Generated Jacobi
  • REAL A(100,25), B(100,026)
  • if (Plocal 1) lb1 2 else lb1 1
  • if (Plocal 4) ub1 24 else ub1 25
  • do k l,time
  • if (Plocal gt l) send(B(299,1), Pleft)
  • if (Plocal lt 4) send(B(299,25), Pright)
  • if (Plocal lt 4) recv(B(299,26), Pright)
  • if (Plocal gt 1) recv(B(299,0) , Pleft)
  • do j lb1, ub1
  • do i 2,99
  • A(i,j) (B(i,j-l)B(i-l,j)
  • B(il,j)B(i,jl) )/4
  • enddo
  • enddo

23
Generated Jacobi cont.
  • do j lb1,ub1
  • do i 2,99
  • B(i,j) A(i,j)
  • enddo
  • enddo
  • enddo
  • Only true cross-processor dependences are on the
    k loop thus able to vectorize messages

24
Pipelined Computation
  • In loosely synchronous all processors execute in
    loose lockstep, alternating between phases of
    local computation and global communication e.g.
    Red Black SOR and Jacobi
  • However some computations such as SOR contain
    loop carried dependences
  • They present an opportunity to exploit
    parallelism through pipelining.

25
Pipelined Computation cont.
  • The observation is that for some pipelined
    computations, the program order must be changed.
  • Fine grained pipelining interchanges cross
    processor loops as deeply as possible to improve
    sequential computation but incurs the most
    communication overhead
  • Coarse grained pipelining uses strip mining and
    loop interchange to adjust the granularity of the
    pipelining. Decreases communication overhead at
    the expense of some parallelism

26
Conclusions
  • A usable and efficient machine independent
    parallel programming model is needed to make
    large-scale parallel machines useful to
    scientific programmers
  • Fortran D with its data decomposition model
    performs message vectorization, collective
    communication, fine-grained pipelining, and
    several other optimizations for block distributed
    arrays
  • Fortran D compiler will generate efficient code a
    for a large class of data parallel programs with
    minimal effort

27
Discussion
  • Q How is this applicable to sensor networks?
  • A There is no reference to sensor networks
    explicitly as this paper was written over a
    decade ago. But they provide a unified
    programming methodology to distribute data and
    communicate among processors. Replace this with
    motes and youll this is indeed relevant

28
Discussion cont.
  • Q What about issues such as fault tolerance?
  • A Point well taken. If a message is lost it
    doesnt seem as though the infrastructure is
    there to deal with this. The model could be
    extended to have redundant computation. Perhaps
    even check pointing but as someone mentioned the
    memory of motes may be an issue here.
  • Q They provide a means for load balancing is
    this even applicable to sensor networks?
  • A Yes, it is in sensor networks as we want to
    balance the load so energy isnt completely spent
    on a mote.
Write a Comment
User Comments (0)
About PowerShow.com