Title: Parallel Programming Fortran M
1Parallel Programming Fortran M HPF
2FM Overview
- FM (like CC) is a set of extensions to the
basic Fortran language - Language constructs to create tasks and channels
- Constructs to send and receive messages
- Ensures deterministic execution
- Mapping decisions do not effect design
3Fortran a short history
- First successful high-level language
- Created by IBM in 1955
- Adopted by most scientific and military
institutions (over Assembly) - ASA standard published in 1966 became FORTRAN 66
- FORTRAN 77 became the stable standard and is
still used by many compilers today
4Fortran a short history
- FORTRAN 90/95 (unofficial names) added many of
the properties common in other HLLs - Free format source code form (column independent)
- Modern control structures (CASE DO WHILE)
- Records (structures)
- Array notation (array sections, array operators,
etc.) - Dynamic memory allocation
- Derived types and operator overloading
- Keyword argument passing, INTENT (in, out, inout)
- Numeric precision and range control
- Modules
5Fortran a short history
- Scientists still use it because
- Variable-dimension array arguments in subroutines
- Built-in complex arithmetic
- A compiler-supported infix exponentiation
operator which is generic with respect to both
precision and type - array notation that allows operations on array
sections
6FM Introduction
- Any valid Fortran program is a valid FM program
except - COMMON must be renamed to PROCESS COMMON
- Compilers usually do this for you
- Emphasizes modularity
7FM Introduction
- Processes
- Single and Multiple-producer channels
- Process blocks and do-loops
- Sending and receiving messages
- Mapping
- Variable passing
8(No Transcript)
9Concurrency Defining Processes
- Tasks are implemented in FM as processes
- Process definition defines its interface to its
environment - program fm_bridge_construction
- INPORT (integer) pi
- OUTPORT (integer) po
- CHANNEL(inpi, outpo)
10Concurrency Defining Processes
- Typed port variables
- INPORT (integer, real) p1
- INPORT (real x(128)) p2
- INPORT (integer m, real x(m)) p3
11Concurrency Creating Processes
- An FM program starts out as a single process that
spawns additional processes (like CC) - PROCESSES
- statement_1
- .
- .
- .
- statement_n
- ENDPROCESSES
12Concurrency Creating Processes
- One standard subroutine call can be made (e.g.
call subroutine_1(po1)) - All other subroutine calls must be to processes
and are made using the PROCESSCALL command - PROCESSES
- PROCESSCALL worker(pi1)
- PROCESSCALL worker(pi2)
- PROCESSCALL process_master(po1,po2)
- ENDPROCESSES
13Concurrency Creating Processes
- Statements in a PROCESSES block executes
concurrently - The block terminates when all of the child
processes return
14Concurrency Creating Processes
- Multiple instances of the same process can be
created with the PROCESSDO statement - PROCESSDO i 1,10
- PROCESSCALL myprocess
- ENDPROCESSDO
15Concurrency Creating Processes
- PROCESSDO can be nested
- PROCESSES
- PROCESSCALL master
- PROCESSDO i 1,10
- PROCESSCALL worker
- ENDPROCESSDO
- ENDPROCESSES
16Communication
- FM processes cannot share data directly
- Channels can be single-producer, single-consumer
or multiple-producer, single-consumer
17Communication Creating Channels
- Channels are created using the CHANNEL statement
- CHANNEL(ininport, outoutport)
- Defines both the input port and the output port
18Communication Creating Channels
19Communication Sending Messages
- A process sends a message by applying the SEND
statement to an outport - OUTPORT (integer, real x(10)) po
- ...
- SEND(po) i, a
20Communication Sending Messages
- ENDCHANNEL is used to send an end-of-channel
(EOC) message and set the outport variable to
null - SEND is non-blocking (asynchronous)
21Communication Receiving Messages
- A process receives a message by using the RECEIVE
statement on an inport - INPORT (integer n, real x(n)) pi
- integer num
- real a(128, 128)
- RECEIVE(pi) num, a(1,offset)
22Communication Receiving Messages
- endlabel causes execution to continue at the
label when an EOC message is received - PROCESS bridge(pi) ! Process definition
- INPORT (integer) pi ! Argument inport
- integer num ! Local variable
- do while(.true.) ! While not done
- RECEIVE(portpi, end10) num ! Receive message
- call use_girder(num) ! Process message
- enddo !
- 10 end ! End of process
23Unstructured Communication
- Identity of communicating processes change during
program execution - Many-to-one
- Many-to-many
- Dynamic creation of channels
24Many-to-One Communication
- FMs MERGER statement creates a FIFO message
queue - Allows multiple outports to reference it
- MERGER(ininport, outoutport_specifier)
- The outport_specifier can be an outport, list or
outport_specifiers, or an array section from an
outport array
25Many-to-One Communication
- INPORT (integer) pi ! Single inport
- OUTPORT(integer) pos(4) ! Four outports
- MERGER(inpi,outpos()) ! Merger
- PROCESSES
- call consumer(pi) ! Single consumer
- PROCESSDO i1,4
- PROCESSCALL producer(pos(i))
- ENDPROCESSDO
- ENDPROCESSES
26Many-to-Many Communication
- Similar in implementation to Many-to-one code
using multiple mergers - OUTPORT(integer) pos(3,4) ! 3x4 outports
- INPORT (integer) pis(4) ! 3 inports
- do i1,3 ! 3 mergers
- MERGER(inpis(i),outpos(i,)) !
- enddo !
- PROCESSES !
- PROCESSDO i1,4 !
- PROCESSCALL producer(pos(1,i)) ! 4 producers
- ENDPROCESSDO !
- PROCESSDO i1,3 !
- PROCESSCALL consumers(pis(i)) ! 3 consumers
- ENDPROCESSDO !
- ENDPROCESSES !
27Dynamic Channel Structures
- I/O ports can be sent via inports or outports
- INPORT (OUTPORT (integer)) pi
- OUTPORT (integer) qo ! Outport
- RECEIVE(pi) qo ! Receive outport
28(No Transcript)
29Asynchronous Communication
- Specialized data tasks used to read and write
data requests - Can be implemented with the methodology just
discussed - Distributed data version can be accomplished
using PROBE to do polling of an inport
30Asynchronous Communication
- PROBE sets a logical flag to denote if there is a
message in an inport queue - inport (T) requests ! T an arbitrary type
logical eflag - do while (.true.) ! Repeat
- call advance_local_search ! Compute
- PROBE(requests,emptyeflag) ! Poll for
requests - if(.not. eflag) call respond_to_requests
- enddo
31Determinism
- MERGER and PROBE are non-deterministic constructs
- Any program not using these is guaranteed to be
deterministic - PROCESSDO i 1,2 PROCESSES
- PROCESSCALL proc(i,x) PROCESSCALL proc(1,x)
- ENDPROCESSDO PROCESSCALL proc(2,x) ENDPRO
CESSES
32Argument Passing
- By default, variables passed on ports are copied
to and from the call - If a process only reads a value, then there is no
need to copy the variable back to the calling
process - Use INTENT statement to keep from copying back to
the calling process
33Mapping
- Mapping in FM changes execution time but not the
correctness of an algorithm - PROCESSORS specifies the shape and dimension of a
virtual processor array - LOCATION maps processes to processors
- SUBMACHINE specifies that a process should
execute in a subset of the array
34Mapping Virtual Computers
35Mapping Process Placement
- LOCATION annotation is similar in form to the
PROCESSOR statement - statement LOCATION(index_to_processor_array)
36Mapping Process Placement
- Example
- program ring
- parameter(P3)
- PROCESSORS(P)
- ...
- PROCESSDO i 1,P
- PROCESSCALL ringnode(i, P, pi(i), po(i))
LOCATION(i) - ENDPROCESSDO
37Mapping Submachines
- Same format as the LOCATION annotation
- SUBMACHINE sets up a new virtual computer within
the current virtual computer (set up by a
PROCESSORS statement or another SUBMACHINE)
38Performance Issues
- Because FM directly uses the task/channel
metaphor, previous techniques can be applied
directly - A SEND incurs only one communication cost (not
two) although FM code tends to have more SENDs
39Performance Issues
- Process creation
- Cost depends on the compiler
- If Unix processes are created then they can be
expensive - If threads are created then they are relatively
cheap - Fairness
- Compiler optimization
40Break
- Turn in proposals if you havent already
41Data Parallelism with HPF
- Data parallelism is when the same operation is
applied to elements of a data ensemble - A data-parallel program is a sequence of such
operations
42Data Parallelism - Concurrency
- Data structures operated on in a data-parallel
program can be regular (arrays) or irregular
(trees, sparse matrix, etc.) - HPF requires that they be arrays
43Data Parallelism - Concurrency
- Explicit vs. Implicit parallel constructs
- Explicit
- A BC ! A, B, C are arrays
- Implicit
- do i 1,m
- do j 1,n
- A(i,j) B(i,j)C(i,j)
- enddo
- enddo
44Data Parallelism Concurrency
- HPF compilation can introduce additional
communication depending on how data elements are
distributed - real y, s, X(100)
- X Xy
- do i 2,99
- X(i) (X(i-1) X(i1))/2
- enddo
- s SUM(X)
Possible Communication
45Data Parallelism Locality
- Obviously, data location can drastically effect
the performance of a program - HPF allows the programmer to dictate how a data
structure is to be distributed - !HPF PROCESSORS pr(16)
- real X(1024)
- !HPF DISTRIBUTE X(BLOCK) ONTO pr
46Data Parallelism Design
- Higher level
- Not required to specify communications
- Compiler determines communications from data
distribution specified - More restrictive
- Not all algorithms can be specified as
data-parallel
47Data Parallelism Languages
- Fortran 90
- High Performance Fortran
48Fortran 90
- This is a complex language that extends Fortran
77 - Pointers
- User-defined types
- Dynamic storage
- More
- Array assignment
- Array intrinsic functions
49Fortran 90 Array Assignment Statement
- A typical scalar operation can be applied to
arrays - integer A(10,10), B(10,10), c
- A B c
50Fortran 90 Array Assignment Statement
- Subsets can also be referenced
- Masked array assignment
- WHERE(X / 0) X 1.0/X
51Fortran 90 Array Intrinsic Functions
52Fortran 90 Finite Difference
53Data Distribution
- Fortran 90s array capabilities provide
opportunities for concurrency but not locality - HPF adds directives that gives the programmer
some control over locality - PROCESSORS
- ALIGN
- DISTRIBUTE
54Data Distribution Processors
- Same as the PROCESSORS statement in FM except for
the directive flag - !HPF PROCESSORS P(32)
- !HPF PROCESSORS Q(4,8)
55Data Distribution Alignment
- Data elements of different arrays may relate to
one another - The ALIGN directive is used to specify which
elements should, if possible, be collocated - !HPF ALIGN array WITH target
56Data Distribution Alignment
- Simplest form
- real B(50), C(50)
- !HPF ALIGN C() WITH B()
57Data Distribution Alignment
58Data Distribution Distribute directive
- Each dimension of an array can be distributed to
processors in one of three ways - none
- BLOCK(n) Block (default nN/P)
- CYCLIC(n) Cyclic (default n1)
59Data Distribution Distribute directive
60HPF Finite Difference
61More Complex Data Mapping
- Fortran 90 array operations generally require
balanced or comparable arrays - This is not always the mapping that needs to occur
62FORALL statement
- Allows more general assignments to sections of an
array - FORALL (i1m, j1n) X(i,j)ij
- FORALL (i1n, j1n, iltj) Y(i,j)0.0
- FORALL (i1n) Z(i,i)0.0
63FORALL statement
- To maintain determinism a FORALL statement can
only write to an element once - FORALL (i1n) A(Index(i)) B(i)
64INDEPENDENT directive
- Modifies a do-loop
- Tells the compiler that each iteration of a
do-loop is independent - !HPF INDEPENDENT
- do i1,n
- A(Index(i)) B(i)
- enddo
65Discovering Physical Processors
- A function call to NUMBER_OF_PROCESSORS() can
return the number of physical processors - !HPF PROCESSORS P(NUMBER_OF_PROCESSORS())
66Discovering Processor Configuration
- A function call can be made to PROCESSORS_SHAPE()
to determine the connection scheme - integer Q(SIZE(PROCESSORS_SHAPE()))
67Discovering Processor Examples
68Performance Issues
- Programming skill
- Compiler
- Sequential Bottlenecks
- Communication Costs
69Performance Issues Compilation
- HPF compilers typically use the owner-computes
rule to decide which processor runs what
operation - Communication operations are then optimized
specifically, message passing is moved out of
loops
70Performance Issues Sequential Bottlenecks
- These occur when
- the programmer has not provided enough
opportunities for parallelization - When concurrency is implicit and the compiler
fails to recognize this
71Performance Issues Communication Costs
- F90 and HPF can incur a great deal of
communication costs - Intrinsics and array operations can use values
across an entire array or multiple arrays - Non-aligned arrays
72(No Transcript)
73Performance Issues Communication Costs
- Switching decompositions at procedure boundaries
- Compiler optimizations may not use suggestions
made by the programmer
74Summary
- Fortran M
- Data parallelism
- Fortran 90
- High Performance Fortran