DATA PARALLEL LANGUAGES Chapter 4b

About This Presentation

Title:

DATA PARALLEL LANGUAGES Chapter 4b

Description:

Transmission of data occurs in lock step (SIMD fashion) without congestion or buffering. ... Start | Programs | Accessories | Communications | Remote Desktop ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 36

Provided by: johnni2

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: DATA PARALLEL LANGUAGES Chapter 4b

1
DATA PARALLEL LANGUAGES(Chapter 4b)

multiC,
Fortran 90, and HPF

2
The MultiC Language

References
The multiC Programming Language, Preliminary
Documentation, WaveTracer, PUB-00001-001-00.80,
Jan. 1991.
The multiC Programming Language, User
Documentation, WaveTracer, PUB-00001-001-1.02,
June 1992.
Note This presentation is based on the 1991
manual, unless otherwise noted. (e.g., manuals
refers to both versions.)
MultiC is the language used the WaveTracer and
the Zephyr SIMD computers.
The Zephyr is a second generation WaveTracer, but
was never commercially available.
We were given 10 Zephyrs and several other
incomplete Zephyrs to use for spare part
A MultiC was designed for their third
generation computer, but neither were released.
Both MultiC and a parallel language designed for
the MasPar are fairly similar to an earlier
parallel language called C.
C was designed by Guy Steele for the Connection
Machine.
All are data parallel and extensions of the C
language
An assembler was also written for the WaveTracer
(and probably the Zephyr).
It was intended for use only by company
technicians.

Information about assembler were released to
WaveTracer customers on a need to know basis.
No manual was distributed but some details were
recorded in a short report.
Professor Potter was given some details needed to
put the ASC language on the WaveTracer
MultiC is an extension to ANSI C, as documented
by the following book
The C Programming Language, Second Edition, 1988,
Kernighan Richie.
The WaveTracer computer is called a Data
Transport Computer (DTC) in manual
a large amount of data can be moved in parallel
using interprocessor communications.
Primary expected uses for WaveTracer were
scientific modeling and scientific computation
acoustic waves
heat flow
fluid flow
medical imaging
molecular modeling
neural networks
The 3-D applications are supported by a 3D mesh
on the WaveTracer
Done by sampling a finite set of points (nodes)
in space.

4
WaveTracer Architecture Background

Architecture for Zephyr is fairly similar
Exceptions will be mentioned whenever known
Each board has 4096 bit-serial processors, which
can be connected in any of the following ways
16x16x16 cube in 3D space
64x64 square in 2D space
4096 array in 1D space
The 3D architecture is native on the WT and the
other networks are supported in hardware using
primarily the 3D hardware
The Zephyr probably has a 2D network and only
simulates the more expensive 3D network using
system software.
WaveTracer was available in 1, 2, or 4 boards,
arranged as follows
2 boards were arranged as a 16x32x16 cube
one cube stacked on the top of another cube
8192 processors overall

5
WaveTracer Architecture (Cont)

Four boards are arranged as a 32x32x16 cube
16,384 processors
Arranged as two columns of stacked cubes
Computer supports automatic creation of virtual
processors and network connections to connect
these virtual processors.
If each processor supports k nodes, this slows
down execution speed by a factor of k
Each processor performs each operation k times.
Limited by the amount of memory required for each
virtual node
In practice, slowdown is usually less than k
The set of virtual processors supported by a
physical processor is called its territory.

6
Specifiers for MultiC Variables

Any datatype in C except pointers can be declared
to be a parallel variable using the declaration
multi
This replicates the data object for each
processor to produce a 1,2, or 3 dimensional data
object
In a parallel execution, all multi objects must
have the same dimension.
The multi declaration follows the same format as
ANSC C, e.g
multi int imag, buffer
The uni declaration is used to declare a scalar
variable
Is the default and need not be shown.
The following are equivalent
uni int ptr
int ptr
Bit Length Variables
can be of type uni or multi
Allows user to save memory
All operations can be performed on these
bit-length values
Example A 2 color image can be declared by
multi unsigned int image 1
and an 8 color image by
multi unsigned int picture3

7
Some Control Flow Commands

For uni type data structures, control flow in
MultiC is identical to that in ANSI C.
The parallel IF-ELSE Statement
As in ASC, both the IF and ELSE portions of the
code is executed.
As with ASC, the IF is a mask-setting operation
rather than a branching command
FORMAT Same as for C
WARNING In contrast to ASC, both sets of
statements are executed.
Even if no responders are active in one part, the
sequential commands in that part are executed.
Example count count 1
The parallel WHILE statement
The format used is
while(condition)
The repetition continues as long as condition
is satisfied by one or more responders.
Only those responders (i.e., ones who satisfies
condition preceding to this pass through the
body of while) are active during the execution
of the body of the while.

8
Other Commands

Jump Statements
goto, return, continue, break
These commands are in conflict with structured
programming and should be used with restraint.
Parallel Reduction Operators
Accumulative Product
/ Reciprocal Accumulative Product
Accumulative Sum
- Negate then Accumulative Sum
Accumulative bitwise AND
Accumulative bitwise OR
gt? Accumulative Maximum
lt? Accumulative Minimum
Each of the above reduction operations return a
uni value and provide a powerful arithmetic
operation.
Each accumulative operation would otherwise
require one or more ANSI C loop constructs.
Example If A is a multi data type
largest_value gt? A
smallest_value lt? A

Data Replication
Example
multi int A 0
-
-
-
A 2
First statement stores 0 in every A field
(compile time)
Last statement stores 2 in A field of every
active PE.
Interprocessor Communications
Operators have the form
dx dy dzm
This operator can shift the components of the
multi variable m of all active processors along
one or more coordinate dimensions.
Example A -1 2 1B
Causes each active processor to move the data in
its B field to the A field of the processor at
the following location
one unit in the negative X direction
two units in the positive Y direction
one unit in the positive Z direction
Coordinate Axes

Conventions
If value of dz operator is not specified, it is
assumed to be 0
If the values of dy and dz operators are not
specified, both are assumed to be 0
Example x yV is the same as x y 0V
Inactive processor actions
Does not send its data to another processor
Participates in moving the data from other
processors along.
Transmission of data occurs in lock step (SIMD
fashion) without congestion or buffering.
Coordinate Functions
Used to return a coordinate for each active
virtual processor.
Format multi_x(), multi_y(), and multi_z()
Example
If(multi_x() 0 multi_y 2 multi_z
1)
u A
Note that all processors except the one at
(0,2,1) are inactive with the body of the IF.
The accumulated sum of the active components of
the multivariable A is just the value of the
component of A at processor (0,2,1)
Effect of this example is to store the value in A
at (0,2,1) in the uni variable u.

If the second command in the example is changed
to
A u
the effect is to store the contents of the uni
variable u
into multi variable A at location (0,2,1).
(see manual pg 11-13,14 for more details)
Arrays
Multi-pointers are not supported.
Can not have a parallel variable containing a
pointer in each component of the array.
uni pointers to multi-variables are allowed.
Array Examples
int array_1 10
int array_2 55
multi int array_3 5
array_1 is a 1 dimensional standard C array
array_2 is a 2 dimensional standard C array
array_3 is a 1-dimensional array of multi
variables
MULTI_PERFORM Command
Command gives the size of each dimension of all
multi-values prior to calling for a parallel
execution.
Format

multi_perform is normally called within the main
program.
Usually calls a subroutine that includes all of
the
parallel work
parallel I/O
The main program usually includes
Opening and closing of files
Some of the scalar I/O
define and include statements
When multi_perform is called, it initializes any
extern and static multi objects
In the previous example, multi_perform calls
func. After func returns, the multi space created
for it becomes undefined.
The perror function is extended to print error
messages corresponding to errno numbers resulting
from the execution of MultiC extensions.
Has the following format
if(multi_perform(func,x,y,z)) perror(argv0)
See usage in the examples in Appendix A
More information on page 11-2 of manual
Examples in Manual
Many examples in the manual
17 in appendices alone
Also stored under exname.mc in the MultiC package

13
The AnyResponder Function

Code Segment for Tallying Responders
unsigned int short, tall
multi float height
load_height / assigns value in inches to
height /
if(height gt 70)
tall (multi int)1
else
short (multi int)1
printf(There are d tall people \n, tall)
Comments on Code Segment
Note that the construct
(multi int)1
counts the active PE (i.e., responders).
This technique avoids setting up a bit field to
use to tally active PEs.
Instead sets up a temporary multi variable.
Can be used to see there is at least one
responder at any given time.
Check to see if resulting sum is positive
Provides technique to define the AnyResponder
function needed for associative programming

14
Accessing Components from Multi Variables

Code from page 11-13 or 11-14 of MultiC manuals
include ltmulti.hgt / includes multi library
/
include ltstdlib.hgt
include ltstdio.hgt
void work (void)
uni int a, b, c, u
multi int n
/ Code goes here to assign values to n /
/ Code goes here to assign values to a, b, c
/
if (mult_x() a multi_y() b
multi_z() c)
u n / Assigns value of n at
PE(a,b,c) /
int main (int argc, char, argv )
if (multi_perform(work, 7 , 7, 7))
perror (argv0)
exit (EXIT_SUCCESS)

15
The oneof and next Functions

Function oneof provides a way of selecting one
out of several active processors
Defined in Multi Struct program (A.15) in manual
Procedure is essential for associative
programming.
Code for oneof
multi unsigned oneof(void)1
/ Store the coordinate values in multi
variables x and y /
multi unsigned x multi_x(),
y multi_y(),
uno1 0
/ Next select processor with highest
coordinate value /
if( x gt? x)
if( y gt? y)
uno 1
return uno
Note that multi variable uno stores a 1 for
exactly one processor and all the other
coordinates of uno stores a 0.
The function oneof can be used by another
procedure which is called by multi_perform.
An example of oneof being called by another
procedure is given on pages A46-50 of the
manuals.
Should be useable in the form
if(oneof()) / Check to see if an active
responder exists /

Preceding procedure assumed a 2D configuration
of processors with z1.
If configuration is 3D, the process of selecting
the coordinates can be continued by also
selecting the highest z-coordinate.
Stepping through the active PEs (i.e., next)
Provides the MultiC equivalent of the ASC next
command
An additional one-bit multi integer variable
called bi (for busy-idle) is needed.
First set bi to zero
Activate the PEs you wish to step through.
Next, have the active PEs to write a 1 into bi.
Use
if(oneof())
to restrict the mask to one of the active PEs.
Perform all desired operations with active PE.
Have active PE set its bi value to 0 and then
exit the preceding if statement.
Use the (accumulative sum) operator to see
if any PEs remain to be processed.
If so, return to step above calling oneof
This return can be implemented using a while loop.

17
Sequential Printing of Multi Variable Values

Example Print a block of the image 2D bit array.
A function select_int is used which will return
the value of image at the specified (x,y,z)
coordinate.
The printing occurs in two loops which
increments the value of x from 0 to some
specified constant.
increments the value of y from 0 to some
specified constant.
This example is from page 8-1 of the manuals and
is used in an example on pgs A16-18 of 1991
manual and pgs A12-14 of 1992 manual.
The select_int function
select_int (multi mptr, int x, int y, int z)
/ Here, mptr is a uni pointer to type multi /
int r
if( multi_x x
multi_y y
multi_z z)
/ Restricts scope to the one PE at (x,y,z) /
r mptr
/ OR reduction operator transfers binary value
of multi variable at (x,y,z) to the uni variable
/ return r

The two loops to print a block of values of the
image multi variable.
for( y 0 y lt ysize y)
for (x 0 x lt xsize x)
printf( d, select_int (image,x,y,z)
printf( \n)
Above technique can be adapted to print or read
multi variables or part of multi variables.
Efficient as long as the number of locations
accessed is small.
If I/O operations involve a large amounts of
data, the more efficient data transfer functions
described in manuals (Chapter 8 and Section 11.2
and 11.13) should be used.
The functions multi_fread and multi_fwrite are
analogous to fwrite and fread in C. Information
about them is given on pages 11-1 to 11-4 of the
manuals.

19
Moving Data between Uni Arrays and Multi Variables

The following functions allow the user to move
data between uni arrays and multi variables
multi_from_uni ...
multi_to_uni ...
The above may be replaced with a data type
such as
char
short
int
long
float
double
cfloat
cdouble
These functions are illustrated in several of the
examples.

20
Compiling and Executing Programs on the Zephyr

A 4k Zephyr machine is available for use in the
Parallel and Associative Computing Lab.
It is presently connected to a Windows 2003
Server which supports remote desktop for
interactive use. However, you may use the
computer directly at the console while the lab is
open
Visual Studio 2002 has been installed on the
server. The MultiC language uses a compiler
wrapper to translate MultiC code into Visual C
code.
Programming the Zephyr on a Windows 2003 system
is similar to that using command line programming
tools in UNIX.
You can edit your program using Edit or
Notepad
You can compile and create an executable using
nmake
You can execute your program using the Visual
Studio Command Shell
This is a special DOS shell that has extra path,
include, and library environment variables used
by the compiler and linker.

21
Compiling and Executing Programs on the Zephyr

Login or use Remote Desktop Connection to
zserver.cs.kent.edu
From Windows XP choose Start Programs
Accessories Communications Remote Desktop
Connection
Enter your login name and password and click on
OK
Open an command window and run the DTC Monitor
program
Type dtcmonitor at the command prompt.
This is a daemon program that serializes and
controls executables using the Zephyr.
When this 100 complete, you can then execute
programs on the Zephyr.
You can minimize this command shell.
Important When you are finished enter CTRL-C to
end the dtcmonitor.
Create a folder on your desktop for programs.
You can copy the example Zephyr MultiC program
from D\Common\zephyrtest to your local folder
and rename it for your programming assignment.

22
Compiling and Executing Programs on the Zephyr

Create or edit your MultiC program using DOS edit
or Windows Notepad.
From the Visual Studio Command Shell type
edit anyprog.mc
notepad anyproc.mc
Make sure that the file extension is .mc
Save your work before compiling
Modify the makefile template and change the names
of the MultiC file and object file to those used
in your programming assignment.
Compile and link your program by typing
nmake /f anyprog.mak
nmake (for the default Makefile)
Execute your program by typing the name of your
executable at the command prompt.
When you are finished enter CTRL-C to end the
dtcmonitor.

23
OMIT FOR PRESENT(Multi-C Recursion)

It is possible to write recursive multi
functions in multiC, but you have to test if
there are active PEs still working.
Consider the following multiC function
multi int factorial( multi int n )
multi int r
if( n ! 1 )
r (factorial(n-1)n)
else
r 1
return( r )
What happens?

24
OMIT FOR PRESENT (MultiC Recursion Example)

Recursion
multi int factorial( multi int n )
multi int r
/ stop calculating if every component has been
computed /
if( ! (multi int) 1 )
return(( multi int ) 0 )
/ otherwise, continue calculating /
if( n gt 1 )
r factorial( n-1 ) n
else
r 1
return( r )

25
Fortran 90 and HPF (High Performance Fortran)

A de facto standard for scientific and
engineering computations

26
Fortran 90 AND HPF

References
19 Ian Foster, Designing and Building Parallel
Programs, (online copy), chapter 7.
8 Jordan and Alaghband, Fundamentals of
Parallel Processing, Section 3.6.
Recall data parallelism refers to the concurrency
that occurs when all the same operation is
executed on some or all elements in a data set.
A data parallel program is a sequence of such
operations.
Fortran 90 (or F90)is a data-parallel programming
language.
Some job control algorithms can not be expressed
in a data parallel language.
F90s array assignment statement and array
functions can be used to specify certain types of
data parallel computation.
F90 forms the basis of HPF (High Performance
Fortran) which augments F90 with a small set of
extensions.
In F90 and HPF, the (parallel) data structure
operated on are restricted to arrays.
E.g., data types such as trees, sets, etc. are
not supported.
All array elements must be of the same type.
Fortran arrays can have up to 7 dimensions.

Parallelism in F90 can be expressed explicitly,
as in the array assignment statement
A BC ! A,B,C are arrays
Compilers may be able to detect implicit
parallelism, as in the following example
do I 1,m
do j 1,n
A(i,j) B(i.,j) C(i,j)
enddo
enddo
Parallel execution of above code depends on the
fact that the various do-loops are independent
i.e., one loop does not write/read a variable
that another loop writes/reads.
Compilation can also introduce communications
operations when the computation mapped to one PE
requires data mapped to another PE.
Communication operations in F90 (and HPF) are
inferred by the compiler and do not need to be
specified by the programmer.
These are derived by the compiler from the data
decomposition specified by the programmer.
F90 allows a variety of scalar operations (i.e.,
defined on a single value) to be applied to an
entire array.

All F90s unary and binary operations can be
applied to arrays as well, as illustrated in
below examples
real A(10,200), B(10,10), c
logical L(10,20)
A B c
A A 1.0
A sqrt(A)
L A .EQ. B
The function of the mask is handled in F90 by the
where statement, which has two forms.
The first form uses the where to restrict array
elements on which an assignment is performed
For example, the following replaces each nonzero
entry of array with its reciprocal
where(x / 0) x 1.0/X
The second form of the where is block structured
and has the form
where (mask-expression)
array_assignment
elsewhere
array_assignment
end where

29
Some F90 Array Intrinsic Functions

Array intrinsic functions below assume a vector
version of an array is formed using column
major ordering
Some F90 array intrinsic functions
RESHAPE(A,...) converts array A into a new array
with specified shape and fill
PACK(A, MASK, FILL) forms a vector from masked
elements of A, using fill as needed.
UNPACK(A,MASK, FILL) replaces masked elements
with elements from FILL vector
MERGE(A, B, MASK) returns array of masked A
entries and unmasked entries of B
SPREAD(A, DIM, N) replicate array A, using N
using N copies to form a new array of one larger
dimension
CSHIFT(A, SHIFT, DIM) column major rotation of
elements of A
EOSHIFT(A,...) elements of A are shifted off the
end along specified dimension, with end values
with fill from either a specified scalar or array
of dimension 1 less than A
TRANSPOSE(A) returns transpose of array A.
Some array intrinsic functions that perform
computation
MAXVAL(A) returns the maximum value of A
MINVAL(A) returns the minimum value of A
SUM(A) returns the sum of the element of A
PRODUCT(A) product of elements of A
MAXLOC(ARRAY) indices of max value in A
MINLOC(ARRAY) indices of min value in A
MATMUL(A,B) matrix multiplication AB

30
The HPF Data Distribution Extension

Reference 19 Ian Foster, Designing and
Building Parallel Programs, (online copy),
chapter 7.
F90 array expressions specify opportunities for
parallel execution but no control over how to
perform these so that communication is minimized.
HPF handling of data distribution involves three
directives
The PROCESSOR directive specifies the shape and
size of the array of abstract processors.
The ALIGN directive is used to align elements of
different arrays with each other, indicating that
they should be distributed in the same manner.
The DISTRIBUTE directive is used to distribute an
object (and all objects aligned with it) onto an
abstract processor array.
The data distribution directives can have a major
impact on a programs performance (but not on the
results computed), affecting
Partitioning of data to processors
Agglomeration Considering value of combining
tasks to produce fewer larger tasks.
Communications required to coordinate task
execution.
Mapping of tasks to processors

31
HPF Data Distribution (Cont.)

Data distribution directives are recommendations
to a HPF compiler, not instructions.
Compiler can ignore them if it determines that
this will improve performance.
PROCESSOR directive
Creates an arrangement for abstract processors
and gives this arrangement a name.
Example !HPF PROCESSORS P(4,8)
Normally one abstract processor is created for
each physical processor.
There could be more abstract processors than
physical ones.
However, HPF does not specify a way of mapping
abstract to physical processors.
ALIGN Directive
Specifies array elements that should, if
possible, be mapped to the same processor.
Operations involving data objects that are
aligned are likely to be more efficient due to
reduced communication costs if on same PE.
EXAMPLE
real B(50), C(50)
!HPF ALIGN C() WITH B()

32
HPF Data Distribution (Cont.)

ALIGN Directive (cont.)
A can be used to collapse dimensions (i.e.,
to match one element with many elements
Considerably flexibility is allowed in specifying
which array elements are to be aligned.
Dummy variables can be used for dimensions
Integer formulas to specify offsets.
An align statement can be used to specify that
elements of an array should be replicated over
certain processors.
Costly if replicated arrays are updated often.
Increases communication or redundant computation.
DISTRIBUTE Directive
Indicates how data are to be distributed among
processor memories.
Specifies for each dimension of an array one of
three ways that the array elements will be
distributed among the processors
No distribution
BLOCK(n) Block distribution
(default n N/P)
CYCLIC(n) Cyclic distribution
(default n 1)

33
HPF Data Distribution (Cont.)

DISTRIBUTE Directive (cont.)
Block distribution divides the items/indices in
that dimension into equal-sized blocks of size
N/P.
Cyclic distribution maps every Pth index to the
same processor.
Applies not only to the named array but also to
any array that is aligned to it.
The following DISTRIBUTE directives specifies a
mapping for all three arrays.
!HPF PROCESSORS p(20)
real A(100,100), B(100,100), C(100,100)
!HPF ALIGN B(,) with A(,)
!HPF DISTRIBUTE A(BLOCK,) ONTO p

34
HPF Concurrency

The F90 array assignment statements provide a
convenient way of specifying data parallel
operations.
However, this does not apply to all data parallel
operations, as the array on the right hand must
have the same shape as the one on the left hand
side.
HPF provides two other constructs to exploit data
parallelism, namely the FORALL and the
INDEPENDENT directives.
The FORALL Statement
Allows a more general assignments to sections of
an array.
General form is
FORALL (triplet, ... , triplet, mask) assignment
Examples
FORALL (i1m, j1,n) X(i,j) ij
FORALL (i1n, j1,n, iltj) Y(i,j) 0.0
The INDEPENDENT Directive and Do-Loops
The INDEPENDENT directive can be used to assert
that the iterations of a do-loop can be performed
independently, that is
They can be performed in any order
They can be performed concurrently
The INDEPENDENT directive must immediately
precede the do-loop that it applies to.
Examples of independent and non-independent
do-loops are given in 19, Foster, pg 258-9.

35
Additional HPF Comments

A HPF program typically consists of a sequence of
calls to subroutines and functions.
The data distribution that is best for a
subroutine may be different than the data
distribution used in the calling program.
Two possible strategies for handling this
situation are
Specify a local distribution using DISTRIBUTE and
ALIGN, even if this requires expensive data
movement on entering
Cost normally occurs on return as well.
Use whatever data distribution is used in the
calling program, even if not optimal. This
requires use of INHERIT directive.
Both F90 and HPF intrinsic functions (e.g., SUM,
MAXVAL) combine data from entire arrays and
involve considerable communication.
Some other F90/HPF intrinsic functions such as
DOT_PRODUCT involve communciation cost only if
their arguments are not aligned.
Array operations involving the FORALL statement
can result in communication if the computation of
a value for an element A(i) require data values
that are not on the same processor (e.g., B(j)).