PRAM and Basic Algorithms - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

PRAM and Basic Algorithms

Description:

... are then obtained by pair-wise combination of adjacent partial results in a single step. ... for the PRAM broadcasting algorithms in the lecture note. ... – PowerPoint PPT presentation

Number of Views:262

Avg rating:3.0/5.0

Slides: 26

Provided by: spe9

Category:

more less

Transcript and Presenter's Notes

Title: PRAM and Basic Algorithms

1
PRAM and Basic Algorithms
by Shietung Peng
2
PRAM Submodels

Four sub-models of the PRAM model have been
defined based on whether concurrent reads from or
writes to the same location are allowed.

3
Data Broadcasting

One-to-all and all-to-all broadcasting in the
CREW or CRCW PRAM can be done in O(1) and O(p)
steps, respectively.
In EREW PRAM, one has to make p copies of data
value using recursive doubling technique.

4
Data Broadcasting

An example of data broadcasting in EREW PRAM via
recursive doubling.

5
Data Broadcasting

It is more efficient to turn off the processors
that do not perform useful work in a given step.

6
Data Broadcasting

EREW PRAM data broadcasting without redundant
copying.

7
Data Broadcasting

To perform all-to-all broadcasting, first, in one
step, all of the values to be broadcast are
written into the broadcast vector B. Each
processor then reads the other p-1 values in p-1
steps.

8
Data Broadcasting an Application

The all-to-all broadcasting algorithm can be used
to develop a O(p) time sorting algorithm. More
efficient PRAM algorithm will be given later.

9
Parallel Prefix Computation

Similar to the broadcasting algorithm, parallel
prefix computation on an EREW PRAM can be done by
the recursive doubling technique.

10
Parallel Prefix Computation

We can also use the divide-and-conquer paradigm
to design two other algorithms for this problem.
The first algorithm repeatedly combines
consecutive elements to obtain a list of half the
size, and yields correct values for all
odd-indexed results. The even-indexed results are
then found in a single step. The total time of
this algorithm is 2lgp.

11
Parallel Prefix Computation

The second algorithm performs parallel prefix
computation separately on two sub-lists, the
odd-indexed and the even-indexed. The final
results are then obtained by pair-wise
combination of adjacent partial results in a
single step. The total time of this algorithm is
lgp.

12
Ranking the Elements of a Linked List

The problem can be defined as follows Given a
linear linked list, rank the list elements in
terms of the distance from each to the terminal
element.
The PRAM input and output data structures in
depicted by the following figure

13
The Ranking Algorithm

The algorithm uses the pointer jumping technique.

14
The Ranking Algorithm

The following example shows the intermediate
values in the vectors rank and next, initially
and after each of the three iterations.

15
Some Implementation Aspects

We discuss a number of practical considerations
that are important in transforming a PRAM
algorithm into an efficient program for an actual
shared-memory parallel computer.
The most important of these relates to data
layout in the shared memory.
In any physical implementation of shared memory,
the m memory locations will be in B memory banks
(modules), each bank holding m/B addresses.
Typically, a memory bank can provide access to a
single memory word in a given memory cycle.

16
Some Implementation Aspects

Even if the PRAM algorithm assumes the EREW
sub-model where no two processors access the same
memory location in the same memory cycle, memory
bank conflicts may still arise.
A possible solution is to try to lay out the data
in the shared memory and organize the
computational steps of the algorithm so that a
memory bank is accessed at most once in each
cycle.
This is a quite challenging problem. The main
ideas relating to layout methods are best
explained by the matrix multiplication problem.

17
Matrix Multiplication

Given m-by-m matrices A and B, the PRAM matrix
multiplication algorithm with p processors
is given below

18
Matrix Multiplication

The m processors would need
to read row i of the matrix A for their
computation. In order to avoid multiple accesses
to the same matrix element, we skew the accesses
so that reads the elements of row i
beginning with . In this way, the entire row
i of A is read out in every cycle, albeit with
the elements distributed differently to the
processors in each cycle.
To ensure that conflict-free parallel access to
all elements of each row of A is possible in
every cycle, the data layout must assign
different columns of A to different memory banks.
This is possible if the data storage is
column-major (see the figure in the next page).
However, processors all
access the j-th column of B. Therefore, the
column-major storage scheme will lead to memory
bank conflicts for all such accesses to the
columns of B.

19
Column-major Storage Scheme

A matrix can be stored in column-major order to
allow concurrent access to rows as depicted in
the following figure.

20
Skewed Matrix Storage Scheme

A matrix can be laid out in memory in such a way
that both columns and rows are accessible in
parallel without memory bank conflicts.
In this scheme, the matrix element (i,j) is found
in location i of module (ij) mod B.

21
Linear Skewing Scheme

It is more convenient to deal with vectors rather
than matrices for the memory layout problem for
conflict-free parallel access.
The 6-by-6 matrix of the previous figure can be
viewed as a vector of length 36, as shown by the
figure below.

22
Linear Skewing Scheme

Thus, the memory data layout problem is reduced
to the following Given a vector of length l,
store it in B memory banks in such a way that
accesses with strides are conflict-free.
A linear skewing scheme is one that stores the
k-th vector element in the bank akb mod B.
With a linear skewing scheme, the vector elements
k, ks, k2s, , k(B-1)s will be assigned to
different memory modules iff sb is relatively
prime with respect to B (the number of the memory
bank).
A simple way to guarantee conflict-free parallel
access for all strides is to choose B to be a
prime number. Notice that column is stride of 1,
row is stride of m, diagonal is stride of m1,
and anti-diagonal is stride of m-1.

23
Exercise 4

Broadcasting on a PRAM
Find the speed-up, efficiency, and the various
other measures for the PRAM broadcasting
algorithms in the lecture note.
Show how two separate broadcasts by two
processors can be done in only one or two extra
EREW PRAM steps compared with a single broadcast.
Modify the broadcasting algorithms such that a
processor that obtains the value broadcast by
Processor i keeps it in a register and does not
have to read it from the memory each time.

24
Exercise 4

Parallel prefix on a PRAM
Develop a PRAM algorithm for an incomplete
parallel prefix computation involving p or fewer
elements in the input vector X0..p-1. In this
variant, some elements of X may be marked as
being invalid and the i-th prefix is defined as
the combination of all valid elements up to the
i-th element.
Develop a PRAM algorithm for a partitioned
parallel prefix computation defined as follows.
The input X consists of p elements. A Boolean
vector Y is also given, with Yi1 indicating
that the Xi is the first element of a new
partition. The partitioned prefix of an entry is
the combination of all elements in the same
partition up to that entry.

25
Exercise 4

Consider the linear skewing scheme s(i,j) aibj
mod B that yields the index of the memory bank
where the element (i,j) of an m-by-m matrix is
stored. In order to have conflict-free parallel
access to rows, columns, diagonals, and
anti-diagonals, prove that it is sufficient to
choose B to be the smallest prime number that is
no less than max(m,5).

Write a Comment

User Comments (0)