PRAM and Basic Algorithms - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

PRAM and Basic Algorithms

Description:

... are then obtained by pair-wise combination of adjacent partial results in a single step. ... for the PRAM broadcasting algorithms in the lecture note. ... – PowerPoint PPT presentation

Number of Views:262
Avg rating:3.0/5.0
Slides: 26
Provided by: spe9
Category:

less

Transcript and Presenter's Notes

Title: PRAM and Basic Algorithms


1
PRAM and Basic Algorithms
by Shietung Peng
2
PRAM Submodels
  • Four sub-models of the PRAM model have been
    defined based on whether concurrent reads from or
    writes to the same location are allowed.

3
Data Broadcasting
  • One-to-all and all-to-all broadcasting in the
    CREW or CRCW PRAM can be done in O(1) and O(p)
    steps, respectively.
  • In EREW PRAM, one has to make p copies of data
    value using recursive doubling technique.

4
Data Broadcasting
  • An example of data broadcasting in EREW PRAM via
    recursive doubling.

5
Data Broadcasting
  • It is more efficient to turn off the processors
    that do not perform useful work in a given step.

6
Data Broadcasting
  • EREW PRAM data broadcasting without redundant
    copying.

7
Data Broadcasting
  • To perform all-to-all broadcasting, first, in one
    step, all of the values to be broadcast are
    written into the broadcast vector B. Each
    processor then reads the other p-1 values in p-1
    steps.

8
Data Broadcasting an Application
  • The all-to-all broadcasting algorithm can be used
    to develop a O(p) time sorting algorithm. More
    efficient PRAM algorithm will be given later.

9
Parallel Prefix Computation
  • Similar to the broadcasting algorithm, parallel
    prefix computation on an EREW PRAM can be done by
    the recursive doubling technique.

10
Parallel Prefix Computation
  • We can also use the divide-and-conquer paradigm
    to design two other algorithms for this problem.
  • The first algorithm repeatedly combines
    consecutive elements to obtain a list of half the
    size, and yields correct values for all
    odd-indexed results. The even-indexed results are
    then found in a single step. The total time of
    this algorithm is 2lgp.

11
Parallel Prefix Computation
  • The second algorithm performs parallel prefix
    computation separately on two sub-lists, the
    odd-indexed and the even-indexed. The final
    results are then obtained by pair-wise
    combination of adjacent partial results in a
    single step. The total time of this algorithm is
    lgp.

12
Ranking the Elements of a Linked List
  • The problem can be defined as follows Given a
    linear linked list, rank the list elements in
    terms of the distance from each to the terminal
    element.
  • The PRAM input and output data structures in
    depicted by the following figure

13
The Ranking Algorithm
  • The algorithm uses the pointer jumping technique.

14
The Ranking Algorithm
  • The following example shows the intermediate
    values in the vectors rank and next, initially
    and after each of the three iterations.

15
Some Implementation Aspects
  • We discuss a number of practical considerations
    that are important in transforming a PRAM
    algorithm into an efficient program for an actual
    shared-memory parallel computer.
  • The most important of these relates to data
    layout in the shared memory.
  • In any physical implementation of shared memory,
    the m memory locations will be in B memory banks
    (modules), each bank holding m/B addresses.
    Typically, a memory bank can provide access to a
    single memory word in a given memory cycle.

16
Some Implementation Aspects
  • Even if the PRAM algorithm assumes the EREW
    sub-model where no two processors access the same
    memory location in the same memory cycle, memory
    bank conflicts may still arise.
  • A possible solution is to try to lay out the data
    in the shared memory and organize the
    computational steps of the algorithm so that a
    memory bank is accessed at most once in each
    cycle.
  • This is a quite challenging problem. The main
    ideas relating to layout methods are best
    explained by the matrix multiplication problem.

17
Matrix Multiplication
  • Given m-by-m matrices A and B, the PRAM matrix
    multiplication algorithm with p processors
    is given below

18
Matrix Multiplication
  • The m processors would need
    to read row i of the matrix A for their
    computation. In order to avoid multiple accesses
    to the same matrix element, we skew the accesses
    so that reads the elements of row i
    beginning with . In this way, the entire row
    i of A is read out in every cycle, albeit with
    the elements distributed differently to the
    processors in each cycle.
  • To ensure that conflict-free parallel access to
    all elements of each row of A is possible in
    every cycle, the data layout must assign
    different columns of A to different memory banks.
    This is possible if the data storage is
    column-major (see the figure in the next page).
  • However, processors all
    access the j-th column of B. Therefore, the
    column-major storage scheme will lead to memory
    bank conflicts for all such accesses to the
    columns of B.

19
Column-major Storage Scheme
  • A matrix can be stored in column-major order to
    allow concurrent access to rows as depicted in
    the following figure.

20
Skewed Matrix Storage Scheme
  • A matrix can be laid out in memory in such a way
    that both columns and rows are accessible in
    parallel without memory bank conflicts.
  • In this scheme, the matrix element (i,j) is found
    in location i of module (ij) mod B.

21
Linear Skewing Scheme
  • It is more convenient to deal with vectors rather
    than matrices for the memory layout problem for
    conflict-free parallel access.
  • The 6-by-6 matrix of the previous figure can be
    viewed as a vector of length 36, as shown by the
    figure below.

22
Linear Skewing Scheme
  • Thus, the memory data layout problem is reduced
    to the following Given a vector of length l,
    store it in B memory banks in such a way that
    accesses with strides are conflict-free.
  • A linear skewing scheme is one that stores the
    k-th vector element in the bank akb mod B.
  • With a linear skewing scheme, the vector elements
    k, ks, k2s, , k(B-1)s will be assigned to
    different memory modules iff sb is relatively
    prime with respect to B (the number of the memory
    bank).
  • A simple way to guarantee conflict-free parallel
    access for all strides is to choose B to be a
    prime number. Notice that column is stride of 1,
    row is stride of m, diagonal is stride of m1,
    and anti-diagonal is stride of m-1.

23
Exercise 4
  • Broadcasting on a PRAM
  • Find the speed-up, efficiency, and the various
    other measures for the PRAM broadcasting
    algorithms in the lecture note.
  • Show how two separate broadcasts by two
    processors can be done in only one or two extra
    EREW PRAM steps compared with a single broadcast.
  • Modify the broadcasting algorithms such that a
    processor that obtains the value broadcast by
    Processor i keeps it in a register and does not
    have to read it from the memory each time.

24
Exercise 4
  • Parallel prefix on a PRAM
  • Develop a PRAM algorithm for an incomplete
    parallel prefix computation involving p or fewer
    elements in the input vector X0..p-1. In this
    variant, some elements of X may be marked as
    being invalid and the i-th prefix is defined as
    the combination of all valid elements up to the
    i-th element.
  • Develop a PRAM algorithm for a partitioned
    parallel prefix computation defined as follows.
    The input X consists of p elements. A Boolean
    vector Y is also given, with Yi1 indicating
    that the Xi is the first element of a new
    partition. The partitioned prefix of an entry is
    the combination of all elements in the same
    partition up to that entry.

25
Exercise 4
  • Consider the linear skewing scheme s(i,j) aibj
    mod B that yields the index of the memory bank
    where the element (i,j) of an m-by-m matrix is
    stored. In order to have conflict-free parallel
    access to rows, columns, diagonals, and
    anti-diagonals, prove that it is sufficient to
    choose B to be the smallest prime number that is
    no less than max(m,5).
Write a Comment
User Comments (0)
About PowerShow.com