A Taste of Parallel Algorithms - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

A Taste of Parallel Algorithms

Description:

We examine some of the following five simple building-block parallel operations ... A special case of semi-group computation, namely, maximum finding, is shown ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 21
Provided by: spe9
Category:

less

Transcript and Presenter's Notes

Title: A Taste of Parallel Algorithms


1
A Taste of Parallel Algorithms
by Shietung Peng
2
Some Simple Computations
  • We examine some of the following five simple
    building-block parallel operations on three
    simple parallel architectures linear array,
    binary tree, and 2D mesh
  • Semi-group computation
  • Parallel prefix computation
  • Package routing
  • Broadcasting and multicasting
  • Sorting

3
Some Simple Architectures
  • A linear array or ring
  • A balanced binary tree
  • A 2D mesh or torus

4
Algorithms for Linear Array
  • A special case of semi-group computation, namely,
    maximum finding, is shown below In each step, a
    processor sends its max-thus-far value
    (initialized to its own data value) to its two
    neighbors. Each processor, on receiving values
    from its left and right neighbors, sets its
    max-thus-far value to the largest of the three
    values.

5
Algorithms for Linear Array
  • The algorithm for parallel prefix computation is
    similar to the semi-group computation algorithm.
    The processor at the left end becomes active and
    sends its data value to the right. On receiving a
    value from its left neighbor, a processor becomes
    active, sums up the value received from the left
    and its own data value, and sends the result to
    the right.

6
Algorithms for Linear Array
  • Extension of the algorithm for parallel prefix
    and the semi-group algorithms to the case where
    each processor holds several data items is
    straightforward. Each processor does a prefix
    computation on its own data set of size n/p, then
    does a diminished parallel prefix computation
    (the prefix up to the (i-1)th value), and finally
    combines this results with locally computed
    prefix. In all, 2n/p p 2 computation steps
    and p 1 communication steps are required.
  • An example of computing prefix sum on a linear
    array with two items per processor is shown in
    the next page.

7
Algorithms for Linear Array
8
Algorithms for Linear Array
  • We consider two versions of sorting on a linear
    array with and without I/O. Sorting with the
    keys input sequentially from the left is depicted
    below.
  • Each processor, on receiving a key
  • value from the left, compares the
  • received value with the local value.
  • the smaller of the two values is kept
  • and the larger value is passed on to
  • the right.
  • The total sorting time is equal to the
  • I/O time.

9
Algorithms for Linear Array
  • If the key values are already in place, one per
    processor, then an algorithm known as odd-even
    transposition can be used for sorting.

10
Algorithms for Linear Array
  • The odd-even transposition algorithm uses p
    processors to sort p keys in p compare-exchange
    steps. How good is the algorithm? Assume that the
    best sequential sorting algorithm takes plgp
    steps. Then, we have T(p) p, W(p) ,
    S(p) lgp, E(p) (lgp)/p, R(p) p/(2lgp), U(p)
    1/2, and Q(p) .
  • In practice, the number n of keys to be sorted is
    greater than the number p of processors. In this
    case, each processor first sorts its list of size
    n/p using any efficient sequential sorting
    algorithm. Next, we perform the odd-even
    transposition sort as before except that each
    compare-exchange step is replaced by a
    merge-split step. For example, if P0 is holding
    (1,3,7,8) and P1 has (2,4,5,9), a merge-split
    step will turn the lists into (1,2,3,4) and
    (5,7,8,9), respectively.
  • The total time of this generalized algorithm is
    (n/p)lg(n/p) 2n.

11
Algorithms for Binary Tree
  • In algorithms for a binary tree of processors, we
    assume that the data elements are initially held
    by the leaf processors only. The non-leaf
    processors participate in the computation, but do
    not have data elements of their own.
  • The binary-tree architecture is ideally suited
    for parallel-prefix computation. The algorithm
    consists of two phases an upward phase followed
    by a downward phase. The two phases are depicted
    by the figure in the next page.
  • Given a list of 0s and 1s, the rank of each 1 in
    the list can be determined by a prefix sum
    computation

12
Algorithms for Binary Tree
  • At the downward phase, each
  • processor receives value p from
  • its parent and value l from its
  • left-child. Then, passes p to its
  • left-child and combine p and l,
  • and sends the result to its right-
  • child.

13
Algorithms for Binary Tree
  • For sorting, we can use an algorithm similar to
    bubble sort that allows the smaller elements in
    the leaves to bubble up to the root processor
    first. Then, the root sends the elements to leaf
    nodes in the proper order.
  • Initially, each leaf has a single data item and
    all other nodes are empty. At the upward phase,
    each inner node has storage space of two values,
    migrating upward from its left and right
    sub-trees. There are three cases
  • Contains 2 items do nothing
  • Contains 1 item that came from left (right) get
    the smaller item from right (left) child
  • Empty get smaller item from each child

14
Algorithms for Binary Tree
  • At the downward phase, each node knows the number
    of leaf nodes in its left sub-tree. If the rank
    of the element received from above larger than
    the number of leaf node to the left, then the
    data item is sent to the right, otherwise, to the
    left.
  • The figure in the next page shows the upward data
    movement (up to the point when the smallest
    element is in the root node, ready to begin the
    downward movement).
  • Because of the bisection width of the binary tree
    is 1, the above linear time algorithm can not be
    improved.

15
Algorithms for Binary Tree
16
Algorithms for 2D Mesh
  • The linear array algorithms can be used as
    building blocks in the 2D mesh algorithms. This
    leads to simple algorithms, but not necessarily
    the most efficient ones.
  • Parallel prefix computation in 2D mesh can be
    done easily in the following three phases,
    similar to that in linear array for the case ngtp.
  • (1) do a parallel prefix computation on each row,
  • (2) do a diminished parallel prefix computation
    in the rightmost column, and
  • (3) broadcast the results in the rightmost column
    to all of the elements in the same rows and
    combine with the local prefix values.

17
Algorithms for 2D Mesh
  • The parallel prefix algorithms in 2D mesh takes
  • unit
    time.
  • Next, we describe without proof, the simple
    version of a sorting algorithm known as shear
    sort. The algorithm consists of
    phases in a 2D mesh with r rows. In each phase,
    except the last one, all rows are sorted
    independently in a snakelike order even-numbered
    rows from left to right, odd-numbered rows from
    right to left. Then, all columns are sorted
    independently from top to bottom. In the final
    phase, rows are sorted from left to right.

18
Algorithms for 2D Mesh
  • Using the odd-even transposition algorithm for
    row-sort and column-sort in the shear-sort
    algorithm, we get that the algorithm needs
  • compare-exchange
    steps for sorting in row-major order.
  • The figure below shows the execution of the
    algorithm in a 3-by-3 mesh.

19
Exercise 2
  • Given n data items, determine the optimal number
    p of processors in a linear array such that if
    the n data items are distributed in the
    processors with each holding approximately n/p
    elements, the time to perform parallel prefix
    computation is minimized.
  • Compute the effectiveness measures introduced in
    the lecture note for parallel prefix computation
    algorithm on linear array, binary tree, and 2D
    mesh architecture.

20
Exercise 2
  • Shear-sort on 2D mesh of processors
  • Write down the number of compare-exchange steps
    required in perform shear-sort in 2D mesh with r
    rows and p/r columns.
  • Compute the effectiveness measures for the
    shear-sort based on the results above.
  • Discuss the best row/column ratio that minimizes
    the sorting time.
  • How would shear-sort work if each processor
    initially hold more than one key?
Write a Comment
User Comments (0)
About PowerShow.com