Introduction to Parallel Architectures - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Introduction to Parallel Architectures

Description:

Hypercube. Number n of processors is a power of 2. Processors ... Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 21
Provided by: laurenc4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Parallel Architectures


1
Introduction to Parallel Architectures
Dr. Laurence Boxer Niagara University
2
Parallel Computers
  • Purpose - speed
  • Divide a problem among processors
  • Let each processor work on its portion of problem
    in parallel (simultaneously) with other
    processors
  • Ideal - if p is the number of processors, get
    solution in 1/p of the time used by a computer of
    1 processor
  • Actual - rarely get that much speedup, due to
    delays for interprocessor communications

3
Graphs of relevant functions
4
Architectural issues
Limitations on speed
  • Communications diameter - how many communication
    steps are necessary to send data from processor
    that has it to processor that needs it - large is
    bad
  • Bisection width - how many wires must be cut to
    cut network in half - measure of how fast massive
    amounts of data can be moved through network -
    large is good

Limitation on expansion
  • Degree of network - important to scalability
    (ability to expand number of processors) - large
    is bad

5
PRAM - Parallel Random Access Machine
  • Shared memory yields fast communications

Any processor can send data to any other
processor in time as follows
  • Fast communications make this model theoretical
    ideal for fastest possible parallel algorithms
    for given of processors
  • Impractical - too many wires if lots of processors
  • Source processor writes data to memory
  • Destination processor reads data from memory

6
(No Transcript)
7
Notice the tree structure of the previous
algorithm
8
Linear array architecture
  • Degree of network 2 - easily expanded
  • Bisection width 1 - cant move large amounts of
    data efficiently across network
  • Communication diameter n-1 - wont perform
    global communication operations efficiently

9
Total on linear array
  • Assume 1 item per processor
  • Communications diameter implies
  • Since this is the time required to total n items
    on a RAM, there is no asymptotic benefit to using
    a linear array for this problem

10
Input-based sorting on a linear array
  • The algorithm illustrated is a version of
    Selection Sort - each processor selects the
    smallest value it sees and passes others to the
    right.
  • Time is proportional to communication diameter,

11
Mesh architecture
  • Square grid of processors
  • Each processor connected by communication link to
    N, S, E, W neighbors
  • Degree of network 4 - makes expansion easy - can
    introduce adjacent meshes and connect border
    processors

12
Application sorting Could have initial data all
in wrong half of mesh, as shown.
In 1 time unit, amt. of data that can cross into
correct half of mesh
Since all n items must get to correct half-mesh,
time required to sort is
13
In a mesh, each of these steps takes time.
Hence, time for broadcast is
14
Semigroup operation (e.g., total) in mesh
2. Roll up last row to get total in a corner.
1. Roll up columns in parallel, totaling each
column in last row by sending data downward.
Time
3. Broadcast total from corner to all processors.
Time
Time
15
Mesh total algorithm - continued
Previous algorithm could run in approximately
half the time by gathering total in a center,
than corner, processor.
However, running time is still
i.e., still approximately proportional to
communication diameter (with smaller constant of
proportionality).
16
Hypercube
  • Number n of processors is a power of 2
  • Processors are numbered from 0 to n-1
  • Connected processors are those whose binary
    labels differ in exactly 1 bit.

17
Illustration of total operation in hypercube.
Reverse direction of arrows to broadcast result
Time
18
Coarse-grained parallelism
  • Most of previous discussion was of fine-grained
    parallelism - of processors comparable to of
    data items
  • Realistically, few budgets accommodate such
    expensive computers - more likely to use
    coarse-grained parallelism with relatively few
    processors compared with of data items.
  • Coarse grained algorithms often based on each
    processor boiling its share of data down to
    single partial result, then using fine-grained
    algorithm to combine these partial results

19
Example coarse-grained total
Suppose n data are distributed evenly (n/p per
processor among p processors)
1. In parallel, each processor totals its share
of the data. Time T(n/p)
2. Use a fine-grained algorithm to add the
partial sums (total residing in one processor)
and broadcast result to all processors. In case
of mesh, time
Total time for mesh
Since
, this is T(n/p) - optimal.
20
More info
Algorithms Sequential and Parallel
by
Russ Miller and Laurence Boxer
Prentice-Hall, 2000
(available December, 1999)
Write a Comment
User Comments (0)
About PowerShow.com