Introduction to Parallel Architectures - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Introduction to Parallel Architectures

Description:

Hypercube. Number n of processors is a power of 2. Processors ... Illustration of total operation in hypercube. Reverse direction of arrows to broadcast result ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 21

Provided by: laurenc4

Learn more at: https://purple.niagara.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Parallel Architectures

1
Introduction to Parallel Architectures
Dr. Laurence Boxer Niagara University
2
Parallel Computers

Purpose - speed
Divide a problem among processors
Let each processor work on its portion of problem
in parallel (simultaneously) with other
processors
Ideal - if p is the number of processors, get
solution in 1/p of the time used by a computer of
1 processor
Actual - rarely get that much speedup, due to
delays for interprocessor communications

3
Graphs of relevant functions
4
Architectural issues
Limitations on speed

Communications diameter - how many communication
steps are necessary to send data from processor
that has it to processor that needs it - large is
bad
Bisection width - how many wires must be cut to
cut network in half - measure of how fast massive
amounts of data can be moved through network -
large is good

Limitation on expansion

Degree of network - important to scalability
(ability to expand number of processors) - large
is bad

5
PRAM - Parallel Random Access Machine

Shared memory yields fast communications

Any processor can send data to any other
processor in time as follows

Fast communications make this model theoretical
ideal for fastest possible parallel algorithms
for given of processors
Impractical - too many wires if lots of processors

Source processor writes data to memory
Destination processor reads data from memory

6
(No Transcript)
7
Notice the tree structure of the previous
algorithm
8
Linear array architecture

Degree of network 2 - easily expanded
Bisection width 1 - cant move large amounts of
data efficiently across network
Communication diameter n-1 - wont perform
global communication operations efficiently

9
Total on linear array

Assume 1 item per processor
Communications diameter implies

Since this is the time required to total n items
on a RAM, there is no asymptotic benefit to using
a linear array for this problem

10
Input-based sorting on a linear array

The algorithm illustrated is a version of
Selection Sort - each processor selects the
smallest value it sees and passes others to the
right.
Time is proportional to communication diameter,

11
Mesh architecture

Square grid of processors
Each processor connected by communication link to
N, S, E, W neighbors

Degree of network 4 - makes expansion easy - can
introduce adjacent meshes and connect border
processors

12
Application sorting Could have initial data all
in wrong half of mesh, as shown.
In 1 time unit, amt. of data that can cross into
correct half of mesh
Since all n items must get to correct half-mesh,
time required to sort is
13
In a mesh, each of these steps takes time.
Hence, time for broadcast is
14
Semigroup operation (e.g., total) in mesh
2. Roll up last row to get total in a corner.
1. Roll up columns in parallel, totaling each
column in last row by sending data downward.
Time
3. Broadcast total from corner to all processors.
Time
Time
15
Mesh total algorithm - continued
Previous algorithm could run in approximately
half the time by gathering total in a center,
than corner, processor.
However, running time is still
i.e., still approximately proportional to
communication diameter (with smaller constant of
proportionality).
16
Hypercube

Number n of processors is a power of 2
Processors are numbered from 0 to n-1
Connected processors are those whose binary
labels differ in exactly 1 bit.

17
Illustration of total operation in hypercube.
Reverse direction of arrows to broadcast result
Time
18
Coarse-grained parallelism

Most of previous discussion was of fine-grained
parallelism - of processors comparable to of
data items
Realistically, few budgets accommodate such
expensive computers - more likely to use
coarse-grained parallelism with relatively few
processors compared with of data items.
Coarse grained algorithms often based on each
processor boiling its share of data down to
single partial result, then using fine-grained
algorithm to combine these partial results

19
Example coarse-grained total
Suppose n data are distributed evenly (n/p per
processor among p processors)
1. In parallel, each processor totals its share
of the data. Time T(n/p)
2. Use a fine-grained algorithm to add the
partial sums (total residing in one processor)
and broadcast result to all processors. In case
of mesh, time
Total time for mesh
Since
, this is T(n/p) - optimal.
20
More info
Algorithms Sequential and Parallel
by
Russ Miller and Laurence Boxer
Prentice-Hall, 2000
(available December, 1999)

Write a Comment

User Comments (0)