Title: Introduction to Parallel Architectures
1Introduction to Parallel Architectures
Dr. Laurence Boxer Niagara University
2Parallel Computers
- Purpose - speed
- Divide a problem among processors
- Let each processor work on its portion of problem
in parallel (simultaneously) with other
processors - Ideal - if p is the number of processors, get
solution in 1/p of the time used by a computer of
1 processor - Actual - rarely get that much speedup, due to
delays for interprocessor communications
3Graphs of relevant functions
4Architectural issues
Limitations on speed
- Communications diameter - how many communication
steps are necessary to send data from processor
that has it to processor that needs it - large is
bad - Bisection width - how many wires must be cut to
cut network in half - measure of how fast massive
amounts of data can be moved through network -
large is good
Limitation on expansion
- Degree of network - important to scalability
(ability to expand number of processors) - large
is bad
5PRAM - Parallel Random Access Machine
- Shared memory yields fast communications
Any processor can send data to any other
processor in time as follows
- Fast communications make this model theoretical
ideal for fastest possible parallel algorithms
for given of processors - Impractical - too many wires if lots of processors
- Source processor writes data to memory
- Destination processor reads data from memory
6(No Transcript)
7Notice the tree structure of the previous
algorithm
8Linear array architecture
- Degree of network 2 - easily expanded
- Bisection width 1 - cant move large amounts of
data efficiently across network - Communication diameter n-1 - wont perform
global communication operations efficiently
9Total on linear array
- Assume 1 item per processor
- Communications diameter implies
- Since this is the time required to total n items
on a RAM, there is no asymptotic benefit to using
a linear array for this problem
10Input-based sorting on a linear array
- The algorithm illustrated is a version of
Selection Sort - each processor selects the
smallest value it sees and passes others to the
right. - Time is proportional to communication diameter,
11Mesh architecture
- Square grid of processors
- Each processor connected by communication link to
N, S, E, W neighbors
- Degree of network 4 - makes expansion easy - can
introduce adjacent meshes and connect border
processors
12Application sorting Could have initial data all
in wrong half of mesh, as shown.
In 1 time unit, amt. of data that can cross into
correct half of mesh
Since all n items must get to correct half-mesh,
time required to sort is
13In a mesh, each of these steps takes time.
Hence, time for broadcast is
14Semigroup operation (e.g., total) in mesh
2. Roll up last row to get total in a corner.
1. Roll up columns in parallel, totaling each
column in last row by sending data downward.
Time
3. Broadcast total from corner to all processors.
Time
Time
15Mesh total algorithm - continued
Previous algorithm could run in approximately
half the time by gathering total in a center,
than corner, processor.
However, running time is still
i.e., still approximately proportional to
communication diameter (with smaller constant of
proportionality).
16Hypercube
- Number n of processors is a power of 2
- Processors are numbered from 0 to n-1
- Connected processors are those whose binary
labels differ in exactly 1 bit.
17Illustration of total operation in hypercube.
Reverse direction of arrows to broadcast result
Time
18Coarse-grained parallelism
- Most of previous discussion was of fine-grained
parallelism - of processors comparable to of
data items - Realistically, few budgets accommodate such
expensive computers - more likely to use
coarse-grained parallelism with relatively few
processors compared with of data items. - Coarse grained algorithms often based on each
processor boiling its share of data down to
single partial result, then using fine-grained
algorithm to combine these partial results
19Example coarse-grained total
Suppose n data are distributed evenly (n/p per
processor among p processors)
1. In parallel, each processor totals its share
of the data. Time T(n/p)
2. Use a fine-grained algorithm to add the
partial sums (total residing in one processor)
and broadcast result to all processors. In case
of mesh, time
Total time for mesh
Since
, this is T(n/p) - optimal.
20More info
Algorithms Sequential and Parallel
by
Russ Miller and Laurence Boxer
Prentice-Hall, 2000
(available December, 1999)