Parallel Computing Final Exam Review - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Computing Final Exam Review

Description:

... but with arbitration problem(many ... Communication/Arbitration. ... among different processors and how we arbitrate communication related conflicts. ... – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 142
Provided by: xw
Category:

less

Transcript and Presenter's Notes

Title: Parallel Computing Final Exam Review


1
Parallel Computing Final Exam Review
2
What is Parallel computing?
  • Parallel computing involves performing parallel
    tasks using more than one computer.
  • Example in real life with related principles --
    book shelving in a library
  • Single worker
  • P workers with each worker stacking n/p books,
    but with arbitration problem(many workers try to
    stack the next book in the same shelf.)
  • P workers with each worker stacking n/p books,
    but without arbitration problem (each worker work
    on a different set of shelves)

3
Important Issues in parallel computing
  • Task/Program Partitioning.
  • How to split a single task among the processors
    so that each processor performs the same amount
    of work, and all processors work collectively to
    complete the task.
  • Data Partitioning.
  • How to split the data evenly among the processors
    in such a way that processor interaction is
    minimized.
  • Communication/Arbitration.
  • How we allow communication among different
    processors and how we arbitrate communication
    related conflicts.

4
Challenges
  1. Design of parallel computers so that we resolve
    the above issues.
  2. Design, analysis and evaluation of parallel
    algorithms run on these machines.
  3. Portability and scalability issues related to
    parallel programs and algorithms
  4. Tools and libraries used in such systems.

5
Units of Measure in HPC
  • High Performance Computing (HPC) units are
  • Flop floating point operation
  • Flops/s floating point operations per second
  • Bytes size of data (a double precision floating
    point number is 8)
  • Typical sizes are millions, billions,
    trillions
  • Mega Mflop/s 106 flop/sec Mbyte 220 1048576
    106 bytes
  • Giga Gflop/s 109 flop/sec Gbyte 230 109
    bytes
  • Tera Tflop/s 1012 flop/sec Tbyte 240 1012
    bytes
  • Peta Pflop/s 1015 flop/sec Pbyte 250 1015
    bytes
  • Exa Eflop/s 1018 flop/sec Ebyte 260 1018
    bytes
  • Zetta Zflop/s 1021 flop/sec Zbyte 270 1021
    bytes
  • Yotta Yflop/s 1024 flop/sec Ybyte 280 1024
    bytes
  • See www.top500.org for current list of fastest
    machines

6
What is a parallel computer?
  • A parallel computer is a collection of processors
    that cooperatively solve computationally
    intensive problems faster than other computers.
  • Parallel algorithms allow the efficient
    programming of parallel computers.
  • This way the waste of computational resources can
    be avoided.
  • Parallel computer v.s. Supercomputer
  • supercomputer refers to a general-purpose
    computer that can solve computational intensive
    problems faster than traditional computers.
  • A supercomputer may or may not be a parallel
    computer.

7
Flynns taxonomy of computer architectures
(control mechanism)
  • Depending on the execution and data streams
    computer architectures can be distinguished into
    the following groups.
  • (1) SISD (Single Instruction Single Data) This
    is a sequential computer.
  • (2) SIMD (Single Instruction Multiple Data)
    This is a parallel machine like the TM CM-200.
    SIMD machines are suited for data-parallel
    programs where the same set of instructions are
    executed on a large data set.
  • Some of the earliest parallel computers such as
    the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1
    belonged to this class of machines
  • (3) MISD (Multiple Instructions Single Data)
    Some consider a systolic array a member of this
    group.
  • (4) MIMD (Multiple Instructions Multiple Data)
    All other parallel machines. A MIMD architecture
    can be an MPMD or an SPMD. In a Multiple Program
    Multiple Data organization, each processor
    executes its own program as opposed to a single
    program that is executed by all processors on a
    Single Program Multiple Data architecture.
  • Examples of such platforms include current
    generation Sun Ultra Servers, SGI Origin Servers,
    multiprocessor PCs, workstation clusters, and the
    IBM SP
  • Note Some consider CM-5 as a combination of a
    MIMD and SIMD as it contains control hardware
    that allows it to operatein a SIMD mode.

8
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical
MIMD architecture (b).
9
Taxonomy based on Address-Space Organization
(memory distribution)
  • Message-Passing Architecture
  • In a distributed memory machine each processor
    has its own memory. Each processor can access its
    own memory faster than it can access the memory
    of a remote processor (NUMA for Non-Uniform
    Memory Access). This architecture is also known
    as message-passing architecture and such machines
    are commonly referred to as multicomputers.
  • Examples Cray T3D/T3E, IBM SP1/SP2, workstation
    clusters.
  • Shared-Address-Space Architecture
  • Provides hardware support for read/write to a
    shared address space. Machines built this way are
    often called multiprocessors.
  • (1) A shared memory machine has a single address
    space shared by all processors (UMA, for Uniform
    Memory Access).
  • The time taken by a processor to access any
    memory word in the system is identical.
  • Examples SGI Power Challenge, SMP machines.
  • (2) A distributed shared memory system is a
    hybrid between the two previous ones. A global
    address space is shared among the processors but
    is distributed among them. Example SGI Origin
    2000
  • Note The existence of a cache in shared-memory
    parallel machines cause cache coherence problems
    when a cached variable is modified by a processor
    and the shared-variable is requested by another
    processor. cc-NUMA for cachecoherent NUMA
    architectures (Origin 2000).

10
NUMA and UMA Shared-Address-Space Platforms
Typical shared-address-space architectures (a)
Uniform-memory access shared-address-space
computer (b) Uniform-memory-access
shared-address-space computer with caches and
memories (c) Non-uniform-memory-access
shared-address-space computer with local memory
only.
11
Message Passing vs. Shared Address Space
Platforms
  • Message passing requires little hardware support,
    other than a network.
  • Shared address space platforms can easily emulate
    message passing. The reverse is more difficult to
    do (in an efficient manner).

12
Taxonomy based on processor granularity
  • The granularity sometimes refers to the power of
    individual processors. Sometimes is also used to
    denote the degree of parallelism.
  • (1) A coarse-grained architecture consists of
    (usually few) powerful processors (eg old Cray
    machines).
  • (2) a fine-grained architecture consists of
    (usually many inexpensive) processors (eg TM
    CM-200, CM-2).
  • (3) a medium-grained architecture is between the
    two (eg CM-5).
  • Process Granularity refers to the amount of
    computation assigned to a particular processor of
    a parallel machine for a given parallel program.
    It also refers, within a single program, to the
    amount of computation performed before
    communication is issued. If the amount of
    computation is small (low degree of concurrency)
    a process is fine-grained. Otherwise granularity
    is coarse.

13
Taxonomy based on processor synchronization
  • (1) In a fully synchronous system a global clock
    is used to synchronize all operations performed
    by the processors.
  • (2) An asynchronous system lacks any
    synchronization facilities. Processor
    synchronization needs to be explicit in a users
    program.
  • (3) A bulk-synchronous system comes in between a
    fully synchronous and an asynchronous system.
    Synchronization of processors is required only at
    certain parts of the execution of a parallel
    program.

14
Physical Organization of Parallel Platforms
ideal architecture(PRAM)
  • The Parallel Random Access Machine (PRAM) is one
    of the simplest ways to model a parallel
    computer.
  • A PRAM consists of a collection of (sequential)
    processors that can synchronously access a global
    shared memory in unit time. Each processor can
    thus access its shared memory as fast (and
    efficiently) as it can access its own local
    memory.
  • The main advantages of the PRAM is its simplicity
    in capturing parallelism and abstracting away
    communication and synchronization issues related
    to parallel computing.
  • Processors are considered to be in abundance and
    unlimited in number. The resulting PRAM
    algorithms thus exhibit unlimited parallelism
    (number of processors used is a function of
    problem size).
  • The abstraction thus offered by the PRAM is a
    fully synchronous collection of processors and a
    shared memory which makes it popular for parallel
    algorithm design.
  • It is, however, this abstraction that also makes
    the PRAM unrealistic from a practical point of
    view.
  • Full synchronization offered by the PRAM is too
    expensive and time demanding in parallel machines
    currently in use.
  • Remote memory (i.e. shared memory) access is
    considerably more expensive in real machines than
    local memory access
  • UMA machines with unlimited parallelism are
    difficult to build.

15
Four Subclasses of PRAM
  • Depending on how concurrent access to a single
    memory cell (of the shared memory) is resolved,
    there are various PRAM variants.
  • ER (Exclusive Read) or EW (Exclusive Write) PRAMs
    do not allow concurrent access of the shared
    memory.
  • It is allowed, however, for CR (Concurrent Read)
    or CW (Concurrent Write) PRAMs.
  • Combining the rules for read and write access
    there are four PRAM variants
  • EREW
  • access to a memory location is exclusive. No
    concurrent read or write operations are allowed.
  • Weakest PRAM model
  • CREW
  • Multiple read accesses to a memory location are
    allowed. Multiple write accesses to a memory
    location are serialized.
  • ERCW
  • Multiple write accesses to a memory location are
    allowed. Multiple read accesses to a memory
    location are serialized.
  • Can simulate an EREW PRAM
  • CRCW
  • Allows multiple read and write accesses to a
    common memory location.
  • Most powerful PRAM model
  • Can simulate both EREW PRAM and CREW PRAM

16
Resolve concurrent write access
  • (1) in the arbitrary PRAM, if multiple processors
    write into a single shared memory cell, then an
    arbitrary processor succeeds in writing into this
    cell.
  • (2) in the common PRAM, processors must write the
    same value into the shared memory cell.
  • (3) in the priority PRAM the processor with the
    highest priority (smallest or largest indexed
    processor) succeeds in writing.
  • (4) in the combining PRAM if more than one
    processors write into the same memory cell, the
    result written into it depends on the combining
    operator. If it is the sum operator, the sum of
    the values is written, if it is the maximum
    operator the maximum is written.

Note An algorithm designed for the common PRAM
can be executed on a priority or arbitrary PRAM
and exhibit similar complexity. The same holds
for an arbitrary PRAM algorithm when run on a
priority PRAM.
17
Innerconnection Networks for Parallel Computers
  • Interconnection networks carry data between
    processors and to memory.
  • Interconnects are made of switches and links
    (wires, fiber).
  • Interconnects are classified as static or
    dynamic.
  • Static networks
  • Consists of point-to-point communication links
    among processing nodes
  • Also referred to as direct networks
  • Dynamic networks
  • Built using switches (switching element) and
    links
  • Communication links are connected to one another
    dynamically by the switches to establish paths
    among processing nodes and memory banks.

18
Static and DynamicInterconnection Networks
Classification of interconnection networks (a) a
static network and (b) a dynamic network.
19
Network Topologies
  • Bus-Based Networks
  • The simplest network that consists a shared
    medium(bus) that is common to all the nodes.
  • The distance between any two nodes in the network
    is constant (O(1)).
  • Ideal for broadcasting information among nodes.
  • Scalable in terms of cost, but not scalable in
    terms of performance.
  • The bounded bandwidth of a bus places limitations
    on the overall performance as the number of nodes
    increases.
  • Typical bus-based machines are limited to dozens
    of nodes.
  • Sun Enterprise servers and Intel Pentium based
    shared-bus multiprocessors are examples of such
    architectures

20
Network Topologies
  • Crossbar Networks
  • Employs a grid of switches or switching nodes to
    connect p processors to b memory banks.
  • Nonblocking network
  • the connection of a processing node to a memory
    bank doesnot block the connection of any other
    processing nodes to other memory banks.
  • The total number of switching nodes required is
    T(pb). (It is reasonable to assume bgtp)
  • Scalable in terms of performance
  • Not scalable in terms of cost.
  • Examples of machines that employ crossbars
    include the Sun Ultra HPC 10000 and the Fujitsu
    VPP500

21
Network Topologies Multistage Network
  • Multistage Networks
  • Intermediate class of networks between bus-based
    network and crossbar network
  • Blocking networks access to a memory bank by a
    processor may disallow access to another memory
    bank by another processor.
  • More scalable than the bus-based network in terms
    of performance, more scalable than crossbar
    network in terms of cost.

The schematic of a typical multistage
interconnection network.
22
Network Topologies Multistage Omega Network
  • Omega network
  • Consists of log p stages, p is the number of
    inputs(processing nodes) and also the number of
    outputs(memory banks)
  • Each stage consists of an interconnection pattern
    that connects p inputs and p outputs
  • Perfect shuffle(left rotation)
  • Each switch has two conncetion modes
  • Pass-thought conncetion the inputs are sent
    straight through to the outputs
  • Cross-over connection the inputs to the
    switching node are crossed over and then sent
    out.
  • Has p/2log p switching nodes

23
Network Topologies Multistage Omega Network
A complete Omega network with the perfect
shuffle interconnects and switches can now be
illustrated
A complete omega network connecting eight inputs
and eight outputs. An omega network has p/2
log p switching nodes, and the cost of such a
network grows as (p log p).
24
Network Topologies Multistage Omega Network
Routing
  • Let s be the binary representation of the source
    and d be that of the destination processor.
  • The data traverses the link to the first
    switching node. If the most significant bits of s
    and d are the same, then the data is routed in
    pass-through mode by the switch else, it switches
    to crossover.
  • This process is repeated for each of the log p
    switching stages.
  • Note that this is not a non-blocking switch.

25
Network Topologies Multistage Omega Network
Routing
An example of blocking in omega network one of
the messages (010 to 111 or 110 to 100) is
blocked at link AB.
26
Network Topologies - Fixed Connection Networks
(static)
  • Completely-connection Network
  • Star-Connected Network
  • Linear array
  • 2d-array or 2d-mesh or mesh
  • 3d-mesh
  • Complete Binary Tree (CBT)
  • 2d-Mesh of Trees
  • Hypercube

27
Evaluating Static Interconnection Network
  • One can view an interconnection network as a
    graph whose nodes correspond to processors and
    its edges to links connecting neighboring
    processors. The properties of these
    interconnection networks can be described in
    terms of a number of criteria.
  • (1) Set of processor nodes V . The cardinality of
    V is the number of processors p (also denoted by
    n).
  • (2) Set of edges E linking the processors. An
    edge e (u, v) is represented by a pair (u, v)
    of nodes. If the graph G (V,E) is directed,
    this means that there is a unidirectional link
    from processor u to v. If the graph is
    undirected, the link is bidirectional. In almost
    all networks that will be considered in this
    course communication links will be bidirectional.
    The exceptions will be clearly distinguished.
  • (3) The degree du of node u is the number of
    links containing u as an endpoint. If graph G is
    directed we distinguish between the out-degree of
    u (number of pairs (u, v) ? E, for any v ? V )
    and similarly, the in-degree of u. T he degree d
    of graph G is the maximum of the degrees of its
    nodes i.e. d maxudu.

28
Evaluating Static Interconnection Network (cont.)
  • (4) The diameter D of graph G is the maximum of
    the lengths of the shortest paths linking any two
    nodes of G. A shortest path between u and v is
    the path of minimal length linking u and v. We
    denote the length of this shortest path by duv.
    Then, D maxu,vduv. The diameter of a graph G
    denotes the maximum delay (in terms of number of
    links traversed) that will be incurred when a
    packet is transmitted from one node to the other
    of the pair that contributes to D (i.e. from u to
    v or the other way around, if D duv). Of course
    such a delay would hold if messages follow
    shortest paths (the case for most routing
    algorithms).
  • (5) latency is the total time to send a message
    including software overhead. Message latency is
    the time to send a zero-length message.
  • (6) bandwidth is the number of bits transmitted
    in unit time.
  • (7) bisection width is the number of links that
    need to be removed from G to split the nodes into
    two sets of about the same size (1).

29
Network Topologies - Fixed Connection Networks
(static)
  • Completely-connection Network
  • Each node has a direct communication link to
    every other node in the network.
  • Ideal in the sense that a node can send a message
    to another node in a single step.
  • Static counterpart of crossbar switching networks
  • Nonblocking
  • Star-Connected Network
  • One processor acts as the central processor.
    Every other processor has a communication link
    connecting it to this central processor.
  • Similar to bus-based network.
  • The central processor is the bottleneck.

30
Network Topologies Completely Connected and Star
Connected Networks
Example of an 8-node completely connected network.
(a) A completely-connected network of eight
nodes (b) a star connected network of nine
nodes.
31
Network Topologies - Fixed Connection Networks
(static)
  • Linear array
  • In a linear array, each node(except the two nodes
    at the ends) has two neighbors, one each to its
    left and right.
  • Extension ring or 1-D torus(linear array with
    wraparound).
  • 2d-array or 2d-mesh or mesh
  • The processors are ordered to form a
    2-dimensional structure (square) so that each
    processor is connected to its four neighbor
    (north, south, east, west) except perhaps for the
    processors of the boundary.
  • Extension of linear array to two-dimensions Each
    dimension has ?p nodes with a node identified by
    a two-tuple (i,j).
  • 3d-mesh
  • A generalization of a 2d-mesh in three
    dimensions. Exercise Find the characteristics of
    this network and its generalization in k
    dimensions (k gt 2).
  • Complete Binary Tree (CBT)
  • There is only one path between any pair of two
    nodes
  • Static tree network have a processing element at
    each node of the tree.
  • Dynamic tree network nodes at intermediate
    levels are switching nodes and the leaf nodes are
    processing elements.
  • Communication bottleneck at higher levels of the
    tree.
  • Solution increasing the number of communication
    links and switching nodes closer to the root.

32
Network Topologies Linear Arrays
Linear arrays (a) with no wraparound links (b)
with wraparound link.
33
Network Topologies Two- and Three Dimensional
Meshes
Two and three dimensional meshes (a) 2-D mesh
with no wraparound (b) 2-D mesh with wraparound
link (2-D torus) and (c) a 3-D mesh with no
wraparound.
34
Network Topologies Tree-Based Networks
Complete binary tree networks (a) a static tree
network and (b) a dynamic tree network.
35
Network Topologies Fat Trees
A fat tree network of 16 processing nodes.
36
Evaluating Network Topologies
  • Linear array
  • V N, E N -1, d 2, D N -1, bw 1
    (bisection width).
  • 2d-array or 2d-mesh or mesh
  • For V N, we have a vN vN mesh structure,
    with E 2N O(N), d 4, D 2vN - 2, bw
    vN.
  • 3d-mesh
  • Exercise Find the characteristics of this
    network and its generalization in k dimensions (k
    gt 2).
  • Complete Binary Tree (CBT) on N 2n leaves
  • For a complete binary tree on N leaves, we define
    the level of a node to be its distance from the
    root. The root is of level 0 and the number of
    nodes of level i is 2i. Then, V 2N - 1, E
    2N - 2 O(N), d 3, D 2lgN, bw 1(c)

37
Network Topologies - Fixed Connection Networks
(static)
  • 2d-Mesh of Trees
  • An N2-leaf 2d-MOT consists of N2 nodes ordered as
    in a 2d-array N N (but without the links). The N
    rows and N columns of the 2d-MOT form N row CBT
    and N column CBTs respectively.
  • For such a network, V N22N(N -1), E
    O(N2), d 3, D 4lgN, bw N. The 2d-MOT
    possesses an interesting decomposition property.
    If the 2N roots of the CBTs are removed we get 4
    N/2 N/2 CBT s.
  • The 2d-Mesh of Trees (2d-MOT) combines the
    advantages of 2d-meshes and binary trees. A
    2d-mesh has large bisection width but large
    diameter (vN). On the other hand a binary tree on
    N leaves has small bisection width but small
    diameter. The 2d-MOT has small diameter and large
    bisection width.
  • A 3d-MOTcan be defined similarly.

38
Network Topologies - Fixed Connection Networks
(static)
  • Hypercube
  • The hypercube is the major representative of a
    class of networks that are called hypercubic
    networks. Other such networks is the butterfly,
    the shuffle-exchange graph, de-Bruijn graph,
    Cube-connected cycles etc.
  • Each vertex of an n-dimensional hypercube is
    represented by a binary string of length n.
    Therefore there are V 2n N vertices in
    such a hypercube. Two vertices are connected by
    an edge if their strings differ in exactly one
    bit position.
  • Let u u1u2 . . .ui . . . un. An edge is a
    dimension i edge if it links two nodes that
    differ in the i-th bit position.
  • This way vertex u is connected to vertex ui u1u2
    . . . ui . . .un with a dimension i edge.
    Therefore E N lg N/2 and d lgN n.
  • The hypercube is the first network examined so
    far that has degree that is not a constant but a
    very slowly growing function of N. The diameter
    of the hypercube is D lgN. A path from node u
    to node v can be determined by correcting the
    bits of u to agree with those of v starting from
    dimension 1 in a left-to-right fashion. The
    bisection width of the hypercube is bw N. This
    is a result of the following property of the
    hypercube. If all edges of dimension i are
    removed from an n dimensional hypercube, we get
    two hypercubes each one of dimension n - 1.

39
Network Topologies Hypercubes and their
Construction
Construction of hypercubes from hypercubes of
lower dimension.
40
Evaluating Static Interconnection Networks
Network Diameter BisectionWidth Arc Connectivity Cost (No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
41
Performance Metrics for Parallel Systems
  • Number of processing elements p
  • Execution Time
  • Parallel runtime the time that elapses from the
    moment a parallel computation starts to the
    moment the last processing element finishes
    execution.
  • Ts serial runtime
  • Tp parallel runtime
  • Total Parallel Overhead T0
  • Total time collectively spent by all the
    processing elements running time required by
    the fastest known sequential algorithm for
    solving the same problem on a single processing
    element.
  • T0pTp-Ts

42
Performance Metrics for Parallel Systems
  • Speedup S
  • The ratio of the serial runtime of the best
    sequential algorithm for solving a problem to the
    time taken by the parallel algorithm to solve the
    same problem on p processing elements.
  • STs(best)/Tp
  • Example adding n numbers TpT(logn), Ts T(n),
    S T(n/logn)
  • Theoretically, speedup can never exceed the
    number of processing elements p(Sltp).
  • Proof Assume a speedup is greater than p, then
    each processing element can spend less than time
    Ts/p solving the problem. In this case, a single
    processing element could emulate the p processing
    elements and solve the problem in fewer than Ts
    units of time. This is a contradiction because
    speedup, by definition, is computed with respect
    to the best sequential algorithm.
  • Superlinear speedup In practice, a speedup
    greater than p is sometimes observed, this
    usually happens when the work performed by a
    serial algorithm is greater than its parallel
    formulation or due to hardware features that put
    the serial implementation at a disadvantage.

43
Example for Superlinear speedup
  • Superlinear speedup
  • Example1 Superlinear effects from caches With
    the problem instance size of A and 64KB cache,
    the cache hit rate is 80. Assume latency to
    cache of 2ns and latency of DRAM of 100ns, then
    memory access time is 20.81000.221.6ns. If
    the computation is memory bound and performs one
    FLOP/memory access, this corresponds to a
    processing rate of 46.3 MFLOPS. With the problem
    instance size of A/2 and 64KB cache, the cache
    hit rate is higher, i.e., 90, 8 the remaining
    data comes from local DRAM and the other 2 comes
    from the remote DRAM with latency of 400ns, then
    memory access time is 20.91000.084000.0217.8
    . The corresponding execution rate at each
    processor is 56.18MFLOPS, and for two processors
    the total processing rate is 112.36MFLOPS. Then
    the speedup will be 112.36/46.32.43!

44
Example for Superlinear speedup
  • Superlinear speedup
  • Example2 Superlinear effects due to exploratory
    decomposition explore leaf nodes of an
    unstructured tree. Each leaf has a label
    associated with it and the objective is to find a
    node with a specified label, say S. The
    solution node is the rightmost leaf in the tree.
    A serial formulation of this problem based on
    depth-first tree traversal explores the entire
    tree, i.e. all 14 nodes, time is 14 units time.
    Now a parallel formulation in which the left
    subtree is explored by processing element 0 and
    the right subtree is explored by processing
    element 1. The total work done by the parallel
    algorithm is only 9 nodes and corresponding
    parallel time is 5 units time. Then the speedup
    is 14/52.8.

45
Performance Metrics for Parallel Systems(cont.)
  • Efficiency E
  • Ratio of speedup to the number of processing
    element.
  • ES/p
  • A measure of the fraction of time for which a
    processing element is usefully employed.
  • Examples adding n numbers on n processing
    elements TpT(logn), Ts T(n), S T(n/logn), E
    T(1/logn)
  • Cost(also called Work or processor-time product)
    W
  • Product of parallel runtime and the number of
    processing elements used.
  • WTpp
  • Examples adding n numbers on n processing
    elements W T(nlogn).
  • Cost-optimal if the cost of solving a problem on
    a parallel computer has the same asymptotic
    growth(in T terms) as a function of the input
    size as the fastest-known sequential algorithm on
    a single processing element.
  • Problem Size W2
  • The number of basic computation steps in the best
    sequential algorithm to solve the problem on a
    single processing element.
  • W2Ts of the fastest known algorithm to solve the
    problem on a sequential computer.

46
Parallel vs Sequential Computing Amdahls
  • Theorem 0.1 (Amdahls Law) Let f, 0 f 1, be
    the fraction of a computation that is inherently
    sequential. Then the maximum obtainable speedup S
    on p processors is S 1/(f (1 - f)/p)
  • Proof. Let T be the sequential running time for
    the named computation. fT is the time spent on
    the inherently sequential part of the program. On
    p processors the remaining computation, if fully
    parallelizable, would achieve a running time of
    at most (1-f)T/p. This way the running time of
    the parallel program on p processors is the sum
    of the execution time of the sequential and
    parallel components that is, fT (1 - f)T/p. The
    maximum allowable speedup is therefore S T/(fT
    (1 - f)T/p) and the result is proven.

47
Amdahls Law
  • Amdahl used this observation to advocate the
    building of even more powerful sequential
    machines as one cannot gain much by using
    parallel machines. For example if f 10, then S
    10 as p ? 8. The underlying assumption in
    Amdahls Law is that the sequential component of
    a program is a constant fraction of the whole
    program. In many instances as problem size
    increases the fraction of computation that is
    inherently sequential decreases with time. In
    many cases even a speedup of 10 is quite
    significant by itself.
  • In addition Amdahls law is based on the concept
    that parallel computing always tries to minimize
    parallel time. In some cases a parallel computer
    is used to increase the problem size that can be
    solved in a fixed amount of time. For example in
    weather prediction this would increase the
    accuracy of say a three-day forecast or would
    allow a more accurate five-day forecast.

48
Parallel vs Sequential Computing Gustaffsons Law
  • Theorem 0.2 (Gustafsons Law) Let the execution
    time of a parallel algorithm consist of a
    sequential segment fT and a parallel segment (1 -
    f)T and the sequential segment is constant. The
    scaled speedup of the algorithm is then. S (fT
    (1 - f)Tp)/(fT (1 - f)T) f (1 - f)p
  • For f 0.05, we get S 19.05, whereas Amdahls
    law gives an S 10.26.
  • 1 proc p proc
  • fT fT
  • (1-f)Tp (1-f)T
  • T(f(1-f)p) T
  • Amdahls Law assumes that problem size is fixed
    when it deals with scalability. Gustafsons Law
    assumes that running time is fixed.

49
Brents Scheduling Principle(Emulations)
  • Suppose we have an unlimited parallelism
    efficient parallel algorithm, i.e. an algorithm
    that runs on zillions of processors. In practice
    zillions of processors may not available. Suppose
    we have only p processors. A question that arises
    is what can we do to run the efficient zillion
    processor algorithm on our limited machine.
  • One answer is emulation simulate the zillion
    processor algorithm on the p processor machine.
  • Theorem 0.3 (Brents Principle) Let the execution
    time of a parallel algorithm requires m
    operations and runs in parallel time t. Then
    running this algorithm on a limited processor
    machine with only p processors would require time
    m/p t.
  • Proof Let mi be the number of computational
    operations at the i-th step, i.e. .If
    we assign the p processors on the i-th step to
    work on these mi operations they can conclude in
    time . Thus the total
    running time on p processors would be

50
The Message Passing Interface (MPI) Introduction
  • The Message-Passing Interface (MPI)is an attempt
    to create a standard to allow tasks executing on
    multiple processors to communicate through some
    standardized communication primitives.
  • It defines a standard library for message passing
    that one can use to develop message-passing
    program using C or Fortran.
  • The MPI standard define both the syntax and the
    semantics of these functional interface to
    message passing.
  • MPI comes intro a variety of flavors, freely
    available such as LAM-MPI and MPIch, and also
    commercial versions such as Critical Softwares
    WMPI.
  • It supporst message-passing on a variety of
    platforms from Linux-based or Windows-based PC to
    supercomputer and multiprocessor systems.
  • After the introduction of MPI whose functionality
    includes a set of 125 functions, a revision of
    the standard took place that added C support,
    external memory accessibility and also Remote
    Memory Access (similar to BSPs put and get
    capability)to the standard. The resulting
    standard is known as MPI-2 and has grown to
    almost 241 functions.

51
The Message Passing InterfaceA minimum set
  • A minimum set of MPI functions is described
    below. MPI functions use the prefix MPI and after
    the prefix the remaining keyword start with a
    capital letter.
  • A brief explanation of the primitives can be
    found on the textbook (beginning page 242). A
    more elaborate presentation is available in the
    optional book.

Function Class Standard Function Operation
Initialization and Termination MPI MPI MPI_Init MPI_Finalize Start of SPMD code End of SPMD code
Abnormal Stop MPI MPI_Abort One process halts all
Process Control MPI MPI MPI MPI_Comm_size MPI_Comm_rank MPI_Wtime Number of processes Identifier of Calling Process Local (wall-clock) time
Message Passing MPI MPI MPI_Send MPI_Recv Send message to remote proc. Receive mess. from remote proc.
52
General MPI program
include ltmpi.hgt Main(int argc, char
argv) MPI_Init(argc, argv) /no MPI
functions called before this/ MPI_Finalize()
/no MPI functions called after this/
53
MPI and MPI-2Initialization and Termination
  • include ltmpi.hgt
  • int MPI_Init (int argc, char argv)
  • int MPI_Finalize(void)
  • Multiple processes from the same source are
    created by issuing the function MPI_Init and
    these processes are safely terminated by issuing
    a MPI_Finalize.
  • The arguments of MPI_Init are the command line
    arguments minus the ones that were used/processed
    by the MPI implementation(main functions
    parameters argc and argv). Thus command line
    processing should only be performed in the
    program after the execution of this function
    call. Successful return returns a MPI_SUCCESS
    otherwise an error-code that is implementation
    dependent is returned.
  • Definitions are available in ltmpi.hgt.

54
MPI and MPI-2Abort
  • int MPI_Abort(MPI_Comm comm, int errcode)
  • Note that MPI_Abort aborts an MPI program
    cleanly. The first one is a communicator and
    second argument is an integer error code.
  • A communicator is a collection of processes that
    can send messages to each other. A default
    communicator is MPI_COMM_WORLD, which consists of
    all the processes running when program execution
    begins.

55
MPI and MPI-2Communicators Process Control
  • Under MPI, a communication domain is a set of
    processors that are allowed to communicate with
    each other. Information about such a domain is
    stored in a communicator that uniquely identify
    the processors that participate in a
    communication operation.
  • A default communication domain is all the
    processors of a parallel execution it is called
    MPI COMM WORLD. By using multiple communicators
    between possibly overlapping groups of processors
    we make sure that messages are not interfering
    with each other.
  • include ltmpi.hgt
  • int MPI_Comm_size ( MPI_Comm comm, int size)
  • int MPI_Comm_rank ( MPI_Comm comm, int rank)
  • Thus
  • MPI_Comm_size ( MPI_COMM_WORLD, nprocs)
  • MPI_Comm_rank ( MPI_COMM_WORLD, pid )
  • return the number of processors nprocs and the
    processor id pid of the calling processor.

56
MPI and MPI-2Communicators Process Control
Example
  • A hello world! Program in MPI is the following
    one.
  • include ltmpi.hgt
  • int main(int argc, char argv)
  • int nprocs, mypid
  • MPI_Init(argc,argv)
  • MPI_Comm_size(MPI_COMM_WORLD,nprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD,mypid )
  • printf("Hello world from process d of total
    d\n", mypid, nprocs)
  • MPI_Finalize()

57
Submit job to Grid Engine
  • (1)- created mpi program, e.g. example.c.
  • (2)- compiled my program, using mpicc -O2
    example.c -o example
  • (3)- type command of vi submit
  • (4)- i put this into vi submit
  • !/bin/bash
  • -S /bin/sh
  • -N example
  • -q default
  • -pe lammpi 4
  • -cwd
  • mpirun C ./example
  • (5)- then ran chmod ax submit
  • (6)- then ran qsub submit
  • (7)- ran qstat to check the status of your program

58
MPIMessage-Passing primitives
  • include ltmpi.hgt
  • / Blocking send and receive /
  • int MPI_Send(void buf, int count, MPI_Datatype
    dtype, int dest, int tag, MPI_Comm comm)
  • int MPI_Recv(void buf, int count, MPI_Datatype
    dtype, int src, int tag, MPI_Comm comm,
    MPI_Status stat)
  • / Non-Blocking send and receive /
  • int MPI_Isend(void buf, int count, MPI_Datatype
    dtype, int dest, int tag, MPI_Comm comm,
    MPI_Request req)
  • int MPI_Irecv(void buf, int count, MPI_Datatype
    dtype, int src, int tag, MPI_Comm comm,
    MPI_Request req)
  • int MPI_Wait(MPI_Request preq, MPI_Status
    stat)
  • buf - initial address of send/receive buffer
  • count - number of elements in send buffer
    (nonnegative integer) or maximum number of
    elements in receive buffer.
  • dtyp - datatype of each send/receive buffer
    element (handle)
  • dest,src - rank of destination/source (integer)
  • Wild-card MPI_ANY_SOURCE for recv only. No
    wildcard for dest.
  • tag - message tag (integer). Range 0...32767.
  • Wild-card MPI_ANY_TAG for recv only send must
    specify tag.
  • comm - communicator (handle)
  • stat - status object (Status), which can be the
    MPI constant it returns the source and tag of
    the message that was acctually received.

59
data type correspondence between MPI and C
  • MPI_CHAR --gt signed char ,
  • MPI_SHORT --gt signed short int ,
  • MPI_INT --gt signed int
  • MPI_LONG --gt signed long int ,
  • MPI_UNSIGNED_CHAR --gt unsigned char ,
  • MPI_UNSIGNED_SHORT --gt unsigned short int ,
  • MPI_UNSIGNED --gt unsigned int
  • MPI_UNSIGNED_LONG --gt unsigned long int ,
  • MPI_FLOAT --gt float ,
  • MPI_DOUBLE --gt double,
  • MPI_LONG_DOUBLE --gt long double,
  • MPI_BYTE
  • MPI_PACKED

60
MPIMessage-Passing primitives
  • The MPI Send and MPI Recv functions are blocking,
    that is they do not return unless it is safe to
    modify or use the contents of the send/receive
    buffer.
  • MPI also provides for non-blocking send and
    receive primitives. These are MPI Isend and MPI
    Irecv, where the I stands for Immediate.
  • These functions allow a process to post that it
    wants to send to or receive from another process,
    and then allow the process to call a function
    (eg. MPI Wait to complete the send-receive pair.
  • Non-blocking send-receives allow for the
    overlapping of computation/communication. Thus
    MPI Wait plays the role of synchronizer the
    send/receive are only advisories and
    communication is only effected at the MPI Wait.

61
MPIMessage-Passing primitives tag
  • A tag is simply an integer argument that is
    passed to a communication function and that can
    be used to uniquely identify a message. For
    example, in MPI if process A sends a message to
    process B, then in order for B to receive the
    message, the tag used in A's call to MPI_Send
    must be the same as the tag used in B's called to
    MPI_Recv. Thus, if the characteristics of two
    messages sent by A to B are identical (i.e., same
    count and datatype), then A and B can distinguish
    between the two by using different tags.
  • For example, suppose A is sending two floats, x
    and y, to B. Then the processes can be sure that
    the values are received correctly, regardless of
    the order in which A sends and B receives,
    provided different tags are used
  • / Assume system provides some buffering /
  • if (my_rank A)
  • tag 0
  • MPI_Send(x, 1, MPI_FLOAT, B, tag,
    MPI_COMM_WORLD)
  • . . .
  • tag 1
  • MPI_Send(y, 1, MPI_FLOAT, B, tag,
    MPI_COMM_WORLD)
  • else if (my_rank B)
  • tag 1
  • MPI_Recv(y, 1, MPI_FLOAT, A, tag,
    MPI_COMM_WORLD, status)
  • . . .
  • tag 0
  • MPI_Recv(x, 1, MPI_FLOAT, A, tag,
    MPI_COMM_WORLD, status)

62
MPI Message-Passing primitivesCommunicators
  • Now if one message from process A to process B is
    being sent by the library, and another, with
    identical characteristics, is being sent by the
    user's code, unless the library developer insists
    that user programs refrain from using certain tag
    values, this approach cannot be made to work.
    Clearly, partitioning the set of possible tags is
    at best an inconvenience if one wishes to modify
    an existing user code so that it can use a
    library that partitions tags, each message
    passing function in the entire user code must be
    checked.
  • The solution that was ultimately decided on was
    the communicator.
  • Formally, a communicator is a pair of objects
    the first is a group or ordered collection of
    processes, and the second is a context, which can
    be viewed as a unique, system-defined tag. Every
    communication function in MPI takes a
    communicator argument, and a communication can
    succeed only if all the processes participating
    in the communication use the same communicator
    argument. Thus, a library can either require that
    its functions be passed a unique library-specific
    communicator, or its functions can create their
    own unique communicator. In either case, it is
    straightforward for the library designer and the
    user to make certain that their messages are not
    confused.

63
For example, suppose now that the user's code is
sending a float, x, from process A to process B,
while the library is sending a float, y, from A
to B / Assume system provides some buffering
/ void User_function(int my_rank, float x)
MPI_Status status if (my_rank A) /
MPI_COMM_WORLD is pre-defined in MPI /
MPI_Send(x, 1, MPI_FLOAT, B, 0,
MPI_COMM_WORLD) else if (my_rank B)
MPI_Recv(x, 1, MPI_FLOAT, A, 0,
MPI_COMM_WORLD, status) . . . void
Library_function(float y) MPI_Comm
library_comm MPI_Status status int my_rank
/ Create a communicator with the same group /
/ as MPI_COMM_WORLD, but a different context /
MPI_Comm_dup(MPI_COMM_WORLD, library_comm) /
Get process rank in new communicator /
MPI_Comm_rank(library_comm, my_rank) if
(my_rank A) MPI_Send(y, 1, MPI_FLOAT, B, 0,
library_comm) else if (my_rank B)
MPI_Recv(y, 1, MPI_FLOAT, A, 0, library_comm,
status) . . . int main(int argc, char
argv) . . . if (my_rank A)
User_function(A, x) . . . Library_function
(y) else if (my_rank B)
Library_function(y) . . . User_function(B,
x) . . .
64
MPI Message-Passing primitivesUser-defined
datatypes
  • The second main innovation in MPI, user-defined
    datatypes, allows programmers to exploit this
    power, and as a consequence, to create messages
    consisting of logically unified sets of data
    rather than only physically contiguous blocks of
    data.
  • Loosely, an MPI datatype is a sequence of
    displacements in memory together with a
    collection of basic datatypes (e.g., int, float,
    double, and char). Thus, an MPI-datatype
    specifies the layout in memory of data to be
    collected into a single message or data to be
    distributed from a single message.
  • For example, suppose we specify a sparse matrix
    entry with the following definition.
  • typedef struct
  • double entry
  • int row, col
  • mat_entry_t
  • MPI provides functions for creating a variable
    that stores the layout in memory of a variable of
    type mat_entry_t. One does this by first defining
    an MPI datatype
  • MPI_Datatype mat_entry_mpi_t
  • to be used in communication functions, and then
    calling various MPI functions to initialize
    mat_entry_mpi_t so that it contains the required
    layout. Then, if we define
  • mat_entry_t x
  • we can send x by simply calling
  • MPI_Send(x, 1, mat_entry_mpi_t, dest, tag,
    comm)
  • and we can receive x with a similar call to
    MPI_Recv.

65
MPI An example with the blocking operations
  • include ltstdio.hgt
  • include ltmpi.hgt
  • define N 10000000 // Choose N to be multiple of
    nprocs to avoid problems.
  • // Parallel sum of 1 , 2 , 3, ... , N
  • int main(int argc,char argv)
  • int pid,nprocs,i,j
  • int sum, start, end, total
  • MPI_Status status
  • MPI_Init(argc,argv)
  • MPI_Comm_size(MPI_COMM_WORLD,nprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD,pid )
  • sum 0 total 0
  • start (N/nprocs)pid 1 // Each processor
  • end (N/nprocs)(pid1)
  • for(istartiltendi) sum i
  • if (pid ! 0 )
  • MPI_Send(sum,1,MPI_INT,0,1,MPI_COMM_WORLD)
  • else

66
Non-Blocking send and Receive
  • / Non-Blocking send and receive /
  • int MPI_Isend(void buf, int count, MPI_Datatype
    dtype, int dest, int tag, MPI_Comm comm,
    MPI_Request req)
  • int MPI_Irecv(void buf, int count, MPI_Datatype
    dtype, int src, int tag, MPI_Comm comm,
    MPI_Request req)
  • int MPI_Wait(MPI_Request preq, MPI_Status
    stat)
  • include "mpi.h"
  • int MPI_Wait ( MPI_Request request, MPI_Status
    status)
  • Waits for an MPI send or receive to complete
  • Input Parameter
  • request (handle)
  • Output Parameter
  • status object (Status) . May be
    MPI_STATUS_IGNORE.

67
MPI An example with the non-blocking operations
  • include ltstdio.hgt
  • include ltmpi.hgt
  • define N 10000000 // Choose N to be multiple of
    nprocs to avoid problems.
  • // Parallel sum of 1 , 2 , 3, ... , N
  • int main(int argc,char argv)
  • int pid,nprocs,i,j
  • int sum, start, end, total
  • MPI_Status status
  • MPI_Request request
  • MPI_Init(argc,argv)
  • MPI_Comm_size(MPI_COMM_WORLD,nprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD,pid )
  • sum 0 total 0
  • start (N/nprocs)pid 1 // Each processor
  • end (N/nprocs)(pid1)
  • for(istartiltendi) sum i
  • if (pid ! 0 )
  • // MPI_Send(sum,1,MPI_INT,0,1,MPI_COMM_WORLD)
  • MPI_Isend(sum,1,MPI_INT,0,1,MPI_COMM_WORLD,requ
    est)

68
MPI Basic Collective Operations
  • One simple collective operations
  • int MPI_Bcast(void message, int count,
    MPI_Datatype datatype, int root, MPI_Comm comm)
  • The routine MPI_Bcast sends data from one process
    to all others

69
MPI_Bcast
Process 1
Process 2
Process 3
Process 0
Data Present
70
Simple Program that Demonstrates MPI_Bcast
  • include ltmpi.hgt
  • include ltstdio.hgt
  • int main (int argc, char argv)
  • int k,id,p,size
  • MPI_Init(argc,argv)
  • MPI_Comm_rank(MPI_COMM_WORLD, id)
  • MPI_Comm_size(MPI_COMM_WORLD, size)
  • if(id 0)
  • k 20
  • else
  • k 10
  • for(p0 pltsize p)
  • if(id p)
  • printf("Process d k d
    before\n",id,k)
  • //note MPI_Bcast must be put where all
    other processes
  • //can see it.
  • MPI_Bcast(k,1,MPI_INT,0,MPI_COMM_WORLD)
  • for(p0 pltsize p)

71
Simple Program that Demonstrates MPI_Bcast
  • The Output would look like
  • Process 0 k 20 before
  • Process 0 k 20 after
  • Process 3 k 10 before
  • Process 3 k 20 after
  • Process 2 k 10 before
  • Process 2 k 20 after
  • Process 1 k 10 before
  • Process 1 k 20 after

72
Parallel Algorithm Assumptions
  • Convention In this subject we name processors
    arbitrarily either 0, 1, . . . , p - 1 or 1, 2, .
    . . , p.
  • The input to a particular problem would reside in
    the cells of the shared memory. We assume, in
    order to simplify the exposition of our
    algorithms, that a cell is wide enough (in bits
    or bytes) to accommodate a single instance of the
    input (eg. a key or a floating point number). If
    the input is of size n, the first n cells
    numbered 0, . . . , n - 1 store the input.
  • We assume that the number of processors of the
    PRAM is n or a polynomial function of the size n
    of the input. Processor indices are 0, 1, . . . ,
    n - 1.

73
PRAM Algorithm Matrix Multiplication
  • Matrix Multiplication
  • A simple algorithm for multiplying two n n
    matrices on a CREW PRAM with time complexity T
    O(lg n) and P n3 follows. For convenience,
    processors are indexed as triples (i, j, k),
    where i, j, k 1, . . . , n. In the first step
    processor (i, j, k) concurrently reads aij and
    bjk and performs the multiplication aijbjk. In
    the following steps, for all i, k the results (i,
    , k) are combined, using the parallel sum
    algorithm to form cik ?j aijbjk. After lgn
    steps, the result cik is thus computed.
  • The same algorithm also works on the EREW PRAM
    with the same time and processor complexity. The
    first step of the CREW algorithm need to be
    changed only. We avoid concurrency by
    broadcasting element aij to processors (i, j, )
    using the broadcasting algorithm of the EREW PRAM
    in O(lg n) steps. Similarly, bjk is broadcast to
    processors (, j, k).
  • The above algorithm also shows how an n-processor
    EREW PRAM can simulate an n-processor CREW PRAM
    with an O(lg n) slowdown.

74
Matrix Multiplication
  • CREW EREW
  • 1. aij to all (i,j,) procs O(1) O(lgn)
  • bjk to all (,j,k) procs O(1) O(lgn)
  • 2. aijbjk at (i,j,k) proc O(1) O(1)
  • 3. parallel sumj aij bjk (i,,k) procs O(lgn)
    O(lgn) n procs participate
  • 4. cik sumj aijbjk O(1) O(1)
  • TO(lgn),PO(n3 ) WO( n3 lgn) W2 O(n3 )

75
PRAM Algorithm Logical AND operation
  • Problem. Let X1 . . .,Xn be binary/boolean
    values. Find X X1 ? X2 ? . . . ? Xn.
  • The sequential problem accepts a P 1, T
    O(n),W O(n) direct solution.
  • An EREW PRAM algorithm solution for this problem
    works the same way as the PARALLEL SUM algorithm
    and its performance is P O(n), T O(lg n),W
    O(n lg n) along with the improvements in P and W
    mentioned for the PARALLEL SUM algorithm.
  • In the remainder we will investigate a CRCW PRAM
    algorithm. Let binary value Xi reside in the
    shared memory location i. We can find X X1 ? X2
    ? . . . ? Xn in constant time on a CRCW PRAM.
    Processor 1 first writes an 1 in shared memory
    cell 0. If Xi 0, processor i writes a 0 in
    memory cell 0. The result X is then stored in
    this memory cell.
  • The result stored in cell 0 is 1 (TRUE) unless a
    processor writes a 0 in cell 0 then one of the
    Xi is 0 (FALSE) and the result X should be FALSE,
    as it is.

76
Logical AND operation
  • begin Logical AND (X1 . . .Xn)
  • 1. Proc 1 writ1es in cell 0.
  • 2. if Xi 0 processor i writes 0 into cell 0.
  • end Logical AND
  • Exercise Give an O(1) CRCW algorithm for LOGICAL
    OR.

77
Parallel Operations with Multiple Outputs
Parallel Prefix
  • Problem definition Given a set of n values x0,
    x1, . . . , xn-1 and an associative operator, say
    , the parallel prefix problem is to compute the
    following n results/sums.
  • 0 x0,
  • 1 x0 x1,
  • 2 x0 x1 x2,
  • . . .
  • n - 1 x0 x1 . . . xn-1.
  • Parallel prefix is also called prefix sums or
    scan. It has many uses in parallel computing such
    as in load-balancing the work assigned to
    processors and compacting data structures such as
    arrays.
  • We shall prove that computing ALL THE SUMS is no
    more difficult that computing the single sum x0
    . . .xn-1.

78
Parallel Prefix Algorithm1 divide-and-conquer
  • x0 x1 x2 x3 x4 x5 x6 x7
    ltltParalel Prefix "Box" for 8 inputs

  • ------------------- --------------------
  • 1 2
    ltltlt 2 PP Boxes for 4 inputs each
  • ------------------- --------------------


  • Take rightmost output of Box 1 and

  • combine it with the outputs of Box2




  • x0...x3
    x0..x7
  • x0...x2 x0...x6
  • x0x1 x0...x5
  • x0 x0...x4

79
Parallel Prefix Algorithm 2
  • An algorithm for parallel prefix on an EREW PRAM
    would require lg n phases. In phase i, processor
    j reads the contents of cells j and j - 2i (if it
    exists) combines them and stores the result in
    cell j.
  • The EREW PRAM algorithm that solves the parallel
    prefix problem has performance P O(n), T O(lg
    n), and W O(n lg n), W2 O(n).

80
Parallel Prefix Algorithm 2 Example
  • For visualization purposes, the second step is
    written in two different lines. When we write x1
    . . . x5 we mean
  • x1 x2 x3 x4 x5.
  • x1 x2 x3
Write a Comment
User Comments (0)
About PowerShow.com