Optimizing communication operations CS433 Spring 2001 - PowerPoint PPT Presentation

About This Presentation
Title:

Optimizing communication operations CS433 Spring 2001

Description:

Several patterns of communication are prevalent. How to ... Extrems: binary cube or 2-D grid often represent optima, depending on specific constants (esp. ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 15
Provided by: laxmika
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Optimizing communication operations CS433 Spring 2001


1
Optimizing communication operationsCS433Spring
2001
  • Laxmikant Kale

2
Communication optimization
  • Several patterns of communication are prevalent
  • How to optimize them in different contexts
  • Examples
  • Broadcast, Reductions, prefix
  • Permutation
  • Each-to-all (or most) individualized messages
  • Each-to-all (or most) broadcasts
  • Context varies
  • Size of data
  • Number of processors
  • Interconnection network (mostly ignored)

3
Broadcasts
  • How do we optimize broadcast?
  • Is there any scope for optimization?
  • Basic issue there is overhead in sending and
    receiving each message
  • Processor overhead
  • Baseline algorithm Spanning tree
  • k-ary spanning
  • What k is optimal?
  • Skewed trees and hypercubes
  • Take into account the fact that the first
    processor we sent a message to is ready before
    the last one
  • so the last one will be on a critical path
  • Assign less work to later subtrees
  • Actual shape of the tree depends on timing
    constants

4
Reductions
  • Inverse of broadcasts
  • So, the same arguments and techniques apply

5
Broadcasts /reductions with migrating objects
  • In Charm
  • Object arrays consist of tens to hundreds of
    elements per processor
  • Collective operations?
  • Processor-based spanning tree,
  • along with per processor local multicast or
    collection
  • When objects migrate in the middle of a
    reduction?
  • More complex algorithm is needed
  • Send up a count of contributing objects, along
    with the quantity being reduced
  • If the total number of objects dont match, wait
    for straggler (I.e. Delayed) contributions.
  • Need to make sure to avoid deadlock
  • Broadcasts
  • Must make sure to avoid double delivery and
    missed delivery for migrating objects

6
Parallel Prefix
  • Ith element of the output is the sum of all
    previous elements in input
  • Seemingly hard to parallelize because of
    dependence

Parallel version 1 N (virtual?) processors Each
processor I sends its value to the processor
I2k in the kth phase. Log N phases
For (I0 IltN I) for (j0 jlt I j)
BI BI Aj
7
Parallel prefix communication optmization
  • Too much communication in the parallel algorithm
  • Log N communication steps,
  • Processor sends N/P messages in each phase
  • each message containing one scalar (int or
    double) only
  • First idea use each-to-all indiviadualized to
    optimize phase
  • reduces to log N steps, each with a log P
    messages, at best.
  • Note each processor has N/P values.
  • Idea treat each processors block as one number
  • Do a local sum, followed by a parallel prefix of
    size P, followed by local update.
  • Log P phases, each with just one message per
    processor
  • Implementation careful to avoid mixing up
    messages across phases
  • Does this method sacrifice computation cost?
  • The sum is carried out twice

8
Each to all, individualized messages
  • Each processor sends (P-1) distinct messages to
    the other processors
  • P-1 sends and receives
  • If messages are short, the fixed cost per message
    will overwhelm
  • How to optimize?

9
Each to all, individualized messages
  • Each processor sends (P-1) distinct messages to
    the other processors
  • P-1 sends and receives
  • If messages are short, the fixed cost per message
    will overwhelm
  • How to optimize?
  • Row-column multicasts
  • Dimensional exchange
  • k-dimensional variants

10
Row-column multicast
  • Idea organize processors in a 2D matrix (kXk)
  • Phase I
  • Send (k-1) messages, to each processor in your
    row.
  • Each message contains k messages, one for each
    processor in that column
  • Wait for (k-1) messages from processors in your
    row
  • Phase 2
  • Sort out the data into (k-1) messages for each
    processor in your column
  • Save data meant for you
  • Send (k-1) messages in your column
  • Cost
  • 2(k-1) messages, each containing m(k-1) bytes
  • Instead of (P-1) messages, each containing m
    bytes
  • The advantage? Reduced per-message cost

11
Row-column broadcast generalization
  • When is it better than direct message send?
  • m is small. (short messages)
  • per-message fixed costs are high
  • If the costs are right, one can optimize further
  • Use a 3D grid, instead of 2D
  • use a higher-dimensional grid
  • Use a hypercube
  • What are the costs for each of those?
  • How to handle empty messages?

12
Costs
  • As we increase the number of dimensions,
  • total number of messages decrease
  • total bandwidth used decreases
  • Extrems binary cube or 2-D grid often represent
    optima, depending on specific constants (esp. m)

13
Other methods?
  • Spanning tree
  • Everyone sends one message to root,
  • root partitions data, sends to each subtree

14
Each to all broadcast
  • Dimensional excahnge
  • Row-column?
  • Spanning tree.
  • Cost analysis
Write a Comment
User Comments (0)
About PowerShow.com