Optimizing communication operations CS433 Spring 2001

About This Presentation

Title:

Optimizing communication operations CS433 Spring 2001

Description:

Several patterns of communication are prevalent. How to ... Extrems: binary cube or 2-D grid often represent optima, depending on specific constants (esp. ... – PowerPoint PPT presentation

Number of Views:12

Avg rating:3.0/5.0

Slides: 15

Provided by: laxmika

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing communication operations CS433 Spring 2001

1
Optimizing communication operationsCS433Spring
2001

Laxmikant Kale

2
Communication optimization

Several patterns of communication are prevalent
How to optimize them in different contexts
Examples
Broadcast, Reductions, prefix
Permutation
Each-to-all (or most) individualized messages
Each-to-all (or most) broadcasts
Context varies
Size of data
Number of processors
Interconnection network (mostly ignored)

3
Broadcasts

How do we optimize broadcast?
Is there any scope for optimization?
Basic issue there is overhead in sending and
receiving each message
Processor overhead
Baseline algorithm Spanning tree
k-ary spanning
What k is optimal?
Skewed trees and hypercubes
Take into account the fact that the first
processor we sent a message to is ready before
the last one
so the last one will be on a critical path
Assign less work to later subtrees
Actual shape of the tree depends on timing
constants

4
Reductions

Inverse of broadcasts
So, the same arguments and techniques apply

5
Broadcasts /reductions with migrating objects

In Charm
Object arrays consist of tens to hundreds of
elements per processor
Collective operations?
Processor-based spanning tree,
along with per processor local multicast or
collection
When objects migrate in the middle of a
reduction?
More complex algorithm is needed
Send up a count of contributing objects, along
with the quantity being reduced
If the total number of objects dont match, wait
for straggler (I.e. Delayed) contributions.
Need to make sure to avoid deadlock
Broadcasts
Must make sure to avoid double delivery and
missed delivery for migrating objects

6
Parallel Prefix

Ith element of the output is the sum of all
previous elements in input
Seemingly hard to parallelize because of
dependence

Parallel version 1 N (virtual?) processors Each
processor I sends its value to the processor
I2k in the kth phase. Log N phases
For (I0 IltN I) for (j0 jlt I j)
BI BI Aj
7
Parallel prefix communication optmization

Too much communication in the parallel algorithm
Log N communication steps,
Processor sends N/P messages in each phase
each message containing one scalar (int or
double) only
First idea use each-to-all indiviadualized to
optimize phase
reduces to log N steps, each with a log P
messages, at best.
Note each processor has N/P values.
Idea treat each processors block as one number
Do a local sum, followed by a parallel prefix of
size P, followed by local update.
Log P phases, each with just one message per
processor
Implementation careful to avoid mixing up
messages across phases
Does this method sacrifice computation cost?
The sum is carried out twice

8
Each to all, individualized messages

Each processor sends (P-1) distinct messages to
the other processors
P-1 sends and receives
If messages are short, the fixed cost per message
will overwhelm
How to optimize?

9
Each to all, individualized messages

Each processor sends (P-1) distinct messages to
the other processors
P-1 sends and receives
If messages are short, the fixed cost per message
will overwhelm
How to optimize?
Row-column multicasts
Dimensional exchange
k-dimensional variants

10
Row-column multicast

Idea organize processors in a 2D matrix (kXk)
Phase I
Send (k-1) messages, to each processor in your
row.
Each message contains k messages, one for each
processor in that column
Wait for (k-1) messages from processors in your
row
Phase 2
Sort out the data into (k-1) messages for each
processor in your column
Save data meant for you
Send (k-1) messages in your column
Cost
2(k-1) messages, each containing m(k-1) bytes
Instead of (P-1) messages, each containing m
bytes
The advantage? Reduced per-message cost

11
Row-column broadcast generalization

When is it better than direct message send?
m is small. (short messages)
per-message fixed costs are high
If the costs are right, one can optimize further
Use a 3D grid, instead of 2D
use a higher-dimensional grid
Use a hypercube
What are the costs for each of those?
How to handle empty messages?

12
Costs

As we increase the number of dimensions,
total number of messages decrease
total bandwidth used decreases
Extrems binary cube or 2-D grid often represent
optima, depending on specific constants (esp. m)

13
Other methods?