Title: Optimizing communication operations CS433 Spring 2001
1Optimizing communication operationsCS433Spring
2001
2Communication optimization
- Several patterns of communication are prevalent
- How to optimize them in different contexts
- Examples
- Broadcast, Reductions, prefix
- Permutation
- Each-to-all (or most) individualized messages
- Each-to-all (or most) broadcasts
- Context varies
- Size of data
- Number of processors
- Interconnection network (mostly ignored)
3Broadcasts
- How do we optimize broadcast?
- Is there any scope for optimization?
- Basic issue there is overhead in sending and
receiving each message - Processor overhead
- Baseline algorithm Spanning tree
- k-ary spanning
- What k is optimal?
- Skewed trees and hypercubes
- Take into account the fact that the first
processor we sent a message to is ready before
the last one - so the last one will be on a critical path
- Assign less work to later subtrees
- Actual shape of the tree depends on timing
constants
4Reductions
- Inverse of broadcasts
- So, the same arguments and techniques apply
5Broadcasts /reductions with migrating objects
- In Charm
- Object arrays consist of tens to hundreds of
elements per processor - Collective operations?
- Processor-based spanning tree,
- along with per processor local multicast or
collection - When objects migrate in the middle of a
reduction? - More complex algorithm is needed
- Send up a count of contributing objects, along
with the quantity being reduced - If the total number of objects dont match, wait
for straggler (I.e. Delayed) contributions. - Need to make sure to avoid deadlock
- Broadcasts
- Must make sure to avoid double delivery and
missed delivery for migrating objects
6Parallel Prefix
- Ith element of the output is the sum of all
previous elements in input - Seemingly hard to parallelize because of
dependence
Parallel version 1 N (virtual?) processors Each
processor I sends its value to the processor
I2k in the kth phase. Log N phases
For (I0 IltN I) for (j0 jlt I j)
BI BI Aj
7Parallel prefix communication optmization
- Too much communication in the parallel algorithm
- Log N communication steps,
- Processor sends N/P messages in each phase
- each message containing one scalar (int or
double) only - First idea use each-to-all indiviadualized to
optimize phase - reduces to log N steps, each with a log P
messages, at best. - Note each processor has N/P values.
- Idea treat each processors block as one number
- Do a local sum, followed by a parallel prefix of
size P, followed by local update. - Log P phases, each with just one message per
processor - Implementation careful to avoid mixing up
messages across phases - Does this method sacrifice computation cost?
- The sum is carried out twice
8Each to all, individualized messages
- Each processor sends (P-1) distinct messages to
the other processors - P-1 sends and receives
- If messages are short, the fixed cost per message
will overwhelm - How to optimize?
9Each to all, individualized messages
- Each processor sends (P-1) distinct messages to
the other processors - P-1 sends and receives
- If messages are short, the fixed cost per message
will overwhelm - How to optimize?
- Row-column multicasts
- Dimensional exchange
- k-dimensional variants
10Row-column multicast
- Idea organize processors in a 2D matrix (kXk)
- Phase I
- Send (k-1) messages, to each processor in your
row. - Each message contains k messages, one for each
processor in that column - Wait for (k-1) messages from processors in your
row - Phase 2
- Sort out the data into (k-1) messages for each
processor in your column - Save data meant for you
- Send (k-1) messages in your column
- Cost
- 2(k-1) messages, each containing m(k-1) bytes
- Instead of (P-1) messages, each containing m
bytes - The advantage? Reduced per-message cost
11Row-column broadcast generalization
- When is it better than direct message send?
- m is small. (short messages)
- per-message fixed costs are high
- If the costs are right, one can optimize further
- Use a 3D grid, instead of 2D
- use a higher-dimensional grid
- Use a hypercube
- What are the costs for each of those?
- How to handle empty messages?
12Costs
- As we increase the number of dimensions,
- total number of messages decrease
- total bandwidth used decreases
- Extrems binary cube or 2-D grid often represent
optima, depending on specific constants (esp. m)
13Other methods?
- Spanning tree
- Everyone sends one message to root,
- root partitions data, sends to each subtree
14Each to all broadcast
- Dimensional excahnge
- Row-column?
- Spanning tree.
- Cost analysis