Communication and Computation on Arrays with Reconfigurable Optical Buses - PowerPoint PPT Presentation

About This Presentation

Title:

Communication and Computation on Arrays with Reconfigurable Optical Buses

Description:

Basic Operation: Binary Sum ... Basic Op: General Sorting ... All the basic data movement operations discussed are scalable. ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 37

Provided by: csG7

Learn more at: http://www.cs.gsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Communication and Computation on Arrays with Reconfigurable Optical Buses

1
Communication and Computation on Arrays with
Reconfigurable Optical Buses

Yi Pan, Ph.D.
IEEE Computer Society Distinguished Visitors
Program Speaker
Department of Computer Science
Georgia State University
Atlanta, Georgia, USA

2
Background

Existing Interconnection Schemes have Problems
Static interconnection networks (such as
hypercube)
Provide limited connectivity between processors
The time complexities are lower bounded by the
diameters.

3
Electronic buses

Messages cannot be transmitted concurrently on
buses
Buses become a major bottleneck on a system.

4
Reconfigurable buses

Messages transmitted concurrently.
The diameter problem disappears using a single
global bus.
However, you cannot do them at the same time.
Is still a potential bottleneck when transferring
a large amount of data.

5
Optical Signal Transmissions on Waveguides

Unidirectional propagation.
Predictable propagation delay per unit length.
Message pipelining is possible.

6
Linear Arrays with An Optical Bus

Three waveguides Message waveguides, Reference
waveguides, Select waveguides.
Messages are organized as fixed-length message
frames.
Optical signals propagate unidirectionally from
left to right on the upper segment and from right
to left on the lower segment.

7
Pipelined Bus
8
Bus Structure and Address Scheme

add one unit delay ? between any two processors
on the receiving segments of the reference
waveguides.
add a conditional delay ? between any two
processors on the transmitting segments of the
select waveguides.
The relative time delay of a select pulse and a
reference pulse produces a double-height pulse.

9
LARPBS Structure

Three waveguides
Conditional and fixed switches
delay and segment switches

10
Segmenting
11
Coincident pulse technique

source sends a reference pulse at time Tref
and a select pulse at time Tsel.
The source also sends a message frame, on the
message waveguide.
Whenever a processor detects a coincidence of the
two pulses, it reads the message frame.

12
Addressing Techniques

Coincident pulse
Select and reference frames

13
Addressing Techniques

Coincident pulse
Select and reference frames

14
Basic Data Movement Operations

One-to-One Communication
A source processor sends a reference pulse at
time Tref (the beginning of a bus cycle) on the
reference waveguide.
Also sends a select pulse at time Tsel on the
select waveguide.
The source processor also sends a message frame,
on the message waveguide, which propagates
synchronously with the reference pulse.

15
Basic Data Movement Operations

Broadcast
All conditional switches are set to straight.
Source processor sends a reference pulse at the
beginning of its address frame and sends N
consecutive select pulses in its address frame
on the select waveguide.

16
Broadcast Switch Setup
17
Basic Data Movement Operations

Multicast
Each source processor sends a reference pulse at
the beginning of its address frame.
A source processor sends several corresponding
select pulses in its address frame on the select
waveguide.

18
Basic Operations

Compression
The compression algorithm moves all active data
items to the right side of the array.
Processor i sets its local switch S(i) to cross
if X(i)1, and to straight if X(i)0.
Then, processor i, whose X(i)1, injects a
reference pulse on the reference waveguide and a
select pulse on the select waveguide at the
beginning of a bus cycle.
A processor also sends a message frame containing
its local data in memory location D through the
message waveguide during the bus cycle.
Processors with X(i)0 do not put any pulse or
message on the three waveguides.

19
Compression

The select pulse sent by the processor whose
index is the k-th largest in the active set
passes k conditional delays in the transmitting
segments on the select waveguide because k
processors to its right are in the active set and
their corresponding switches are set to cross.
Since both the select and reference pulses pass k
delays on the select and reference waveguides
when arriving at processor N-k, the two pulses
will meet only at processor N-k.

20
(No Transcript)
21
Basic Operations

Split operation
To separate the active set from the inactive set.
We call the compression algorithm to move all
data elements in the active set to the upper part
of the array.
Then, we call the compression algorithm to move
all data elements in the set to the upper part of
the array.
Third, move all data items in memory location D2
left s positions.
It uses O(1) bus cycles.

22
Basic Operation Binary Sum

Processor i sets its switch S(i) on the
transmitting segment to straight if Ai 1, and
cross otherwise.
Processor i injects a reference and select pulse
on the reference and select bus, respectively, at
the beginning of a bus cycle.
Other processors count the number of delays
encountered, and feed back to processor i.
O(1) bus cycles.

23
(No Transcript)
24
Basic Op General Sorting

A quicksort algorithm of N elements is proposed
for the LARPBS model of size N and runs in O( log
N ) expected time.
The idea is to use partition and split
recursively.

25
Basic Op Integer Sorting

Sort N integers with k bits in O(k) steps.
to repeat the split algorithm k times, and each
time uses the l-th bit as the mark for the active
set.
After k iterations, all N integers are in sorted
order.

26
Basic Op Maximum Finding

An O( log log N ) time algorithm using N
processors.
Partition the input into groups so that enough
processors can be allocated to each group in
order to find the maximum of that group in
constant time by the above algorithm.
As the computation progresses, the number of
candidates for the maximum is reduced. This
implies that the number of processors available
per candidate increases, and so the group size
can be increased.

27
Scalability Issue

The term scalability has two uses.
It refers to scaling up a system to accommodate
ever increasing user demand, or scaling down a
system to improve cost-effectiveness.
If a model is scalable, then a programmer need
not be so concerned with the actual size of the
machine on which a program is to run.

28
Scalability Issue

We are concerned with efficiently running
algorithms on machines with scaled-down size.
If the total time increase is only a factor of O(
N/P), then the algorithm is scalable.

29
Mapping Schemes

Two obvious schemes exist to map N input elements
to P processors.
Cyclic mapping maps element x to processor (x
mod p).
Block mapping divides an array into P contiguous
chunks and maps the i-th chunk of N/P elements to
the i-th processor.

30
Scaled One-To-One Permutation

Permutation Routing of N data elements on a
p-processor LARPBS.

31
Scaled One-To-One Permutation

Naive Algorithm
Every processor sends its local data one by one.
Data Contention Problem. Requires O((N/p) p )
time.

32
Scaled One-To-One Permutation

Algorithm using radix sorting
Each processor first sorts its local
data based on their destination addresses to
avoid conflicts.
Time complexity is O((N/p) log N). (PDCS '97 by
Trahan, Pan,Vaidyanathan and Bourgeois).

33
Scaled One-To-One Permutation

Randomized algorithm (IEEE TPDS 97, by
Rajasekaran and Sahni).
Time complexity is O((N/p) with high probability.

34
Scaled One-To-One Permutation

Optimal algorithm
Assign color to each processor and use more
complicated schemes to sort message by
destination color (JPDC 2000).
Time complexity is O((N/p).
In general for h-relation Each processor of a
p-processor LARPBS is the source of at most h
messages and the destination of at most h
messages.
Time complexity is O(h).

35
Scalability Results

Conversion between cyclic and block mappings of N
elements can be performed by a P-processor LARPBS
in O(N/P) time.
All the basic data movement operations discussed
are scalable.
Many application algorithms using these
operations are also scalable.

36
Future Research