Title: Optics group
1Stage-Distributed Time-Division Permutation
Routing in a Multistage Optically Interconnected
Fabric
Alvaro Cassinelli(1), Makoto Naruse(2), Alain
Goulet(1), and Masatoshi Ishikawa(1) (1)
University of Tokyo, Dept. Information Physics
and Computing, 7-3-1 Hongo Bunkyo-ku, Tokyo
113-0033, Japan. (2) Communications Research
Laboratory, 4-2-1 Nukui-kita, Koganei, Tokyo
184-8795, Japan.
http//www.k2.t.u-tokyo.ac.jp/index-e.html
2PLAN of the presentation
I. Introduction space-domain optical switching
fabrics
II. Column-Control in Multistage Interconnection
Networks (CCMINs)
III. Folded Optical Implementation of a
transparent CCMIN
IV. Packet switching in a buffered CCMIN (new)
V. Conclusion and Further Research
VI. Some References
3I. Introduction the problem on study
How to design an efficient optical switching
fabric for addressing
- Processor-memory bottleneck in Supercomputers
- Router bottleneck in Next Generation Optical
Internet
These problems have some similarities low
latency required, synchronization, high
bandwidth Traffic characteristics changes
synchronous/asynchronous, regular/arbitrary
request patterns, fixed/variable length of data
bursts (granularity)
In fact, the above problems are case studies
among a continuum of situations
4I. Introduction optics inside routers
Where optics?
- interconnect router subsystems
- at the (unbuffered) switching fabric (OXC)
- at the interfaces and controller (all-optical
routing)
5II. Column-Control in Multistage Interconnection
Networks
II.1 Multistage Interconnection Networks
II.2 Column-Control in MINs
II.3 Permutation Capacity of CCMIN
II.4 Unbuffered CCMIN for permutation routing
6II.1 Multistage Interconnection Networks
- Wide-sense non-blocking
- Low latency
Basic switching fabric Full-Crossbar
(XC)
- O(N2) complexity (using 2x2 switches)
- Simultaneous switching noise
- Central controller bottleneck
- Poor modularity
- Circuit Switching good for low-latency
memory-processor communications. - Packet Switching Maximum throughput of 63
without buffers (uniform traffic).
7II.1 Multistage Interconnection Networks
It still has point-to-point full connectivity.
(and is self-routing)
8II.2 Column-Control in MINs
- Column-control simplifies hardware and control
9II.3 Permutation Capacity of CCMIN
However
local-blocking
if blocking was a problem for a MIN
10II.3 Permutation Capacity of CCMIN
64x64 network
- Request serviced by circuit switching, (or by
on-the-flight packet switching) - Input requests are indep. Bernoulli trials
(parameter ?) - Uniform Traffic equal probability of requesting
any output port
tends to 63 when N??, because HOL blocking.
crossbar
Standard MIN
Probability of request acceptance
both tend to 0 when N??
CCMIN
Input request probability per unit time (?)
CCMIN cannot be used to service arbitrary
requests in a circuit-switched manner!
11II.4 Unbuffered CCMIN for permutation routing
5
C3
6
C4
1
13
14
2
9
10
C2
15
7
16
11
8
12
3
C1
4
4-D hypercube-connected multiprocessor
Synchronous, weak-connected parallel computer
(processors use same permutation / time slot)
12III. Folded Optical Implementation of a
transparent CCMIN
III.1 Designing a CCMIN for circuit-switched
permutation routing
III.2 Folded Optical Implementation
III.3 Experimental Demonstration
III.4 Possible applications
13III.1 Designing a CCMIN for circuit-switched
permutation routing
- Number of permutations 2n (n3)
3 stage CC-Baseline Network
- These are c3, idxc2, idxc1, id
- These are just the required permutations to
implement the (3D) hypercube!
c3, id
c1, id
c2, id
A multistage version of most parallel-computer
direct-network topologies (hypercube,
cube-connected-cycles, deBruijn, etc.) can be
implemented as a CCMIN with properly designed
inter-stage permutation modules.
14III.2 Folded Optical Implementation
- plane implementation
- electronic
- planar lightwave circuit (PLC)
Multistage Interconnection Network architecture
- 3D implementation
- free space
- guided-wave
Dense Efficient 3D folded inter-stage optical
interconnects
Optical Multistage Architecture Paradigm (fixed
interconnections)
15III.2 Folded Optical Implementation
slide not shown in main presentation
Guide-wave (fiber-based) Modules vs. Free-Space
- fixed, no broadcast optical fiber ok.
- better efficiency (and just like free-space
optics, no cross-talk in 3D). - No space-invariance imposed.
- Precise and robust alignment possible.
- Theoretically more volume efficient than
free-space counterpart.
- hard to build? not fundamentally difficult
(can be automated, permutation decomposition
possible) - Alignment of output and input
- Power dissipation fundamental limit very far
compared with electronics.
integrated 2D folded perfect shuffle
permutation module
Prototype Fiber module (fibers and holders)
Wave-guide arrays for fixed, point-to-point and
space variant interconnections are an interesting
alternative to free-space optics
16Prototype (non-integrated) 4x4 fiber module
slide not shown in main presentation
Input (VCSEL 8544nm)
Output (CCD)
Two holder prototypes Zirconium, SiO2 Pitch
2505 ?m Multimode graded index fibers
NA0,21 (core 50?m, cladding 126?m) Transmission
loss 3dB/km
17III.2 Multiple-permutation module
Besides density, reduced crosstalk and optical
efficiency, there is another nice feature of the
guided-wave approach to plane-to-plane optical
interconnections
18Cube Permutations for N2n
slide not shown in main presentation
Unfolded (example with n4)
Cube Permutation ck
c1
c3
c4
bn, bk1, bk, bk-1, b2, b1
ck
bn, bk1, bk, bk-1, b2,b1
Folded
If k ? n/2, exchange only rows If kgtn/2, ck
exchange only columns. The modules are just the
same, rotated.
19III.2 Experimental Demonstration
plane mapping (folding)
Row-Column Folded bi-permutation module
Unfolded hypercube and identity permutations
Prototype implementation of using optical fibers
() not unique!
20III.2 Experimental Demonstration
slide not shown in main presentation
topology is mapped on a plane
four-dimensional hypercube-connected
multiprocessor
(processors interconnected trough a 2D optical
socket or laying in a VLSI chip matrix)
Spanned 4D hypercube (use four bi-permutation
modules)
21III.2 Experimental Demonstration
slide not shown in main presentation
Output (CCD camera)
Input (VCSEL array)
Exit first module
Commutation pitch 125 ?m
Alignment tolerance ?5 ?m (half peak power).
Input second module
Inter-module Coupling Efficiency 1.7dB (no
additional optics, matching oil or antireflection
coating).
?
Validation of simple cascaded architecture.
22III.2 Experimental Demonstration
Visualization of 2D permutation switching using a
pair of modules
C2 or Id
C1 or Id
23III.2 Demonstration electromechanical actuator
X-Y electro-magnetic actuated device
(can vibrate the module in both X and Y
directions in principle, permutation
interleaving is possible in both directions)
Resonant frequency 430 Hz (?62.5?m)
(Micro electro-mechanical actuators (MEMS) may
also be an interesting alternative when switching
latency in the millisecond range is tolerable)
24III.2 Demonstration electromechanical actuator
slide not shown in main presentation
Resonant-frequency round-robin permutation
scheduling
Interconnect 1
Interconnect 2
Interconnect 3
Interconnect N
time
Time slot
25III.2 Demonstration electromechanical actuator
slide not shown in main presentation
Input slow row/column scan of VCSEL array
No electromagnetic actuation
Electromagnetic actuation
Fixed Identity permutation
Identity Cube2 permutations alternate at 860 Hz.
26III.2 Demonstration electromechanical actuator
Input 635nm laser modulated at 500MHz Output
High speed photodetector
Actuator position
200ms
Photodetector signal
- Switching latency between interconnections
0,96 ms () - Time Slot (3dB) 200ms
- If 10Gb/s optical link, burst size is 2 Mbits
per channel, (every millisecond). Average
bandwidth of 2 Gb/s per channel
() MEMS routers ms range.
27III.4 Possible applications of an optical CCMIN
- Possible computing applications
- The present system is not usable for typical
memory-processor communications, which requires
low latencies (lt 100 ns), unless another
switching hardware is used (Acousto-optic cells
?s range / electro-optical material ns range) - If processing time is large (slow switching
latency) and burst of data large, the
electromechanical system may be used (FFT, large
database retrieval, ?)
- Communication networks
- burst switching at the WAN level (ms range
reconfiguration times). - scientific-dedicated, transparent networks with
long holding times and high-bandwidth
(TransLight, GLIF). MEMS switches are currently
used (reconfiguration times in the range of a
second is ok). An optical GSMIN may be used to
regularly provide interconnection configurations. - if switching time is reduced, it can be used to
perform cyclic permutation scheduling in an
virtual output queued (VOQ) switch, leading to
100 throughput (Standford Tiny-Tera Switch)
28 slide not shown in main presentation
Burst interconnection within short time
slot (Ex. 10Gbps, 100nsec ? 1kbit)
time
Computation one-stage (ex. 1 ms)
Interconnection switching interval (Ex. 1ms)
Burst Interconnects
Slow switching may be okay
29IV. Packet switching in a buffered CCMIN
IV.1 Buffering in blocking networks
IV.2 FIFO Buffered CCMIN architecture
IV.3 Performance evaluation
IV.4 Delay-line buffered architecture
30IV.1 Buffering for packet switching
Blocking is a serious drawback for circuit
switching Less serious for packet switching
- Unbuffered networks (even wide-sense
non-blocking) suffer from HOL blocking
buffering is unavoidable. - Input queues, Output Queues and Virtual Output
Queues and internal buffering has been explored
in crossbars as well as in MINs - However, an advantage of buffered MINs over
buffered crossbars is that the stage-distributed
switching marries well with the distribution of
buffering (thus avoiding large buffers)
Buffering is a solution adopted in usual MINs
how much a CCMIN is improved by buffering?
31IV.2 FIFO Buffered CCMIN architecture
Why this architecture may compare well with
standard buffered MINs?
- For uniform traffic, at each stage half of the
packets wait, and half pass individual
switch/buffer control is, presumably, not really
required
inter-stage FIFO buffers
- Whats more
-
- Arbitration for configuring the Global Switches
may not be necessary at all !
32IV.3 Performance global control vs. local control
Seven stage - 128x128 Input/Output fabrics
(rem inter-stage transfer with maximum speed-up
equal to the size of the buffer)
6
6
5
5
4
crossbar
4
3
3
standard MIN
2
Probability of packet acceptance
Performance of Global Switched MIN compares very
well with that of a standard MIN.
Buffer size
2
1
Global Switched MIN
0
1
0
Input request probability per unit time (?)
- GSMIN performance evolve quicker with buffer
size - For buffer size 5 packets, equivalent
performances - For buffer size 3 packets, performances are
better than Xbar
33IV.3 Performance global control with blind
alternate
Blind Switch alternation of a GSMIN
6
5
As expected blind alternation of switch states
gives same performance than a fair
switch-selection (for uniform traffic)
crossbar
4
3
Probability of packet acceptance
Buffer size
2
blind alternate
fair switching
1
0
Input request probability per unit time (?)
This is very interesting, because it means that a
Standard MIN can be operated blindly if traffic
is uniform enough. Interconnection scheduling
bottleneck is eliminated (CLOS, etc.) by using a
Time-Division Permutation Routing strategy.
34IV.4 Delay-line buffered architecture
Reliable optical memories are still too difficult
to implement...
delay-line buffer
What about just delaying packets?
(since there are only two states per stage, only
a single delay-line may give good performance)
output
input
35 slide not shown in main presentation
we didnt study a standard MIN with
delay-lines
delay-line buffer
Switch
input
output
36IV.4 Performance of a delay-line buffered
architecture
6
Blind alternation of global witch states is
assumed
5
4
crossbar
3
delay-line
(we didnt study a standard MIN with
delay-lines)
Probability of packet acceptance
Buffer size
2
Global Switched MIN
1
0
Input request probability per unit time (?)
Using a single selectable delay per channel and
per stage, performance lies somewhere in between
one and two-packet sized FIFO buffered
architecture.
37V. Conclusion
V.1 Results
V.2 Further Research
38V.1 Conclusion
Summarizing
- Column-Control simplifies MIN hardware and
control
- Column-Control MIN may have enough permutation
capacity for specific applications (highly
parallel algorithms)
- Column-Controlled MIN can be efficiently
implemented using dense plane-to-plane optical
interconnections
- Column-Controlled MIN can be used for packet
switching if buffered, giving roughly the same
performance than standard MINs
- Path-selection mechanism may be blind (i.e.
round-robin, time-division permutation routing)
without appreciable degradation of performance.
39V.2 Further Research
On transparent circuit switched CCMINs
- An arbitrary permutation request may be serviced
by multiplexing in time the available set of
permutations. This needs input buffers and
speed-up (i.e. short switching latency). This has
been explored in standard MINs using 2x2
switches
- Design of active modules, and multi-function
modules (containing more than two permutations,
but also other optical functions - e.g. optical
delay lines)
On buffered packet switched CCMINs
- How heavily the the studied architectures rely
on the URM assumption? Study more realistic
traffic models / ways to balance the non-regular
traffic.
- Other models of buffers in particular,
inter-stage virtual output queues (VOQ) may gives
very good performance in CCMIN (because with a
speed-up of only 2, each stage will have 100
throughput). Two parallel delay-line buffers ?
40 slide not shown in main presentation
V.2 Fast switching permutation modules
stack of PLC layers coupled in the normal
direction
Reconfiguration time can be of the order of
nanoseconds!
- Simulation of a crossbar by speed-up (TDM
connections for local area networks) - Core of a permutation routing switches for
inter-processor communications in a parallel
computer
41 slide not shown in main presentation
V.2 advanced further research
Based on the observation that VOQ and speed-up,
plus optimal permutation decomposition are the
basic ingredients of the Birkhof-von Newmann
Switch (plus load-balancing to simplify the
decomposition gt Tiny-Tera switch) with 100
throughput, it will be interesting to study then
1) a constrained decomposition of a rate
matrix onto the set of available CCMIN
permutations 2) a multistage version of the BVN
switch, where the permutation decomposition is
done a) at each stage (using bi-permutation
modules, this will probably lead to simple
forced-alternate mode, and reduce the size of the
VOQ, to only 2, which may be accommodated by
simple delay-lines!), b) every some stages, so
that the available set of permutations will be
very reduced, but still larger than 2. This may
optimize the design of buffer functions (no need
to put in all stages).
Thank you for your attention
42VI. Some References
slide not shown in main presentation
Traffic models J. Cao et al., Internet traffic
tends toward Poisson and Independent as load
Increases, Nonlinear Estimation and
Classification, eds. C. Holmes et al., Springer,
NY, 2002.
thermo-optic matrix Goh01 round-robin (TDM).
Thompson91. Crosstalk can be solved
decomposing a permutation into semi-permutations,
with an increase of the number of network stages
Qiao Volume-consumption comparisons of
free-space and guided-wave optical
interconnections, Y.Li and J. Popelek,
p.1815-1825, Appl.Opt. Vol 39, n.11, april
2000. Study of inter-stage VOQ in MINs Kolias,
Dual Banyan Switch, Kolias W.J. Dainty,
Virtual-Channel Flow Control, IEEE Trans.
Parallel and Distr. Systems, Vol. 3, No. 2, Mar.
1992, pp. 194-205. Dainy studies DAMQ
(dynamically allocated multi-queue buffers),
which looks quite similar to hop-mode buffers.