Title: Heiko Schr
1ROUTING ? Sorting? Image Processing? Sparse
Matrices?
Reconfigurable Meshes !
Heiko Schröder, 1998
2Reconfigurable architectures
- FPGAs
- reconfigurable multibus
- reconfigurable networks (Transputers, PVM)
- dynamically reconfigurable mesh
- Aim
- efficiency
- special purpose --gt general purpose
architectures
3contents
- 1.) Motivation for the reconfigurable mesh
- 2.) Routing (and sorting)
- better than PRAM
- better than mesh
- 3.) Image processing
- 4.) Sparse matrix
- multiplication
- 5.) Bounded bus length
4PRAM
0 1 2 3 4 5 6
7 8 9
0 1 2 3 4 5 6
7 8 9
diameter O(1) bisection width ?(n)
EREW CRCW
5Mesh/Torus
6Hypercube
diameter O(log n) bisection width ?(n)
7reconfigurable mesh
reconfigurable mesh mesh interior connections
15 positions
diameter 1 !!
8global OR
Time O(1) on RM -- ?(log n) on EREW-PRAM
9Prefix sum
0 1 1 0 1 0 0 1 1 1
Time O(1) Area ?(nxn)
Fast but expensive
10Modulo 3 counter
Time O(1) on RM ?(log n / log log n) on
CRCW-PRAM
11modulo k2 counter (ranking)
- 2 digit numbers to the basis of k represent all
numbers smaller than k2. - 1.) determine x mod k (lsd)
- 2.) count number of wraps (msd).
--gt modulo k2 counting in 2 steps on a k x k2
array
12enumeration / prefix sum
time O(log n)
wire efficiency ! -- (compared with tree) 1/2
number of processors
13permutation routing - 2 steps
2 steps !!!
14Kundes all-to-all mapping
Sorting sort blocks all-to-all (columns) sort
blocks all-to-all (rows) o-e-sort blocks
15sorting in constant time
block
Sort blocks
Complete sort sort blocks all-to-all (2)
sort blocks all-to-all (2) o-e-sort blocks
16- better than PRAM --- but useless!!!
17Kundes all-to-all mapping
n x n
18vertical all-to-all
19horizontal all-to-all
20Use of bus
1 step
(k/2)2 steps
2 steps
3 steps
3 steps
2 steps
1 step
21sorting in optimal time Kunde / Schröder
- (k/2)2 steps
- kn1/3
- each step takes n1/3 time
- --gt T n/4
Sorting sort blocks (O(n2/3)) all-to-all
(n/2) sort blocks (O(n2/3)) all-to-all (n/2) sort
blocks (O(n2/3)) time n o(n)
22Why optimal?
23Use of theorem
1.) n keys on a kxk RM Time gt
n/k Proof Wherever the data is stored there is
always a bisection of length k -- this can be
demonstrated sweeping left right through the
array. Q.e.d. 2.) nxn keys on an nxn RM Time
gt n. Proof trivial
24n o(n)
Optimal --- but ...
25enumeration / prefix sum
time O(log n)
wire efficiency ! -- (compared with tree) 1/2
number of processors
26ABCD-routing
B
Row-major enumeration of A, B, C and D packets
within each quadrant in time 4 log n. Determine
destination position of each packet.
27elementary steps
28time analysis
time 3 x n/2 T3no(n)
29T lt 2n
mesh-diameter 2n
30enough of routing/sorting
Constant factor ! Can we do better ? What kind of
problems ? Image processing Sparse problems !
31Image processing
- Border following
- Edge detection
- Component labeling
- Skeletons
- Transforms
32Component labelling
While own label is not received 1.) Candidates
brake bus and send their label a) clockwise b)
anti-clockwise 2.) Candidates switch off and
restore bus if they see smaller label
Time O(1) -- O(log n)
33Transforms
- Wavelet transform Time log n on RM
- -- time n on mesh
- FFT Time n on RM and mesh
- Hough transform Time m x log n on RM
- -- time m x n on mesh
34systolic matrix multiplication
B
time n
A
C
35sparse matrix multiplication
x
A
B
C
36unlimited bus length
1 2
3
37A row-sparse B column-sparse
Repeat k times Begin horizontal ring
broadcast Repeat k times vertical ring
broadcast End.
k
B
k
A
C
38lower bound (c,r)
k3
n48
B
A
C
39splitting the problem
Repeat k times Begin vertical ring
broadcast Repeat s times horizontal ring
broadcast End.
s
r
AA
A
s
r
CA
BA
B
s
r
CCC
k B-elements
first s
s
T
s
s
A
B/C
time ks
40CR
A has nk non-zero elements ? Ar has at most nk/s
non-zero rows ? for s ?n Ar has at most k ?n
non-zero rows. As B is a CC- problem ? it takes
time k ?n .
41Ar B calculating products
time k2
42column sum
i-1
i
i1
row i
time log n
43routing within columns
44Reconfigurable architectures
- Reconfigurable mesh ?
- constant diameter !
No !!!
Physical laws!
45Physical limits
- 30cm/ns
- on chip 1cm/ns
- --gt bounded bus length
c300 000 km/sec
good idea !
46bounded broadcast
1 2
3
time k n/l
47creating main stations
1 2 3
1 2 3
1 2 3
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
3 3 3 3 3 3
time k
48A row-sparse B column-sparse
Create main stations 1,,k for A and B (time
n/lk) For i1,,k do Begin horizontal ring
broadcast i of A For j1,,k do vertical
ring broadcast j of B End.
k
B
k
A
C
49A and B column-sparse
Create main stations 1, , k for A (time
n/lk) For i1,,k do Begin horizontal ring
broadcast i k bounded vertical broadcasts of
products merging new products End.
50remove minor stations
1 2
3
51results
Time n (nxn mesh) A and B column sparse (k2)
(k22n/l) A and B row sparse (k2) (k2 2n/l) A
row sparse, B column sparse (k2) (k2 n/l) A
column sparse, B row sparse (11n/l)
- image processing
- sorting
- routing
- load balancing
52- The RM is in some cases better than PRAM
- The RM is always at least as good as mesh
- The RM is often better than the mesh
The End