Heiko Schr - PowerPoint PPT Presentation

About This Presentation

Title:

Heiko Schr

Description:

reconfigurable networks (Transputers, PVM) dynamically ... Mesh/Torus. Diameter ( ) bisection width ( ) 2D mesh. Heiko Schr der, 1998. Reconfigurable mesh 6 ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 53

Provided by: professor50

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: Heiko Schr

1
ROUTING ? Sorting? Image Processing? Sparse
Matrices?
Reconfigurable Meshes !
Heiko Schröder, 1998
2
Reconfigurable architectures

FPGAs
reconfigurable multibus
reconfigurable networks (Transputers, PVM)
dynamically reconfigurable mesh
Aim
efficiency
special purpose --gt general purpose
architectures

3
contents

1.) Motivation for the reconfigurable mesh
2.) Routing (and sorting)
better than PRAM
better than mesh
3.) Image processing
4.) Sparse matrix
multiplication
5.) Bounded bus length

4
PRAM
0 1 2 3 4 5 6
7 8 9
0 1 2 3 4 5 6
7 8 9
diameter O(1) bisection width ?(n)
EREW CRCW
5
Mesh/Torus
6
Hypercube
diameter O(log n) bisection width ?(n)
7
reconfigurable mesh
reconfigurable mesh mesh interior connections
15 positions
diameter 1 !!
8
global OR
Time O(1) on RM -- ?(log n) on EREW-PRAM
9
Prefix sum
0 1 1 0 1 0 0 1 1 1
Time O(1) Area ?(nxn)
Fast but expensive
10
Modulo 3 counter
Time O(1) on RM ?(log n / log log n) on
CRCW-PRAM
11
modulo k2 counter (ranking)

2 digit numbers to the basis of k represent all
numbers smaller than k2.
1.) determine x mod k (lsd)
2.) count number of wraps (msd).

--gt modulo k2 counting in 2 steps on a k x k2
array
12
enumeration / prefix sum

1 1 1 1 1 1 1 1

time O(log n)
wire efficiency ! -- (compared with tree) 1/2
number of processors
13
permutation routing - 2 steps
2 steps !!!
14
Kundes all-to-all mapping
Sorting sort blocks all-to-all (columns) sort
blocks all-to-all (rows) o-e-sort blocks
15
sorting in constant time
block
Sort blocks
Complete sort sort blocks all-to-all (2)
sort blocks all-to-all (2) o-e-sort blocks
16

better than PRAM --- but useless!!!

17
Kundes all-to-all mapping
n x n
18
vertical all-to-all
19
horizontal all-to-all
20
Use of bus
1 step
(k/2)2 steps
2 steps
3 steps
3 steps
2 steps
1 step
21
sorting in optimal time Kunde / Schröder

(k/2)2 steps
kn1/3
each step takes n1/3 time
--gt T n/4

Sorting sort blocks (O(n2/3)) all-to-all
(n/2) sort blocks (O(n2/3)) all-to-all (n/2) sort
blocks (O(n2/3)) time n o(n)
22
Why optimal?
23
Use of theorem
1.) n keys on a kxk RM Time gt
n/k Proof Wherever the data is stored there is
always a bisection of length k -- this can be
demonstrated sweeping left right through the
array. Q.e.d. 2.) nxn keys on an nxn RM Time
gt n. Proof trivial
24
n o(n)
Optimal --- but ...
25
enumeration / prefix sum

1 1 1 1 1 1 1 1

time O(log n)
wire efficiency ! -- (compared with tree) 1/2
number of processors
26
ABCD-routing

move and smooth

B
Row-major enumeration of A, B, C and D packets
within each quadrant in time 4 log n. Determine
destination position of each packet.
27
elementary steps
28
time analysis
time 3 x n/2 T3no(n)
29
T lt 2n
mesh-diameter 2n
30
enough of routing/sorting
Constant factor ! Can we do better ? What kind of
problems ? Image processing Sparse problems !
31
Image processing

Border following
Edge detection
Component labeling
Skeletons
Transforms

32
Component labelling
While own label is not received 1.) Candidates
brake bus and send their label a) clockwise b)
anti-clockwise 2.) Candidates switch off and
restore bus if they see smaller label
Time O(1) -- O(log n)
33
Transforms

Wavelet transform Time log n on RM
-- time n on mesh
FFT Time n on RM and mesh
Hough transform Time m x log n on RM
-- time m x n on mesh

34
systolic matrix multiplication
B
time n
A
C
35
sparse matrix multiplication
x

A
B
C
36
unlimited bus length

ring broadcast

1 2
3
37
A row-sparse B column-sparse
Repeat k times Begin horizontal ring
broadcast Repeat k times vertical ring
broadcast End.
k
B
k
A
C
38
lower bound (c,r)
k3
n48
B
A
C
39
splitting the problem
Repeat k times Begin vertical ring
broadcast Repeat s times horizontal ring
broadcast End.
s

r
AA

A
s
r
CA
BA
B
s
r
CCC
k B-elements
first s
s
T
s
s
A
B/C
time ks
40
CR
A has nk non-zero elements ? Ar has at most nk/s
non-zero rows ? for s ?n Ar has at most k ?n
non-zero rows. As B is a CC- problem ? it takes
time k ?n .
41
Ar B calculating products
time k2
42
column sum
i-1
i
i1
row i
time log n
43
routing within columns
44
Reconfigurable architectures

Reconfigurable mesh ?
constant diameter !

No !!!
Physical laws!
45
Physical limits

30cm/ns
on chip 1cm/ns
--gt bounded bus length

c300 000 km/sec
good idea !
46
bounded broadcast
1 2
3
time k n/l
47
creating main stations
1 2 3
1 2 3
1 2 3
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
3 3 3 3 3 3
time k
48
A row-sparse B column-sparse
Create main stations 1,,k for A and B (time
n/lk) For i1,,k do Begin horizontal ring
broadcast i of A For j1,,k do vertical
ring broadcast j of B End.
k
B
k
A
C
49
A and B column-sparse
Create main stations 1, , k for A (time
n/lk) For i1,,k do Begin horizontal ring
broadcast i k bounded vertical broadcasts of
products merging new products End.
50
remove minor stations
1 2
3
51
results
Time n (nxn mesh) A and B column sparse (k2)
(k22n/l) A and B row sparse (k2) (k2 2n/l) A
row sparse, B column sparse (k2) (k2 n/l) A
column sparse, B row sparse (11n/l)