Heiko Schr - PowerPoint PPT Presentation

About This Presentation
Title:

Heiko Schr

Description:

reconfigurable networks (Transputers, PVM) dynamically ... Mesh/Torus. Diameter ( ) bisection width ( ) 2D mesh. Heiko Schr der, 1998. Reconfigurable mesh 6 ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 53
Provided by: professor50
Learn more at: http://web.cecs.pdx.edu
Category:
Tags: heiko | schr | torus

less

Transcript and Presenter's Notes

Title: Heiko Schr


1
ROUTING ? Sorting? Image Processing? Sparse
Matrices?
Reconfigurable Meshes !
Heiko Schröder, 1998
2
Reconfigurable architectures
  • FPGAs
  • reconfigurable multibus
  • reconfigurable networks (Transputers, PVM)
  • dynamically reconfigurable mesh
  • Aim
  • efficiency
  • special purpose --gt general purpose
    architectures

3
contents
  • 1.) Motivation for the reconfigurable mesh
  • 2.) Routing (and sorting)
  • better than PRAM
  • better than mesh
  • 3.) Image processing
  • 4.) Sparse matrix
  • multiplication
  • 5.) Bounded bus length

4
PRAM
0 1 2 3 4 5 6
7 8 9
0 1 2 3 4 5 6
7 8 9
diameter O(1) bisection width ?(n)
EREW CRCW
5
Mesh/Torus
6
Hypercube
diameter O(log n) bisection width ?(n)
7
reconfigurable mesh
reconfigurable mesh mesh interior connections
15 positions
diameter 1 !!
8
global OR
Time O(1) on RM -- ?(log n) on EREW-PRAM
9
Prefix sum
0 1 1 0 1 0 0 1 1 1
Time O(1) Area ?(nxn)
Fast but expensive
10
Modulo 3 counter
Time O(1) on RM ?(log n / log log n) on
CRCW-PRAM
11
modulo k2 counter (ranking)
  • 2 digit numbers to the basis of k represent all
    numbers smaller than k2.
  • 1.) determine x mod k (lsd)
  • 2.) count number of wraps (msd).

--gt modulo k2 counting in 2 steps on a k x k2
array
12
enumeration / prefix sum
  • 1 1 1 1 1 1 1 1

time O(log n)
wire efficiency ! -- (compared with tree) 1/2
number of processors
13
permutation routing - 2 steps
2 steps !!!
14
Kundes all-to-all mapping
Sorting sort blocks all-to-all (columns) sort
blocks all-to-all (rows) o-e-sort blocks
15
sorting in constant time
block
Sort blocks
Complete sort sort blocks all-to-all (2)
sort blocks all-to-all (2) o-e-sort blocks
16
  • better than PRAM --- but useless!!!

17
Kundes all-to-all mapping
n x n
18
vertical all-to-all
19
horizontal all-to-all
20
Use of bus
1 step
(k/2)2 steps
2 steps
3 steps
3 steps
2 steps
1 step
21
sorting in optimal time Kunde / Schröder
  • (k/2)2 steps
  • kn1/3
  • each step takes n1/3 time
  • --gt T n/4

Sorting sort blocks (O(n2/3)) all-to-all
(n/2) sort blocks (O(n2/3)) all-to-all (n/2) sort
blocks (O(n2/3)) time n o(n)
22
Why optimal?
23
Use of theorem
1.) n keys on a kxk RM Time gt
n/k Proof Wherever the data is stored there is
always a bisection of length k -- this can be
demonstrated sweeping left right through the
array. Q.e.d. 2.) nxn keys on an nxn RM Time
gt n. Proof trivial
24
n o(n)
Optimal --- but ...
25
enumeration / prefix sum
  • 1 1 1 1 1 1 1 1

time O(log n)
wire efficiency ! -- (compared with tree) 1/2
number of processors
26
ABCD-routing
  • move and smooth

B
Row-major enumeration of A, B, C and D packets
within each quadrant in time 4 log n. Determine
destination position of each packet.
27
elementary steps
28
time analysis
time 3 x n/2 T3no(n)
29
T lt 2n
mesh-diameter 2n
30
enough of routing/sorting
Constant factor ! Can we do better ? What kind of
problems ? Image processing Sparse problems !
31
Image processing
  • Border following
  • Edge detection
  • Component labeling
  • Skeletons
  • Transforms

32
Component labelling
While own label is not received 1.) Candidates
brake bus and send their label a) clockwise b)
anti-clockwise 2.) Candidates switch off and
restore bus if they see smaller label
Time O(1) -- O(log n)
33
Transforms
  • Wavelet transform Time log n on RM
  • -- time n on mesh
  • FFT Time n on RM and mesh
  • Hough transform Time m x log n on RM
  • -- time m x n on mesh

34
systolic matrix multiplication
B
time n
A
C
35
sparse matrix multiplication
x

A
B
C
36
unlimited bus length
  • ring broadcast

1 2
3
37
A row-sparse B column-sparse
Repeat k times Begin horizontal ring
broadcast Repeat k times vertical ring
broadcast End.
k
B
k
A
C
38
lower bound (c,r)
k3
n48
B
A
C
39
splitting the problem
Repeat k times Begin vertical ring
broadcast Repeat s times horizontal ring
broadcast End.
s

r
AA

A
s
r
CA
BA
B
s
r
CCC
k B-elements
first s
s
T
s
s
A
B/C
time ks
40
CR
A has nk non-zero elements ? Ar has at most nk/s
non-zero rows ? for s ?n Ar has at most k ?n
non-zero rows. As B is a CC- problem ? it takes
time k ?n .
41
Ar B calculating products
time k2
42
column sum
i-1
i
i1
row i
time log n
43
routing within columns
44
Reconfigurable architectures
  • Reconfigurable mesh ?
  • constant diameter !

No !!!
Physical laws!
45
Physical limits
  • 30cm/ns
  • on chip 1cm/ns
  • --gt bounded bus length

c300 000 km/sec
good idea !
46
bounded broadcast
1 2
3
time k n/l
47
creating main stations
1 2 3
1 2 3
1 2 3
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
3 3 3 3 3 3
time k
48
A row-sparse B column-sparse
Create main stations 1,,k for A and B (time
n/lk) For i1,,k do Begin horizontal ring
broadcast i of A For j1,,k do vertical
ring broadcast j of B End.
k
B
k
A
C
49
A and B column-sparse
Create main stations 1, , k for A (time
n/lk) For i1,,k do Begin horizontal ring
broadcast i k bounded vertical broadcasts of
products merging new products End.
50
remove minor stations
1 2
3
51
results
Time n (nxn mesh) A and B column sparse (k2)
(k22n/l) A and B row sparse (k2) (k2 2n/l) A
row sparse, B column sparse (k2) (k2 n/l) A
column sparse, B row sparse (11n/l)
  • image processing
  • sorting
  • routing
  • load balancing

52
  • The RM is in some cases better than PRAM
  • The RM is always at least as good as mesh
  • The RM is often better than the mesh

The End
Write a Comment
User Comments (0)
About PowerShow.com