Title: The Future of Parallel Computing
1The Future of Parallel Computing
SA ISA PIPS RM OH
Special Purpose Mesh Architectures
Heiko Schröder, 1998
2Contents
- Why meshes ???
- Application specific parallel mesh architectures
-Systolic Arrays -Instruction Systolic
Arrays -PIPS -Reconfigurable mesh -Optical Highway
3Physical limits
- OPS -- 0.3 mm/OP
- 1000 PEs with OPS --30cm/OP
- massive parallelism
- distributed memory
c300 000 km/sec
4Processor power
5- Scaling
- Faktor 2
- 1/2 width
- 1/2 hight
- 1/2 switching time
0,5 µ
8 x performance!
0,25 µ
6CMOS transistors
10m
Size of minimal transistor
1m
0,1m
ca. 0,03m
0,01m
1960
1970
1980
1990
2000
2010
2020
2030
7Mesh/Torus
8Hypercube
9VLSI
- Very
- Large
- Scale
- Integration
- simple cells
- few types
- regular architecture
- short connections
- mesh -- torus
10Pin limitations
11Bisection width
12Programming
- SA --- Systolic Array
- SIMD --- Single Instruction Multiple Data
- ISA --- Instruction Systolic Array
- MIMD --- Multiple Instruction Multiple Data
13parallel merge
- initial situation
- 1.) sort columns
- (odd-even-transposition sort)
- 2.) sort rows
- (odd-even-transposition sort)
- sorted !!!!
x1
x2 x3 x4 x5 x6
...
x7
...
x17 x18
y1 y2 y3 y4 y5 y6
...
y7
...
y17 y18
140-1 principle
- The 0-1 principle states that if all sequences of
0 and 1 are sorted properly than this is a
correct sorter. - The sorter must be based on moving data.
15MIMD-mesh (clocked)
min
max
Time 2n
16systolic merge
17systolic merge
18systolic merge
19systolic merge
20systolic merge
21systolic merge
22systolic merge
23systolic merge
24systolic merge
25systolic merge
26systolic merge
27systolic merge
28systolic merge
29systolic merge
30systolic merge
31systolic merge
32systolic merge
33systolic merge
34systolic merge
35systolic merge
36systolic merge
37Characteristics of SAs
Extremely high cost-performance no flexibility --
long development time
Suitable for special signal processing tasks ???
38Systolic architectures I
39Systolic architectures II
40ISA merge
41ISA merge
42ISA merge
43ISA merge
44ISA merge
45ISA merge
46ISA merge
47ISA merge
48ISA merge
49ISA merge
50ISA merge
51ISA merge
52ISA merge
53ISA merge
54ISA merge
55ISA merge
56ISA merge
57ISA merge
58ISA merge
59ISA merge
60ISA merge
61ISA merge
62ISA merge
63ISA merge
64ISA merge
65ISA merge
66ISA merge
67ISA merge
68ISA merge
69ISA merge
70ISA merge
71Hough transform on the ISA
- good line detection method
Fast tomography
72robot vision
projector
CCD
CCD
73Use of the ISA
Special features fast aggregate functions (sum,
carry) fast local communication no local
memory typical improvement over PC Factor 20-30
- Areas of application for ISA
- automatic optical quality control
- real time signal processing
- computer graphics /visualization
- linear equations
- Cryptography --gt Tele-medicine ?
74Instruction Systolic Array
75PIPS (1990-94)
32x32 torus 16 bit parallel communication 16 bit
add prefetch
1 M bit
1 M bit
memory control
BHP -- CSIRO -- NU -- ADFA 1.4 M
76Special features local memory SIMD-torus memory
pre-fetch Applications visualization 3D-simulati
on (CFD, FEM)
77(No Transcript)
78PIPS
79Use in industry ?
Performance Gflops
3675
Research
3000
2500
2121
2000
1500
1327
1168
1000
Industry
648
500
693
248
126
1993
1994
1995
1996
80Investments
Investments into parallel computers M
3500
3000
2500
2000
Research
1500
Industry
1000
500
0
1993
1994
1995
1996
81Concentration
Number of manufacturers
60
50
49
40
30
21
19
20
11
10
1993
1994
1995
1996
82Degree of Parallelism
Number of new Systems
450
400
350
300
1 to 63
250
64 to 255
200
256 to 1023
150
1024 and more
100
50
0
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
May-93
May-94
May-95
May-96
May-97
83Evaluation
Cost computation time
- Parallel computers with standard components
- Imbedded parallel systems
84reconfigurable mesh
reconfigurable mesh mesh interior connections
low cost
15 positions
85global OR and modulo 3
log n on EREW-PRAM
log n / log log n on CRCW-PRAM
86sorting with all-to-all mapping
Sorting sort blocks all-to-all (columns) sort
blocks all-to-all (rows) o-e-sort blocks
87all-to-all mapping
n x n
88vertical all-to-all
89horizontal all-to-all
901 step
(k/2)2 steps
2 steps
3 steps
3 steps
2 steps
1 step
91sorting in optimal time
- (k/2)2 steps
- kn1/3
- each step takes n1/3 time
- --gt T n/4
Sorting sort blocks (O(n2/3)) all-to-all
(n/2) sort blocks (O(n2/3)) all-to-all (n/2) sort
blocks (O(n2/3)) time n o(n)
92Reconfigurable mesh
Special features SIMD constant diameter faster
than PRAM ? Suitable applications routing/sorting/
load balancing sparse matrix multiplication segmen
tation / component labeling feature
extraction image database ?
93Reconfigurable mesh
94Optical Highway
All-to-all connection
W1 P100 W100 P22
95(No Transcript)
96Features of optically connected
meshes SIMD/SPMD/MIMD implement all major
architectures all-to-all communication in 2
steps Bulk synchronous processing (BSP) no
latency hiding no pin-limitation Applications coar
se grain parallel computing only? ray-tracing
? ???
97Optical Highway
1. H. Schröder et al, RMB --- A Reconfigurable
Multiple Bus Network, HPCA 96, San Jose,
1996 2. H. Schröder, O. Sykora, I. Vrto, Optical
All-to-All Communication for some Product
Graphs, SOFSEM '97, Milovy, Czech Republic,
1997
98Bisection-width / Diameter
99Suitable problems ?
diameter log n bisection width n
SA suitable applications?
SA ISA PIPS
ISA 2D-problems, aggregate functions
local communication
PIPS 3D-problems, local communication
RM
RM diameter-bound gt bisection-width-bound
OH
OH PRAM equivalent?
100?
?
?
?