Title: Membrane Computing in the Connex Environment
1Membrane Computingin theConnex Environment
Gheorghe Stefan BrightScale Inc., Sunnyvale, CA
Politehnica University of Bucharest gstefan_at_b
rightscale.com
2Outline
- Integral Parallel Architecture
- The Connex Chip
- The Connex Architecture
- How to Use the Connex Environment
- Concluding Remarks
3Integral Parallel Architecture
- The Ubiquitousness of Parallelism Asks for
Integral Parallel Architectures - Partial Recursive Functions Parallel
Computation - A Functional Taxonomy of Parallel Computation
4Parallelism can not be avoided anymore
- Intels approach
- Multi-processors
- the best approach for multi-threading on MIMD
architecture - Inefficient on SIMD architecture
- Ignores the MISD architecture
- Many-processors asking for another taxonomy
- They work as accelerators
- They perform critical functions
- Berkeleys 13 dwarfs is a functional approach for
many-processors - Real applications ask for all kind of parallelism
to solve corner cases the places where the
devil hides
5Partial Recursive Functions Parallel
Computation
- Composition Rule the Basic Parallel Structures
- Primitive Recursive Rule
- Minimalization Rule
6Composition the Associated Structure
- f(x0, xn-1) g(h0(x0, xn-1), h1(x0, xn-1),
hm-1(x0, xn-1))
x0, xn-1
. . .
. . .
f(x0, xn-1)
h1
hm-1
h0
g(h0, h1, hm-1)
7Data Parallel Composition
- X x0, xn-1 ? h(x0), h(x0), h(x0)
x0
x1
xn-1
. . .
h(x0) h(x1)
h(xn-1)
h
h
h
8Speculative Composition
- function vector H h0, h1, hn-1, scalar x
- H(x) h0(x), h1(x) hn-1(x)
x
. . .
h0(x) h1(x)
hn-1(x)
h0
h1
hm-1
9Serial Composition
x
Time parallelism
The general case
f(x) g1(g2( g3(
gp(x) )))
f(x)
h
g(h(x))
10Reduction Composition
x0 x1 xm-1
g(x0, xm-1)
g(x0, x1, xm-1)
11Primitive recursive rule
- f(x,y) h(x, f(x, y-1)), where f(x,0) g(x)
- f(x,y) h(x, h(x, h(x, h(x, g(x) ) )))
- Parallel solution makes sense only if the
function must be - computed many times.
- Implementations
- Data parallel composition
- Loop in a serial composition
12Data Parallel Composition for the Primitive
Recursive Rule
- x, Y y0, yn-1 ? f(x,y0), f(x,y1),
f(x,yn-1)
(x, y0)
(x, y1) (x, yn-1)
. . . f(x, y0) f(x,
y1) f(x, yn-1)
h
h
h
13Serial Composition for the Primitive Recursive
Rule
- x, ltYgt lty0, yn-1gt ? ltFgt ltf(x,y0), f(x,y1),
f(x,yp-1)gt
x, ltYgt
. . .
ltFgt
h
h
h
sel
14Minimalization rule
- f(x) min(y)m(x,y) 0
- Implementations
- Speculative composition reduction composition
- Serial composition reduction composition
15Speculative Composition Reduction Composition
for Minimalization
x
. . .
. . . m(x,0), 0
m(x,n-1), n-1
f(x) i
m(x,1)
m(x,n-1)
m(x,0)
first0, i
16Serial Composition Reduction Composition for
Minimalization
yi-1 yi-2
yi-s
selection
code yi
(Pi the i-th pipe stage) Example of dynamic
reconfiguration
Pi-5
Pi-1
Pi-2
Pi-3
Pi-4
Pi
17Functional Taxonomy of Parallel Computation
- Data Parallel Computation uses SIMD-like
machines - Time Parallel Computation is a very special
sort of MIMD used to compute only one function - Speculative Computation is MISD machine
completely ignored by the actual implementations
18Integral Parallel Architecture
- An Integral Parallel Architecture (IPA) uses all
kinds of parallelism to build a real machine, in
two versions - complex IPA all types of parallel mechanisms
tightly interleaved on the same physical
structure (pipelined superscalar speculative
general purpose processors) - intensive IPA all types of parallel mechanisms
highly separated, implemented on specific
physical structures (accelerators for embedded
computation in a SoC approach)
19Intensive IPA
- Intensive IPA are used as accelerators for
complex IPA - Monolithic intensive IPA the same machine works
in two modes - Data parallel
- Time parallel
- Segregated intensive IPA two distinct machines
are used, one for data parallel computation and
the other for time parallel (i.e. speculative)
computation
20The Connex Chip
- The organization of BA1024
- multi-core area of 4 MIPS
- many-core data parallel area of 1024 simple PEs
- speculative time parallel pipe of 8 PEs
- interfaces (DDR, PCI, video audio interfaces
for 2 HDTV channels)
21The Connex System
255
Connex Array 1,024 linearly connected 16-bit
Processing Cells Sequencer 32-bit stack machine
program memory data memory issues in each
cycle (on a 2-stage pipe) one 64-bit instruction
for Connex Array and a 24-bit instruction for
itself IO Controller 32-bit stack machine
controls a 3.2 GB/s IO channel Processing
Cell Integer unit data memory Boolean unit
254
16-bit RAM For data
Sequencer (4KB data 32Kb program memory)
I/O Controller (4KB data 4KB program memory)
Connex Array
1
Address
0
R7
R6
R5
R4
R3
AUX
R2
I/O
R1
Connex
R0
1
I/O channel works in parallel with code running
on the Connex Array
16 bit ALU
22Connex Array Structure
255
254
- Processing Cells are linearly connected using
only the register R0 - IO Plan consists in all R1s supervised mainly by
the IO Controller - Conditional execution based on the state of
Boolean unit - Integer unit, Boolean unit and Data memory
execute in each cycle command fields from a
64-bit instruction issued by Sequencer - Vector reduction operations with scalar results
in the TOS of Sequencer (receiving through a
3-stage pipe data from the array of cells)
1
0
R7
R6
R5
R4
R3
R2
R1
R0
1023
0
1
on
off
on
16 bit ALU
16 bit ALU
16 bit ALU
23I/O System
Switch Fabric (128-bit word)
Connex Array
IS
I/O Plane
IOC
Interrupts
DRAM
DDR-DRAM Controller
DRAM
DRAM
DRAM
24Test ICE
64-bit Wide DRAM
Configurable Switch Fabric
BT.656/1120
BT.656/1120
ConnexArray Programmable Media
Processor Multi-Codec Processing Pre-Analysis 3D
Filter Scaling Video Merge/Blend Motion Adaptive
De-interlacing
BT.656/1120
BT.656/1120
I/O Sequencer
1x-I2S
1x-I2S
Configurable Switch Fabric
Configurable Switch Fabric
S/PDIF
1x-I2S
4xI2S
Instruction Sequencer
Test
PCI v2.2 or Generic
Flash
Configurable Switch Fabric
BA1024
25The Connex Architecture
- Vectors selections
- Programming Connex
- Performances
26Vectors Selections
- Linear array of processing elements ? vectors
- Local data memory in each processing element ?
array of vectors - Data dependency operations at the level of each
processing element ? selections
27Full Line Operations
0
1023
0
16-bit data operand
Line i
, -, , XOR, etc.
Line j
Line k
255
Line k Line i OP Line j Line k Line i
OP scalar value (repeated for all elements)
28Columns Active Based On Repeating Patterns
0
1023
0
Line i
, -, , XOR, etc.
Line j
Line k
255
Mark all odd columns active. Or mark every third
column active. Or mark every third and fourth
column active, etc.
29Columns Active Based On Data Content
0
1023
0
Line i
, -, , XOR, etc.
Line j
Line k
255
Apparently random columns are active, marked,
based on data-dependent results of previous
operations.
30Outer-Loop Parallelism
0
1023
7
0
8x8
8x8
8x8
8x8
..
7
Line i
Line j
255
Example 128 sets of 8x8 run in parallel in a
1024-cell array
31Programming Connex
- int main()
- vector V1 2 // V1 2, 2, 2
- vector V2 3 // V2 3, 3, 3
- vector V // V 0, 0, 0
- vector Index indexvector() // Index 0, 1,
- V mm_absdiff(V1, V2) // V 1,1, 1
- return 0
-
- // Find the absolute difference between two
vectors - vector mm_absdiff(vector V1, vector V2)
- vector V
- V V1 - V2
- WHERE (V lt 0)
- V -V // V abs(V)
-
- ENDW
- return V
-
- VectorC is an extension/restriction of C
- Code that operates on scalar data written in
regular C notation - Connex-specific operators defined as functions
for features not available in C, e.g.
operations on vectors and selections (Boolean
vectors) -
- VectorC uses sequential operators and control
structures on vector and select data-types - Using VectorC the Connex Machine is programmed
the same way as conventional sequential machines
32Overall performances of BA1024
- 200 GOP/sec
- 3.2 GB/sec external bandwidth
- 400 GB/sec internal bandwidth
- gt 60 GOP/Watt
- gt 2 GOP/mm2
- Note 1 OP 16-bit simple integer operation
(excluding multiplication)
33How to Use the Connex Environmentfor Membrane
Computation
- Example (G. Paun)
- the initial configuration 123a f c3 2
1... - R1 e ? (e, out), f ? f
- R2 b ?d, d ? de, ff ? f, cf ? cdd
- R3 a ? ab, a ? bd, f ? ff
34 The first example of processing
Initial vector (1,) (2,) (3,) (0,a) (0,f)
(0,c) (3,) (2,) (1,) ...
a f c ... a ? ab, f ? ff
a b f f c ... // 11
clock cycles a ? ab, f ? ff a b b f f
f f c ... // 15 clock cycles a ?
bd, f ? ff b b b f f f f f f f f c
... // 27 clock cycles b ?d, ff ? f
d d d f f f f c ... // 10
clock cycles d ? de, ff ? f d e d e d e
f f c ... // 10 clock cycles d ? de, cf
? cdd d e e d e e d e e d f c ... // 10 clock
cycles e ? (e, out), f ? f d d d d f c e e e
e e e... // 15 clock cycles total
98 clock cycles
35 The second example of processing
Initial vector (1,) (2,) (3,) (1,a) (1,f)
(1,c) (3,) (2,) (1,) ... 1a 1f 1c
... ? 1a 1b 2f 1c ... ? //
in 5 clock cycles 1a 2b 4f 1c ... ?
// in 5 clock cycles 3b 8f 1c ...
? // in 10 clock cycles 3d 4f 1c ...
? // in 7 clock cycles 3d 3e 2f 1c
... ? // in 8 clock cycles 4d 3e 1f 1c
... ? // in 8 clock cycles 4d 1f 1c
3e... ? // in 5 clock cycles
total 48 clock cycles
36 The third example of processing
The third membrane is duplicated (multiplicated),
but the content can be different 1a 1f
1c 2a 1f 1c ... ? 1a 1b 2f
1c 2a 2b 2f 1c ... ? // in 5 clock
cycles 1a 2b 4f 1c 2a 4b 4f 1c ... ?
// in 5 clock cycles 3b 8f 1c 6b 8f 1c
... ? // in 10 clock cycles
3d 4f 1c 6d 4f 1c ... ? //
in 7 clock cycles 3d 3e 2f 1c 6d 6e 2f 1c
... ? // in 8 clock cycles 4d 3e 1f
1c 7d 6e 1f 1c... ? // in 8 clock
cycles 4d 1f 1c 7d 1f 1c 9e...
? // in 10 clock cycles
total 53
clock cycles For up to 200 level 3 membranes the
number of clock cycles remains 53.
37Concluding Remarks
- Functional taxonomy vs. Flynn taxonomy
- Connex architecture accelerates membrane
computation - An efficient P-architecture asks for few
additional features to the Connex architecture - Why not a P-language?
38Main technical contributors to the Connex project
Emanuele Altieri, BrightScale Inc., CA Lazar
Bivolarski, BrightScale Inc., CA Frank Ho,
BrightScale Inc., CA Mihaela Malita, St. Anselm
College, NH Bogdan Mitu, BrightScale Inc.,
CA Dominique Thiebaut, Smith College, MA Tom
Thomson, BrightScale Inc., CA Dan Tomescu,
BrightScale Inc., CA
39 - Thank You
- Mihaelas webpage on VectorC
- www.anselm.edu/homepage/mmalita/
-
- QA