Membrane Computing in the Connex Environment - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Membrane Computing in the Connex Environment

Description:

The Ubiquitousness of Parallelism Asks for Integral Parallel ... Intel's approach. Multi-processors: the best approach for multi-threading on MIMD architecture ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 40

Provided by: see48

Category:

more less

Transcript and Presenter's Notes

Title: Membrane Computing in the Connex Environment

1
Membrane Computingin theConnex Environment
Gheorghe Stefan BrightScale Inc., Sunnyvale, CA
Politehnica University of Bucharest gstefan_at_b
rightscale.com
2
Outline

Integral Parallel Architecture
The Connex Chip
The Connex Architecture
How to Use the Connex Environment
Concluding Remarks

3
Integral Parallel Architecture

The Ubiquitousness of Parallelism Asks for
Integral Parallel Architectures
Partial Recursive Functions Parallel
Computation
A Functional Taxonomy of Parallel Computation

4
Parallelism can not be avoided anymore

Intels approach
Multi-processors
the best approach for multi-threading on MIMD
architecture
Inefficient on SIMD architecture
Ignores the MISD architecture
Many-processors asking for another taxonomy
They work as accelerators
They perform critical functions
Berkeleys 13 dwarfs is a functional approach for
many-processors
Real applications ask for all kind of parallelism
to solve corner cases the places where the
devil hides

5
Partial Recursive Functions Parallel
Computation

Composition Rule the Basic Parallel Structures
Primitive Recursive Rule
Minimalization Rule

6
Composition the Associated Structure

f(x0, xn-1) g(h0(x0, xn-1), h1(x0, xn-1),
hm-1(x0, xn-1))

x0, xn-1

. . .

. . .

f(x0, xn-1)
h1
hm-1
h0
g(h0, h1, hm-1)
7
Data Parallel Composition

X x0, xn-1 ? h(x0), h(x0), h(x0)

x0
x1
xn-1

. . .
h(x0) h(x1)
h(xn-1)

h
h
h
8
Speculative Composition

function vector H h0, h1, hn-1, scalar x
H(x) h0(x), h1(x) hn-1(x)

x

. . .

h0(x) h1(x)
hn-1(x)

h0
h1
hm-1
9
Serial Composition

f(x) g(h(x))

x

Time parallelism
The general case

f(x) g1(g2( g3(
gp(x) )))
f(x)
h
g(h(x))
10
Reduction Composition

f(x0, xm-1) g(x0, xm-1)

x0 x1 xm-1

g(x0, xm-1)
g(x0, x1, xm-1)
11
Primitive recursive rule

f(x,y) h(x, f(x, y-1)), where f(x,0) g(x)
f(x,y) h(x, h(x, h(x, h(x, g(x) ) )))
Parallel solution makes sense only if the
function must be
computed many times.
Implementations
Data parallel composition
Loop in a serial composition

12
Data Parallel Composition for the Primitive
Recursive Rule

x, Y y0, yn-1 ? f(x,y0), f(x,y1),
f(x,yn-1)

(x, y0)
(x, y1) (x, yn-1)

. . . f(x, y0) f(x,
y1) f(x, yn-1)

h
h
h
13
Serial Composition for the Primitive Recursive
Rule

x, ltYgt lty0, yn-1gt ? ltFgt ltf(x,y0), f(x,y1),
f(x,yp-1)gt

x, ltYgt

. . .
ltFgt

h
h
h
sel
14
Minimalization rule

f(x) min(y)m(x,y) 0
Implementations
Speculative composition reduction composition
Serial composition reduction composition

15
Speculative Composition Reduction Composition
for Minimalization

x

. . .

. . . m(x,0), 0

m(x,n-1), n-1

f(x) i
m(x,1)
m(x,n-1)
m(x,0)
first0, i
16
Serial Composition Reduction Composition for
Minimalization

yi-1 yi-2
yi-s

selection
code yi

(Pi the i-th pipe stage) Example of dynamic
reconfiguration

Pi-5
Pi-1
Pi-2
Pi-3
Pi-4
Pi
17
Functional Taxonomy of Parallel Computation

Data Parallel Computation uses SIMD-like
machines
Time Parallel Computation is a very special
sort of MIMD used to compute only one function
Speculative Computation is MISD machine
completely ignored by the actual implementations

18
Integral Parallel Architecture

An Integral Parallel Architecture (IPA) uses all
kinds of parallelism to build a real machine, in
two versions
complex IPA all types of parallel mechanisms
tightly interleaved on the same physical
structure (pipelined superscalar speculative
general purpose processors)
intensive IPA all types of parallel mechanisms
highly separated, implemented on specific
physical structures (accelerators for embedded
computation in a SoC approach)

19
Intensive IPA

Intensive IPA are used as accelerators for
complex IPA
Monolithic intensive IPA the same machine works
in two modes
Data parallel
Time parallel
Segregated intensive IPA two distinct machines
are used, one for data parallel computation and
the other for time parallel (i.e. speculative)
computation

20
The Connex Chip

The organization of BA1024
multi-core area of 4 MIPS
many-core data parallel area of 1024 simple PEs
speculative time parallel pipe of 8 PEs
interfaces (DDR, PCI, video audio interfaces
for 2 HDTV channels)

21
The Connex System
255
Connex Array 1,024 linearly connected 16-bit
Processing Cells Sequencer 32-bit stack machine
program memory data memory issues in each
cycle (on a 2-stage pipe) one 64-bit instruction
for Connex Array and a 24-bit instruction for
itself IO Controller 32-bit stack machine
controls a 3.2 GB/s IO channel Processing
Cell Integer unit data memory Boolean unit
254
16-bit RAM For data
Sequencer (4KB data 32Kb program memory)
I/O Controller (4KB data 4KB program memory)
Connex Array

1
Address
0
R7
R6
R5
R4
R3
AUX
R2
I/O
R1
Connex
R0
1
I/O channel works in parallel with code running
on the Connex Array
16 bit ALU
22
Connex Array Structure
255
254

Processing Cells are linearly connected using
only the register R0
IO Plan consists in all R1s supervised mainly by
the IO Controller
Conditional execution based on the state of
Boolean unit
Integer unit, Boolean unit and Data memory
execute in each cycle command fields from a
64-bit instruction issued by Sequencer
Vector reduction operations with scalar results
in the TOS of Sequencer (receiving through a
3-stage pipe data from the array of cells)

1
0
R7
R6
R5
R4
R3
R2
R1
R0
1023
0
1
on
off
on
16 bit ALU
16 bit ALU
16 bit ALU
23
I/O System
Switch Fabric (128-bit word)
Connex Array
IS
I/O Plane
IOC
Interrupts
DRAM
DDR-DRAM Controller
DRAM
DRAM
DRAM
24
Test ICE
64-bit Wide DRAM
Configurable Switch Fabric
BT.656/1120
BT.656/1120
ConnexArray Programmable Media
Processor Multi-Codec Processing Pre-Analysis 3D
Filter Scaling Video Merge/Blend Motion Adaptive
De-interlacing
BT.656/1120
BT.656/1120
I/O Sequencer
1x-I2S
1x-I2S
Configurable Switch Fabric
Configurable Switch Fabric
S/PDIF
1x-I2S
4xI2S
Instruction Sequencer
Test
PCI v2.2 or Generic
Flash
Configurable Switch Fabric
BA1024
25
The Connex Architecture

Vectors selections
Programming Connex
Performances

26
Vectors Selections

Linear array of processing elements ? vectors
Local data memory in each processing element ?
array of vectors
Data dependency operations at the level of each
processing element ? selections

27
Full Line Operations
0
1023
0
16-bit data operand
Line i
, -, , XOR, etc.
Line j

Line k
255
Line k Line i OP Line j Line k Line i
OP scalar value (repeated for all elements)
28
Columns Active Based On Repeating Patterns
0
1023
0
Line i
, -, , XOR, etc.
Line j

Line k
255
Mark all odd columns active. Or mark every third
column active. Or mark every third and fourth
column active, etc.
29
Columns Active Based On Data Content
0
1023
0
Line i
, -, , XOR, etc.
Line j

Line k
255
Apparently random columns are active, marked,
based on data-dependent results of previous
operations.
30
Outer-Loop Parallelism
0
1023
7
0
8x8
8x8
8x8
8x8
..
7
Line i
Line j
255
Example 128 sets of 8x8 run in parallel in a
1024-cell array
31
Programming Connex

int main()
vector V1 2 // V1 2, 2, 2
vector V2 3 // V2 3, 3, 3
vector V // V 0, 0, 0
vector Index indexvector() // Index 0, 1,
V mm_absdiff(V1, V2) // V 1,1, 1
return 0
// Find the absolute difference between two
vectors
vector mm_absdiff(vector V1, vector V2)
vector V
V V1 - V2
WHERE (V lt 0)
V -V // V abs(V)
ENDW
return V

VectorC is an extension/restriction of C
Code that operates on scalar data written in
regular C notation
Connex-specific operators defined as functions
for features not available in C, e.g.
operations on vectors and selections (Boolean
vectors)
VectorC uses sequential operators and control
structures on vector and select data-types
Using VectorC the Connex Machine is programmed
the same way as conventional sequential machines

32
Overall performances of BA1024

200 GOP/sec
3.2 GB/sec external bandwidth
400 GB/sec internal bandwidth
gt 60 GOP/Watt
gt 2 GOP/mm2
Note 1 OP 16-bit simple integer operation
(excluding multiplication)

33
How to Use the Connex Environmentfor Membrane
Computation

Example (G. Paun)
the initial configuration 123a f c3 2
1...
R1 e ? (e, out), f ? f
R2 b ?d, d ? de, ff ? f, cf ? cdd
R3 a ? ab, a ? bd, f ? ff

34
The first example of processing
Initial vector (1,) (2,) (3,) (0,a) (0,f)
(0,c) (3,) (2,) (1,) ...
a f c ... a ? ab, f ? ff
a b f f c ... // 11
clock cycles a ? ab, f ? ff a b b f f
f f c ... // 15 clock cycles a ?
bd, f ? ff b b b f f f f f f f f c
... // 27 clock cycles b ?d, ff ? f
d d d f f f f c ... // 10
clock cycles d ? de, ff ? f d e d e d e
f f c ... // 10 clock cycles d ? de, cf
? cdd d e e d e e d e e d f c ... // 10 clock
cycles e ? (e, out), f ? f d d d d f c e e e
e e e... // 15 clock cycles total
98 clock cycles
35
The second example of processing
Initial vector (1,) (2,) (3,) (1,a) (1,f)
(1,c) (3,) (2,) (1,) ... 1a 1f 1c
... ? 1a 1b 2f 1c ... ? //
in 5 clock cycles 1a 2b 4f 1c ... ?
// in 5 clock cycles 3b 8f 1c ...
? // in 10 clock cycles 3d 4f 1c ...
? // in 7 clock cycles 3d 3e 2f 1c
... ? // in 8 clock cycles 4d 3e 1f 1c
... ? // in 8 clock cycles 4d 1f 1c
3e... ? // in 5 clock cycles
total 48 clock cycles
36
The third example of processing
The third membrane is duplicated (multiplicated),
but the content can be different 1a 1f
1c 2a 1f 1c ... ? 1a 1b 2f
1c 2a 2b 2f 1c ... ? // in 5 clock
cycles 1a 2b 4f 1c 2a 4b 4f 1c ... ?
// in 5 clock cycles 3b 8f 1c 6b 8f 1c
... ? // in 10 clock cycles
3d 4f 1c 6d 4f 1c ... ? //
in 7 clock cycles 3d 3e 2f 1c 6d 6e 2f 1c
... ? // in 8 clock cycles 4d 3e 1f
1c 7d 6e 1f 1c... ? // in 8 clock
cycles 4d 1f 1c 7d 1f 1c 9e...
? // in 10 clock cycles
total 53
clock cycles For up to 200 level 3 membranes the
number of clock cycles remains 53.
37
Concluding Remarks

Functional taxonomy vs. Flynn taxonomy
Connex architecture accelerates membrane
computation
An efficient P-architecture asks for few
additional features to the Connex architecture
Why not a P-language?

38
Main technical contributors to the Connex project
Emanuele Altieri, BrightScale Inc., CA Lazar
Bivolarski, BrightScale Inc., CA Frank Ho,
BrightScale Inc., CA Mihaela Malita, St. Anselm
College, NH Bogdan Mitu, BrightScale Inc.,
CA Dominique Thiebaut, Smith College, MA Tom
Thomson, BrightScale Inc., CA Dan Tomescu,
BrightScale Inc., CA
39