Title: Le mod
1Frédéric Gava
Le modèle BSP Bulk-Synchronous Parallel
2Background
Parallel programming
3The BSP model
BSP architecture
- Characterized by
- p Number of processors
- r Processors speed
- L Global synchronization
- g Phase of communication (1 word at most sent
of received by each processor)
4Model of execution
Beginning of the super-step i
Local computing on each processor
Global (collective) communications between
processors
Global synchronization exchanged data available
for the next super-step
Cost(i) (max0?xltp wxi) hi?g L
5Exemple dune machine BSP
6Modèle de coût
- Coût(programme)somme des coûts des super-étapes
- BSP computation scalable, portable, predictable
- BSP algorithm design minimising W (temps de
calculs), H (communications), S (nombre de
super-étapes)? - Coût(programme) W gH SL
- g et L sont calculables (benchmark) doù
possibilité de prédiction - Main principles
- Load-balancing minimises W
- data locality minimises H
- coarse granularity minimises S
- In genral, data locality good, network locality
bad! - Typically, problem size ngtgtgtp (slackness)?
- Input/output distribution even, but otherwise
arbitrary
7A libertarian model
- No master
- Homogeneous power of the nodes
- Global (collective) decision procedure instead
- No god
- Confluence (no divine intervention)
- Cost predictable
- Scalable performances
- Practiced but confined
8Advantages and drawbacks
- Advantages
- Allows cost prediction and deadlock free
- Structuring execution and thus bulk-sending it
can be very efficient (sending one file of 1000
bytes performs better than sending 1000 file of 1
byte) in many architectures (multi-cores,
clusters, etc.) - Abstract architecture portable
- ?
- Drawbacks
- Some algorithmic patterns dont feet well in the
BSP model pipeline etc. - Some problem are difficult (impossible) to feet
to a Coarse-grain execution model (fine-grained
parallelism) - Abstract architecture dont take care of some
efficient possibilities of some architecture
(cluster of multi-core, grid) and thus need other
libraries or model of execution - ?
9Example broadcast
- Direct broadcast (one super-step)
1
0
2
BSP cost p?n?g L
- Broadcast with 2 super-steps
BSP cost ??n?g ??L
10Algorithmes BSP
- Matrices multiplication, inversion,
décomposition, algèbre linéaire, etc. - Matrices creuses idem.
- Graphes plus court chemin, décomposition, etc.
- Géométrie diagramme de Voronoi, intersection de
polygones, etc. - FFT Fast Fournier Transformation
- Recherches de motifs
- Etc.
11Parallel prefixes
- If we suppose associative operator ()
- a(bc)(ab)c or better
- a(b(cd))(ab) (cd)
- Example
12Parallel Prefixes
- Classical log(p) super-steps method
0
1
2
3
Cost log(p) ( Time(op)Size(d)gL)
13Parallel Prefixes
- Divide-and-conquer method
0
1
2
3
14Our parallel machine
- Cluster of PCs
- Pentium IV 2.8 Ghz
- 512 Mb RAM
- A front-end Pentium IV 2.8 Ghz, 512 Mb RAM
- Gigabit Ethernet cards and switch,
- Ubuntu 7.04 as OS
15Our BSP Parameters g
16Our BSP Parameters L
17How to read bench
- There are many manners to publish benchs
- Tables
- Graphics
- The goal is to say it is a good parallel
method, see my benchs but it is often easy to
arrange the presentation of the graphics to hide
the problems - Using graphics (from the simple to hide to the
hardest) - Increase size of data and see for some number of
processors - Increase number of processors to a typical size
of data - Acceleration, i.e, Time(seq)/Time(par)
- Efficienty , i.e, Acceleration/Number of
processors - Increase number of processors and size of the
data
18Increase number of processors
19Acceleration
20Efficienty
21Increase data and processors
22Super-linear acceleration ?
- Better than theoretical acceleration. Possible if
data feet more on caches memories than in the RAM
due to a small number of data on each processor. - Why the impact of caches ? Mainly, each
processor has a little of memory call cache.
Access to this memory is (all most) twice faster
that RAM accesses. - Take for example, multiplication of matrices
23Fast multiplication
- A straight-forward C implementation of
resmult(A,B) (of size NN) can look like this - for (i 0 i lt N i)
- for (j 0 j lt N j)
- for (k 0 k lt N k)
- resij aik bkj
- Considerer the following equation
- where T is the transposition of matrix b
24Fast multiplication
- One can implement this equation in C as
- double tmpNN
- for (i 0 i lt N i)
- for (j 0 j lt N j)
- tmpij bji
- for (i 0 i lt N i)
- for (j 0 j lt N j)
- for (k 0 k lt N k)
- resij aik tmpjk
- where tmp is the transpose of b
- This new multiplication if 2 time fasters. With
other caches optimisations, one can have a 64
faster programs without modifying really the
algorithm.
25More complicated examples
26N-body problem
27Presentation
- We have a set of body
- coordinate in 2D or 3D
- point masse
- The classic N-body problem is to calculate the
gravitational energy of N point masses that is - Quadratique complexity
- In practice, N is very big and sometime, it is
impossible to keep the set in the main memory
28Parallel methods
- Each processor has a sub-part of the original
set - Parallel method one each processsor
- compute local interactions
- compute interactions with other point masses
- parallel prefixes of the local interactions
- For 2) simple parallel methods
- using a total exchange of the sub-sets
- using a systolic loop
29Systolic loop
0
1
2
3
30Benchs and BSP predictions
31Benchs and BSP predictions
32Benchs and BSP predictions
33Parallel methods
- There exist many better algorithms than this one
- Especially, considering when computing all
interactions is not needed (distancing molecules) - One classic algorithm is to divide the space
into-subspace, and computed recursively the
n-body on each sub-space (so have sub-sub-spaces)
and only consider, interactions between these
sub-spaces. Stop the recursion when there are at
most two molecules in the sub-space - That introduces nlog(n) computations
34Sieve of Eratosthenes
35Presentation
- Classic find the prime number by enumeration
- Pure functional implementation using list
- Complexity nlog(n)/log(log(n))
- We used
- elimint list?int?int list which deletes from a
list all the integers multiple of the given
parameter - final elimint list?int list?int list iterates
elim - seq_generateint?int?int list which returns the
list of integers between 2 bounds - selectint?int list?int list which gives the
first prime numbers of a list.
36Parallel methods
- Simple Parallel methods
- using a kind of scan
- using a direct sieve
- using a recursive one
- Different partitions of data
- per block (for scan)
- cyclic distribution
11,14,17,20,23
12,15,18,21,24
13,16,19,22,25
37Scan version
- Method using a scan
- Each processor computes a local sieve (the
processor 0 contains thus the first prime
numbers) - then our scan is applied and we eliminate on
processor i the integers that are multiple of
integers of processors i-1, i-2, etc. - Cost as a scan (logarithmic)
38Direct version
- Method
- each processor computes a local sieve
- then integers that are less to are
globally exchanged and a new sieve is applied to
this list of integers (thus giving prime numbers) - each processor eliminates, in its own list,
integers that are multiples of this first primes
39Inductive version
- Recursive method by induction over n
- We suppose that the inductive step gives the
th first primes - we perform a total exchange on them to
eliminates the non-primes. - End of this induction comes from the BSP cost
we end when n is small enough so that the
sequential methods is faster than the parallel
one - Cost
40Benchs and BSP predictions
41Benchs and BSP predictions
42Benchs and BSP predictions
43Benchs and BSP predictions
44Parallel sample sorting
45Presentation
- Each processor has listed set of data (array,
list, etc.) - The goal is that
- data on each processor are ordored.
- data on processor i are smaller than data on
processor i1 - good balancing
- Parallel sorting is not very efficient due to too
many communications - But usefull and more efficient than gather all
the data in one processor and then sort them
46Tri Parallèle BSP
47Tiskins Sampling Sort
0
1
2
1,11,16,7,14,2,20
18,9,13,21,6,12,4
15,5,19,3,17,8,10
48Tiskins Sampling Sort
49Benchs and BSP predictions
50Benchs and BSP predictions
51Matrix multiplication
52Naive parallel algorithm
- We have two matrices A and B of size nn
- We supose
- Each matrice is distributed by blocs of size
- That is, element A(i,j) is on processor
- Algorithm
53Benchs and BSP predictions
54Benchs and BSP predictions
55Benchs and BSP predictions
56Benchs and BSP predictions