Title: MiniSymposium Adaptive Algortihms for Scientific computing
1(No Transcript)
2Mini SymposiumAdaptive Algorithms for
Scientific computing
- Adaptive, hybrids, oblivious what do those
terms mean ? - Taxonomy of autonomic computing Ganek Corbi
2003 - Self-configuring / self-healing /
self-optimising / self-protecting - Objective towards an analysis based on
the algorithm performance
- 9h45 Adaptive algorithms - Theory and
applications Jean-Louis Roch al. AHA Team
INRIA-CNRS Grenoble, France - 10h15 Hybrids in exact linear algebra Dave
Saunders U. Delaware, USA - 10h45 Adaptive programming with hierarchical
multiprocessor tasks Thomas Rauber, Gudula
Runger, U. Bayreuth, Germany - 11h15 Cache-Oblivious algorithms Michael
Bender, Stony Brook U., USA
3Adaptive algorithmsTheory and applications
- Van Dat Cung, Jean-Guillaume Dumas, Thierry
Gautier, Guillaume Huard, Bruno Raffin,
Jean-Louis Roch, Denis Trystram - IMAG-INRIA Workgroup on Adaptive and Hybrid
Algorithms Grenoble, France
Contents I. Some criteria to analyze adaptive
algorithms II. Work-stealing and adaptive
parallel algorithms III. Adaptive parallel
prefix computation
4Why adaptive algorithms and how?
Resources availability is versatile
Input data vary
Measures on resources
Measures on data
Adaptation to improve performances
- Scheduling
- partitioning
- load-balancing
- work-stealing
- Calibration
- tuning parameters block size/ cache
choice of instructions, - priority managing
5Modeling an hybrid algorithm
- Several algorithms to solve a same problem f
- Eg algo_f1, algo_f2(block size), algo_fk
- each algo_fk being recursive
algo_fi ( n, ) . f ( n
- 1, ) . f ( n /
2, )
- E.g. practical hybrids
- Atlas, Goto, FFPack
- FFTW
- cache-oblivious B-tree
- any parallel program with scheduling
support Cilk, Athapascan/Kaapi, Nesl,TLib
6- How to manage overhead due to choices ?
- Classification 1/2
- Simple hybrid iff O(1) choices eg block
size in Atlas, - Baroque hybrid iff an unbounded number of choices
eg recursive splitting factors in FFTW - choices are either dynamic or pre-computed based
on input properties.
7- Choices may or may not be based on architecture
parameters. - Classification 2/2. an hybrid is
- Oblivious control flow does not depend neither
on static properties of the resources nor on the
input eg cache-oblivious algorithm Bender - Tuned strategic choices are based on static
parameters eg block size w.r.t cache,
granularity, - Engineered tuned or self tunedeg ATLAS and
GOTO libraries, FFTW, eg LinBox/FFLAS
Saundersal - Adaptive self-configuration of the algorithm,
dynamlc - Based on input properties or resource
circumstances discovered at run-timeeg idle
processors, data properties, eg TLib
RauberRünger
8Examples
- BLAS libraries
- Atlas simple tuned (self-tuned)
- Goto simple engineered (engineered tuned)
- LinBox / FFLAS simple self-tuned,adaptive
Saundersal - FFTW
- Halving factor baroque tuned
- Stopping criterion simple tuned
- Parallel algorithm and scheduling
- Choice of parallel degree eg Tlib
RauberRünger - Work-stealing schedile baroque hybrid
9Adaptive algorithmsTheory and applications
- Van Dat Cung, Jean-Guillaume Dumas, Thierry
Gautier,Guillaume Huard, Bruno Raffin,
Jean-Louis Roch, Denis Trystram - INRIA-CNRS Project onAdaptive and Hybrid
Algorithms Grenoble, France
Contents I. Some criteria to analyze for
adaptive algorithms II. Work-stealing and
adaptive parallel algorithms III. Adaptive
parallel prefix computation
10Work-stealing (1/2)
 Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)
- Workstealing greedy schedule but
distributed and randomized - Each processor manages locally the tasks it
creates - When idle, a processor steals the oldest ready
task on a remote -non idle- victim processor
(randomly chosen)
11Work-stealing (2/2)
 Work W1 total operations
performed
Depth W? ops on a critical path (parallel
time on ?? resources)
- Interests -gt suited to heterogeneous
architectures with slight modification
Bender-Rabin02 -gt with good probability,
near-optimal schedule on p processors with
average speeds ?ave Tp lt W1/(p ?ave) O ( W?
/ ?ave ) - NB succeeded steals task migrations lt p
W? Blumofe 98, Narlikar 01, Bender 02 - Implementation work-first principle Cilk,
Kaapi - Local parallelism is implemented by sequential
function call - Restrictions to ensure validity of the default
sequential schedule - serie-parallel/Cilk
- reference order/Kaapi -
12Work-stealing and adaptability
- Work-stealing ensures allocation of processors to
tasks transparently to the application with
provable performances - Support to addition of new resources
- Support to resilience of resources and
fault-tolerance (crash faults, network, ) - Checkpoint/restart mechanisms with provable
performances Porch, Kaapi, - Baroque hybrid adaptation there is an
-implicit- dynamic choice between two algorithms - a sequential (local) algorithm depth-first
(default choice) - A parallel algorithm breadth-first
- Choice is performed at runtime, depending on
resource idleness - Well suited to applications where a fine grain
parallel algorithm is also a good sequential
algorithm Cilk - Parallel DivideConquer computations
- Tree searching, BranchX
- -gt suited when both sequential and parallel
algorithms perform (almost) the same number
of operations
13But often parallelism has a cost !
- Solution to mix both a sequential and a parallel
algorithm - Basic technique
- Parallel algorithm until a certain  grainÂ
then use the sequential one - Problem W? increases also, the number of
migration and the inefficiency o( - Work-preserving speed-up Bini-Pan 94
cascading Jaja92 Careful interplay of both
algorithms to build one with both W? small
and W1 O( Wseq ) - Divide the sequential algorithm into block
- Each block is computed with the (non-optimal)
parallel algorithm - Drawback sequential at coarse grain and
parallel at fine grain o( - Adaptive granularity dual approach
- Parallelism is extracted at run-time from any
sequential task
14Self-adaptive grain algorithm
- Based on the Work-first principle Executes
always a sequential algorithm to reduce
parallelism overhead - gt use parallel algorithm only if a processor
becomes idle by extracting parallelism from a
sequential computation - Hypothesis two algorithms
- - 1 sequential SeqCompute- 1 parallel
LastPartComputation at any time, it is
possible to extract parallelism from the
remaining computations of the sequential
algorithm - Examples - iterated product Vernizzi 05 -
gzip / compression Kerfali 04 - MPEG-4 / H264
Bernard 06 - prefix computation Traore 06
15Adaptive algorithmsTheory and applications
- Van Dat Cung, Jean-Guillaume Dumas, Thierry
Gautier,Guillaume Huard, Bruno Raffin,
Jean-Louis Roch, Denis Trystram - INRIA-CNRS Project onAdaptive and Hybrid
Algorithms Grenoble, France
Contents I. Some criteria to analyze for
adaptive algorithms II. Work-stealing and
adaptive parallel algorithms III. Adaptive
parallel prefix computation
16Prefix computation an example where
parallelism always costs ?1 a0a1
?2a0a1a2 ?na0a1an
- Sequential algorithm for (i 0 i lt n
i ) ? i ? i 1 a i - Parallel algorithm Ladner-Fischer
W1 W? n
a0 a1 a2 a3 a4 an-1 an
W? 2. log n but W1 2.n Twice more
expensive than the sequential
17Adaptive prefix computation
- Any (parallel) prefix performs at least W1 ? 2.n
- W? ops - Strict-lower bound on p identical processors Tp
? 2n/(p1) block algorithm pipeline
Nicolaual. 2000 - Application of adaptive scheme
- One process performs the main sequential
computation - Other work-stealer processes computes parallel
 segmented prefix - Near-optimal performance on processors with
changing speeds Tp lt 2n/((p1).
?ave) O ( log n / ?ave)
lower bound
18Scheme of the proof
- Dynamic coupling of two algorithms that completes
simultaneously - Sequential (optimal) number of operations S
- Parallel performs X operations
- dynamic splitting always possible till finest
grain BUT local sequential - Scheduled by workstealing on p-1 processors
- Critical path small (log X)
- Each non constant time task can be splitted
(variable speeds) - Analysis
-
- Algorithmic scheme ensures Ts Tp O(log X)gt
enables to bound the whole number X of operations
performedand the overhead of parallelism (sX)
- ops_optimal - Comparison to the lower bound on the number of
operations.
19Adaptive Prefix on 3 processors
?1
20Adaptive Prefix on 3 processors
?3
?7
21Adaptive Prefix on 3 processors
?8
22Adaptive Prefix on 3 processors
?8
?8
?5
?6
?9
?11
23Adaptive Prefix on 3 processors
?12
?11
?8
?8
?5
?6
?7
?9
?11
?10
24Adaptive Prefix on 3 processors
Implicit critical path on the sequential process
25Adaptive prefix some experiments
Joint work with Daouda Traore
Prefix of 10000 elements on a SMP 8 procs (IA64 /
linux)
External charge
Time (s)
Time (s)
processors
processors
Multi-user context Adaptive is the
fastest15 benefit over a static grain algorithm
- Single user context
- Adaptive is equivalent to
- - sequential on 1 proc
- - optimal parallel-2 proc. on 2 processors
- -
- - optimal parallel-8 proc. on 8 processors
26The Prefix race sequential/parallel fixed/
adaptive
On each of the 10 executions, adaptive completes
first
27With double sum ( riri-1 xi )
Finest grain limited to 1 page 16384 octets
2048 double
Single user
Processors with variable speeds
Remark for n4.096.000 doubles - pure
sequential 0,20 s - minimal grain 100
doubles 0.26s on 1 proc and 0.175 on 2 procs
(close to lower bound)
28 E.g.Triangular system solving
1/ x1 - b1 / a11 2/ For k2..n bk bk -
ak1.x1
A
system of dimension n
system of dimension n-1
29 E.g.Triangular system solving
30Conclusion
- Adaptive what choices and how to choose ?
-
- Illustration Adaptive parallel prefix based on
work-stealing - - self-tuned baroque hybrid O(p log n )
choices - - achieves near-optimal performance
- processor oblivious
- Generic adaptive scheme to implement parallel
algorithms with provable performance -
31Mini SymposiumAdaptive Algorithms for
Scientific computing
- Adaptive, hybrids, oblivious what do those
terms mean ? - Taxonomy of autonomic computing Ganek Corbi
2003 - Self-configuring / self-healing /
self-optimising / self-protecting - Objective towards an analysis based on
the algorithm performance
- 9h45 Adaptive algorithms - Theory and
applications Jean-Louis Roch al. AHA Team
INRIA-CNRS Grenoble, France - 10h15 Hybrids in exact linear algebra Dave
Saunders, U. Delaware, USA - 10h45 Adaptive programming with hierarchical
multiprocessor tasks Thomas Rauber, U. Bayreuth,
Germany - 11h15 Cache-Obloivious algorithms Michael
Bender, Stony Brook U., USA
32Questions ?
33Some examples (1/2)
- Adaptive algorithms used empirically an
theoretically - Atlas 2001 dense linear algebra library
- Instruction set and instruction schedule
- Self-camobration pg yjr blpvk idr
lt!uuuuuuuuuu de la taille des blocs Ã
linstallation sur la machine - FFTW (1998, ) FFT (n) lt p FFT(q) and q
FFT(n) - For any n, for any recursive call FFT(n)
pre-compite the nest value for p - Pré-calcul de la découpe optimale pour la taille
n du vecteur sur la machine - Cache-oblivious B-trees
- Block recursive splitting to minimize page
faults - Self adaptation to memory hierarchy
- Workstealing (Cilk (1998, ) (2000, )
recursive parallelism - Choice between sequential depth-first schedule
and breadth-first schedule -  Work-first principle to optimize local
sequentilal execution and put overhead on rare
steals from idle processors . - Implicitly adaptive
34Some examples (2/2)
- Moldable tasks Ordonnancement bi-critère avec
garantie Trystramal 2004 - Combinaison récursive alternatiive
dapproximation pour chaque critère - Auto-adaptation avec performance garantie pour
chaque critère - Algorithmes  Cache-Oblivious Benderal
2004 - Découpe récursive par bloc qui minimise les
défauts de page - Auto-adaptation à la hiérarchie mémoire
(B-tree) - Algorithmes  Processor-Oblivious Rochal
2005 - Combinaison récursive de 2 algorithmes séquentiel
et parallèle - Auto-adaptation à linactivité des ressources
35Best case parallel algorithm is efficient
- W? is small and W1 Wseq
- The parallel algorithm is an optimal sequential
one - Exemples parallel DC algorithms
- Implementation work-first principle - no
overhead when local execution of tasks - Examples
- Cilk THE protocol
- Kaapi Compareswap only
36Experimentation knary benchmark
procs Speed-Up
8 7,83
16 15,6
32 30,9
64 59,2
100 90,1
Distributed Archi. iCluster Athapascan
SMP Architecture Origin 3800 (32 procs)Cilk /
Athapascan
Ts 2397 s ? T1 2435
37High potential degree of parallelism
In  practice coarse granularity Splitting
into p resources Drawback heterogeneous
architecture, dynamic ?i(t) speed
of processor i at time t In  theory fine
granularity Maximal parallelism Drawback
overhead of tasks management
How to choose/adapt granularity ?
38How to obtain an efficientfine-grain algorithm ?
- Hypothesis for efficiency of work-stealing
- the parallel algorithm is  work-optimalÂ
- T? is very small (recursive parallelism)
- Problem
- Fine grain (T? small) parallel algorithms may
involve a large overhead with respect to a
sequential efficient algorithm - Overhead due to parallelism creation and
synchronization - But also arithmetic overhead
39Self-grain Adaptive algorithms
- Recursive computations
- Local sequential computation
- Special case
- recursive extraction of parallelism when a
resource becomes idle - But local execution of a sequential algorithm
- Hypothesis two algorithms
- - 1 sequential SeqCompute
- - 1 parallel LastPartComputation gt at any
time, it is possible to extract parallelism from
the remaining computations of the sequential
algorithm - Example
- - iterated product Vernizzi - gzip /
compression Kerfali - - MPEG-4 / H264 Bernard . - prefix
computation Traore
40Adaptive Prefix versus optimalon identical
processors
41Illustration adaptive parallel prefix
- Adaptive parallel computing on non-uniform and
shared resources - Example of adaptive prefix computation
42Indeed parallelism often costs ... eg Prefix
computation P1 a0a1, P2a0a1a2, ,
Pna0a1an
- Sequential algorithm for (i 0 i lt n
i ) P i P i 1 a i W1 n - Parallel algorithm Ladner-Fischer
W? 2. log n but W1 2.n Twice more
expensive than the sequential