Title: On the domain decomposition approach in some convectiondiffusionreaction problems
1On the domain decomposition approach in some
convection-diffusion-reaction problems
- K. Georgiev 1, Z. Zlatev 2
- 1 Institute for Parallel Processing,
- Bulgarian Academy of Sciences, Bulgaria
- 2 National Environmental Research Institute,
- Roskilde, Denmark
2Introduction
- Efficient numerical methods algorithms for
solving - convection-diffusion-reaction systems
- are of big priority due to
- numerous practically very much important problems
3Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
4Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
- pipe networks
5Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
- pipe networks
- acoustics
6Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
- pipe networks
- acoustics
- turbolent kinetic energy and its dispersion
7Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
- pipe networks
- acoustics
- turbolent kinetic energy and its dispersion
- non-Newton flows (extra stresses)
8Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
- pipe networks
- acoustics
- turbolent kinetic energy and its dispersion
- non-Newton flows (extra stresses)
- magneto - hydrodynamics
9Introduction
- Promoninent among this subject are simulations
in - air pollution modelling
- pipe networks
- acoustics
- turbolent kinetic energy and its dispersion
- non-Newton flows (extra stresses)
- magneto - hydrodynamics
- modelling of zeolite filters
- etc.
10The Mathematical Model
- Danish Eulerian Model for long range transport of
air pollutants
11Splitting into submodels
12Numerical treatment
- Finite elements (1D linear first order, bilinear,
nonconforming)
13Numerical treatment in 2D case
- Finite elements (1D linear first order, bilinear,
nonconforming) - Predictor-corrector methods with several
different correctors in advection-diffusion
submodel - QSSA (Quasi-Steady-State Algorithm) in
chemistry-emission submodel - Exact solution in deposition submodel
14Size of the computational task
15Parallelization strategy
- Distributed memory parallelization model via
Message Passing Interface (MPI) - maximum
portability of the code
16Parallelization strategy
- Distributed memory parallelization model via
Message Passing Interface (MPI) - maximum
portability of the code - Based on domain decomposition of the horizontal
grid - domain overlapping in the advection-diffusion
submodel - nonoverlapping subdomains in chemistry
-deposition submodel
17DD approach
18Need of parallel computations
- SunFire 6800 (24 CPU UltraSparc-III/750 MHz
- (DTU, Lyngby, Denmark)
19Need of parallel computations
- SunFire 6800 (24 CPU UltraSparc-III/750 MHz
- (DTU, Lyngby, Denmark)
- Grid 480 x 480 (10 km. resolution)
- 2D version of UNI-DEM
- ONE processor
20Need of parallel computations
- SunFire 6800 (24 CPU UltraSparc-III/750 MHz
- (DTU, Lyngby, Denmark)
- Grid 480 x 480 (10 km. resolution)
- 2D version of UNI-DEM
- ONE processor
- 4 017 852 sec. 1116 h 46.5 day!
21Danish Eulerian Model (UNI-DEM)
- vector computers (CRAY C92A, Fujitsu, etc.)
22Danish Eulerian Model (UNI-DEM)
- vector computers (CRAY C92A, Fujitsu, etc.)
- parallel computers with distributed memory (IBM
SP, CRAY T3E, Beowulf clusters, etc.)
23Danish Eulerian Model (UNI-DEM)
- vector computers (CRAY C92A, Fujitsu, etc.)
- parallel computers with distributed memory (IBM
SP, CRAY T3E, Beowulf clusters, etc.) - parallel computers with shared memory (SGI
Origin, SUN, etc.)
24Danish Eulerian Model (UNI-DEM)
- vector computers (CRAY C92A, Fujitsu, etc.)
- parallel computers with distributed memory (IBM
SP, CRAY T3E, Beowulf clusters, etc.) - parallel computers with shared memory (SGI
Origin, SUN, etc.) - parallel computers with two levels of parallelism
(IBM SMP, Macitntoch G4 clusters, etc.)
25Some numerical experiments
- Machines used
- SunFire 6800 at DTU, Lyngby, Denmark
- (24 CPU UltraSparc-III/750 MHz)
26Some numerical experiments
- Machines used
- SunFire 6800 at DTU, Lyngby, Denmark
- (24 CPU UltraSparc-III/750 MHz)
- Macintosh G4 Power PC cluster at IPP, Sofia
- (Linux cluster, 4 nodes x 2 CPU G4/450 MHz)
27Some numerical experiments
- Computing time (in sec.) and speedup on
Macintosh cluster - 1 proc. 2 proc. 4 proc. 8 proc
- Grid size
- 96 x 96 65 036 33 698 (1.93)
17 185 (3.78) 9 684 (6.72) - 288 x 288 1 338 960 699 424 (1.91) 366
548 (3.65) 175 066 (7.65)
28Some numerical experiments
- Computing time (in sec.) and speedup on
Macintosh cluster - 1 proc. 2 proc. 4 proc. 8 proc
- Grid size
- 96 x 96 65 036 33 698 (1.93)
17 185 (3.78) 9 684 (6.72) - 288 x 288 1 338 960 699 424 (1.91) 366
548 (3.65) 175 066 (7.65) - Computing time (in sec.) and speedup on SunFire
6800 - 96 x 96 52 744 23 217 (2.27)
13 296 (3.97) 7 765 (6.79) - 288 x 288 709 339 327 400 (2.17) 198
030 (3.58) 100 033 (7.09)
29Some numerical experiments
- SunFire 6800 vs Macintosh G4 cluster
30Some numerical experiments
- SunFire 6800 vs Macintosh G4 cluster
- SUN top performance
- --------------------------------
1.67 - G4 top performance
31Some numerical experiments
- SunFire 6800 vs Macintosh G4 cluster
- SUN top performance
- --------------------------------
1.67 - G4 top performance
- 1 proc. 2 proc. 4 proc. 8 proc
- Grid size
- 96 x 96 1.23 1.45 1.29 1.25
- 288 x 288 1.89 2.14 1.85 1.75
32Acknowledgments
- This research was support in part by
- Grant IO-01/03 of Bulgarian NSF
- Grant from the NATO Scientific Programme (CRG
960505).
33Implementations
- for sequential computers
- for vector computers (CRAY C92A, Fujitsu, etc.)
- for parallel computers with distributed memory
(IBM SP, CRAY T3E, Beowulf clusters, etc.) - for parallel computers with shared memory (SGI
Origin, SUN, etc.) - for parallel computers with two level of
parallelism distributed memory between nodes
and shared memory inside the nodes that consist
of several processors (IBM SMP, clusters of
multiprocessor nodes, etc.)
34Space discretization
- 32 x 32 (150 km. resolution)
- 96 x 96 (50 km. resolution)
- 288 x 288 (16,7 km resolution)
- 480 x 480 (10 km resolution)
35Need of parallel computations
- No of equations per system of ODEs that are
treated at every time-step (typically 3456 for
one month period) - Grid---gt 32x32x10 96x96x10 288x288x10
480x480x10 - No. of
- spec.
- 35 358400 3 225 600 29 030
400 80 640 000 - 56 573 400 5 160 960 46 448
640 129 024 000 - 168 1 720 320 15 482 880 139 345 920
387 072 000
36Need of parallel computations
- SunFire 6800 (24 CPU UltraSparc-III/750 MHz
- (DTU, Lyngby, Denmark)
- Grid 480 x 480 (10 km. resolution)
- 2D version of UNI-DEM
- ONE processor
- 4 017 852 sec. 1116 h 46.5 day!
37Computing time modules
- Comp. time in sec. on one proc. of IBM SMP
- Module Comp. time
Percent - Chemistry 16 147
83.09 - Advection 3 013
15.51 - Initialization 2
0.00 - Input operations 50
0.26 - Output operations 220
1.13 - Total time 19 432
100.00
38Chemistry chunks
- To reduce the computing time used in the chemical
module it is worthwhile to divide the arrays into
smaller portions chunks. - Copy data from appropriate sections of the large
arrays in small arrays where the chunks are
stored. Then the bulk of the computational work
is performed by using data from the small arrays
(which will hopefully stay in the caches)
39Chemistry chunks
- Let us assume that M is the length of the
leading dimension of the 2D arrays used in the
chemical module - Divide these arrays into nchunks chunks
- Leading dimension of the obtained smaller
arrays, is nsize M/nchunks
40Chemistry chunks
- The largest arrays in chemical submodel
- (a) three arrays for the concentrations,
- (b) one array for the emissions,
- (c) one array for the time-dependent chemical
rate coefficients and - (d) three arrays for the depositions (dry, wet
and total).
41Chemistry chunks
- DO ichunk 1, nchunks
- Copy chunk ichunk from some of the eight large
arrays into small two-dimensional arrays with
leading dimension nsize - DO j 1, nspecies
- DO i 1, nsize
- Perform the chemical reactions involving
species j for grid-point i - END DO
- END DO
- Copy some of the small two-dimensional arrays
with leading dimension nsize into chunk ichunk of
the corresponding large arrays - END DO
42Chemistry chunks
- Comp. time in sec. , 2D UNI-DEM, (96 x 96) grid,
one proc. - Chunks size Fujitsu Origin 2000 Mac
G4 IBM SMP - 1 76 964 14 847 6 952 10 313
- 48 2 611 12 114
5 792 5 225 - 9216 494 18 549 12 893 19
432
43Chemistry chunks
- Conclusions
- The optimal length of the chunks depends on the
memory hierarchy of the computer. - The length of the chunks, nsize, should be a
parameter which can be selected in the main
program.
44Chemistry chunks
- Conclusions (cont)
- The use of long chunks leads to many cache misses
and, therefore, to a very significant increase of
the computing time. - The use of very short chunks increases too much
the number of copies that have to be made and,
thus, the computing time. - The use of medium chunks in rather large range
gives normally very good results
45UNI-DEM on shared memory computers
- OpenMP
- (i) to get good results on different shared
memory computers when such directives are used - (ii) to achieve a high degree of portability.
46UNI-DEM on shared memory computers
- It is important to identify the parallel tasks
and to group them in an appropriate way when
necessary - The horizontal advection and diffusion
- The performance of the horizontal advection and
diffusion can be carried out independently for
every chemical compound (and for the 3-D version
for every layer)
47UNI-DEM on shared memory computers
- The chemistry and deposition
- These two processes can be carried out in
parallel for every grid-point. - The number of parallel tasks is equal to the
number of gridpoints, but each task is a small
task. - Therefore, the tasks should be grouped in an
appropriate way.
48UNI-DEM on shared memory computers
- The vertical exchange
- The performance of the vertical exchange along
each vertical grid-line is a parallel task. The
number Nx x Ny x Nz - If the grid is fine, then the number of these
tasks is becoming enormous. - However, the parallel tasks are not very big and
have to be grouped.
49UNI-DEM on distributed memory computers
- Message Passing Interface (MPI)
- The space domain of the model is divided into
several sub-domains (the number of these
sub-domains being equal to the number of the
processors assigned to the job). - Each processor works on its own sub-domain.
50DD of the domain
51UNI-DEM on distributed memory computers
- The pre-processing procedure
- The input data (the meteorological data and the
emission data) are distributed (consistently with
the sub-domains) to the assigned processors. - Not only is each processor working on its own
sub-domain, but it also has access to all
meteorological and emission data, which are
needed in the run.
52UNI-DEM on distributed memory computers
- The post-processing procedure
- During the run, each processor prepares output
data files for its own subdomain. At the end of
the job all these files have to be collected on
one of the processors and prepared for using them
in the future. - The use of the pre-processing and post-processing
procedures is done in order to reduce as much as
possible the communications during the actual
computations.
53Some performance results
- SGI Origin 2000 (96 x 96 x 10) grid
- Proc. Time(sec.) Speed up Efficiency
- 1 42 907 -- --
- 32 2 215 19.37 61
54Some performance results
- IBM SP (480 x 480) grid
- Proc. Time(sec.) Speed up Efficiency
- 8 54 978 -- --
- 32 15 998 3.44 86
55Some performance results
- IBM SMP (96 x 96) grid
- Proc. Time(sec.) Speed up Efficiency
- 1 5 978 -- --
- 16 424 12.32 72