On the domain decomposition approach in some convectiondiffusionreaction problems - PowerPoint PPT Presentation

About This Presentation

Title:

On the domain decomposition approach in some convectiondiffusionreaction problems

Description:

domain overlapping in the advection-diffusion submodel ... The performance of the horizontal advection and diffusion can be carried out ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 56

Provided by: sveto

Category:

more less

Transcript and Presenter's Notes

Title: On the domain decomposition approach in some convectiondiffusionreaction problems

1
On the domain decomposition approach in some
convection-diffusion-reaction problems

K. Georgiev 1, Z. Zlatev 2
1 Institute for Parallel Processing,
Bulgarian Academy of Sciences, Bulgaria
2 National Environmental Research Institute,
Roskilde, Denmark

2
Introduction

Efficient numerical methods algorithms for
solving
convection-diffusion-reaction systems
are of big priority due to
numerous practically very much important problems

3
Introduction

Promoninent among this subject are simulations
in
air pollution modelling

4
Introduction

Promoninent among this subject are simulations
in
air pollution modelling
pipe networks

5
Introduction

Promoninent among this subject are simulations
in
air pollution modelling
pipe networks
acoustics

6
Introduction

Promoninent among this subject are simulations
in
air pollution modelling
pipe networks
acoustics
turbolent kinetic energy and its dispersion

7
Introduction

Promoninent among this subject are simulations
in
air pollution modelling
pipe networks
acoustics
turbolent kinetic energy and its dispersion
non-Newton flows (extra stresses)

8
Introduction

Promoninent among this subject are simulations
in
air pollution modelling
pipe networks
acoustics
turbolent kinetic energy and its dispersion
non-Newton flows (extra stresses)
magneto - hydrodynamics

9
Introduction

Promoninent among this subject are simulations
in
air pollution modelling
pipe networks
acoustics
turbolent kinetic energy and its dispersion
non-Newton flows (extra stresses)
magneto - hydrodynamics
modelling of zeolite filters
etc.

10
The Mathematical Model

Danish Eulerian Model for long range transport of
air pollutants

11
Splitting into submodels
12
Numerical treatment

Finite elements (1D linear first order, bilinear,
nonconforming)

13
Numerical treatment in 2D case

Finite elements (1D linear first order, bilinear,
nonconforming)
Predictor-corrector methods with several
different correctors in advection-diffusion
submodel
QSSA (Quasi-Steady-State Algorithm) in
chemistry-emission submodel
Exact solution in deposition submodel

14
Size of the computational task
15
Parallelization strategy

Distributed memory parallelization model via
Message Passing Interface (MPI) - maximum
portability of the code

16
Parallelization strategy

Distributed memory parallelization model via
Message Passing Interface (MPI) - maximum
portability of the code
Based on domain decomposition of the horizontal
grid
domain overlapping in the advection-diffusion
submodel
nonoverlapping subdomains in chemistry
-deposition submodel

17
DD approach
18
Need of parallel computations

SunFire 6800 (24 CPU UltraSparc-III/750 MHz
(DTU, Lyngby, Denmark)

19
Need of parallel computations

SunFire 6800 (24 CPU UltraSparc-III/750 MHz
(DTU, Lyngby, Denmark)
Grid 480 x 480 (10 km. resolution)
2D version of UNI-DEM
ONE processor

20
Need of parallel computations

SunFire 6800 (24 CPU UltraSparc-III/750 MHz
(DTU, Lyngby, Denmark)
Grid 480 x 480 (10 km. resolution)
2D version of UNI-DEM
ONE processor
4 017 852 sec. 1116 h 46.5 day!

21
Danish Eulerian Model (UNI-DEM)

vector computers (CRAY C92A, Fujitsu, etc.)

22
Danish Eulerian Model (UNI-DEM)

vector computers (CRAY C92A, Fujitsu, etc.)
parallel computers with distributed memory (IBM
SP, CRAY T3E, Beowulf clusters, etc.)

23
Danish Eulerian Model (UNI-DEM)

vector computers (CRAY C92A, Fujitsu, etc.)
parallel computers with distributed memory (IBM
SP, CRAY T3E, Beowulf clusters, etc.)
parallel computers with shared memory (SGI
Origin, SUN, etc.)

24
Danish Eulerian Model (UNI-DEM)

vector computers (CRAY C92A, Fujitsu, etc.)
parallel computers with distributed memory (IBM
SP, CRAY T3E, Beowulf clusters, etc.)
parallel computers with shared memory (SGI
Origin, SUN, etc.)
parallel computers with two levels of parallelism
(IBM SMP, Macitntoch G4 clusters, etc.)

25
Some numerical experiments

Machines used
SunFire 6800 at DTU, Lyngby, Denmark
(24 CPU UltraSparc-III/750 MHz)

26
Some numerical experiments

Machines used
SunFire 6800 at DTU, Lyngby, Denmark
(24 CPU UltraSparc-III/750 MHz)
Macintosh G4 Power PC cluster at IPP, Sofia
(Linux cluster, 4 nodes x 2 CPU G4/450 MHz)

27
Some numerical experiments

Computing time (in sec.) and speedup on
Macintosh cluster
1 proc. 2 proc. 4 proc. 8 proc
Grid size
96 x 96 65 036 33 698 (1.93)
17 185 (3.78) 9 684 (6.72)
288 x 288 1 338 960 699 424 (1.91) 366
548 (3.65) 175 066 (7.65)

28
Some numerical experiments

Computing time (in sec.) and speedup on
Macintosh cluster
1 proc. 2 proc. 4 proc. 8 proc
Grid size
96 x 96 65 036 33 698 (1.93)
17 185 (3.78) 9 684 (6.72)
288 x 288 1 338 960 699 424 (1.91) 366
548 (3.65) 175 066 (7.65)
Computing time (in sec.) and speedup on SunFire
6800
96 x 96 52 744 23 217 (2.27)
13 296 (3.97) 7 765 (6.79)
288 x 288 709 339 327 400 (2.17) 198
030 (3.58) 100 033 (7.09)

29
Some numerical experiments

SunFire 6800 vs Macintosh G4 cluster

30
Some numerical experiments

SunFire 6800 vs Macintosh G4 cluster
SUN top performance
--------------------------------
1.67
G4 top performance

31
Some numerical experiments

SunFire 6800 vs Macintosh G4 cluster
SUN top performance
--------------------------------
1.67
G4 top performance
1 proc. 2 proc. 4 proc. 8 proc
Grid size
96 x 96 1.23 1.45 1.29 1.25
288 x 288 1.89 2.14 1.85 1.75

32
Acknowledgments

This research was support in part by
Grant IO-01/03 of Bulgarian NSF
Grant from the NATO Scientific Programme (CRG
960505).

33
Implementations

for sequential computers
for vector computers (CRAY C92A, Fujitsu, etc.)
for parallel computers with distributed memory
(IBM SP, CRAY T3E, Beowulf clusters, etc.)
for parallel computers with shared memory (SGI
Origin, SUN, etc.)
for parallel computers with two level of
parallelism distributed memory between nodes
and shared memory inside the nodes that consist
of several processors (IBM SMP, clusters of
multiprocessor nodes, etc.)

34
Space discretization

32 x 32 (150 km. resolution)
96 x 96 (50 km. resolution)
288 x 288 (16,7 km resolution)
480 x 480 (10 km resolution)

35
Need of parallel computations

No of equations per system of ODEs that are
treated at every time-step (typically 3456 for
one month period)
Grid---gt 32x32x10 96x96x10 288x288x10
480x480x10
No. of
spec.
35 358400 3 225 600 29 030
400 80 640 000
56 573 400 5 160 960 46 448
640 129 024 000
168 1 720 320 15 482 880 139 345 920
387 072 000

36
Need of parallel computations

SunFire 6800 (24 CPU UltraSparc-III/750 MHz
(DTU, Lyngby, Denmark)
Grid 480 x 480 (10 km. resolution)
2D version of UNI-DEM
ONE processor
4 017 852 sec. 1116 h 46.5 day!

37
Computing time modules

Comp. time in sec. on one proc. of IBM SMP
Module Comp. time
Percent
Chemistry 16 147
83.09
Advection 3 013
15.51
Initialization 2
0.00
Input operations 50
0.26
Output operations 220
1.13
Total time 19 432
100.00

38
Chemistry chunks

To reduce the computing time used in the chemical
module it is worthwhile to divide the arrays into
smaller portions chunks.
Copy data from appropriate sections of the large
arrays in small arrays where the chunks are
stored. Then the bulk of the computational work
is performed by using data from the small arrays
(which will hopefully stay in the caches)

39
Chemistry chunks

Let us assume that M is the length of the
leading dimension of the 2D arrays used in the
chemical module
Divide these arrays into nchunks chunks
Leading dimension of the obtained smaller
arrays, is nsize M/nchunks

40
Chemistry chunks

The largest arrays in chemical submodel
(a) three arrays for the concentrations,
(b) one array for the emissions,
(c) one array for the time-dependent chemical
rate coefficients and
(d) three arrays for the depositions (dry, wet
and total).

41
Chemistry chunks

DO ichunk 1, nchunks
Copy chunk ichunk from some of the eight large
arrays into small two-dimensional arrays with
leading dimension nsize
DO j 1, nspecies
DO i 1, nsize
Perform the chemical reactions involving
species j for grid-point i
END DO
END DO
Copy some of the small two-dimensional arrays
with leading dimension nsize into chunk ichunk of
the corresponding large arrays
END DO

42
Chemistry chunks

Comp. time in sec. , 2D UNI-DEM, (96 x 96) grid,
one proc.
Chunks size Fujitsu Origin 2000 Mac
G4 IBM SMP
1 76 964 14 847 6 952 10 313
48 2 611 12 114
5 792 5 225
9216 494 18 549 12 893 19
432

43
Chemistry chunks

Conclusions
The optimal length of the chunks depends on the
memory hierarchy of the computer.
The length of the chunks, nsize, should be a
parameter which can be selected in the main
program.

44
Chemistry chunks

Conclusions (cont)
The use of long chunks leads to many cache misses
and, therefore, to a very significant increase of
the computing time.
The use of very short chunks increases too much
the number of copies that have to be made and,
thus, the computing time.
The use of medium chunks in rather large range
gives normally very good results

45
UNI-DEM on shared memory computers

OpenMP
(i) to get good results on different shared
memory computers when such directives are used
(ii) to achieve a high degree of portability.

46
UNI-DEM on shared memory computers

It is important to identify the parallel tasks
and to group them in an appropriate way when
necessary
The horizontal advection and diffusion
The performance of the horizontal advection and
diffusion can be carried out independently for
every chemical compound (and for the 3-D version
for every layer)

47
UNI-DEM on shared memory computers

The chemistry and deposition
These two processes can be carried out in
parallel for every grid-point.
The number of parallel tasks is equal to the
number of gridpoints, but each task is a small
task.
Therefore, the tasks should be grouped in an
appropriate way.

48
UNI-DEM on shared memory computers

The vertical exchange
The performance of the vertical exchange along
each vertical grid-line is a parallel task. The
number Nx x Ny x Nz
If the grid is fine, then the number of these
tasks is becoming enormous.
However, the parallel tasks are not very big and
have to be grouped.

49
UNI-DEM on distributed memory computers

Message Passing Interface (MPI)
The space domain of the model is divided into
several sub-domains (the number of these
sub-domains being equal to the number of the
processors assigned to the job).
Each processor works on its own sub-domain.

50
DD of the domain
51
UNI-DEM on distributed memory computers

The pre-processing procedure
The input data (the meteorological data and the
emission data) are distributed (consistently with
the sub-domains) to the assigned processors.
Not only is each processor working on its own
sub-domain, but it also has access to all
meteorological and emission data, which are
needed in the run.

52
UNI-DEM on distributed memory computers

The post-processing procedure
During the run, each processor prepares output
data files for its own subdomain. At the end of
the job all these files have to be collected on
one of the processors and prepared for using them
in the future.
The use of the pre-processing and post-processing
procedures is done in order to reduce as much as
possible the communications during the actual
computations.

53
Some performance results