Title: Introducci
1Introducción a la computación de altas
prestaciones
Francisco Almeida y Francisco de Sande
Departamento de Estadística, I.O. y
Computación Universidad de La Laguna
La Laguna, 12 de febrero de 2004
2Questions
- Why Parallel Computers?
- How Can the Quality of the Algorithms be
Analyzed? - How Should Parallel Computers Be Programmed?
- Why the Message Passing Programming Paradigm?
- Why de Shared Memory Programming Paradigm?
3OUTLINE
- Introduction to Parallel Computing
- Performance Metrics
- Models of Parallel Computers
- The MPI Message Passing Library
- Examples
- The OpenMP Shared Memory Library
- Examples
- Improvements in black hole detection using
parallelism
4Why Parallel Computers ?
- Applications Demanding more Computational Power
- Artificial Intelligence
- Weather Prediction
- Biosphere Modeling
- Processing of Large Amounts of Data (from sources
such as satellites) - Combinatorial Optimization
- Image Processing
- Neural Network
- Speech Recognition
- Natural Language Understanding
- etc..
-
1990 s
1980 s
1970 s
Performance
1960 s
Cost
SUPERCOMPUTERS
5Top500
6Performace Metrics
7Speed-up
- Ts Sequential Run Time Time elapsed between
the begining and the end of its execution on a
sequential computer. - Tp Parallel Run Time Time that elapses from
the moment that a parallel computation starts to
the moment that the last processor finishes the
execution. - Speed-up Ts / Tp ? p
- Ts Time of the best sequential algorithm to
solve the problem.
8Speed-up
9Speed-up
10Speed-up
11Speed-up
Optimal Number of Processors
12Efficiency
- In practice, ideal behavior of an speed-up equal
to p is not achieved because while executing a
parallel algorithm, the processing elements
cannot devote 100 of their time to the
computations of the algorithm. - Efficiency Measure of the fraction of time for
which a processing element is usefully employed. - E (Speed-up / p) x 100
13Efficiency
14Amdahls Law
- Amdahls law attempt to give a maximum bound for
speed-up from the nature of the algorithm chosen
for the parallel implementation. - Seq Proportion of time the algorithm needs to
be spent in purely sequential parts. - Par Proportion of time that might be done in
parallel - Seq Par 1 (where 1 is for algebraic
simplicity) - Maximum Speed-up (Seq Par) / (Seq Par / p)
1 / (Seq Par / p)
Seq Par Maximum Speed-up
0,1 0,001 0,999 500,25
0,5 0,005 0,995 166,81
1 0,01 0,99 90,99
10 0,1 0,9 9,91
p 1000
15Example
- A problem to be solved many times over several
different inputs. - Evaluate F(x,y,z)
- x in 1 , ..., 20 y in 1 , ..., 10 z in 1 ,
..., 3 - The total number of evaluations is 20103 600.
- The cost to evaluate F in one point (x, y, z) is
t. - The total running time is t 600.
- If t is equal to 3 hours.
- The total running time for the 600 evaluations is
1800 hours ? 75 days
16Speed-up
17Models
18The Sequential Model
- The RAM model express computations on von Neumann
architectures. - The von Neumann architecture is universally
accepted for sequential computations.
RAM
Von Neumann
19The Parallel Model
PRAM
- Computational Models
- Programming Models
- Architectural Models
BSP, LogP
PVM, MPI, HPF, Threads, OPenMP
Parallel Architectures
20Address-Space Organization
21Digital AlphaServer 8400
Hardware
- Shared Memory
- BusTopology
- C4-CEPBA
- 10 Alpha processors21164
- 2 Gb Memory
- 8,8 Gflop/s
22SGI Origin 2000
Hardware
- Shared Dsitributed Memory
- Hypercubic Topology
- C4-CEPBA
- 64 R1000 processos
- 8 Gb memory
- 32 Gflop/s
23The SGI Origin 3000 Architecture (1/2)
- jen50.ciemat.es
- 160 processors MIPS R14000 / 600MHz
- On 40 nodes with 4 processors each
- Data and instruction cache on-chip
- Irix Operating System
- Hypercubic Network
24The SGI Origin 3000 Architecture (2/2)
- cc-Numa memory Architecture
- 1 Gflops Peak Speed
- 8 MB external cache
- 1 Gb main memory each proc.
- 1 Tb Hard Disk
25Beowulf Computers
- COTS Commercial-Off-The-Shelf computers
26Towards Grid Computing.
Source www.globus.org updated
27The Parallel Model
PRAM
- Computational Models
- Programming Models
- Architectural Models
BSP, LogP
PVM, MPI, HPF, Threads, OpenMP
Parallel Architectures
28Drawbacks that arise when solving Problems using
Parallelism
- Parallel Programming is more complex than
sequential. - Results may vary as a consequence of the
intrinsic non determinism. - New problems. Deadlocks, starvation...
- Is more difficult to debug parallel programs.
- Parallel programs are less portable.
29The Message Passing Model
30The Message Passing Model
Send(parameters) Recv(parameters)
31(No Transcript)
32MPI
CMMD
pvm
Express
Zipcode
p4
PARMACS
EUI
MPI
Parallel Libraries
Parallel Applications
Parallel Languages
33MPI
- What Is MPI?
- Message Passing Interface standard
- The first standard and portable message
passing library with good performance - "Standard" by consensus of MPI Forum
participants from over 40 organizations - Finished and published in May 1994, updated
in June 1995 - What does MPI offer?
- Standardization - on many levels
- Portability - to existing and new systems
- Performance - comparable to vendors'
proprietary libraries - Richness - extensive functionality, many
quality implementations
34A Simple MPI Program
MPI hello.c include ltstdio.hgt include
"mpi.h" main(int argc, charargv) int name,
p, source, dest, tag 0 MPI_Init(argc,argv)
MPI_Comm_rank(MPI_COMM_WORLD,name)
MPI_Comm_size(MPI_COMM_WORLD,p) printf(Hello
from processor d of d\n",name, p)
MPI_Finalize()
35A Simple MPI Program
MPI helloms.c include ltstdio.hgt include
ltstring.hgt include "mpi.h" main(int argc,
charargv) int name, p, source, dest, tag
0 char message100 MPI_Status
status MPI_Init(argc,argv) MPI_Comm_rank(MPI_C
OMM_WORLD,name) MPI_Comm_size(MPI_COMM_WORLD,p)
if (name ! 0) printf("Processor d of
d\n",name, p) sprintf(message,"greetings
from process d!", name)
dest 0 MPI_Send(message,
strlen(message)1,
MPI_CHAR, dest, tag,
MPI_COMM_WORLD) else
printf("processor 0, p d ",p)
for(source1 source lt p source)
MPI_Recv(message,100, MPI_CHAR, source,
tag, MPI_COMM_WORLD, status)
printf("s\n",message)
MPI_Finalize()
36Linear Model to Predict Communication Performance
- Time to send N bytes ? n b
37Performace Prediction Fast, Gigabit Ethernet,
Myrinet
38Basic Communication Operations
39One-to-all broadcast Single-node Accumulation
One-to-all broadcast
M
. . .
0
p
1
M
M
M
Single-node Accumulation
0
1
Step 1
2
Step 2
. . .
Step p
p
40Broadcast on Hypercubes
41Broadcast on Hypercubes
42MPI Broadcast
- int MPI_Bcast(
- void buffer
- int count
- MPI_Datatype datatype
- int root
- MPI_Comm comm
- )
- Broadcasts a message from the
- process with rank "root" to
- all other processes of the group
43Reduction on Hypercubes
A6 110
- _at_ conmutative and associative operator
- Ai in processor i
- Every processor has to obtain A0_at_A1_at_..._at_AP-1
A7 101
A2 101
A3 101
A5 101
A0 000
A1 001
44Reductions with MPI
- int MPI_Reduce(
- void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opint
rootMPI_Comm comm) - Reduces values on all processes to a single value
processes
- int MPI_Allreduce(
- void sendbufvoid recvbufint
countMPI_Datatype datatypeMPI_Op opMPI_Comm
comm) - Combines values form all processes and
distributes the result back to all
45All-To-All BroadcastMultinode Accumulation
All-to-all broadcast
M1
M2
Mp
M0
M0
M0
Single-node Accumulation
M1
M1
M1
Mp
Mp
Mp
Reductions, Prefixsums
46MPI Collective Operations
- MPI Operator Operation
- ----------------------------------------
----------------------- - MPI_MAX maximum
- MPI_MIN minimum
- MPI_SUM sum
- MPI_PROD product
- MPI_LAND logical and
- MPI_BAND bitwise and
- MPI_LOR logical or
- MPI_BOR bitwise or
- MPI_LXOR logical
exclusive or - MPI_BXOR bitwise
exclusive or - MPI_MAXLOC max value and
location - MPI_MINLOC min value and
location
47Computing ? Sequential
double t, pi0.0, w long i, n ... double
local, pi 0.0 .... h 1.0 / (double) n
for (i 0 i lt n i) x (i 0.5)
h pi f(x) pi h
4
2
0.0
0.2
0.4
0.6
0.8
1.0
48Computing ? Parallel
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) h 1.0 / (double) n
mypi 0.0 for (i name i lt n i
numprocs) x h (i 0.5) h
mypi f(x) mypi h sum
MPI_Reduce(mypi, pi, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD)
4
2
0.0
0.2
0.4
0.6
0.8
1.0
49The Master Slave Paradigm
Master
Slaves
50Condor University Wisconsin-Madison.
www.cs.wisc.edu/condor
- A problem to be solved many times over several
different inputs. - The problem to be solved is computational
expensive.
51Condor
- Condor is a specialized workload management
system for compute-intensive jobs. - Like other full-featured batch systems, Condor
provides a job queueing mechanism, scheduling
policy, priority scheme, resource monitoring, and
resource management. - Users submit their serial or parallel jobs to
Condor, Condor places them into a queue, chooses
when and where to run the jobs based upon a
policy, carefully monitors their progress, and
ultimately informs the user upon completion. - Condor can be used to manage a cluster of
dedicated compute nodes (such as a "Beowulf"
cluster). In addition, unique mechanisms enable
Condor to effectively harness wasted CPU power
from otherwise idle desktop workstations. - In many circumstances Condor is able to
transparently produce a checkpoint and migrate a
job to a different machine which would otherwise
be idle. - As a result, Condor can be used to seamlessly
combine all of an organization's computational
power into one resource.
52What is OpenMP?
- Application Program Interface (API) for shared
memory parallel programming - What the application programmer inserts into code
to make it run in parallel - Addresses only shared memory multiprocessors
- Directive based approach with library support
- Concept of base language and extensions to base
language - OpenMP is available for Fortran 90 / 77 and C /
C
53OpenMP is not...
- Not Automatic parallelization
- User explicitly specifies parallel execution
- Compiler does not ignore user directives even if
wrong - Not just loop level parallelism
- Functionality to enable coarse grained
parallelism - Not a research project
- Only practical constructs that can be implemented
with high performance in commercial compilers - Goal of parallel programming application speedup
- Simple/Minimal with Opportunities for
Extensibility
54Why OpenMP?
- Parallel programming landscape before OpenMP
- Standard way to program distributed memory
computers (MPI and PVM) - No standard API for shared memory programming
- Several vendors had directive based APIs for
shared memory programming - Silicon Graphics, Cray Research, Kuck
Associates, Digital Equipment . - All different, vendor proprietary
- Sometimes similar but with different spellings
- Most were targeted at loop level parallelism
- Limited functionality - mostly for parallelizing
loops
55Why OpenMP? (cont.)
- Commercial users, high end software vendors have
big investment in existing code - Not very eager to rewrite their code in new
languages - Performance concerns of new languages
- End result users who want portability forced to
program shared memory machines using MPI - Library based, good performance and scalability
- But sacrifice the built in shared memory
advantages of the hardware - Both require major investment in time and money
- Major effort entire program needs to be
rewritten - New features needs to be curtailed during
conversion
56The OpenMP API
- Multi-platform shared-memory parallel programming
- OpenMP is portable supported by Compaq, HP, IBM,
Intel, SGI, Sun and others on Unix and NT - Multi-Language C, C, F77, F90
- Scalable
- Loop Level parallel control
- Coarse grained parallel control
57The OpenMP API
- Single source parallel/serial programming
- OpenMP is not intrusive (to original serial
code). - Instructions appear in comment statements for
Fortran and through pragmas for C/C - Incremental implementation
- OpenMP programs can be implemented incrementally
one subroutine (function) or even one do (for)
loop at a time
!omp parallel do do i 1, n ... enddo
pragma omp parallel for for (i 0 I lt n
i) ...
58Threads
- Multithreading
- Sharing a single CPU between multiple tasks (or
"threads") in a way designed to minimise the time
required to switch threads - This is accomplished by sharing as much as
possible of the program execution environment
between the different threads so that very little
state needs to be saved and restored when
changing thread. - Threads share more of their environment with each
other than do tasks under multitasking. - Threads may be distinguished only by the value of
their program counters and stack pointers while
sharing a single address space and set of global
variables
59OpenMP Overview How do threads interact?
- OpenMP is a shared memory model
- Threads communicate by sharing variables
- Unintended sharing of data causes race
conditions - race condition when the programs outcome
changes as the threads are scheduled differently - To control race conditions
- Use synchronizations to protect data conflicts
- Synchronization is expensive so
- Change how data is accessed to minimize the need
for synchronization
60OpenMP Parallel Computing Solution Stack
User layer
Environment Variables
Prog. Layer (OpenMP API)
Directives
OpenMP Library
Runtime Library
System layer
OS/system support for shared memory
61Reasoning about programming
- Programming is a process of successive refinement
of a solution relative to a hierarchy of models - The models represent the problem at a different
level of abstraction - The top levels express the problem in the
original problem domain - The lower levels represent the problem in the
computers domain - The models are informal, but detailed enough to
support simulation - Source J.-M. Hoc, T.R.G. Green, R. Samurcay and
D.J. Gilmore (eds.), Psychology of Programming,
Academic Press Ltd., 1990
62Layers of abstraction in Programming
Domain
Model Bridges between domains
Problem
Algorithm
OpenMP only defines these two!
Source Code
Computation
Cost
Hardware
63The OpenMP Computational Model
- OpenMP was created with a particular abstract
machine or computational model in mind - Multiple processing elements
- A shared address space with equal-time access
for each processor - Multiple light weight processes (threads) managed
outside of OpenMP (the OS or some other third
party)
64The OpenMP programming model
- fork-join parallelism
- Master thread spawns a team of threads as needed
- Parallelism is added incrementally i.e. the
sequential program evolves into a parallel program
Master Thread
A nested Parallel Region
Parallel Regions
65So, How good is OpenMP?
- A high quality programming environment supports
transparent mapping between models - OpenMP does this quite well for the models it
defines - Programming model
- Threads forked by OpenMP map onto threads in
modern OSs. - Computational model
- Multiprocessor systems with cache coherency map
onto OpenMP shared address space
66What about the cost model?
- OpenMP doesnt say much about the cost model
- programmers are left to their own devices
- Real systems have memory hierarchies, OpenMPs
assumed machine model doesnt - Caches mean some data is closer to some
processors - Scalable multiprocessor systems organize their
RAM into modules - another source of NUMA - OpenMP programmers must deal with these issues as
they - Optimize performance on each platform
- Scale to run onto larger NUMA machines
67What about the specification model?
- Programmers reason in terms of a specification
model as they design parallel algorithms - Some parallel algorithms are natural in OpenMP
- Specification models implied by loop-splitting
and SPMD algorithms map well onto OpenMPs
programming model - Some parallel algorithms are hard for OpenMP
- Recursive problems and list processing is
challenging for OpenMPs models
68Is OpenMP a good API?
Model Bridges between domains
Fair (5)
Good (8)
Good (9)
Poor (3)
Overall score OpenMP is OK (6), but it sure
could be better!
69OpenMP today
- Hardware Vendors
- Compaq/Digital (DEC)
- Hewlett-Packard (HP)
- IBM
- Intel
- Silicon Graphics
- Sun Microsystems
3rd Party Software Vendors Absoft Edinburgh
Portable Compilers (EPC) Kuck Associates
(KAI) Myrias Numerical Algorithms Group
(NAG) Portland Group (PGI)
70The OpenMP components
- Directives
- Environment Variables
- Shared / Private Variables
- Runtime Library
- OS Threads
71The OpenMP directives
- Directives are special comments in the language
- Fortran fixed form !OMP, COMP, OMP
- Fortran free form !OMP
- Special comments are interpreted by OpenMP
compilers
w 1.0/n sum 0.0 !OMP PARALLEL DO
PRIVATE(x) REDUCTION(sum) do I1,n
x w(I-0.5) sum sum f(x) end
do pi wsum print ,pi end
Comment in Fortran but interpreted by
OpenMP compilers
Dont worry about details now!
72The OpenMP directives
- Look like a comment (sentinel / pragma syntax)
- F77/F90 !OMP directive_name clauses
- C/C pragma omp pragmas_name clauses
- Declare start and end of multithread execution
- Control work distribution
- Control how data is brought into and taken out of
parallel sections - Control how data is written/read inside sections
73The OpenMP environment variables
- OMP_NUM_THREADS - number of threads to run in a
parallel section - MPSTKZ size of stack to provide for each thread
- OMP_SCHEDULE - Control state of scheduled
executions. - setenv OMP_SCHEDULE "STATIC, 5
- setenv OMP_SCHEDULE "GUIDED, 8
- setenv OMP_SCHEDULE "DYNAMIC"
- OMP_DYNAMIC
- OMP_NESTED
74Shared / Private variables
- Shared variables can be accessed by all of the
threads. - Private variables are local to each thread
- In a typical parallel loop, the loop index is
private, while the data being indexed is shared
!omp parallel !omp parallel do !omp
shared(X,Y,Z), private(I) do I1, 100
Z(I) X(I) Y(I) end do !omp end parallel
75OpenMP Runtime routines
- Writing a parallel section of code is matter of
asking two questions - How many threads are working in this section?
- Which thread am I?
- Other things you may wish to know
- How many processors are there?
- Am I in a parallel section?
- How do I control the number of threads?
- Control the execution by setting directive state
76Os threads
- In the case of Linux, it needs to be installed
with an SMP kernel. - Not a good idea to assign more threads than CPUs
available - omp_set_num_threads(omp_get_num_procs())
77A simple example computing ?
78Computing ?
double t, pi0.0, w long i, n
100000000 double local, pi 0.0, w 1.0 /
n ... for(i 0 i lt n i) t (i 0.5)
w pi 4.0/(1.0 tt) pi w ...
79Computing ?
- pragma omp directives in C
- Ignored by non-OpenMP compilers
double t, pi0.0, w long i, n
100000000 double local, pi 0.0, w 1.0 /
n ... pragma omp parallel for reduction(pi)
private(i,t) for(i 0 i lt n i) t (i
0.5) w pi 4.0/(1.0 tt) pi w ...
80Computing ? on a SunFire 6800
81Compiling OpenMP programs
- OpenMP directives are ignored by default
- Example SGI Irix platforms
- f90 -O3 foo.f
- cc -O3 foo.c
- OpenMP directives are enabled with -mp
- Example SGI Irix platforms
- f90 -O3 -mp foo.f
- cc -O3 -mp foo.c
82Fortran example
- OpenMP directives used
- comp parallel clauses
- comp end parallel
- program f77_parallel
- implicit none
- integer n, m, i, j
- parameter (n10, m20)
- integer a(n,m)
- comp parallel default(none)
- comp private(i,j) shared(a)
- do j1,m
- do i1,n
- a(i,j)i(j-1)n
- enddo
- enddo
- comp end parallel
- end
- Parallel clauses include
- default(noneprivateshared)
- private(...)
- shared(...)
83Fortran example
- program f77_parallel
- implicit none
- integer n, m, i, j
- parameter (n10, m20)
- integer a(n,m)
- comp parallel default(none)
- comp private(i,j) shared(a)
- do j1,m
- do i1,n
- a(i,j)i(j-1)n
- enddo
- enddo
- comp end parallel
- end
- Each arrow denotes one thread
- All threads perform identical task
84Fortran example
- With default scheduling,
- Thread a works on j 15
- Thread b on j 610
- Thread c on j 1115
- Thread d on j 1620
- program f77_parallel
- implicit none
- integer n, m, i, j
- parameter (n10, m20)
- integer a(n,m)
- comp parallel default(none)
- comp private(i,j) shared(a)
- do j1,m
- do i1,n
- a(i,j)i(j-1)n
- enddo
- enddo
- comp end parallel
- end
a
b
c
d
85The Foundations of OpenMP
Layers of abstractions or models used to
understand and use OpenMP
86Summary of OpenMP Basics
- Parallel Region (to create threads)
- Comp parallel pragma omp parallel
- Worksharing (to split-up work between threads)
- Comp do pragma omp for
- Comp sections pragma omp sections
- Comp single pragma omp single
- Comp workshare
- Data Environment (to manage data sharing)
- directive threadprivate
- clauses shared, private, lastprivate,
reduction, copyin, copyprivate - Synchronization
- directives critical, barrier, atomic, flush,
ordered, master - Runtime functions/environment variables
87Improvements in black hole detection using
parallelism
Francisco Almeida1, Evencio Mediavilla2, Álex
Oscoz2 and Francisco de Sande1
1 Departamento de Estadística, I.O. y
Computación
2 Instituto de Astrofísica de
Canarias
Universidad de La Laguna Tenerife, Canary
Islands, SPAIN Dresden, September 3
2003
88Introduction (1/3)
- Very frequently there is a divorce between
computer scientists and researchers in other
scientific disciplines - This work collects the experiences of a
collaboration between researchers coming from two
different fields astrophysics and parallel
computing - We present different approaches to the
parallelization of a scientific code that solves
an important problem in astrophysics - the detection of supermassive black holes
89Introduction (2/3)
- The IAC co-authors initially developed a
Fortran77 code solving the problem - The execution time for this original code was not
acceptable and that motivated them to contact
with researchers with expertise in the parallel
computing field - We know in advance that these scientist
programmers deal with intense time-consuming
sequential codes that are not difficult to tackle
using HPC techniques - Researchers with a purely scientific background
are interested in these techniques, but they are
not willing to spend time learning about them
90Introduction (3/3)
- One of our constraints was to introduce the
minimum amount of changes in the original code - Even with the knowledge that some optimizations
could be done in the sequential code - To preserve the use of the NAG functions was also
another restriction in our development
91Outline
- The Problem
- Black holes and quasars
- The method gravitational lensing
- Fluctuations in quasars light curves
- Mathematical formulation of the problem
- Sequential code
- Parallelizations MPI, OpenMP, Mixed MPI/OpenMP
- Computational results
- Conclusions
92Black holes
- Supermassive black holes (SMBH) are supposed to
exist in the nucleus of many if not all the
galaxies - Some of these objects are surrounded by a disk of
material continuously spiraling towards the deep
gravitational potential pit of the SMBH and
releasing huge quantities of energy giving rise
to the phenomena known as quasars (QSO)
93The accretion disk
94Quasars (Quasi Stellar Objects, QSO)
- QSOs are currently believed to be the most
luminous and distant objects in the universe - QSOs are the cores of massive galaxies with super
giant black holes that devour stars at a rapid
rate, enough to produce the amount of energy
observed by a telescope
95The method
- We are interested in objects of dimensions
comparable to the Solar System in galaxies very
far away from the Milky Way - Objects of this size can not be directly imaged
and alternative observational methods are used to
study their structure - The method we use is the observation of QSO
images affected by a microlensing event to study
the structure of the accretion disk
96Gravitational Microlensing
- Gravitational lensing (the attraction of light by
matter) was predicted by General Relativity and
observationally confirmed in 1919 - If light from a QSO pass through a galaxy located
between the QSO and the observer it is possible
that a star in the intervening galaxy crosses the
QSO light beam - The gravitational field of the star amplifies the
light emission coming from the accretion disk
(gravitational microlensing)
97Microlensing Quasar-Star
98Microlensing
- The phenomenon is more complex because the
magnification is not due to a single isolated
microlens, but it rather is a collective effect
of many stars - As the stars are moving with respect to the QSO
light beam, the amplification varies during the
crossing
99Microlensing
100Double Microlens
101Multiple Microlens
- Magnification pattern in the source plane,
produced by a dense field of stars in the lensing
galaxy. - The color reflects the magnification as a
function of the quasar position the sequence
blue-green-red-yellow indicates increasing
magnification
102Q22370305 (1/2)
- So far the best example of a microlensed quasar
is the quadruple quasar Q22370305
103Q22370305 (2/2)
104Fluctuations in Q22370305 light curves
- Lightcurves of the four images of Q22370305 over
a period of almost ten years - The changes in relative brightness are very
obvious
105Q22370305
- In Q22370305, and thanks to the unusually small
distance between the observer and the lensing
galaxy, the microlensing events have a timescale
of the order of months - We have observed Q22370305 from 1999 October
during approximately 4 months at the Roque de los
Muchachos Observatory
106Fluctuations in light curves
- The curve representing the change in luminosity
of the QSO with time depends on the position of
the star and also on the structure of the
accretion disk - Microlens-induced fluctuations in the observed
brightness of the quasar contain information
about the light-emitting source (size of
continuum region or broad line region of the
quasar, its brightness profile, etc.) - Hence from a comparison between observed and
simulated quasar microlensing we can draw
conclusions about the accretion disk
107Q22370305 light curves (2/2)
- Our goal is to model light curves of QSO images
affected by a microlensing event to study the
unresolved structure of the accretion disk
108Mathematical formulation (1/5)
- Leaving aside the physical meaning of the
different variables, the function modeling the
dependence of the observed flux with time
t, can be written as
Where is the ratio between the outer and
inner radii of the accretion disk (we will adopt
).
109Mathematical formulation (2/5)
- To speed up the computation, G has been
approximated by using MATHEMATICA
110Mathematical formulation (3/5)
- Our goal is to estimate in the observed flux
the values of the parameters
by fitting to the observational data
111Mathematical formulation (4/5)
- Specifically, to find the values of the 5
parameters that minimize the error between the
theoretical model and the observational data
according to a chi-square criterion
Where
- N is the number of data points
corresponding - to times ti (i 1, 2, ..., N)
- is the theoretical function evaluated
at time ti
- is the observational error associated
to each data value
112Mathematical formulation (5/5)
- The determination of the minimum in the
5-parameters space depends on the initial
conditions, so - we consider a 5-dimensional grid of starting
points - and m sampling intervals in each variable
- for each one of the points of the this grid we
compute a local minimum - Finally, we select the absolute minimum among them
- To minimize we use the e04ccf NAG routine,
that only requires evaluation of the function and
not of the derivatives
113The sequential code
program seq_black_hole double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit c Data input c Initialize best
solution do k11, m do k21, m
do k31, m do k41, m
do k51, m c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx))
endif enddo enddo
enddo enddo enddo end
114The sequential code
program seq_black_hole double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit c Data input c Initialize best
solution do k11, m do k21, m
do k31, m do k41, m
do k51, m c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx))
endif enddo enddo
enddo enddo enddo end
115Sequential Times
- In a Sun Blade 100 Workstation running Solaris
5.0 and using the native Fortran77 Sun compiler
(v. 5.0) with full optimizations this code takes - 5.89 hours for sampling intervals of size m4
- 12.45 hours for sampling intervals of size m5
- In a SGI Origin 3000 using the native MIPSpro F77
compiler (v. 7.4) with full optimizations - 0.91 hours for m4
- 2.74 hours for m5
116Loop transformation
program seq_black_hole2 implicit
none parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5) integer k1,
k2, k3, k4, k5 common/var/t, fit c
Data input c Initialize best solution
do k 1, m5 c Index transformation c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo end seq_black_hole
117MPI OpenMP (1/2)
- In the last years OpenMP and MPI have been
universally accepted as the standard tools to
develop parallel applications - OpenMP
- is a standard for shared memory programming
- It uses a fork-join model and
- is mainly based on compiler directives that are
added to the code that indicate the compiler
regions of code to be executed in parallel - MPI
- Uses an SPMD model
- Processes can read and write only to their
respective local memory - Data are copied across local memories using
subroutine calls - The MPI standard defines the set of functions and
procedures available to the programmer
118MPI OpenMP (2/2)
- Each one of these two alternatives have both
advantages and disadvantages - Very frequently it is not obvious which one
should be selected for a specific code - MPI programs run on both distributed and shared
memory architectures while OpenMP run only on
shared memory - The abstraction level is higher in OpenMP
- MPI is particularly adaptable to coarse grain
parallelism. OpenMP is suitable for both
coarse and fine grain parallelism - While it is easy to obtain a parallel version of
a sequential code in OpenMP, usually it requires
a higher level of expertise in the case of MPI. A
parallel code can be written in OpenMP just by
inserting proper directives in a sequential code,
preserving its semantics, while in MPI mayor
changes are usually needed - Portability is higher in MPI
119MPI-OpenMP hybridization (1/2)
- Hybrid codes match the architecture of SMP
clusters - SMP clusters are an increasingly popular
platforms - MPI may suffer from efficiency problems in shared
memory architectures - MPI codes may need too much memory
- Some vendors have attempted to exted MPI in
shared memory, but the result is not as efficient
as OpenMP - OpenMP is easy to use but it is limited to the
shared memory architecture
120MPI-OpenMP hybridization (2/2)
- An hybrid code may provide better scalability
- Or simply enable a problem to exploit more
processors - Not necessarily faster than pure MPI/OpenMP
- It depends on the code, architecture, and how the
programming models interact
121MPI code
program black_hole_mpi include
'mpif.h' implicit none parameter(m
4) ... double precision t2(100),
s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud double
precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK(
MPI_COMM_WORLD, myid, ierr ) call
MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c
Data input c Initialize best solution
do k myid, m5 - 1, numprocs c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif enddo
call MPI_FINALIZE( ierr ) end
122MPI code
program black_hole_mpi include
'mpif.h' implicit none parameter(m
4) ... double precision t2(100),
s(100), ne(100), fa(100), efa(100)
common/data/t2, fa, efa, longitud double
precision t, fit(5) common/var/t, fit
call MPI_INIT( ierr ) call MPI_COMM_RANK(
MPI_COMM_WORLD, myid, ierr ) call
MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) c
Data input c Initialize best solution
do k myid, m5 - 1, numprocs c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif enddo
call MPI_FINALIZE( ierr ) end
123OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
124OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
125OpenMP code
program black_hole_omp implicit none
parameter(m 4) ... double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, longitud
double precision t, fit(5)
common/var/t, fit !OMP THREADPRIVATE(/var/,
/data/) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x) COPYIN(/data/)
LASTPRIVATE(fx) do k 0, m5 - 1 c
Initialize starting point x(1), ..., x(5)
call jic2(nfit,x,fx) call
e04ccf(nfit,x,fx,ftol,niw,w1,w2,w3,w4,w5,w6,jic2,.
..) if (fx improves best fx) then
update(best (x, fx)) endif
enddo !OMP END PARALLEL DO end
126Hybrid MPI OpenMP code
program black_hole_mpi_omp double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit !OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_C
OMM_WORLD, myid, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs,
ierr) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx) do k myid,
m5 - 1, mpi_numprocs c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol
,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx
improves best fx) then update(best (x,
fx)) endif enddo c Reduce the
OpenMP best solution !OMP END PARALLEL DO c
Reduce the MPI best solution call
MPI_FINALIZE(ierr) end
127Hybrid MPI OpenMP code
program black_hole_mpi_omp double
precision t2(100), s(100), ne(100), fa(100),
efa(100) common/data/t2, fa, efa, length
double precision t, fit(5) common/var/t,
fit !OMP THREADPRIVATE(/var/, /data/)
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_C
OMM_WORLD, myid, ierr) call
MPI_COMM_SIZE(MPI_COMM_WORLD, mpi_numprocs,
ierr) c Data input c Initialize best
solution !OMP PARALLEL DO DEFAULT(SHARED)
PRIVATE(tid,k,maxcal,ftol,ifail,w1,w2,w
3,w4,w5,w6,x)
COPYIN(/data/)LASTPRIVATE(fx) do k myid,
m5 - 1, mpi_numprocs c Initialize starting
point x(1), ..., x(5) call
jic2(nfit,x,fx) call e04ccf(nfit,x,fx,ftol
,niw,w1,w2,w3,w4,w5,w6,jic2,...) if (fx
improves best fx) then update(best (x,
fx)) endif enddo c Reduce the
OpenMP best solution !OMP END PARALLEL DO c
Reduce the MPI best solution call
MPI_FINALIZE(ierr) end
128The SGI Origin 3000 Architecture (1/2)
- jen50.ciemat.es
- 160 processors MIPS R14000 / 600MHz
- On 40 nodes with 4 processors each
- Data and instruction cache on-chip
- Irix Operating System
- Hypercubic Network
129The SGI Origin 3000 Architecture (2/2)
- cc-Numa memory Architecture
- 1 Gflops Peak Speed
- 8 MB external cache
- 1 Gb main memory each proc.
- 1 Tb Hard Disk
130Computational results (m4)
- As we do not have exclusive mode access to the
architecture, the times correspond to the minimum
time from 5 different executions - The figures corresponding to the mixed mode code
(label MPI-OpenMP) correspond to the minimum
times obtained for different combinations of MPI
processes/OpenMP threads
131Parallel execution time (m4)
132Speedup (m4)
133Results for mixed MPI-OpenMP
134Computational results (m5)
135Parallel execution time (m5)
136Speedup (m5)
137Results for mixed MPI-OpenMP (m5)
138Conclusions (1/3)
- The computational results obtained from all the
parallel versions confirm the robustness of the
method - For the case of non-expert users and the kind of
codes we have been dealing with, we believe that
MPI parallel versions are easier and safer - In the case of OpenMP, the proper usage of data
scope attributes for the variables involved may
be a handicap for users with non-parallel
programming expertise - The higher current portability of the MPI version
is another factor to be considered
139Conclusions (2/3)
- The mixed MPI/OpenMP parallel version is the most
expertise-demanding - Nevertheless, as it has been stated by several
authors and even for the case of a hybrid
architecture, this version does not offer any
clear advantage and it has the disadvantage that
the combination of processes/threads has to be
tuned
140Conclusions (3/3)
- We conclude a first step of cooperation in the
way of applying HPC techniques to improve
performance in astrophysics codes - The scientific aim of applying HPC to
computationally-intensive codes in astrophysics
has been successfully achieved
141Conclusions and Future Work
- The relevance of our results do not come directly
from the particular application chosen, but from
stating that parallel computing techniques are
the key to broach large scale real problems in
the mentioned scientific field - From now on we plan to continue this fruitful
collaboration by applying parallel computing
techniques to some other astrophysical challenge
problems
142Supercomputing Centers
- http//www.ciemat.es/
- http//www.cepba.upc.es/
- http//www.epcc.ed.ac.uk/
143MPI links
- Message Passing Interface Forum
- http//www.mpi-forum.org/
- MPI The Complete Reference
- http//www.netlib.org/utk/papers/mpi-book/mpi-boo
k.html - Parallel Programming with MPI By Peter Pacheco.
http//www.cs.usfca.edu/mpi/
144OpenMP links
- http//www.openmp.org/
- http//www.compunity.org/
145Gracias!
Introducción a la computación de altas
prestaciones
Francisco Almeida y Francisco de Sande (falmeida,
fsande)_at_ull.es
Departamento de Estadística, I.O. y
Computación Universidad de La Laguna
La Laguna, 12 de febrero de 2004