Introduction to Scientific Computing on Linux Clusters - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Scientific Computing on Linux Clusters

Description:

clustering cheap, off-the-shelf processors ... Domain Decomposition. Typically perform operations on arrays ... domain decomposition ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 42

Provided by: OIT278

Learn more at: http://scv.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Scientific Computing on Linux Clusters

1
Introduction to Scientific Computing on Linux
Clusters
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
2
Outline

Why Clusters?
Parallelization
example - Game of Life
performance metrics
Ways to Fool the Masses
summary

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
3
Why Clusters?

Scientific computing has traditionionally been
performed on fast, specialized machines
Buzzword - Commodity Computing
clustering cheap, off-the-shelf processors
can achieve good performance at a low cost if the
applications scale well

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
4
Clusters (2)

102 clusters in current Top 500 list
http//www.top500.org/list/2001/06/
Resonable parallel efficiency is the key
generally use message passing, even if there are
shared-memory CPUs in each box

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
5
Compilers

Linux Fortran compilers (F90/95)
available from many vendors, e.g., Absoft,
Compaq, Intel, Lahey, NAG, Portland Group,
Salford
g77 is free, but is restricted to Fortran 77,
relatively slow

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
6
Compilers (2)

Intel offers free unsupported Fortran compiler
for non-commercial purposes
full F95
OpenMP
http//www.intel.com/software/products/
compilers/f60l/noncom.htm

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
7
Compilers (3)
http//www.polyhedron.com/
8
Compilers (4)

Linux C/C compilers
gcc/g seems to be the standard, usually
described as a good compiler
also available from vendors, e.g., Compaq, Intel,
Portland Group

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
9
Parallelization of Scientific Codes
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
10
Domain Decomposition

Typically perform operations on arrays
e.g., setting up and solving system of equations
domain decomposition
arrays are broken into chunks, and each chunk is
handled by a separate processor
processors operate simultaneously on their own
chunks of the array

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
11
Other Methods

Parallelzation also possible without domain
decomposition
less common
e.g., process one set of inputs while reading
another set of inputs from a file

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
12
Embarrassingly Parallel

if operations are completely independent of one
another, this is called embarrassingly parallel
e.g., initializing an array
some Monte Carlo simulations
not usually the case

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
13
Game of Life

Early simple cellular automata
created by John Conway
2-D grid of cells
each has one of 2 states (alive or dead)
cells are initialized with some distribution of
alive and dead states

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
14
Game of Life (2)

at each time step states are modified based on
states of adjacent cells (including diagonals)
Rules of the game
3 alive neighbors - alive
2 alive neighbors - no change
other - dead

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
15
Game of Life (3)
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
16
Game of Life (4)

Parallelize on 2 processors
assign block of columns to each processor

Problem - What happens at split?

17
Game of Life (5)

Solution - Overlap cells

Each time step, pass overlap data processor to
processor

18
Message Passing

Largest bottleneck to good parallel efficiency is
usually message passing
much slower than number crunching
set up your algorithm to minimize message passing
minimize surface-to-volume ratio of subdomains

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
19
Domain Decomp.
For this domain
To run on 2 processors, decompose like this
Not like this
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
20
How to Pass Msgs.

MPI is the recommended method
PVM may also be used
MPICH
most common
free download
http//www-unix.mcs.anl.gov/mpi/mpich/
others also avalable, e.g., LAM

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
21
How to Pass Msgs.

some MPI tutorials
Boston University
http//scv.bu.edu/Tutorials/MPI/
NCSA
http//pacont.ncsa.uiuc.edu8900/public/MPI/

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
22
Performance
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
23
Code Timing

How well has code been parallelized?
CPU time vs. wallclock time
both are seen in literature
I prefer wallclock
only for dedicated processors
CPU time doesnt account for load imbalance
unix time command
Fortran system_clock subroutine
MPI_Wtime

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
24
Parallel Speedup

quantify how well we have parallelized our code
Sn parallel speedup
n number of processors
T1 time on 1 processor
Tn time on n processors

25
Parallel Speedup (2)
26
Parallel Efficiency

hn parallel efficiency
T1 time on 1 processor
Tn time on n processors
n number of processors

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
27
Parallel Efficiency (2)
28
Parallel Efficiency (3)

What is a reasonable level of parallel
efficiency?
Depends on
how much CPU time you have available
when the paper is due
can think of (1-h) as wasted CPU time
my personal rule of thumb 60

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
29
Parallel Efficiency (4)

Superlinear speedup
parallel efficiency gt 1.0
sometimes quoted in the literature
generally attributed to cache issues
subdomains fit entirely in cache, entire domain
does not
this is very problem dependent
be suspicious!

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
30
Amdahls Law

Always some operations which are performed
serially
want a large fraction of code to execute in
parallel

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
31
Amdahls Law (2)

Let fraction of code that executes serially be
denoted s
Let fraction of code that executes in parallel be
denoted p

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
32
Amdahls Law (3)

Noting that p (1-s)
The parallel speedup is

Amdahls Law
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
33
Amdahls Law (4)

The parallel efficiency is

Alternate version of Amdahls Law
Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
34
Amdahls Law (5)
35
Amdahls Law (6)

Should we despair?
No!
bigger machines solve bigger problems
smaller value of s
if you want to run on a large number of
processors, try to minimize s

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
36
Ways to Fool the Masses

full title Twelve Ways to Fool the Masses When
Giving Performance Results on Parallel Computers
Created by David Bailey of NASA Ames in 1991
following is selection of ways, some paraphrased

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
37
Ways to Fool (2)

Scale problem size with number of processors
Project results linearly
2 proc, 1 hr. 1800 proc., 1 sec.
Present performance of kernel, represent as
performance of application

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
38
Ways to Fool (3)

Compare with old code on obsolete system
Quote MFLOPS based on parallel implementation,
not best serial implementation
increase no. operations rather than decreasing
time

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
39
Ways to Fool (4)

Quote parallel speedup making sure
single-processor version is slow
Mutilate the algorithm used in the parallel
implementation to match the architecture
explicit vs. implicit PDE solvers
Measure parallel times on dedicated system,
serial times in busy environment

Doug Sondak Linux Clusters and Tiled Display
Walls July 30 August 1, 2002
40
Ways to Fool (5)