Title: Parallel computing on nanco an introductory course
1Parallel computing on nanco- an introductory
course
- Anne Weill Zrahia
- Technion,Computer Center
- July 2007
2Parallel Programming on the Nanco
- Parallelization Concepts
- Nanco Computer Design
- Orientation on Nanco
- Parallel Programming -MPI
- 5) Queuing system - SGE
3 4Parallel Power for HPC
- A closely coupled, scalable set of
interconnected computer system, sharing common
hardware and software infrastructure, providing a
parallel set of resources to applications for
improved performance.
5Resources needed for applications arising from
Nanotechnology
- Large memory Tbytes
- High floating point computing speed Tflops
- High data throughput state of the art
6Parallel classification
- Parallel architectures
-
- Shared Memory /
- Distributed Memory
- Programming paradigms
- Data parallel /
- Message passing
7Shared Memory
- Each processor can access any part of the memory
- Access times are uniform (in principle)
- Easier to program (no explicit message passing)
- Bottleneck when several tasks access same
location
8SMP architecture
P
P
P
P
Memory
9Distributed Memory
- Processor can only access local memory
- Access times depend on location
- Processors must communicate via explicit message
passing
10Distributed Memory
Processor Memory
Processor Memory
Interconnection network
11Message Passing Programming
- Separate program on each processor
- Local Memory
- Control over distribution and transfer of data
- Additional complexity of debugging due to
communications
12Why not a cluster
- Single SMP system easier to purchase/maintain
- Ease of programming in SMP systems
13Why a cluster
- Scalability
- Total available physical RAM
- Reduced cost
- But
14Performance issues
- Concurrency ability to perform actions
simultaneously - Scalability performance is not impaired by
increasing number of processors - Locality high ration of local memory
accesses/remote memory accesses (or low
communication)
15SP2 Benchmark
- Goal Checking performance of real world
applications on the SP2 - Execution time (seconds)CPU time for
applications - Speedup
- Execution time for 1 processor
- ---------------------------------
--- - Execution time for p processors
16(No Transcript)
172) Nanco design
18Nanco architecture
19Configuration
M
M
M
P
P
P
P
P
P
node2
node64
node1
Infiniband Switch
20Configuration
- 64 dual-processor, dual core compute nodes, each
dual-core Opteron Rev. F - 8GB RAM memory/node
- 2 master nodes for H/A , also Opterons
- Infiniband Interconnect switch HCAs
- Netapp storage
21(No Transcript)
22AMD Opteron processor
23Memory bottleneck
24AMD Hypertransport
25(No Transcript)
26How does this reflect on performance?
27Performance
- Access to local memory 1hop
- Access to 2nd processor memory 2hops
- Prefetch can be useful for predictable patterns
- Multithreading can be used at node level
28Infiniband interconnect
293) Orientation on nanco
30Getting started
- Security
- Logging in
- Shell environment
- Transferring files
31System access-security
- Secure access
- X-tunelling (for graphics
- Can use ssh X for tunnelling
32Working on nanco
- Because of high-availability, we have 2 master
nodes (masternode1 and masternode2) as points of
entry to the cluster. - Login ssh nanco.technion.ac.il and you will be
redirected to one of the masters
33Login Environment
- Paths and environment variables have been setup
(change things with care) - TCSH is the default (can transfer to bash if you
like) - User modifiable environment variables are in
.cshrc in home directory - Home directory is in /u/courseXX
34Compilers
- Options are gcc, gcc4, suncc for C
- g , sunCC for C
- G77(no F90) , gfortran,sunf90 for
Fortran77/Fortran90
35Useful commands
- ssh-key a script to allow ssh to all nodes
- top - to see your processes Attention you
must login to the actual machine to see your
process - ps u ltusernamegt - to see processes
36Useful commands(cont.)
- parps a script to allow see running processes
on a set of nodes . Usage - parps n1 n2 - from noden1 to noden2
- parshow - a script to see where a particular
executable is running
37Flags for compilation
- sunf90 fast -xO5 -xarchamd64a myprog.f o myprog
- Gcc O3 marchopteron myprog.c o myprog
38Compilation with MPI
- Most MPI implementation support C,C,Fortran77
and Fortran90 bindings. - Scripts for compilation of type mpif77,mpif90,
mpicc etc. - You can specify generic compiler options
394) Parallel programming with MPI
40WHAT is MPI?
- A message- passing library specification
- Extended message-passing model
- Not specific to implementation or computer
41BASICS of MPI PROGRAMMING
- MPI is a message-passing library
- Assumes a distributed memory architecture
- Includes routines for performing communication
(exchange of data and synchronization) among the
processors.
42Message Passing
- Data transfer synchronization
- Synchronization the act of bringing one or more
processes to known points in their execution - Distributed memory memory split up into
segments, each may be accessed by only one
process.
43Message Passing
May I send?
yes
Send data
44MPI STANDARD
- Standard by consensus, designed in an open forum
- Introduced by the MPI FORUM in May 1994, updated
in June 1995. - MPI-2 (1998) produces extensions to the MPI
standard
45Why use MPI ?
- Standardization
- Portability
- Performance
- Richness
- Designed to enable libraries
46Writing an MPI Program
- If there is a serial version , make sure it is
debugged - If not, try to write a serial version first
- When debugging in parallel , start with a few
nodes first.
47Format of MPI routines
48Six useful MPI functions
49Communication routines
50End MPI part of program
51The simplest MPI program
52Exercise 1 running a simple MPI program
53Exercise 2 modifying and using send/receive
54MPI Messages
- DATA data to be sent
- ENVELOPE information to route the data.
55Description of MPI_Send (MPI_Recv)
56Description of MPI_Send (MPI_Recv)
57- program hello
- include mpif.h status(MPI_STATUS_SIZE)
character12 message call MPI_INIT(ierror) call
MPI_COMM_SIZE(MPI_COMM_WORLD, size,ierror) call
MPI_COMM_RANK(MPI_COMM_WORLD, rank,ierror) tag
100 if(rank .eq. 0) then message 'Hello,
world' do i1, size-1 call
MPI_SEND(message, 12, MPI_CHARACTER , i,
tag,MPI_COMM_WORLD,ierror) - enddo
- else
- call MPI_RECV(message, 12, MPI_CHARACTER,
0,tag,MPI_COMM_WORLD, status, ierror) - endif
- print, 'node', rank, '', message
-
- call MPI_FINALIZE(ierror)
- end
58int main( int argc, char argv) int tag100
int rank,size,i MPI_Status status char
message12 MPI_Init(argc,argv)
MPI_Comm_size(MPI_COMM_WORLD,size)
MPI_Comm_rank(MPI_COMM_WORLD,rank)
strcpy(message,"Hello,world")
if (rank0) for
(i1iltsizei)
MPI_Send(message,12,MPI_CHAR,i,tag,MPI_COMM_WORLD)
else
MPI_Recv(message,12,MPI_CHAR,0,tag,MPI_C
OMM_WORLD,status) printf("node d s
\n",rank,message) MPI_Finalize() return
0
59Hellosend
60Some useful remarks
- Source MPI_ANY_SOURCE means that any source is
acceptable - Tags specified by sender and receiver must match,
or MPI_ANY_TAG any tag is acceptable - Communicator must be the same for send/receive.
Usually MPI_COMM_WORLD
61Computing pi using MPI
62Computing pi using MPI(2)
63Computing pi using MPI(3)
64Computing pi using MPI(4)
65Broadcast
- Send data on one node to all other nodes in
communicator. - MPI_Bcast(buffer, count, datatype,root,comm,ierr)
66Broadcast
DATA
A0
A0
P0
A0
P1
A0
P2
A0
P3
67Performance evaluation
- Fortran
- Real8 t1
- T1 MPI_Wtime() ! Returns elapsed time
- C
- double t1
- t1 MPI_Wtime ()
68MPI References
- The MPI Standard
- www-unix.mcs.anl.gov/mpi/index.html
- Parallel Programming with MPI,Peter S.
Pacheco,Morgan Kaufmann,1997 - Using MPI, W. Gropp,Ewing Lusk,Anthony Skjellum,
The MIT Press,1999.
695) Queuing system Sun Grid Engine
70Sun Grid Engine
- Open-source batch queuing system similar to PBS
or LSF - Automatically runs jobs on less loaded nodes
- Queue jobs for later execution to avoid
overloading of system
71Queues definition
- System job execution policy
- Resource allocation
- Resource limits
- Accounting
72SGE properties
- Can schedule serial or MPI jobs
-
- - serial jobs run in individual host queues
- - parallel jobs must include a parallel
environment request
73Working with SGE jobs
- There are command for querying or modifying the
status of a job running or queued by SGE - - qsub submit a job
- - qstat - query the status of a job
- - qdel - deleting a job from SGE
74Submitting a serial job
- Create a submit script (basic.sh)
- !/bin/sh
- scalar example
- Echo This code is running on hostname date
- end of script
75Submitting a serial job
- The job is submitted to SGE using the qsub
command - qsub basic.sh
762 ways of submitting
- With arguments
- qsub o outputfile j y cwd basic.sh
- In submit script
77Monitoring a job - QSTAT
- To list the status and node properties
- Qstat
78Monitoring a job - qstat
- qstat output important fields
- Job identifier
- Job status
- - qw- queued and waiting
- - t job transferring and about to start
- - r job running on listed hosts
- - d job has been marked for deletion
79Deleting a job - QDEL
- Single job qdel 151
- List of jobs
- qdel 151 152 153
- All jobs under user
- qdel u artemis
80Output produced by jobs
- By default , we get 2 files
- ltscriptgt.o.ltjobidgt std output
- ltscriptgt.e.ltjobidgt error messages
- For parallel jobs, also
- ltscriptgt.po.ltjobidgt list of processors the
job ran on
81Debugging job failures
82Script for submitting parallel jobs
- Mpisub gets as input number of processors and
script - Ex mpisub 8 ltmyscript.shgt
83Parallel MPI jobs and SGE
- SGE uses the concept of a parallel environment
(PE) - Several PEs can coexist on the machine
- Each host has an associated queue and resource
list (time,memory) - A PE is a list of hosts along with a set number
of job slots
84List of queues
85Qstat options
86Thanks for your attention!!
B