Using the Linux Cluster at OSC Science - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Using the Linux Cluster at OSC Science

Description:

Linux, Solaris and MS Windows (X86 only). Compiler suite includes, C (pgcc) C++ (pcCC) Fortran 77 ... Recommended flags: -Xa -fast -tp p6 ... F77/F90 Options ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 45
Provided by: LeslieS96
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: Using the Linux Cluster at OSC Science


1
Using the Linux Cluster at OSCScience
Technology Support and Systems StaffHigh
Performance ComputingThe Ohio Supercomputer
Center1224 Kinnear RoadColumbus,Ohio 43212
2
Contents
  • Introduction
  • Accessing the Linux Cluster at OSC
  • User Environment
  • Development Environment
  • Job Scheduling
  • Cluster Parallel Programming

3
Introduction
  • Linux clusters are becoming mature
  • Scientists at NASA Goddard started the Beowulf
    idea, see http//www.beowulf.org
  • Variation on an old theme, NOWs, Cow,s, etc...
  • Rich tool environment, free and commercial
  • Third-party adoption
  • LS-Dyna, Fluent and other scientific codes
  • Large choice of commercial compilers
  • Integrators

4
Tutorial Objectives
  • Logging into the cluster
  • Understanding the layout and hardware of the OSC
    cluster
  • Compiling a program
  • Submitting a job to the batch queue system
  • Compiling and running parallel programs

5
Cluster Layout
6
Hardware Introduction
  • The OSC Beowulf cluster consists of the
    following
  • A front-end node for interactive use, compiling,
    testing, etc.
  • Several (currently 32) compute nodes used by
    parallel jobs.
  • A high-speed system area network (SAN) for
    inter-node communication.
  • External network access.

7
Front End Node Configuration
  • Quad Intel Pentium III Xeon processors running at
    550 MHz with 512kB of L2 cache.
  • 2 GB RAM.
  • Dual UW SCSI controllers supporting 72 GB of SCSI
    disk (mirrored system disk, /usr/local for
    cluster-wide software).
  • Dual Fast Ethernet interfaces.
  • HIPPI interface for fast access to the OSC mass
    storage server (mss.osc.edu).

8
Processor Performance
  • All of the nodes in the OSC Beowulf cluster use
    the Intel Pentium III Xeon processor with a 550
    MHz clock
  • x86 instruction decoder in front of a RISC-style
    super-scalar execution core with out-of-order
    execution.
  • 32 kB L1 instruction cache, 32 kB L1 data cache,
    and 512 kB unified L2 cache.
  • 5 execution units 2 integer units, 2 load/store
    units, and 1 floating-point unit.
  • 14-stage pipeline.
  • 550 MFLOPs peak, 120 MFLOPs on Linpack 100x100.

9
Memory Performance
  • All of the nodes in the OSC Beowulf cluster use
    100MHz SDRAM memory
  • 64-bit wide data path.
  • 6ns latency.
  • 800 MB/s peak, 300 MB/s on stream_d memory copy.

10
Cluster Interconnect
  • Myrinet interconnects are full duplex, 1.281.28
    Gbit/Second channels, 2.02.0 Gbit/Second
    channels are available today.
  • The driver provides user level, OS bypass
    communication primitives.
  • Memory registration to implement zero copy
    protocols.
  • Communication primitives provided through GM,
    Glens Messages.

Use Level Programs
Use Level Programs
MPI, ch_gm
GM
11
Accessing the Linux Cluster at OSC
  • There are several ways to remotely access the
    front end node of the Linux cluster,
    oscbw.osc.edu.
  • You can use telnet
  • telnet oscbw.osc.edu
  • rsh and rlogin are also available
  • rlogin oscbw.osc.edu -l myuserid
  • However, we encourage you to use ssh if possible
  • ssh oscbw.osc.edu -l myuserid
  • ssh sends your commands over an encrypted stream,
    so your passwords and all data transferred cant
    be sniffed over the network.
  • Another benefit of ssh is also the recommended
    method if you will be using the interactive batch
    queue (required for parallel debugging).

12
Remote X Display from the Beowulf Cluster
  • You can run applications which use the X Window
    System on the front end node and have them
    displayed on your remote workstation or PC.
  • If you use ssh, you should be able to display X
    applications remotely with no further work ssh
    does all the necessary steps itself.
  • If you use telnet, rlogin, or rsh, you need to
    set an environment variable called DISPLAY in
    your session on the front end node to point to
    your workstation
  • export DISPLAYmypc.some.edu0.0 ( for ksh
    users)
  • setenv DISPLAY mypc.some.edu0.0 (for csh users)
  • You also need to tell your workstation that the
    front end node is allowed to display to it
  • xhost oscbw.osc.edu

13
User Environment
  • Shells supported at OSC on the cluster are,
  • ksh
  • bash
  • tcsh and csh
  • The modules interface is a way to allow
    multiple versions of software to coexist.
  • They allow you to add or remove software from
    your environment without having to manually
    modify environment variables.
  • This is a Cray-ism which OSC has adopted for
    all of our HPC systems the OSC Beowulf uses a
    modules implementation from Los Alamos National
    Lab.

14
Using modules
  • You can get a list of modules you currently have
    loaded by running module list
  • oscbwgt module list
  • Currently Loaded Modulefiles
  • 1) pbs_2_2_0
  • 2) pgi_3_1
  • 3) modules_0_2
  • 4) mpich_gm
  • To get a list of all available modules, run
    module avail
  • oscbwgt module avail
  • ---------- /usr/local/lanl-modules-0.2/modules
    ----------
  • hdf -gt hdf_4_1_2
  • pbs -gt pbs_2_1_13
  • list continues...

15
Using Modules (cont)
  • To add a software module to your environment, run
    module load ltmodulenamegt
  • oscbwgt module load scms
  • oscbwgt which scms
  • /usr/local/scms/bin/scms
  • oscbwgt module list
  • Currently Loaded Modulefiles
  • scms
  • To remove a software package from your
    environment, run module unload ltmodulenamegt
  • oscbwgt module unload scms
  • oscbwgt which scms
  • scms Command not found.
  • oscbwgt module list
  • Currently Loaded Modulefiles
  • no scms

16
Development Environment
  • Portland Group Compilers
  • Vendor of Compilers for traditional HPC systems.
  • Contracted by DOE and Intel to provide compilers
    for Intel ASCI Red.
  • Optimizing compiler for Intel P6 core.
  • Linux, Solaris and MS Windows (X86 only).
  • Compiler suite includes,
  • C (pgcc)
  • C (pcCC)
  • Fortran 77 (pgf77)
  • Fortran 90 (pgf90)
  • High Performance Fortran - HPF (pghpf)
  • Link compatible with GCC objects and libraries.
  • Includes debugger and profiler (can use GDB).

17
Portland Group Compilers C Options
  • -B (allows C style comments)
  • -mp (enables support for OpenMP and SGI-style PCF
    pragmas for parallelization)
  • -Xa (enforces strict ANSI C compliance)
  • -Xc (enforces loose ANSI C compliance)
  • -Xs (enforces strict KRv1 C compliance)
  • -Xt (enforces loose KRv1 C compliance)
  • Recommended flags -Xa -fast -tp p6 -Mvectassoc
  • -Mvectcachesize524288

18
Portland Group Compilers Common Options
  • Most of these are identical to their counterparts
    in the GNU compilers
  • -c (compile only do not link)
  • -DMACROvalue (defines preprocessor macro MACRO
    with optional value default value is 1)
  • -g (generate symbols for debugging disables
    optimization)
  • -I/dir/name (add /dir/name to the list of
    directories to be searched for included files)
  • -lname (add library libname.aso to the list of
    libraries to be linked)
  • -L/dir/name (add /dir/name to the list of
    directories to be searched for library files)
  • -o outfile (name resulting output file outfile
    default is a.out)
  • -UMACRO (removes definition of MACRO from
    preprocessor)

19
Portland Group Compilers C Options
  • -A (enforces strict ANSI C compliance)
  • --exceptions (enables ANSI C exceptions)
  • -mp (enables support for OpenMP and SGI-style PCF
    pragmas for parallelization)
  • --prelink-objects (enables support for template
    libraries within template libraries)
  • -tall (forces all templates to be instantiated)
  • -tlocal (forces template instantiations to be
    local)
  • -tnone (forces no templates to be instantiated)
  • -tused (instantiates only those templates used)
  • Recommended flags -A -fast -tp p6 -Mvectassoc
  • -Mvectcachesize524288 --prelink-objects

20
Portland Group Compilers F77/F90 Options
  • -byteswapio (uses byte-swapping unformatted I/O
    compatible with Sun and SGI systems)
  • -i4 (assumes 4-byte INTEGERs default)
  • -i8 (assumes 8-byte INTEGERs)
  • -module /dir/name (adds /dir/name to the list of
    directories searched for F90 modules)
  • -mp (enables support for OpenMP and SGI-style PCF
    directives for parallelization)
  • -r4 (assumes 4-byte REALs default)
  • -r8 (assumes 8-byte REALs)
  • -Mcraypointer (forces compatibility with Cray
    CF77 pointer semantics)
  • Recommended flags -fast -tp p6 -Mvectassoc
  • -Mvectcachesize524288

21
Optimization Usage
  • Vectorizor can optimize for countable loops with
    large arrays.
  • Use -Minfoloop to have the compiler report what
    optimizations were applied to the loops,
    unrolling and vectorized.
  • Cache size can be specified to maximize cache
    re-use, -Mvectcachesize
  • Use -Mneginfoloop to provide information about
    why a loop was not a candidate for vectorization.
  • Can specify number of times to unroll a loop.
  • Can use -Minline to inline functions. This can
    improve the performance of calls to functions
    inside of subroutines.
  • Is not useful for functions that have an
    execution time gtgt penalty for the jump.
  • This option will sacrifice code compactness for
    efficiency.

22
Optimizations Usage (cont.)
  • All command line optimizations are available
    through directives or pragmas.
  • Can be used to enable or disable specific
    optimizations.

23
Caveats for Portland Compilers
  • F77 and F90 are separate front-ends.
  • Debugger cannot display floating point registers.
  • Code compiled with Portland Compiler is
    compatible with GDB
  • Initial listing of code does not work.
  • Set break point or watch point where desired.
  • Profiler can be difficult or impossible to use on
    parallel codes.
  • Complete compiler suite documentation can be
    found at, http//www.pgroup.com/docs.htm

24
MPI Compiler Wrappers
  • The MPICH/GM implementation of MPI uses a set of
    compiler scripts to keep users from having to
    remember how to set include and library paths for
    the their MPI compiles. These scripts call the
    system compilers to do the actual compilation.
    The scripts support the following languages
  • C (mpicc -- wrapper for pgcc)
  • C (mpiCC -- wrapper for pgCC)
  • Fortran 77 (mpif77 -- wrapper for pgf77)
  • Fortran 90 (mpif90 -- wrapper for pgf90)
  • These compiler scripts accept the same arguments
    as the compiler they wrap, i.e. mpicc accepts the
    same arguments as pgcc, mpif77 accepts the same
    arguments as pgf77, etc.

25
MPI Compiler Wrappers (cont.)
  • The MPI compiler wrappers also accept a few
    command line arguments of their own
  • -mpilog (generates MPE log files compatible with
    the jumpshot MPI profiler)
  • -mpitrace (prints trace messages on entry and
    exit to all MPI routines)

26
When the MPI Compiler Wrappers Break
  • Occasionally, a programs build process will make
    make assumptions about quoting around the
    arguments for compilers that will not work with
    the MPI compiler wrappers (which are after all
    only shell scripts). In these cases, you should
    use the Portland Group compilers directly and use
    the following environment variables
  • Compile with
  • MPI_CFLAGS (C)
  • MPI_CXXFLAGS (C)
  • MPI_FFLAGS (F77)
  • MPI_F90FLAGS (F90)
  • Link with MPI_LIBS

27
Libraries
  • The OSC Beowulf cluster has several Fortran
    numerical libraries installed which can be used
    in conjunction with the Portland Group compilers
  • BLAS (link with -lblas)
  • LAPACK (link with -llapack -lblas)
  • LAPACK90 (link with -L/usr/local/lib -llapack90
    -llapack -lblas)
  • BLACS, PBLAS, and ScaLAPACK (compile with mpif77
    or mpif90, link with SCALAPACK PBLAS
    FBLACS)
  • A public domain version of Crays libsci FFT
    routines (link with
    -L/usr/local/lib -lsci)
  • PETSC (module load petsc to use, look at the
    examples Makefiles in PETSC_ROOT/examples for
    how to build programs which use it)

28
Libraries (cont)
  • The OSC Beowulf cluster also has several I/O
    libraries installed for writing files in platform
    independent formats
  • HDF (module load hdf to use, compile with
    HDF_INCLUDE, link with HDF_LIBS)
  • HDF5 (module load hdf5 to use, compile with
    HDF5_INCLUDE, link with HDF5_LIBS)
  • NetCDF (link with -lnetcdf for C or Fortran, or
    -lnetcdf_c for C)

29
Job Scheduling
  • Why Job Scheduling Software
  • In an ideal world, users would coordinate with
    each other and no conflicts would be encountered
    when running jobs on a cluster.
  • Unfortunately in real life we have limited
    resources (processors, memory and network
    interfaces)
  • Users, faced with time deadlines of their own,
    will want to execute jobs on the cluster as it
    fits with their schedule
  • High throughput users can swamp the whole system,
    if allowed
  • Users can check for CPU availability (system
    load), but how many will check memory or network
    interface availability
  • Job scheduling system allows you to enforce a
    system policy
  • Policy can be established by management or peer
    review
  • Enforcement of policy will control what are the
    maximum resources available, and in what order
    jobs will be allocated these resources

30
Job Scheduling Software for Clusters
  • There are several batch queuing systems available
    for Linux-based clusters, depending on what your
    needs are. Here are just a few
  • Condor (http//www.cs.wisc.edu/condor/)
  • DQS (http//www.scri.fsu.edu/pasko/dqs.html)
  • Generic NQS (http//www.gnqs.org/)
  • Job Manager (http//bond.imm.dtu.dk/jobd/)
  • GNU Queue (http//www.gnu.org/software/queue/queue
    .html)
  • LSF (http//www.platform.com/ -- commercial)
  • Portable Batch System (PBS) (http//pbs.pbspro.com
    /)

31
Introduction to PBS
  • PBS is short for Portable Batch System it is
    an open source batch queuing system.
  • It is an outgrowth/extension of the NQS batch
    queuing system from the NAS project at NASA Ames
    Research Center.
  • PBS is available for virtually anything that is
    UNIX-like, including Linux, the BSDs, UNICOS,
    IRIX, Solaris, AIX, HP/UX, and Digital UNIX.

32
How PBS Handles a Job
  • User determines resource requirements for a job
    and writes a batch script.
  • User submits job to PBS with the qsub command.
  • PBS places the job into a queue based on its
    resource requests and runs the job when those
    resources become available.
  • The job runs until it either completes or exceeds
    one of its resource request limits.
  • PBS copies the jobs output into the directory
    from which the job was submitted and optionally
    notified the user via email that the job has
    ended.

33
Determining Job Requirements
  • For single CPU jobs, PBS needs to know at least
    two resource requirements
  • CPU time
  • memory
  • For multiprocessor parallel jobs, PBS also needs
    to know how many nodes/CPUs are required.
  • Other things to consider
  • Job name?
  • Working in /tmp or TMPDIR?
  • Where to put standard output and error output?
  • Should the system email when the job completes?

34
PBS Job Scripts
  • An PBS job script is just a regular shell script
    with some comments (the ones starting with PBS)
    which are meaningful to PBS. These comments are
    used to specify properties of the job.
  • PBS job scripts always start in your home
    directory, HOME. If you need to work in another
    directory, your job script will need to cd to
    there.
  • Every PBS job has a unique temporary directory,
    TMPDIR. This in on each compute nodes local
    disk array and thus is much faster than your home
    directory, which is mounted over the network from
    the mass storage server. For best I/O
    performance, you should try to copy all the files
    you need into TMPDIR, do your work there, and
    then copy any files you want to keep back to your
    home directory.

35
PBS Job Scripts (cont)
  • Useful PBS options
  • -e errfile (redirect standard error to errfile)
  • -I (run as an interactive job)
  • -j oe (combine standard output and standard
    error)
  • -l cputN (request N seconds of CPU time N can
    also be in hhmmss form)
  • -l memNKMGBW (request N kilomegagigabyte
    swords of memory)
  • -l nodesNppnM (request N nodes with M
    processors per node)
  • -m e (mail the user when the job completes)
  • -m a (mail the user if the job aborts)
  • -o outfile (redirect standard output to outfile)
  • -N jobname (name the job jobname)
  • -S shell (use shell instead of your login shell
    to interpret the batch script must include a
    complete path)
  • -V (job inherits the full environment of the
    current shell, including DISPLAY)

36
A First Batch Script
  • Here is a simple batch job
  • PBS -l cput400000
  • PBS -l nodes1ppn1
  • PBS -N cdnz3d
  • PBS -j oe
  • PBS -S /bin/ksh
  • cd HOME/Beowulf/cdnz3d
  • cp .dat cdnz3d.in cdnz3dxyz.bin TMPDIR
  • cd TMPDIR
  • /usr/bin/time ./cdnz3d gt cdnz3d.hist
  • cp cdnz3d.out cdnz3dq.bin HOME/Beowulf/cdnz3d
  • This job asks for one CPU on one node, and 40
    hours of CPU time. Its name is cdnz3d.

37
Monitoring a Job
  • The status of all the jobs running on the Beowulf
    cluster can be shown with the qstat command
  • oscbwgt qstat -a
  • oscbw.cluster.osc.edu

  • Req'd Req'd Elap
  • Job ID Username Queue Jobname
    SessID NDS TSK Memory Time S Time
  • --------------- -------- -------- ----------
    ------ --- --- ------ ----- - -----
  • 80.oscbw.clust osu2376 serial h1.com
    1207 1 1 64mb 1640 R 0105 node01
  • 86.oscbw.clust troy serial cdnz3d
    776 1 1 36mb 4000 R 0000 node02
  • 93.oscbw.clust cls022 serial NAME
    787 1 1 128mb 1106 R 0000 node04
  • 101.oscbw.clus cls022 SMP imid1
    6542 1 2 64mw 1000 R 0040 node01

38
qstat Output Fields
  • Job Id (request number)
  • Username (userid)
  • Queue (queue the job is in)
  • Jobname (name of the job)
  • SessId (job identifier)
  • NDS (number of nodes requested)
  • TSK (number of CPUs per node requested)
  • Reqd Memory (memory requested if waiting or
    used if running)
  • Reqd Time (CPU time requested)
  • S (status)
  • R (running)
  • Q (queued and waiting)
  • Elap Time (time the job has been running)
  • nodes the job is running on

39
Killing a Job
  • If, for whatever reason, you need to delete a
    queued job or kill a running job, use the qdel
    command.
  • Usage qdel request_number

40
SMP Jobs
  • So far, the job scripts weve seen have been
    serial, uniprocessor jobs. The following is an
    example of a job that used more than one
    processor on a single node
  • oscbw/Beowulf/ompgt more smp.pbs
  • PBS -N smp
  • PBS -j oe
  • PBS -S /bin/ksh
  • PBS -l nodes1ppn4
  • PBS -l cput00100
  • cd HOME/Beowulf/omp
  • export OMP_NUM_THREADS4
  • /usr/bin/time ./matmul-omp

41
More on SMP and Serial Jobs
  • The only real difference between a uniprocessor
    job and an SMP job (at least from PBSs point of
    view) is the -l nodes1ppn4 limit in the SMP
    job. This tells PBS to allow the job to run four
    processes (or threads) concurrently on one node.
  • If you simply request a number of nodes (eg. -l
    nodes1), PBS will assume that you want one
    processor per node.

42
Parallel Jobs
  • Both serial and SMP jobs run on only 1 node.
    However, most MPI programs should be run on more
    than 1 node. Here is an example of how to do
    that
  • PBS -N nblock
  • PBS -j oe
  • PBS -l nodes4ppn4
  • PBS -l cput10000
  • cd /Beowulf/mpi-c
  • mpiexec ./nblock

43
mpiexec Format
  • mpiexec OPTION... executable args...
  • -n numproc Use only the specified number of
    processes (optional)
  • -tv, -totalview Debug using totalview (does
    not work with ch_gm)
  • -perif Allocate only one process
    per myrinet interface
  • This flag can be used to ensure maximum
    communication
  • bandwidth available to each process
  • -pernode Allocate only one
    process per compute node. For SMP
  • nodes, only one processor will be allocated
    a job.
  • This flag is used to implement multiple level
    parallelism
  • with MPI between nodes, and threads within a
    node

44
Other Sources of Information
  • OSC technical information server,
    http//oscinfo.osc.edu
  • OSC state-wide software licenses,
    http//oscinfo.osc.edu/software/ssd.html
  • Linux Fortran web page, http//studbolt.physast.ug
    a.edu/templon/fortran.html
  • Cygnus/FSF GCC homepage, http//gcc.gnu.org
  • Scientific Applications on Linux,
    http//SAL.KachinaTech.COM/index.shtml
  • Myricom homepage, http//www.myri.com
Write a Comment
User Comments (0)
About PowerShow.com