Using DataStar Mahidhar Tatineni and Amit Majumdar, SDSC

About This Presentation

Title:

Using DataStar Mahidhar Tatineni and Amit Majumdar, SDSC

Description:

... will sit idle, while the big job waits for the required number of nodes to ... Wait time. Jobs increase in priority as they age ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 18

Provided by: sdsc

Category:

more less

Transcript and Presenter's Notes

Title: Using DataStar Mahidhar Tatineni and Amit Majumdar, SDSC

1
Using DataStarMahidhar Tatineni and Amit
Majumdar, SDSC
2
DataStar Configuration

15.6 TF, 2528 processors total
11 32-way 1.7 GHz IBM p690
2 nodes 64 GB memory for login and system use
4 nodes 128 GB memory 1 node 256 GB for batch
scientific computation
3 nodes 128 GB memory for database,
DiscoveryLink, HPSS
1 node 256 GB memory for interactive use (post
processing, visualization)
176 8-way 1.5 GHz IBM p655
16 GB memory
Batch scientific computation
96 8-way 1.7 GHz IBM p655
32 GB memory
Batch scientific computation
All nodes Federation switch attached
All nodes SAN attached
Parallel filesystem 116 TB GPFS 750 TB GPFS-WAN
(Shared with Blue Gene, TG IA-64 cluster)

3
Logging in

SSH2 (Secure Shell 2) client
http//www.sdsc.edu/us/consulting/ssh.html
For this workshop We willing be using the
dspoe.sdsc.edu for submitting and running jobs in
the express queue.
ssh ltusernamegt_at_dspoe.sdsc.edu (DataStar)

4
Moving files

SCP
scp original_file user_at_dspoe.sdsc.edu/to_dir/copi
ed_file
BBFTP
gt2 GB
http//www.sdsc.edu/us/resources/datastar/getstart
.htmlmigrate
High Performance Storage System (HPSS) /
Storage Resource Broker (SRB)

5
File System Structure

/home 1GB quota. Backed up. Do Not store large
files or ouput
/scratch Local file system (64 GB on each
node). Space cleaned after each job.
/gpfs 116TB shared parallel file system.
NOT backed. Purgable
/gpfs-wan Visible for read-write operations via
TG network (750 TB)
HPSS (archival storage) - 25PB of tape capacity
http//www.sdsc.edu/us/resources/hpss/

6
Batch/Interactive computing

Batch job environment
Job Manager Load Leveler (tool from IBM)
Job Scheduler Catalina (SDSC internal
tool)
Job Monitoring Various commands
Batch Interactive use on different nodes.
DataStar Login Nodes
dslogin.sdsc.edu
dspoe.sdsc.edu
dsdirect.sdsc.edu

7
(No Transcript)
8
Queues nodes

Start with dspoe (interactive queues)
Do production runs from dslogin (normal
normal32 queues)
Use express queues from dspoe to get it right
now.
Use dsdirect for special needs.

9
Loadleveler Commands

Show the current queue state llq
Submit job to the queue llsubmit
Cancel your job in the queue llcancel
Special (more useful commands from SDSCs
inhouse tool Catalina)
showq to look at the status of the queue.
show_bf to look at the backfill window
opportunities
Note When a job shows None as the start time
or says BADRESOURCELIST in the showq output gt
you are requesting resources which are not
available on the machine (for example asking for
too much memory, too many nodes or too many CPUs
per node)

10
Sample Job Scripts

Example files are located here
/gpfs/projects/workshop/running_jobs
Copy the whole directory
Use Makefile to compile the source code.
Edit the parameters in the job submission
scripts.
The examples illustrate use of Loadleveler to
submit, and follow jobs.

11
Backfill window show_bf

Scenario Queue draining for running big job(s).
Many nodes will sit idle, while the big job waits
for the required number of nodes to finish
currently running jobs.
Use show_bf command to identify all the idle
nodes (and the duration they are available
for) and use them immediately.

12
SDSC Job Priorities - 1

Priorities determined by a number
of weighting factors
Job size
gt 128 nodes (1024 procs) get highest priority
Prevents wasted machine dry-outs
Favors jobs that can be run no where else
Allocation size
PIs with 1.2M hours need to run more jobs than
those with 10k hours

13
SDSC Job Priorities - 2

Priorities determined by a number of weighting
factors (cont.)
Priority
High, normal, express queues
Charge high X2 normal X1 express X1.8
4 nodes reserved 24 7 just for express jobs
Wait time
Jobs increase in priority as they age
Big boost for normal jobs older than 4 days or
high jobs older than 2 day
Boosted just under the large jobs

14
Tips to reduce queue wait time.

Every time you submit a job look for
any possible backfill windows (use show_bf)
Try to estimate your job runtime and ask
for exact amount you would need
do not just ask for the max 18 hrs
If possible scale up your job to more number
of processors/nodes.

15
Archival Storage HPSS

What is HPSS
The centralized, long-term data storage
system at SDSC is the
High Performance Storage System (HPSS)
currently stores more than 3 PB of data (as of
June 2006)
total system capacity of 25 PB of data.
Data added at an average rate of 100 TB per month
(between Aug0 5 and Feb 06).

16
SDSC Resources Applications

SDSC provides a wide range of software
applications installed on the production
computing platforms. Ranging from finite element
codes to state-of-the-art visualization packages,
these applications are available to any
researcher who has a computing allocation at
SDSC.
The applications are listed on the SDSC
Applications page
http//www.sdsc.edu/us/resources/applications.htm
l
Information is also available on TeraGrid
Software page
http//hpcsoftware.teragrid.org/Software/user/ind
ex.php

17
SDSC DataStar Applications and Libraries

The third party applications and libraries are
usually located in /usr/local/apps64 (for 64 bit
applications) and /usr/local/apps32 (for 32 bit
applications.
The /usr/local/apps directory corresponds to the
64 bit directory.
Users must take care to link the correct versions
of the libraries when compiling, i.e use the 64
bit libraries when compiling with the q64 option
or while compiling with OBJECT_MODE64.
For example if you need the single precision
version of the fftw 2.1.5 library and you are
compiling in 64 bit mode you must link the
library in
/usr/local/apps64/fftw215s/lib/