NCCS User Forum - PowerPoint PPT Presentation

About This Presentation

Title:

NCCS User Forum

Description:

SCU4 added to Discover and currently running in 'pioneer' mode ... Accessible via /share; usage deprecated. Accessible via $SHARE; correct usage. GSFC ... – PowerPoint PPT presentation

Number of Views:445

Avg rating:3.0/5.0

Slides: 38

Provided by: bill138

Learn more at: https://www.nccs.nasa.gov

Category:

more less

Transcript and Presenter's Notes

Title: NCCS User Forum

1
NCCS User Forum

11 December 2008

2
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
3
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
4
Key Accomplishments

SCU4 added to Discover and currently running in
pioneer mode
Explore decommissioned and removed
Discover filesystems converted to GPFS 3.2 native
mode

5
Discover Utilization Past Year
SCU3 cores added
73.3
67.1
64.4
2,446,365 CPU hours
1,320,683 CPU hours
6
Discover Utilization
7
Discover Queue Expansion Factor
Weighted over all queues for all jobs (Background
and Test queues excluded)
Eligible Time Run Time Run Time
8
Discover Availability

September through November availability
13 outages
9 unscheduled
0 hardware failures
7 software failures
2 extended maintenance windows
4 scheduled
104.3 hours total downtime
68.3 unscheduled
36.0 scheduled

GPFS hang

Longest outages
11/28-29 GPFS hang, 21 hrs
11/12 Electrical maintenance, Discover
reprovisioning,18 hrs
Scheduled outage
10/1 SCU4 integration, 11.5 hrs
Scheduled outage plus extension
9/2-3 Subnet Manager hang, 11.3 hrs
11/6 GPFS hang, 10.9 hrs

Electrical maintenance, Discover reprovisioning
SCU4 integration
Subnet Manager hang
GPFS hang
GPFS hang
GPFS hang
SCU4 integration, Switch reconfiguration
Subnet Manager maint.
Subnet Manager hang
GPFS hang
9
Current Issues on DiscoverGPFS Hangs

Symptom GPFS hangs resulting from users running
nodes out of memory.
Impact Users cannot login or use filesystem.
System Admins reboot affected nodes.
Status Implemented additional monitoring and
reporting tools.

10
Current Issues on DiscoverProblems with PBS V
Option

Symptom Jobs with large environments not
starting
Impact Jobs placed on hold by PBS
Status Consulting with Altair. In the interim,
dont use V to pass full environment, instead
use v or define necessary variables within job
scripts.

11
Resolved Issues on DiscoverInfiniband Subnet
Manager

Symptom Working nodes erroneously removed from
GPFS following Infiniband Subnet problems with
other nodes.
Impact Job failures due to node removal
Status Modified several subnet manager
configuration parameters on 9/17 based on IBM
recommendations. Problem has not recurred.

12
Resolved Issues on DiscoverPBS Hangs

Symptom PBS server experiencing 3-minute hangs
several times per day
Impact PBS-related commands (qsub, qstat, etc.)
hang
Status Identified periodic use of two
communication ports also used for hardware
management functions. Implemented work-around on
9/17 to prevent conflicting use of these ports.
No further occurrences.

13
Resolved Issues on DiscoverIntermittent NFS
Problems

Symptom Inability to access archive filesystems
Impact hung commands and sessions when
attempting to access ARCHIVE
Status Identified hardware problem with Force10
E600 network switch. Implemented workaround and
replaced line card. No further occurrences.

14
Future Enhancements

Discover Cluster
Hardware platform
Additional storage
Data Portal
Hardware platform
Analysis environment
Hardware platform
DMF
Hardware platform

15
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
16
Very High Level of What to Expect in FY09
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Analysis System
DMF from IRIX to Linux
Major Initiatives
Cluster Upgrade (Nehalem)
Data Management Initiative
Delivery of IBM Cell
Continued Scalability Testing
Discover FC and Disk Addition
Other Activities
Additional Discover Disk
Discover SW Stack Upgrade
New Tape Drives
17
Adapting the Overall Architecture

Services will have
More independent SW stacks
Consistent user environment
Fast access to the GPFS file systems
Large additional disk capacity for longer storage
of files within GPFS
This will result in
Fewer downtimes
Rolling outages (not everything at once)

18
Conceptual Architecture Diagram
10 GbE LAN
Discover (batch) Base SCU1 SCU2 SCU3 SCU4 Viz
Analysis Nodes (interactive)
FY09 Compute Upgrade (Nehalem)
Data Portal
Archive DMF
IB
IB
IB
IB
GPFS I/O Servers
GPFS I/O Servers
GPFS I/O Servers
GPFS I/O Servers
SAN
SAN
SAN
19
What is the Analysis Environment?

Initial technical implementation plan
Large shared memory (256 GB at least) nodes
16 core nodes with 16 GB/core
Interactive (not batch) direct logins
Fast access to GPFS
10 GbE network connectivity
Consistent software stack to Discover
Independent of compute stack (coupled only by
GPFS)
Additional storage for staging from the archive
specific for analysis
Visibility and easy access to the archive and
data portal (NFS)

20
Excited about Intel Nehalem

Quick Specs
Core 7i 45 nm
731 million transistors per quad-core
2.66 GHz to 2.93 GHz
Private L1 cache (32 KB) and L2 (256 KB) per core
Shared L3 cache (up to 8 MB) across all the cores
1,066 MHz DDR3 Memory (3 channels per core)
Important Features
Intel QuickPath Interconnect
Turbo Boost
Hyper-Threading
Learn more at
http//www.intel.com/technology/architecture-silic
on/next-gen/index.htm
http//en.wikipedia.org/wiki/Nehalem_(microarchite
cture)

21
Nehalem versus Harpertown

Single thread improvement (will vary based on
application)
Larger cache with the 8 MB shared cache across
all processors
Memory to processor bandwidth dramatically
increased over the Harpertown
Initial measurements have shown 3 to 4x memory to
processor bandwidth increase

22
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
23
Issues from Last User ForumShared Project Space

Implementation of shared project space on
Discover
Status resolved
Available for projects by request
Accessible via /share usage deprecated
Accessible via SHARE correct usage

24
Issues from Last User ForumIncrease Queue Limits

Increase CPU time limits in queues
Status resolved

Queue Priority Max CPUs Max Hours
test 101 2064 12
general_hi 80 512 24
debug 70 32 1
general_long 55 256 24
general 50 256 12
general_small 50 16 12
background 1 256 4
25
Issues from Last User ForumCommands to Access
DMF

Implementation dmget and dmput
Status test version ready to be enabled on
Discover login nodes
Reason for delay was that dmgets on non-dm files
would hang
There may still be stability issues
E-mail will be sent soon notifying users of
availability

26
Issues from Last User Forum Enabling Sentinel
Jobs

Running a sentinel subjob to watch a main
parallel compute subjob in a single PBS job
Status under investigation
Requires an NFS mount of data portal file system
on Discover gateway nodes
Requires some special PBS usage to specify how
subjobs will land on nodes

27
Other IssuesPoor Interactive Response

Slow interactive response on Discover
Status resolved
Router line card replaced
Automatic monitoring instituted to promptly
detect future problems

28
Other IssuesParallel Jobs gt 300-400 CPUs

Some users experiencing problems running gt
300-400 CPUs on Discover
Status resolved
stacksize unlimited in .cshrc file needed
Intel mpi passes environment, including settings
in startup files

29
Other IssuesParallel Jobs gt 1500 CPUs

Many jobs wont run at gt 1500 CPUs
Status under investigation
Some simple jobs will run
NCCS consulting with IBM and Intel to resolve the
issue
Software upgrades probably required
Solution may fix slow Intel MPI startup

30
Other IssuesVisibility of the Archive

Visibility of the archive from discover
Current Status
Compute/viz nodes dont have external network
connections
Hard NFS mounts guarantee data integrity, but
if there is an NFS hang, the node hangs
Login/gateway nodes may use a soft NFS mount,
but risk of data corruption
bbftp or scp (to Dirac) preferred over cp when
copying data

31
DMF Transition

Dirac due to be replaced in Q2 CY09
Interactive host for Grads, IDL, Matlab, etc.
Much larger memory
GPFS shared with Discover
Significant increase in GPFS storage
Impacts to Dirac users
Source code must be recompiled
COTS must be relicensed/rehosted
Old Dirac up until migration complete

32
Help Us Help You

Dont use PBS V (job hangs with error too
many failed attempts to start)
Direct stdout, stderr to specific files, or you
will fill up the PBS spool directory
Use an interactive batch session instead of an
interactive session on a login node
If you suspect your job is crashing nodes, call
us before running again

33
Help Us Help You (continued)

Try to be specific when reporting problems, for
example
If the archive is broken, specify symptoms
If files are inaccessible or cant be recalled,
please send us the file names

34
Plans

Implement a better scheduling policy
Implement integrated job performance monitoring
Implement better job metrics reporting
Or

35
Feedback

Now Voice your
Praises?
Complaints?
Suggestions?
Later NCCS Support
support_at_nccs.nasa.gov
(301) 286-9120
Later USG Lead (me!)
William.A.Ward_at_nasa.gov
(301) 286-2954

36
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
37
Open DiscussionQuestionsComments

Write a Comment

User Comments (0)