Title: NCCS User Forum
1NCCS User Forum
2Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
3Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
4Key Accomplishments
- SCU4 added to Discover and currently running in
pioneer mode - Explore decommissioned and removed
- Discover filesystems converted to GPFS 3.2 native
mode
5Discover Utilization Past Year
SCU3 cores added
73.3
67.1
64.4
2,446,365 CPU hours
1,320,683 CPU hours
6Discover Utilization
7Discover Queue Expansion Factor
Weighted over all queues for all jobs (Background
and Test queues excluded)
Eligible Time Run Time Run Time
8Discover Availability
- September through November availability
- 13 outages
- 9 unscheduled
- 0 hardware failures
- 7 software failures
- 2 extended maintenance windows
- 4 scheduled
- 104.3 hours total downtime
- 68.3 unscheduled
- 36.0 scheduled
GPFS hang
- Longest outages
- 11/28-29 GPFS hang, 21 hrs
- 11/12 Electrical maintenance, Discover
reprovisioning,18 hrs - Scheduled outage
- 10/1 SCU4 integration, 11.5 hrs
- Scheduled outage plus extension
- 9/2-3 Subnet Manager hang, 11.3 hrs
- 11/6 GPFS hang, 10.9 hrs
Electrical maintenance, Discover reprovisioning
SCU4 integration
Subnet Manager hang
GPFS hang
GPFS hang
GPFS hang
SCU4 integration, Switch reconfiguration
Subnet Manager maint.
Subnet Manager hang
GPFS hang
9Current Issues on DiscoverGPFS Hangs
- Symptom GPFS hangs resulting from users running
nodes out of memory. - Impact Users cannot login or use filesystem.
System Admins reboot affected nodes. - Status Implemented additional monitoring and
reporting tools.
10Current Issues on DiscoverProblems with PBS V
Option
- Symptom Jobs with large environments not
starting - Impact Jobs placed on hold by PBS
- Status Consulting with Altair. In the interim,
dont use V to pass full environment, instead
use v or define necessary variables within job
scripts.
11Resolved Issues on DiscoverInfiniband Subnet
Manager
- Symptom Working nodes erroneously removed from
GPFS following Infiniband Subnet problems with
other nodes. - Impact Job failures due to node removal
- Status Modified several subnet manager
configuration parameters on 9/17 based on IBM
recommendations. Problem has not recurred.
12Resolved Issues on DiscoverPBS Hangs
- Symptom PBS server experiencing 3-minute hangs
several times per day - Impact PBS-related commands (qsub, qstat, etc.)
hang - Status Identified periodic use of two
communication ports also used for hardware
management functions. Implemented work-around on
9/17 to prevent conflicting use of these ports.
No further occurrences.
13Resolved Issues on DiscoverIntermittent NFS
Problems
- Symptom Inability to access archive filesystems
- Impact hung commands and sessions when
attempting to access ARCHIVE - Status Identified hardware problem with Force10
E600 network switch. Implemented workaround and
replaced line card. No further occurrences.
14Future Enhancements
- Discover Cluster
- Hardware platform
- Additional storage
- Data Portal
- Hardware platform
- Analysis environment
- Hardware platform
- DMF
- Hardware platform
15Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
16Very High Level of What to Expect in FY09
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Analysis System
DMF from IRIX to Linux
Major Initiatives
Cluster Upgrade (Nehalem)
Data Management Initiative
Delivery of IBM Cell
Continued Scalability Testing
Discover FC and Disk Addition
Other Activities
Additional Discover Disk
Discover SW Stack Upgrade
New Tape Drives
17Adapting the Overall Architecture
- Services will have
- More independent SW stacks
- Consistent user environment
- Fast access to the GPFS file systems
- Large additional disk capacity for longer storage
of files within GPFS - This will result in
- Fewer downtimes
- Rolling outages (not everything at once)
18Conceptual Architecture Diagram
10 GbE LAN
Discover (batch) Base SCU1 SCU2 SCU3 SCU4 Viz
Analysis Nodes (interactive)
FY09 Compute Upgrade (Nehalem)
Data Portal
Archive DMF
IB
IB
IB
IB
GPFS I/O Servers
GPFS I/O Servers
GPFS I/O Servers
GPFS I/O Servers
SAN
SAN
SAN
19What is the Analysis Environment?
- Initial technical implementation plan
- Large shared memory (256 GB at least) nodes
- 16 core nodes with 16 GB/core
- Interactive (not batch) direct logins
- Fast access to GPFS
- 10 GbE network connectivity
- Consistent software stack to Discover
- Independent of compute stack (coupled only by
GPFS) - Additional storage for staging from the archive
specific for analysis - Visibility and easy access to the archive and
data portal (NFS)
20Excited about Intel Nehalem
- Quick Specs
- Core 7i 45 nm
- 731 million transistors per quad-core
- 2.66 GHz to 2.93 GHz
- Private L1 cache (32 KB) and L2 (256 KB) per core
- Shared L3 cache (up to 8 MB) across all the cores
- 1,066 MHz DDR3 Memory (3 channels per core)
- Important Features
- Intel QuickPath Interconnect
- Turbo Boost
- Hyper-Threading
- Learn more at
- http//www.intel.com/technology/architecture-silic
on/next-gen/index.htm - http//en.wikipedia.org/wiki/Nehalem_(microarchite
cture)
21Nehalem versus Harpertown
- Single thread improvement (will vary based on
application) - Larger cache with the 8 MB shared cache across
all processors - Memory to processor bandwidth dramatically
increased over the Harpertown - Initial measurements have shown 3 to 4x memory to
processor bandwidth increase
22Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
23Issues from Last User ForumShared Project Space
- Implementation of shared project space on
Discover - Status resolved
- Available for projects by request
- Accessible via /share usage deprecated
- Accessible via SHARE correct usage
24Issues from Last User ForumIncrease Queue Limits
- Increase CPU time limits in queues
- Status resolved
Queue Priority Max CPUs Max Hours
test 101 2064 12
general_hi 80 512 24
debug 70 32 1
general_long 55 256 24
general 50 256 12
general_small 50 16 12
background 1 256 4
25Issues from Last User ForumCommands to Access
DMF
- Implementation dmget and dmput
- Status test version ready to be enabled on
Discover login nodes - Reason for delay was that dmgets on non-dm files
would hang - There may still be stability issues
- E-mail will be sent soon notifying users of
availability
26Issues from Last User Forum Enabling Sentinel
Jobs
- Running a sentinel subjob to watch a main
parallel compute subjob in a single PBS job - Status under investigation
- Requires an NFS mount of data portal file system
on Discover gateway nodes - Requires some special PBS usage to specify how
subjobs will land on nodes
27Other IssuesPoor Interactive Response
- Slow interactive response on Discover
- Status resolved
- Router line card replaced
- Automatic monitoring instituted to promptly
detect future problems
28Other IssuesParallel Jobs gt 300-400 CPUs
- Some users experiencing problems running gt
300-400 CPUs on Discover - Status resolved
- stacksize unlimited in .cshrc file needed
- Intel mpi passes environment, including settings
in startup files
29Other IssuesParallel Jobs gt 1500 CPUs
- Many jobs wont run at gt 1500 CPUs
- Status under investigation
- Some simple jobs will run
- NCCS consulting with IBM and Intel to resolve the
issue - Software upgrades probably required
- Solution may fix slow Intel MPI startup
30Other IssuesVisibility of the Archive
- Visibility of the archive from discover
- Current Status
- Compute/viz nodes dont have external network
connections - Hard NFS mounts guarantee data integrity, but
if there is an NFS hang, the node hangs - Login/gateway nodes may use a soft NFS mount,
but risk of data corruption - bbftp or scp (to Dirac) preferred over cp when
copying data
31DMF Transition
- Dirac due to be replaced in Q2 CY09
- Interactive host for Grads, IDL, Matlab, etc.
- Much larger memory
- GPFS shared with Discover
- Significant increase in GPFS storage
- Impacts to Dirac users
- Source code must be recompiled
- COTS must be relicensed/rehosted
- Old Dirac up until migration complete
32Help Us Help You
- Dont use PBS V (job hangs with error too
many failed attempts to start) - Direct stdout, stderr to specific files, or you
will fill up the PBS spool directory - Use an interactive batch session instead of an
interactive session on a login node - If you suspect your job is crashing nodes, call
us before running again
33Help Us Help You (continued)
- Try to be specific when reporting problems, for
example - If the archive is broken, specify symptoms
- If files are inaccessible or cant be recalled,
please send us the file names
34Plans
- Implement a better scheduling policy
- Implement integrated job performance monitoring
- Implement better job metrics reporting
- Or
35Feedback
- Now Voice your
- Praises?
- Complaints?
- Suggestions?
- Later NCCS Support
- support_at_nccs.nasa.gov
- (301) 286-9120
- Later USG Lead (me!)
- William.A.Ward_at_nasa.gov
- (301) 286-2954
36Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
37Open DiscussionQuestionsComments