NCCS User Forum - PowerPoint PPT Presentation

About This Presentation
Title:

NCCS User Forum

Description:

SCU4 added to Discover and currently running in 'pioneer' mode ... Accessible via /share; usage deprecated. Accessible via $SHARE; correct usage. GSFC ... – PowerPoint PPT presentation

Number of Views:441
Avg rating:3.0/5.0
Slides: 38
Provided by: bill138
Category:

less

Transcript and Presenter's Notes

Title: NCCS User Forum


1
NCCS User Forum
  • 11 December 2008

2
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
3
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
4
Key Accomplishments
  • SCU4 added to Discover and currently running in
    pioneer mode
  • Explore decommissioned and removed
  • Discover filesystems converted to GPFS 3.2 native
    mode

5
Discover Utilization Past Year
SCU3 cores added
73.3
67.1
64.4
2,446,365 CPU hours
1,320,683 CPU hours
6
Discover Utilization
7
Discover Queue Expansion Factor
Weighted over all queues for all jobs (Background
and Test queues excluded)
Eligible Time Run Time Run Time
8
Discover Availability
  • September through November availability
  • 13 outages
  • 9 unscheduled
  • 0 hardware failures
  • 7 software failures
  • 2 extended maintenance windows
  • 4 scheduled
  • 104.3 hours total downtime
  • 68.3 unscheduled
  • 36.0 scheduled

GPFS hang
  • Longest outages
  • 11/28-29 GPFS hang, 21 hrs
  • 11/12 Electrical maintenance, Discover
    reprovisioning,18 hrs
  • Scheduled outage
  • 10/1 SCU4 integration, 11.5 hrs
  • Scheduled outage plus extension
  • 9/2-3 Subnet Manager hang, 11.3 hrs
  • 11/6 GPFS hang, 10.9 hrs

Electrical maintenance, Discover reprovisioning
SCU4 integration
Subnet Manager hang
GPFS hang
GPFS hang
GPFS hang
SCU4 integration, Switch reconfiguration
Subnet Manager maint.
Subnet Manager hang
GPFS hang
9
Current Issues on DiscoverGPFS Hangs
  • Symptom GPFS hangs resulting from users running
    nodes out of memory.
  • Impact Users cannot login or use filesystem.
    System Admins reboot affected nodes.
  • Status Implemented additional monitoring and
    reporting tools.

10
Current Issues on DiscoverProblems with PBS V
Option
  • Symptom Jobs with large environments not
    starting
  • Impact Jobs placed on hold by PBS
  • Status Consulting with Altair. In the interim,
    dont use V to pass full environment, instead
    use v or define necessary variables within job
    scripts.

11
Resolved Issues on DiscoverInfiniband Subnet
Manager
  • Symptom Working nodes erroneously removed from
    GPFS following Infiniband Subnet problems with
    other nodes.
  • Impact Job failures due to node removal
  • Status Modified several subnet manager
    configuration parameters on 9/17 based on IBM
    recommendations. Problem has not recurred.

12
Resolved Issues on DiscoverPBS Hangs
  • Symptom PBS server experiencing 3-minute hangs
    several times per day
  • Impact PBS-related commands (qsub, qstat, etc.)
    hang
  • Status Identified periodic use of two
    communication ports also used for hardware
    management functions. Implemented work-around on
    9/17 to prevent conflicting use of these ports.
    No further occurrences.

13
Resolved Issues on DiscoverIntermittent NFS
Problems
  • Symptom Inability to access archive filesystems
  • Impact hung commands and sessions when
    attempting to access ARCHIVE
  • Status Identified hardware problem with Force10
    E600 network switch. Implemented workaround and
    replaced line card. No further occurrences.

14
Future Enhancements
  • Discover Cluster
  • Hardware platform
  • Additional storage
  • Data Portal
  • Hardware platform
  • Analysis environment
  • Hardware platform
  • DMF
  • Hardware platform

15
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
16
Very High Level of What to Expect in FY09
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Analysis System
DMF from IRIX to Linux
Major Initiatives
Cluster Upgrade (Nehalem)
Data Management Initiative
Delivery of IBM Cell
Continued Scalability Testing
Discover FC and Disk Addition
Other Activities
Additional Discover Disk
Discover SW Stack Upgrade
New Tape Drives
17
Adapting the Overall Architecture
  • Services will have
  • More independent SW stacks
  • Consistent user environment
  • Fast access to the GPFS file systems
  • Large additional disk capacity for longer storage
    of files within GPFS
  • This will result in
  • Fewer downtimes
  • Rolling outages (not everything at once)

18
Conceptual Architecture Diagram
10 GbE LAN
Discover (batch) Base SCU1 SCU2 SCU3 SCU4 Viz
Analysis Nodes (interactive)
FY09 Compute Upgrade (Nehalem)
Data Portal
Archive DMF
IB
IB
IB
IB
GPFS I/O Servers
GPFS I/O Servers
GPFS I/O Servers
GPFS I/O Servers
SAN
SAN
SAN
19
What is the Analysis Environment?
  • Initial technical implementation plan
  • Large shared memory (256 GB at least) nodes
  • 16 core nodes with 16 GB/core
  • Interactive (not batch) direct logins
  • Fast access to GPFS
  • 10 GbE network connectivity
  • Consistent software stack to Discover
  • Independent of compute stack (coupled only by
    GPFS)
  • Additional storage for staging from the archive
    specific for analysis
  • Visibility and easy access to the archive and
    data portal (NFS)

20
Excited about Intel Nehalem
  • Quick Specs
  • Core 7i 45 nm
  • 731 million transistors per quad-core
  • 2.66 GHz to 2.93 GHz
  • Private L1 cache (32 KB) and L2 (256 KB) per core
  • Shared L3 cache (up to 8 MB) across all the cores
  • 1,066 MHz DDR3 Memory (3 channels per core)
  • Important Features
  • Intel QuickPath Interconnect
  • Turbo Boost
  • Hyper-Threading
  • Learn more at
  • http//www.intel.com/technology/architecture-silic
    on/next-gen/index.htm
  • http//en.wikipedia.org/wiki/Nehalem_(microarchite
    cture)

21
Nehalem versus Harpertown
  • Single thread improvement (will vary based on
    application)
  • Larger cache with the 8 MB shared cache across
    all processors
  • Memory to processor bandwidth dramatically
    increased over the Harpertown
  • Initial measurements have shown 3 to 4x memory to
    processor bandwidth increase

22
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
23
Issues from Last User ForumShared Project Space
  • Implementation of shared project space on
    Discover
  • Status resolved
  • Available for projects by request
  • Accessible via /share usage deprecated
  • Accessible via SHARE correct usage

24
Issues from Last User ForumIncrease Queue Limits
  • Increase CPU time limits in queues
  • Status resolved

Queue Priority Max CPUs Max Hours
test 101 2064 12
general_hi 80 512 24
debug 70 32 1
general_long 55 256 24
general 50 256 12
general_small 50 16 12
background 1 256 4
25
Issues from Last User ForumCommands to Access
DMF
  • Implementation dmget and dmput
  • Status test version ready to be enabled on
    Discover login nodes
  • Reason for delay was that dmgets on non-dm files
    would hang
  • There may still be stability issues
  • E-mail will be sent soon notifying users of
    availability

26
Issues from Last User Forum Enabling Sentinel
Jobs
  • Running a sentinel subjob to watch a main
    parallel compute subjob in a single PBS job
  • Status under investigation
  • Requires an NFS mount of data portal file system
    on Discover gateway nodes
  • Requires some special PBS usage to specify how
    subjobs will land on nodes

27
Other IssuesPoor Interactive Response
  • Slow interactive response on Discover
  • Status resolved
  • Router line card replaced
  • Automatic monitoring instituted to promptly
    detect future problems

28
Other IssuesParallel Jobs gt 300-400 CPUs
  • Some users experiencing problems running gt
    300-400 CPUs on Discover
  • Status resolved
  • stacksize unlimited in .cshrc file needed
  • Intel mpi passes environment, including settings
    in startup files

29
Other IssuesParallel Jobs gt 1500 CPUs
  • Many jobs wont run at gt 1500 CPUs
  • Status under investigation
  • Some simple jobs will run
  • NCCS consulting with IBM and Intel to resolve the
    issue
  • Software upgrades probably required
  • Solution may fix slow Intel MPI startup

30
Other IssuesVisibility of the Archive
  • Visibility of the archive from discover
  • Current Status
  • Compute/viz nodes dont have external network
    connections
  • Hard NFS mounts guarantee data integrity, but
    if there is an NFS hang, the node hangs
  • Login/gateway nodes may use a soft NFS mount,
    but risk of data corruption
  • bbftp or scp (to Dirac) preferred over cp when
    copying data

31
DMF Transition
  • Dirac due to be replaced in Q2 CY09
  • Interactive host for Grads, IDL, Matlab, etc.
  • Much larger memory
  • GPFS shared with Discover
  • Significant increase in GPFS storage
  • Impacts to Dirac users
  • Source code must be recompiled
  • COTS must be relicensed/rehosted
  • Old Dirac up until migration complete

32
Help Us Help You
  • Dont use PBS V (job hangs with error too
    many failed attempts to start)
  • Direct stdout, stderr to specific files, or you
    will fill up the PBS spool directory
  • Use an interactive batch session instead of an
    interactive session on a login node
  • If you suspect your job is crashing nodes, call
    us before running again

33
Help Us Help You (continued)
  • Try to be specific when reporting problems, for
    example
  • If the archive is broken, specify symptoms
  • If files are inaccessible or cant be recalled,
    please send us the file names

34
Plans
  • Implement a better scheduling policy
  • Implement integrated job performance monitoring
  • Implement better job metrics reporting
  • Or

35
Feedback
  • Now Voice your
  • Praises?
  • Complaints?
  • Suggestions?
  • Later NCCS Support
  • support_at_nccs.nasa.gov
  • (301) 286-9120
  • Later USG Lead (me!)
  • William.A.Ward_at_nasa.gov
  • (301) 286-2954

36
Agenda
Welcome Introduction Phil Webster, CISTO Chief
Current System Status Fred Reitz, Operations
Manager
NCCS Compute Capabilities Dan Duffy, Lead
Architect
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO
Chief
37
Open DiscussionQuestionsComments
Write a Comment
User Comments (0)
About PowerShow.com