NCCS User Forum - PowerPoint PPT Presentation

About This Presentation
Title:

NCCS User Forum

Description:

Incorporation of SCU5 processors into general queue pool ... Building on work by SIVO and GMAO (Brent Swartz) NCCS. Representative GEOS Output ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 65
Provided by: drbil1
Category:

less

Transcript and Presenter's Notes

Title: NCCS User Forum


1
NCCS User Forum
  • 22 September 2009

2
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
3
Key Accomplishments
  • Incorporation of SCU5 processors into general
    queue pool
  • Capability to run large jobs (4000 cores) on
    SCU5
  • Analysis nodes placed in production
  • Migrated DMF from Dirac (Irix) to Palm (Linux)

4
New NCCS Staff Members
  • Lynn Parnell, Ph.D. Engineering Mechanics, High
    Performance Computing Lead
  • Matt Koop, Ph.D. Computer Science, User Services
  • Tom Maxwell, Ph.D. Physics, Analysis System Lead

5
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
6
Key Accomplishments
  • Discover/Analysis Environment
  • Added SCU5 (cluster totals 10,840 compute CPUs,
    110 TF)
  • Placed analysis nodes (dali01-dali06) in
    production status
  • Implemented storage area network (SAN)
  • Implemented GPFS multicluster feature
  • Upgraded GPFS
  • Implemented RDMA
  • Implemented InfiniBand token network
  • Discover/Data Portal
  • Implemented NFS mounts for select Discover data
    on Data Portal
  • Data Portal
  • Migrated all users/applications to HP
    Bladeservers
  • Upgraded GPFS
  • Implemented GPFS multicluster feature
  • Implemented InfiniBand IP network
  • Upgraded SLES10 operating system to SP2
  • DMF
  • Migrated DMF from Irix to Linux
  • Other

7
Discover 2009 Daily Utilization Percentage
8
Discover Daily Utilization Percentage by
GroupMay August 2009
8/13/09 SCU5 (4,128 cores added)
9
Discover Total CPU ConsumptionPast 12 Months
(CPU Hours)
9/4/08 SCU3 (2,064 cores added) 2/4/09 SCU4
(544 cores added) 2/19/09 SCU4 (240 cores
added) 2/27/09 SCU4 (1,280 cores added) 8/13/09
SCU5 (4,128 cores added)
10
Discover Job Analysis August 2009
11
Discover Job Analysis August 2009
12
Discover Availability
Scheduled Maintenance Jun-Aug 10 Jun - 17 hrs 5
min GPFS (Token and Subnets, 3.2.1-12) 24 Jun -
12 hours GPFS (RDMA, Multicluster, SCU5
integration) 29 Jul - 12 hours GPFS 3.2.1-13,
OFED1.4 , DDN firmware 30 Jul - 2 hours 20
minutes DDN controller replacement 19 Aug - 4
hours NASA AUID transition
Unscheduled Outages Jun-Aug 16 Jun 3 hrs 35
min nodes out of memory 24 Jun 4 hrs 39 min
maintenance extension 6-7 Jul 4 hrs 18 min
internal switch error 13 Jul 2 hrs 59 min
GPFS error 14 Jul 26 min nodes out of
memory 20 Jul 2 hrs 2 min GPFS error 29 Jul
55 min Maintenance extension 19 Aug 2 hrs 45
min maintenance extension
13
Current Issues on DiscoverLogin Node Hangs
  • Symptom Login nodes become unresponsive.
  • Impact Users cannot login.
  • Status Developing/testing solution. Issue arose
    during critical security patch installation.

14
Current Issues on DMFPost-Migration Clean-Up
  • Symptoms Various.
  • Impact Various.
  • Status Issues addressed as they are encountered
    and reported.

15
Future Enhancements
  • Discover Cluster
  • PBS V 10
  • Additional storage
  • SLES10 SP2
  • Data Portal
  • GDS OPeNDAP performance enhancements
  • Use of GPFS-CNFS for improved NFS mount
    availability

16
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
17
I/O Study Team
  • Dan Kokron
  • Bill Putman
  • Dan Duffy
  • Bill Ward
  • Tyler Simon
  • Matt Koop
  • Harper Pryor
  • Building on work by SIVO and GMAO (Brent Swartz)

18
Representative GEOS Output
  • Dan Kokron has generated many runs containing
    data in order to characterize the GEOS I/O
  • 720 core, quarter degree GEOS with YOTC-like
    history
  • Number of processes that write 67
  • Total amount of data 225 GB (written to
    multiple files)
  • Average write size 1.7 MB
  • Running in dnb33
  • Using Nehalem cores (GPFS with RDMA)
  • Average Bandwidth
  • Timing the entire CFIO calls results in a
    bandwidth of 3.8 MB/sec
  • Timing just the NetCDF ncvpt calls results in a
    bandwidth of 44.4 MB/sec
  • Why is this so slow?

19
Kernel Benchmarks
  • Used open source I/O kernel benchmarks of xdd and
    iozone
  • Achieved over 1 GB/sec to all the new nobackup
    file systems
  • Wrote two representative one-node c-code
    benchmarks
  • Using c writes and appending to files
  • Using NetCDF writes with chunking and appending
    to files
  • Ran these benchmarks writing out exactly the same
    as process 0 in the GEOS run
  • C-writes Average bandwidth of around 900 MB/sec
    (consistent with kernel benchmarks)
  • NetCDF writes Average bandwidth of around 600
    MB/sec
  • Why is GEOS I/O running so slow?

C-writes Average Bandwidth 900MB/sec
NetCDF-writes Average Bandwidth 600MB/sec
20
Effect of NetCDF Chunking
  • How does changing the NetCDF chunk size affect
    the overall performance?
  • The table shows runs varying the chunk size for
    an average of 10 runs for each chunk size
  • Used the NetCDF kernel benchmark
  • The smallest chunk size reproduces the GEOS
    bandwidth
  • As best as we can tell, this is roughly
    equivalent to the default chunk size
  • The best chunk size turned out to be about the
    size of the array being written 3MB

Chunk size ( Floats) Chunk size (KB) AverageBandwidth (MB/sec)
1K 4 37
32K 128 262
128K 512 492
512K 2,048 537
1M 4,096 596
2M 8,192 497
3M 12,228 369
6M 24,576 477
10M 40,960 327
  • References
  • NetCDF-4 Performance Report, Lee, et. Al.,
    June 2008.
  • NetCDF on-line tutorial
  • http//www.unidata.ucar.edu/software/netcdf/docs_b
    eta/netcdf-tutorial.html
  • Benchmarking I/O Performance with GEOSdas and
    other modeling guru posts
  • https//modelingguru.nasa.gov/clearspace/message/5
    6155615

21
Setting Chunk Size in GEOS
  • Dan K. ran several baseline runs to make sure we
    were measuring things correctly
  • Turned on chunking and set the chunk size equal
    to the write size (1080x721x1x1)
  • Dramatic improvement in ncvpt bandwidth
  • Why was the last run so slow?
  • Because we had a file system hang during that run

File Name Description Ncvpt Bandwidth (MB/sec)
Base Line 1 Base line run with time stamps at each wrote statement 44.47
Base Line 2 Printed out time stamps before and after the call to ncvpt 76.35
Base Line 3 Printing the time stamps moved after the call to ncvpt 64.69
Using NetCDF Chunking Initial run with NetCDF chunking turned on 409.87
Using NetCDF Chunking and Fortran Buffering (1) IO Buffering in the Intel IO library on top of NetCDF chunking 421.23
Using NetCDF Chunking and Fortran Buffering (2) Same as previous run with very different results 45.17
22
What next?
  • Further explore chunk sizes in NetCDF
  • What is the best chunk size?
  • Do you set the chunk sizes for write performance
    or for read performance?
  • Once a file has been written with a set chunk
    size, it cannot be changed without rewriting the
    file.
  • Need to better understand the variability seen in
    the file system performance
  • Not uncommon to see a 2x or greater difference in
    performance from run to run
  • Turn the NetCDF kernel benchmark into a
    multi-node benchmark
  • Use this benchmark for testing system changes and
    potential new systems
  • Compare performance across NCCS and NAS systems
  • Write up results

23
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
24
Ticket Closure Percentiles1 March to 31 August
2009
25
Issue Parallel Jobs gt 1500 CPUs
  • Original problem Many jobs wouldnt run at gt
    1500 CPUs
  • Status at last Forum Resolved using a different
    version of the DAPL library
  • Current Status Now able to run at 4000 CPUs
    using MVAPICH on SCU5

26
Issue Getting Jobs into Execution
  • Long wait for queued jobs before launching
  • Reasons
  • SCALITRUE is restrictive
  • Per user per project limits on number of
    eligible jobs (use qstat is)
  • Scheduling policy first-fit on job list ordered
    by queue priority and queue time
  • User services will be contacting folks using
    SCALITRUE to assist them in migration away from
    this feature

27
Future User Forums
  • NCCS User Forum schedule
  • 8 Dec 2009, 9 Mar 9 2010, 8 Jun 2010, 14 Sep
    2010, and 7 Dec 2010
  • All on Tuesday
  • All 200-330 PM
  • All in Building 33, Room H114
  • Published
  • On http//nccs.nasa.gov/
  • On GSFC-CAL-NCCS-Users

28
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
29
Sustained System Performance
  • What is the overall system performance?
  • Many different benchmarks or peak numbers are
    available
  • Often unrealistic or not relevant
  • SSP refers to a set of benchmarks that evaluates
    performance as related to real workloads on the
    system
  • SSP concepts originated from NERSC (LBNL)

30
Performance Monitoring
  • Not just for evaluating a new system
  • Ever wonder if a system change has affected
    performance?
  • Often changes can be subtle and not detected with
    normal system validation tools
  • Silent corruption
  • Slowness
  • Find out immediately instead of after running the
    application and getting an error

31
Performance Monitoring (contd.)
  • Run real workloads (SSP) to determine performance
    changes over time
  • Quickly determine if something is broken or slow
  • Perform data verification
  • Run automatically on a regular basis as well as
    after system changes
  • e.g. change to a compiler, MPI version, OS update

NERSC SSP Example Chart
32
Meaningful Measurements
  • How you can help
  • We need your application and a representative
    dataset for your application
  • Ideally should take 20-30 minutes to run at
    various processor counts
  • Your benefits
  • Changes to the system that affect your
    application will be noticed immediately
  • Data will be placed on NCCS website to show
    system performance over time

33
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
34
Discover Job Monitor
  • All data is presented as a current system
    snapshot, in 5 min intervals.
  • Displays system load as a percentage
  • Displays the number of running jobs and running
    cores
  • Queued jobs and job wait times
  • Displays current qstat -a output
  • Interactive Historical Utilization Chart
  • Message of the day
  • Displays average number of cores per job
  • Job Monitor

35
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
36
Climate Data Analysis
  • Climate models are generating ever-increasing
    amounts of output data.
  • Larger datasets are making it increasingly
    cumbersome for scientists to perform analyses on
    their desktop computers.
  • Server-side analysis of climate model results is
    quickly becoming a necessity.

37
Parallelizing Application Scripts
  • Many data processing shell scripts can be easily
    parallelized
  • MatLab, IDL, etc.
  • Use task parallelism to process multiple files in
    parallel
  • Each file processed on a separate core within a
    single dali node
  • Limit load on dali (16 cores per node )
  • Max 10 compute intensive processes per node

Serial Version while ( ) process
another file run.grid.qd.s end
Parallel Version while ( ) process
another file run.grid.qd.s end
38
ParaView
  • Open-source, multi-platform visualization
    application
  • Developed by Kitware, Inc. (authors of VTK)
  • Designed to process large data sets
  • Built on parallel VTK
  • Client-server architecture
  • Client Qt based desktop application
  • Data Server MPI based parallel application on
    dali.
  • Parallel streaming filters for data processing
  • Large library of existing filters
  • Highly extensible using plugins
  • Plugin development required for HDF, NetCDF, OBS
    data
  • No existing climate-specific tools or algorithms
  • Data Server being integrated into ESG

39
ParaView Client
  • Qt desktop application that Controls data access,
    processing, analysis, and visualization

40
ParaView Client Features
41
Analysis Workflow Configuration
  • Configure a parallel streaming pipeline for data
    analysis

42
ParaView Applications
Polar Vortex Breakdown Simulation
Golevka Asteroid Explosion Simulation
3D Rayleigh-Benard problem
Cross Wind Fire Simulation
43
Climate Data Analysis Toolkit
  • Integrated environment for data processing, viz,
    analysis
  • Integrates numerous software modules in python
    shell
  • Open source with a large diverse set of
    contributors
  • Analysis environment for ESG developed _at_ LLNL

44
Data Manipulation
  • Exploits NumPy Array and Masked Array
  • Adds persistent climate metadata
  • Exposes NumPy, SciPy, RPy mathematical
    operations

Clustering FFT Image processing Linear
algebra Interpolation Max entropy Optimization Sig
nal processing Statistical functions Convolution S
parse matrices Regression Spatial algorithms
45
Grid Support
  • Spherical Coordinate Remapping and Interpolation
    Package
  • remapping and interpolation between grids on a
    sphere
  • Map between any pair of lat-long grids
  • GridSpec
  • Standard description of earth system model grids
  • To be implemented in NetCDF CF convention
  • Implemented in CMOR
  • MoDAVE
  • Grid visualization

46
Climate Analysis
  • Genutil Cdutil (PCMDI)
  • General Utilities for climate data analysis
  • Statistics, array color manipulation,
    selection, etc.
  • Climate Utilities
  • time extraction, averages, bounds, interpolation
  • masking/regridding, region extraction
  • PyClimate
  • Toolset for analyzing climate variability
  • Empirical Orthogonal Functions (EOF) analysis
  • Analysis of coupled data sets
  • Singular Vector Decomposition (SVD)
  • Canonical Correlation Analysis (CCA)
  • Linear digital filters
  • Kernel based probability
  • Density function estimation

47
CDAT Climate Diagnostics
  • Provides a common environment for climate
    research
  • Uniform diagnostics for model evaluation and
    comparison

Taylor Diagram Thermodynamic Plot Performance
Portrait Plot Wheeler-Kalidas Analysis
48
Contributed Packages
  • PyGrADS (potential)
  • AsciiData
  • BinaryIO
  • ComparisonStatistics
  • CssGrid
  • DsGrid
  • Egenix
  • EOF
  • EzTemplate
  • HDF5Tools
  • IOAPITools
  • Ipython
  • Lmoments
  • MSU
  • NatGrid
  • ORT
  • PyLoapi
  • PynCl
  • RegridPack
  • ShGrid
  • SP
  • SpanLib
  • SpherePack
  • Trends
  • Twisted
  • ZonalMeans
  • ZopeInterface

49
Visualization
  • Visualization and Control System (VCS)
  • Standard CDAT 1D and 2D graphics package
  • Integrated Contributed 2D Packages
  • Xmgrace
  • Matplotlib
  • IaGraph
  • Integrated Contributed 3D packages
  • ViSUS
  • VTK
  • NcVTK
  • MoDAVE

50
Visual Climate Data Analysis Tools (VCDAT)
  • CDAT GUI, facilitates
  • Data access
  • Data processing analysis
  • Data visualization
  • Accepts python input
  • Commands and scripts
  • Saves state
  • Converts keystrokes to python
  • Online help

51
MoDAVE
  • Visualization of Mosaic grids
  • Parallelized using MPI
  • Integration into CDAT in process
  • Developed by Tech-X LLNL

Cubed sphere visualization
52
ViSUS in CDAT
  • Data streaming application
  • Progressive processing visualization of large
    scientific datasets
  • Future capabilities for petascale dataset
    streaming
  • Simultaneous visualization of multiple ( 1D, 2D,
    3D ) data representations

53
VisTrails
  • Scientific workflow and provenance management
    system.
  • Interface for next version of CDAT
  • history trees, data pipelines, visualization
    spreadsheet, provenance capture

54
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
55
Background
  • Scientists generate large data files
  • Processing the files consists of executing a
    series of independent tasks
  • Ensemble runs of models
  • All the tasks are run on one CPU

56
PoDS
  • Task parallelism tool taking advantage of
    distributed architectures as well as multi-core
    capabilities
  • For running serial independent tasks across nodes
  • Does not make any assumption on the underlying
    applications to be executed
  • Can be ported to other platforms

57
PoDs Features
  • Dynamic assessment of resource availability
  • Each task is timed
  • A summary report is provided

58
Task Assignment
Node 1
Command 1 Command 2 Command 3 Command 4 Command
5 Command 6 Command 7 Command 8 Command 9
Node 2
Node 3
Execution File
59
PoDS Usage
  • pods.py -help execFile CpusPerNode
  • execFile file listing all the independent tasks
    to be executed
  • CpusPerNode number of CPUs per node. If not
    provide, PoDS will automatically use the number
    of CPUs available in each node.

60
Simple Example
  • Randomly generates an integer n between 0 and
    109
  • Loops over n to perform some basic operations
  • Each time the application is called a different n
    is obtained. We want to run the application 150
    times.

61
Timing Numbers
Nodes Cores/Node Time (s)
1 1 990
2 496
4 256
8 133
2 1 497
2 247
4 131
8 61
62
More Information
  • Users Guide on ModelingGuru
  • https//modelingguru.nasa.gov/clearspace/docs/DOC-
    1582
  • Package available at
  • /usr/local/other/pods

63
Agenda
Welcome Introduction Phil Webster, CISTO Chief
SSP Test Matt Koop, User Services
Current System Status Fred Reitz, Operations Lead
Discover Job Monitor Tyler Simon, User Services
NCCS Compute Capabilities Dan Duffy, Lead
Architect
Analysis System Updates Tom Maxwell, Analysis
Lead
PoDS Jules Kouatchou, SIVO
User Services Updates Bill Ward, User Services
Lead
Questions and Comments Phil Webster, CISTO Chief
64
Important Contacts
  • NCCS Support
  • support_at_nccs.nasa.gov (301) 286-9120
  • Analysis Lead
  • Thomas.Maxwell_at_nasa.gov (301) 286-7810
  • I/O Improvements
  • Daniel.Q.Duffy_at_nasa.gov (301) 286-8830
  • PoDS Info
  • Jules.Kouatchou-1_at_nasa.gov (301) 286-6059
  • User Services Lead
  • William.A.Ward_at_nasa.gov (301) 286-2954
Write a Comment
User Comments (0)
About PowerShow.com