Title: Production Linux Capacity Computing at Los Alamos
1Production Linux Capacity Computing at Los Alamos
- Steven R. Shaw, CCN-7
- High Performance Computing Systems
- Computing, Communications, and Networking Division
2Topics
- Vision and goals
- Methodology
- Clustermatic and other components
- Current systems
- Lightning
- Flash
- Configuration management
- Operational opportunities
- Lessons learned
- Current and future work
- Questions
3Our Capacity Vision
- to meet requirements of programs within
reasonable resources, is to consolidate
architectures, leverage commodity computing,
Linux open source software, and standardized
deployment over capacity systems - Cheryl Wampler, ASC PI Meeting, March 1-4, 2004.
4Goals for Production Linux Capacity Computing
- Respond to the need for additional capacity
computing - Provide stability and continuity for user
community - Lower integration and operational costs by
leveraging internal resources and open source
software - Use repeatable processes and automation to
deploy new capacity quickly and to efficiently
operate and maintain existing systems
5Goals (continued)
- Provide more compute cycles to users by making
systems easier to build and manage do more with
available resources - Move toward a separate common file system, not
tied to specific platforms - Also move toward a separate standards based
scalable IO network to files systems, archival
storage and other services.
6Methodology
- CCN Division took on role of the system
integrator - Successful collaborative relationships were
established with CCS-1 (LA-MPI, Science
Appliance), CCN-8 (Panasas FS, Compliers, and
tools), CCN-5 (network integration), and
third-party softwaresuppliers - Built upon our Linuxcluster experiencefrom Pink
and othersystems
7Pink Configuration
64 dual-processorI/O nodes
958 dual-processor production computingnodes
GigEnetwork
Myrinet
1 dual-processorBProc masternode
2 dual-processorfront-end nodes
Panasas Global FS
LANLYellow(soon Turquoise)network
OpenNFSServers
8Science Appliance
- The key software in a Science Appliance is a
suite that LANL developed called "Clustermatic" - Clustermatic can completely control a cluster,
from the BIOS up to a high level programming
environment. - It features the Beowulf Distributed Process Space
(BProc), LinuxBios, and a variety of other
open-source kernel modifications, utilities, and
libraries. - Very quick node boot times
- Cluster boot and upgrade in minutes
- Manageable nodes from power-on
- Single system image for the entire cluster
- Quick process migration
9Clustermatic Awards
- Research and Development Magazines 2004 Research
and Development 100. - Clustermatic is a revolutionary software
suite for managing, monitoring, administering and
operating clusters on network-connected computers
running as a high-performance system.
Clustermatic increases reliability and
efficiency, decreases node autonomy, simplifies
computer programming, reduces administration
costs, and minimizes a user's reliance on
unpredictable software, enabling commodity-based
cluster networks to compete with the higher-cost
supercomputers. - The Clustermatic system was awarded the
Excellence in Cluster Technology Award for Open
Source Cluster Solutions at the 2004 ClusterWorld
Conference Exposition, in April 2004.
10Clustermatic Components
- A traditional cluster is built by replicating a
complete system software environment on every
node. - In a Science Appliance (Clustermatic system), we
have master nodes and slave nodes, but only the
master nodes have a fully-configured system. - The slave nodes run a minimal software stack
consisting of LinuxBIOS, Linux, and BProc. - Culture change for users, not every tool and
library exists on the slave nodes.
11Clustermatic Components
- Most importantly, BProc enables a distributed
process space across nodes within the cluster
all user processes running on the slave nodes
appear as processes running on the master node. - Users create processes on the master node and
the system migrates them (the processes) to the
slave nodes. - Standard input, output, and error streams are
redirected to the master node.
Slave nodes
Master node
- Processes remain visible, controllable on master.
12Other Key Components
- Panasas file system
- LA-MPI
- User environment similar to other Los Alamos
systems - HPSS
- LSF
- TotalView
- HPC toolkit
13Science Appliance Systems at LANL
- Lightning, Pink, Grendels, Flash, TLC
- MPI LSF are BProc-integrated.
- Result LANL Science Appliance systems are easy
to use but are different than other LANL systems
14Los Alamos Platforms
15Lightning Capacity System Overview (last week)
- System Hardware
- 1408 dual-processor LNXI AMD Opteron nodes
(11.26 TeraOps peak, 5.6 TB memory) - One Arima Rio Works HDAMA system board with AMD
8111 and 8131 chipsets - Two 2.0 GHz 64-bit processors with 1 MB L2
cache/node - Four GB of memory/node
- One 120-GB disk drive/node
- One ICEBOX controller/node for hardware
monitoring - Scalable to 2048 nodes (scalable design plans for
interconnect) - Myrinet Interconnect (latency 7 usec, bandwidth
250 MB/sec) - Gigabit copper network to network services such
as NFS, Panasas - A copper-based 10/100 network for system
monitoring system reboot, etc.
- System Software
- Linux
- Clustermatic software
- Beoboot, LinuxBios, Bproc, Supermon
- Compilers
- Message Passing
- LAMPI
- Debugging - TotalView
- Archival storage - HPSS
- Resource management - Load Sharing Facility (LSF)
16Lightning Integration and Deployment
Contract signed mid-July 2003
Level 2 (SCS) Lightning User Environment
Level 2 (PC) November 25, 2005
System Delivered
Beta mode
Integration/Acceptance Test
Limited Availability
General
Laboratory Standdown
Secure Environment
Feb
Dec
Nov
Jan
Sept
Oct
Aug
Sep
Oct
Jun
May
Jul
Mar
Apr
Nov
2003
2004
DP Award of Excellence For Integration Effort
Linpack Run 8.051 TF 64-bit Linpack 6 on Top500
17Lightning last week
18- Linux production and development environment
model Production segmentsDevelopment
environmentsSupport and system functions
19Flash Timeline
- Assemble hardware 11/17-11/19/04
- Stabilize hardware 11/20 11/24
- Acceptance testing complete 12/1
- Software install 12/2 12/17
- 88 Person-hours
- First I/O node system on Opteron
- Panasas and network setup in parallel
- Friendly users on 12/19
20Configuration Management
- Philosophy
- All maintenance and installation is done within
the configuration management system - Motivation
- Do more with available resources
- Automation is key
- Expertise is encoded
- Automated systems are consistent and tireless
- Prevent errors and mitigate consequences
- Avoid creating error-likely situations
- Correlate effect with cause
- Manual actions reduce the capacity to respond
21Configuration Management
- A framework for automating, to the fullest
extent possible, in a cross-platform and common
fashion the configuration of a product. - Differentiate products at major boundaries that
make sense (O/S, Linux version, Bproc or not,
chip architecture, unique service, etc.) - Databases become the documentation
22Configuration Management Culture Change
- The database is pointless if the system diverges
from its description due to actions taken outside
the data base - All changes, even temporary and debugging in
nature, must be done using our configuration
management tools
23Configuration Management Tools
- Rsync High confidence mirroring of files
- systemimager - Installation, replication and
disaster recovery of the core system - Cfengine Rule based files for installation and
configuration actions - systemimager provides the body, cfengine creates
the soul
24More Configuration Management Tools
- Revision Control System (RCS) track origin and
history - Annotated history within the cfengine database
- RPM (Redhat Package Manager)
- Deterministic, verifiable, removable
- Culture change for some of our suppliers
25Configuration Management Automation and Discipline
- Leads to systems that
- Are more predictable behavior can be
ascertained from the database - More scalable copies are easier
- Better documented
- Easier to debug
- Easier to repair
- Enables us to accomplish more with our available
resources
26Operational Opportunities
- Hardware maintenance
- Field replaceable unit is the node
- Rapid boot time dramatically shortens the time to
repair - Use operations staff for hands-on maintenance,
vendor becomes a parts supplier and second tier
support - Repair the node during prime time and burn-in,
maintaining a supply of tested spares. - Increased job content and satisfaction for
operators
27Operational Opportunities
- Automated interrupt reporting
- When a node becomes interrupted, the HPC
operators are notified by email and a GUI
display. - Event driven notification.
- A record for the interrupt is generated
automatically in the Remedy database and its
status is left open awaiting the problem
resolution. - When a node is returned to service, the Remedy
ticket is automatically updated with the time. - In many cases the cause of the interrupt and
associated error message are captured in the
ticket. - Results in more complete and accurate information.
28Lessons Learned
- Integration issues
- Be sure your suppliers understand your production
support needs and are committed - Remember you own the complete support chain
- Culture change issues
- Users shift from every tool everywhere to a
more deterministic model - Be willing to negotiate the rightweight system
- Administrators - configuration management
discipline - Software suppliers conform to the configuration
management requirements - BProc Master Nodes loading
29Current and Future Work
- Lightning
- Integrate 256 additional nodes
- Reconfigure GIG-E and implement I/O nodes
- Increase Panasas to 200TB
- 8Gb on all nodes
- Lightning and Flash
- 64 bit Linux 2.6 Bproc V4
- Posix threads
- OpenMPI
- PScalBB (Scalable and available I/O network
design)
30Thanks and Questions
- My thanks to the following people for providing
and helping with content - Harvey Wasserman, CCN-7
- Dave Neal, Jerry DeLapp and Daryl Grunau, CCN-9
- Ron Minnich, CCS-1
- Cheryl Wampler, PADNWP
- Thank you for you attention and now for your
questions.