Title: Production Linux Capacity Computing at Los Alamos
1Production Linux Capacity Computing at Los Alamos
- Steven R. Shaw, CCN-7
- High Performance Computing Systems
- Computing, Communications, and Networking Division
- Vision and goals
- Methodology
- Clustermatic and other components
- Current systems
- Lightning
- Flash
- Configuration management
- Operational opportunities
- Lessons learned
- Current and future work
- Questions
3Our Capacity Vision
- to meet requirements of programs within
reasonable resources, is to consolidate
architectures, leverage commodity computing,
Linux open source software, and standardized
deployment over capacity systems - Cheryl Wampler, ASC PI Meeting, March 1-4, 2004.
4Goals for Production Linux Capacity Computing
- Respond to the need for additional capacity
computing - Provide stability and continuity for user
community - Lower integration and operational costs by
leveraging internal resources and open source
software - Use repeatable processes and automation to
deploy new capacity quickly and to efficiently
operate and maintain existing systems
5Goals (continued)
- Provide more compute cycles to users by making
systems easier to build and manage do more with
available resources - Move toward a separate common file system, not
tied to specific platforms - Also move toward a separate standards based
scalable IO network to files systems, archival
storage and other services.
- CCN Division took on role of the system
integrator - Successful collaborative relationships were
established with CCS-1 (LA-MPI, Science
Appliance), CCN-8 (Panasas FS, Compliers, and
tools), CCN-5 (network integration), and
third-party softwaresuppliers - Built upon our Linuxcluster experiencefrom Pink
and othersystems
7Pink Configuration
64 dual-processorI/O nodes
958 dual-processor production computingnodes
1 dual-processorBProc masternode
2 dual-processorfront-end nodes
Panasas Global FS
LANLYellow(soon Turquoise)network
8Science Appliance
- The key software in a Science Appliance is a
suite that LANL developed called "Clustermatic" - Clustermatic can completely control a cluster,
from the BIOS up to a high level programming
environment. - It features the Beowulf Distributed Process Space
(BProc), LinuxBios, and a variety of other
open-source kernel modifications, utilities, and
libraries. - Very quick node boot times
- Cluster boot and upgrade in minutes
- Manageable nodes from power-on
- Single system image for the entire cluster
- Quick process migration
9Clustermatic Awards
- Research and Development Magazines 2004 Research
and Development 100. - Clustermatic is a revolutionary software
suite for managing, monitoring, administering and
operating clusters on network-connected computers
running as a high-performance system.
Clustermatic increases reliability and
efficiency, decreases node autonomy, simplifies
computer programming, reduces administration
costs, and minimizes a user's reliance on
unpredictable software, enabling commodity-based
cluster networks to compete with the higher-cost
supercomputers. - The Clustermatic system was awarded the
Excellence in Cluster Technology Award for Open
Source Cluster Solutions at the 2004 ClusterWorld
Conference Exposition, in April 2004.
10Clustermatic Components
- A traditional cluster is built by replicating a
complete system software environment on every
node. - In a Science Appliance (Clustermatic system), we
have master nodes and slave nodes, but only the
master nodes have a fully-configured system. - The slave nodes run a minimal software stack
consisting of LinuxBIOS, Linux, and BProc. - Culture change for users, not every tool and
library exists on the slave nodes.
11Clustermatic Components
- Most importantly, BProc enables a distributed
process space across nodes within the cluster
all user processes running on the slave nodes
appear as processes running on the master node. - Users create processes on the master node and
the system migrates them (the processes) to the
slave nodes. - Standard input, output, and error streams are
redirected to the master node.
Slave nodes
Master node
- Processes remain visible, controllable on master.
12Other Key Components
- Panasas file system
- User environment similar to other Los Alamos
systems - HPSS
- TotalView
- HPC toolkit
13Science Appliance Systems at LANL
- Lightning, Pink, Grendels, Flash, TLC
- MPI LSF are BProc-integrated.
- Result LANL Science Appliance systems are easy
to use but are different than other LANL systems
14Los Alamos Platforms
15Lightning Capacity System Overview (last week)
- System Hardware
- 1408 dual-processor LNXI AMD Opteron nodes
(11.26 TeraOps peak, 5.6 TB memory) - One Arima Rio Works HDAMA system board with AMD
8111 and 8131 chipsets - Two 2.0 GHz 64-bit processors with 1 MB L2
cache/node - Four GB of memory/node
- One 120-GB disk drive/node
- One ICEBOX controller/node for hardware
monitoring - Scalable to 2048 nodes (scalable design plans for
interconnect) - Myrinet Interconnect (latency 7 usec, bandwidth
250 MB/sec) - Gigabit copper network to network services such
as NFS, Panasas - A copper-based 10/100 network for system
monitoring system reboot, etc.
- System Software
- Linux
- Clustermatic software
- Beoboot, LinuxBios, Bproc, Supermon
- Compilers
- Message Passing
- Debugging - TotalView
- Archival storage - HPSS
- Resource management - Load Sharing Facility (LSF)
16Lightning Integration and Deployment
Contract signed mid-July 2003
Level 2 (SCS) Lightning User Environment
Level 2 (PC) November 25, 2005
System Delivered
Beta mode
Integration/Acceptance Test
Limited Availability
Laboratory Standdown
Secure Environment
DP Award of Excellence For Integration Effort
Linpack Run 8.051 TF 64-bit Linpack 6 on Top500
17Lightning last week
18- Linux production and development environment
model Production segmentsDevelopment
environmentsSupport and system functions
19Flash Timeline
- Assemble hardware 11/17-11/19/04
- Stabilize hardware 11/20 11/24
- Acceptance testing complete 12/1
- Software install 12/2 12/17
- 88 Person-hours
- First I/O node system on Opteron
- Panasas and network setup in parallel
- Friendly users on 12/19
20Configuration Management
- Philosophy
- All maintenance and installation is done within
the configuration management system - Motivation
- Do more with available resources
- Automation is key
- Expertise is encoded
- Automated systems are consistent and tireless
- Prevent errors and mitigate consequences
- Avoid creating error-likely situations
- Correlate effect with cause
- Manual actions reduce the capacity to respond
21Configuration Management
- A framework for automating, to the fullest
extent possible, in a cross-platform and common
fashion the configuration of a product. - Differentiate products at major boundaries that
make sense (O/S, Linux version, Bproc or not,
chip architecture, unique service, etc.) - Databases become the documentation
22Configuration Management Culture Change
- The database is pointless if the system diverges
from its description due to actions taken outside
the data base - All changes, even temporary and debugging in
nature, must be done using our configuration
management tools
23Configuration Management Tools
- Rsync High confidence mirroring of files
- systemimager - Installation, replication and
disaster recovery of the core system - Cfengine Rule based files for installation and
configuration actions - systemimager provides the body, cfengine creates
the soul
24More Configuration Management Tools
- Revision Control System (RCS) track origin and
history - Annotated history within the cfengine database
- RPM (Redhat Package Manager)
- Deterministic, verifiable, removable
- Culture change for some of our suppliers
25Configuration Management Automation and Discipline
- Leads to systems that
- Are more predictable behavior can be
ascertained from the database - More scalable copies are easier
- Better documented
- Easier to debug
- Easier to repair
- Enables us to accomplish more with our available
26Operational Opportunities
- Hardware maintenance
- Field replaceable unit is the node
- Rapid boot time dramatically shortens the time to
repair - Use operations staff for hands-on maintenance,
vendor becomes a parts supplier and second tier
support - Repair the node during prime time and burn-in,
maintaining a supply of tested spares. - Increased job content and satisfaction for
27Operational Opportunities
- Automated interrupt reporting
- When a node becomes interrupted, the HPC
operators are notified by email and a GUI
display. - Event driven notification.
- A record for the interrupt is generated
automatically in the Remedy database and its
status is left open awaiting the problem
resolution. - When a node is returned to service, the Remedy
ticket is automatically updated with the time. - In many cases the cause of the interrupt and
associated error message are captured in the
ticket. - Results in more complete and accurate information.
28Lessons Learned
- Integration issues
- Be sure your suppliers understand your production
support needs and are committed - Remember you own the complete support chain
- Culture change issues
- Users shift from every tool everywhere to a
more deterministic model - Be willing to negotiate the rightweight system
- Administrators - configuration management
discipline - Software suppliers conform to the configuration
management requirements - BProc Master Nodes loading
29Current and Future Work
- Lightning
- Integrate 256 additional nodes
- Reconfigure GIG-E and implement I/O nodes
- Increase Panasas to 200TB
- 8Gb on all nodes
- Lightning and Flash
- 64 bit Linux 2.6 Bproc V4
- Posix threads
- OpenMPI
- PScalBB (Scalable and available I/O network
30Thanks and Questions
- My thanks to the following people for providing
and helping with content - Harvey Wasserman, CCN-7
- Dave Neal, Jerry DeLapp and Daryl Grunau, CCN-9
- Ron Minnich, CCS-1
- Cheryl Wampler, PADNWP
- Thank you for you attention and now for your