Title: Using DescriptionBased Methods for Provisioning Clusters
1Using Description-Based Methods for Provisioning
Clusters
- Philip M. Papadopoulos
- Program Director, Grids and Clusters
- San Diego Supercomputer Center (SDSC)
- California Institute of Telecommuncations and
Information Technology (CalIT2) - University of California, San Diego
2Computing Clusters
- Prefetch Installation of a Raw Rocks Cluster
- How we spent a recent Weekend
- Why should you care about about software
configuration and large-scale provision - Description based configuration
- Taking the administrator out of cluster
administration - XML-based assembly instructions
- Whats next
3Prefetch
- Install a Rocks cluster while this talk
progresses - Will check in every now and then to see how it is
going - Saw this equipment for the first time about ½
hour ago
4A Story for the Top500
- Attempt a TOP 500 Run on a two fused 128 node
PIII (1GHz, 1GB mem) clusters - 100 Mbit ethernet, Gigabit to frontend.
- Myrinet 2000. 128 port switch on each cluster
- Questions
- What LINPACK performance could we get?
- Would Rocks scale to 256 nodes?
- Could we set up/teardown and run benchmarks in
the allotted 48 hours? - Would we get any sleep?
5Setup
New Frontend
8 Cross Connects (Myrinet)
128 nodes (120 on Myrinet)
128 nodes (120 on Myrinet)
- Fri Started 530pm. Built new frontend. Physical
rewiring of myrinet, added ethernet switch. - Fri Midnight. Solved some ethernet issues.
Completely reinstalled all nodes. - Sat 1230a Went to sleep.
- Sat 630a. Woke up. Submitted first LINPACK runs
(225 nodes)
6Mid-Experiment
- Sat 1030a. Fixed a failed disk, and a bad SDRAM.
Started Runs on 240 nodes. Went shopping at
1130a - 430p. Examined Output. Started 6hrs of Runs
- 1130p. Examined more output. Submitted final
large runs. Went to sleep
- 285 GFlops
- 59.5 Peak
- Over 22 hours of continuous computing
240 Dual PIII (1Ghz, 1GB) - Myrinet
7Final Clean-up
- Sun 9a. Submitted 256 node Ethernet HPL
- Sun 1100a. Added GigE to frontend. Started
complete reinstall of all 256 nodes. - 40 Minutes for Complete reinstall (return to
original system state) - Did find some (fixable) issues at this scale
- Too Many open database connections (fixing for
next release) - DHCP client not resilient enough. Server was
fine. - Sun 100p. Restored wiring, rebooted original
frontends. Started reinstall tests. Some debug of
installation issues uncovered at 256 nodes. - Sun 500p. Went home. Drank Beer.
- 233 in the Novemeber 2002 Top500 List
- Built from Stock Rocks
8Key Software Components
- HPL ATLAS. Univ. of Tennessee
- The truckload of open source software that we all
build on - Redhat linux distribution and installer
- de facto standard
- NPACI Rocks
- Complete cluster-aware toolkit that extends a
Redhat Linux distribution - (Inherent Belief that simple and fast can beat
out complex and fast)
9Why Should You Care about Cluster
Configuration/Administration?
- We can barely make clusters work. Gordon Bell
_at_ CCGCS02 (Lyon, France) - No Real Breakthroughs in System Administration
in the last 20 years P. Beckman _at_ CCGCS02 - System Administrators love to continuously
twiddle knobs - Similar to a mechanic continually adjusting
air/fuel mixture on your car - Or worse Randomly unplugging/plugging spark plug
wires - Turnkey/Automated systems remove the system admin
from the equation - Sysadmin doesnt like this.
- This is good
- Think of a cluster as an appliance. Not a
sysadmin playground
10Clusters arent homogeneous
- Real clusters are more complex than a pile of
identical computing nodes - Hardware divergence as cluster changes over time
- Logical heterogeneity of specialized servers
- IO servers, Job Submission, Specialized configs
for apps, Viz nodes - Garden-variety cluster owner shouldnt have to
fight the provisioning issues. - Get them to the point where they are fighting the
hard application parallelization issues - They should be able to easily follow
software/hardware trends - Moderate-sized (upto 128 nodes) clusters are the
standard grid endpoints - A non-uber-administrator handles two 128 node
Rocks clusters at SIO (2601 admin to system
ratio)
11NPACI Rocks Toolkit rocks.npaci.edu
- Techniques and software for easy installation,
management, monitoring and update of Linux
clusters - A complete cluster-aware distribution and
configuration system. - Installation
- Bootable CD floppy which contains all the
packages and site configuration info to bring up
an entire cluster - Management and update philosophies
- Trivial to completely reinstall any (all) nodes.
- Nodes are 100 automatically configured
- RedHat Kickstart to define software/configuration
of nodes - Software is packaged in a query-enabled format
- Never try to figure out if node software is
consistent - If you ever ask yourself this question, reinstall
the node - Extensible, programmable infrastructure for all
node types that make up a real cluster.
12Tools Integrated
- Standard cluster tools
- MPICH, PVM, PBS, Maui (SSH, SSL -gt Red Hat)
- Rocks add ons
- Myrinet support
- GM device build (RPM), RPC-based port-reservation
(usher-patron) - Mpi-launch (understands port reservation system)
- Rocks-dist distribution work horse
- XML (programmable) Kickstart
- eKV (console redirect to ethernet during install)
- Automated mySQL database setup
- Ganglia Monitoring (U.C. Berkeley and NPACI)
- Stupid pet administration scripts
- Other tools
- PVFS
- ATLAS BLAS, High Performance Linpack
13Support for Myrinet
- Myrinet device driver must be versioned to the
exact kernel version (eg. SMP,options) running on
a node - Source is compiled at reinstallation on every
(Myrinet) node (adds 2 minutes installation) (a
source RPM, by the way) - Device module is then installed (insmod).
- GM_mapper run (add node to the network)
- Myrinet ports are limited and must be identified
with a particular rank in a parallel program - RPC-based reservation system for Myrinet ports
- Client requests port reservation from desired
nodes - Rank mapping file (gm.conf) created on-the-fly
- No centralized service needed to track port
allocation - MPI-launch hides all the details of this
- HPL (LINPACK) comes pre-packaged for Myrinet
- Build your Rocks cluster, see where it sits on
the Top500
14Key Ideas
- No difference between OS and application software
- OS installation is completely disposable
- Unique state that is kept only at a node is bad
- Creating unique state at the node is even worse
- Software bits (packages) are separated from
configuration - Diametrically opposite from golden image
methods - Description-based configuration rather than
image-based - Installed OS is compiled from a graph.
- Inheritance of software configurations
- Distribution
- Configuration
- Single step installation of updated software OS
- Security patches pre-applied to the distribution
not post-applied on the node
15Rocks extends installation to be a
straightforward way to manage software on a
cluster
- It becomes trivial to insure software consistency
across a cluster - For our customer base, stability is critical.
16Rocks Disentangles Software Bits (distributions)
and Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
17Managing Software Distributions
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Compute Node
IO Server
Web Server
18Rocks-dist Repeatable process for creation of
localized distributions
- rocks-dist mirror
- Rocks mirror
- Rocks 2.2 release
- Rocks 2.2 updates
- rocks-dist dist
- Create distribution
- Rocks 2.2 release
- Rocks 2.2 updates
- Local software
- Contributed software
- This is the same procedure NPACI Rocks uses.
- Organizations can customize Rocks for their site.
- Iterate, extend as needed
19Description-based Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Compute Node
IO Server
Web Server
20What is a Kickstart file?
Setup/Packages (20)
Package Configuration (80)
cdrom zerombr yes bootloader --location mbr
--useLilo skipx auth --useshadow
--enablemd5 clearpart --all part /boot --size
128 part swap --size 128 part / --size 4096 part
/export --size 1 --grow lang en_US langsupport
--default en_US keyboard us mouse
genericps/2 timezone --utc GMT rootpw --iscrypted
nrDG4Vb8OjjQ. text install reboot
packages _at_Base _at_Emacs _at_GNOME
post cat gt /etc/nsswitch.conf ltlt 'EOF' passwd
files shadow files group
files hosts files dns bootparams
files ethers files EOF cat gt /etc/ntp.conf
ltlt 'EOF' server ntp.ucsd.edu server 127.127.1.1 fu
dge 127.127.1.1 stratum 10 authenticate
no driftfile /etc/ntp/drift EOF /bin/mkdir -p
/etc/ntp cat gt /etc/ntp/step-tickers ltlt
'EOF' ntp.ucsd.edu EOF /usr/sbin/ntpdate
ntp.ucsd.edu /sbin/hwclock --systohc
Portable (ASCII), Not Programmable, O(30KB)
21What are the Issues
- Kickstart file is ASCII
- There is some structure
- Pre-configuration
- Package list
- Post-configuration
- Not a programmable format
- Most complicated section is post-configuration
- Usually this is handcrafted
- Really Want to be able to build sections of the
kickstart file from pieces - Straightforward extension to new software,
different OS
22Focus on the notion of appliances
- How do you define the configuration of nodes with
special attributes/capabilities
23Assembly Graph of a Complete Cluster
- Complete Appliances (compute, NFS, frontend,
desktop, )
- Some key shared configuration nodes
(slave-node, node, base)
24Describing Appliances
- Purple appliances all include slave-node
- Or derived from slave-node
- Small differences are readily apparent
- Portal, NFS has extra-nic. Compute does not
- Compute runs pbs-mom, NFS, Portal do not
- Can compose some appliances
- Compute-pvfs IsA compute and IsA pvfs-io
25Architecture Dependencies
- Focus only on the differences in architectures
- logically, IA-64 compute node is identical to
IA-32 - Architecture type is passed from the top of
graph - Software bits (x86 vs. IA64) are managed in the
distribution
26XML Used to Describe Modules
- lt?xml version"1.0" standalone"no"?gt
- lt!DOCTYPE kickstart SYSTEM "_at_KICKSTART_DTD_at_"
lt!ENTITY ssh "openssh"gtgt - ltkickstartgt
- ltdescriptiongt Enable SSH lt/descriptiongt
- ltpackagegt ssh lt/packagegt
- ltpackagegt ssh-clientslt/packagegt
- ltpackagegt ssh-serverlt/packagegt
- ltpackagegt ssh-askpasslt/packagegt
- lt!-- include XFree86 packages for xauth --gt
- ltpackagegtXFree86lt/packagegt
- ltpackagegtXFree86-libslt/packagegt
- ltpostgt
- cat gt /etc/ssh/ssh_config ltlt 'EOF' lt!--
default client setup --gt - Host
- CheckHostIP no
- ForwardX11 yes
- ForwardAgent yes
- StrictHostKeyChecking no
- UsePrivilegedPort no
- Abstract Package Names, versions, architecture
- ssh-client
- Not
- ssh-client-2.1.5.i386.rpm
- Allow an administrator to encapsulate a logical
subsystem - Node-specific configuration is retrieved from our
database - IP Address
- Firewall policies
- Remote access policies
-
27Space-Time and HTTP
Node Appliances
Frontends/Servers
DHCP
IP Kickstart URL
Kickstart RQST
Generate File
kpp
DB
Request Package
Serve Packages
kgen
Install Package
- HTTP
- Kickstart URL (Generator) can be anywhere
- Package Server can be (a different) anywhere
Post Config
Reboot
28Subsystem Replacement is Easy
- Binaries are in de facto standard package format
(RPM) - XML module files (components) are very simple
- Graph interconnection (global assembly
instructions) is separate from configuration - Examples
- Replace PBS with Sun Grid Engine
- Upgrade version of OpenSSH or GCC
- Turn on RSH (not recommended)
- Purchase commercial compiler (recommended)
29Monitor Built on Ganglia (UCB and SDSC)
30Installation, Reboot, Performance
32 Node Re-Install
- lt 15 minutes to reinstall 32 node subcluster
(rebuilt myri driver) - 2.3min for 128 node reboot
Start
Finsish
Reboot
Start HPL
31Rocks IA64 port
32Timeline
- November 2001 - Beta of Rocks for Itanium release
- Did not support Itanium2
- Only supported Compaq Proliant IA64 platform
- September 2002 - Purchased HP rx2600 workstation
- Came with Debian and SystemImager
- Rocks requires Red Hat and Kickstart
- November 1, 2002 - Received Beta Red Hat Advanced
Server - Graciously provided by a friend (and well known
Linux kernel developer) inside of HP Labs - November 14, 2002 - Port of Rocks 2.3 to IA64
completed - x86 version based on Red Hat 7.3
- IA64 version based on Advanced Server 2.1 (Red
Hat 7.2) - November 16, 2002 - Flew to Baltimore for SC2002
- Demoed Rock 2.3 running on HP IA64 gear on show
floor
33Thirteen day porting effort from IA32 to Itanium2
- Almost true ?
- We did a partial port the previous year for
Itanium 1 - But, this was 4 releases ago for Rocks
- Challenges
- Obtaining HP Itanium 2 hardware
- Obtaining Red Hat OS
- Dealing with ELILO and EFI
34Boot Loaders
- LILO - x86
- LInux LOader (doesnt support IA64)
- Uses Master Boot Record (MBR) of disk to
bootstrap - ELILO - IA64
- Enhanced version of LILO for IA64
- Uses EFI to bootstrap
- Similar to boot prom on other workstations
- Sun - OpenBoot
- Alpha - SRM
- From a Rocks perspective this is the real
distinction between x86 and IA64.
35Software Stack
- Rocks software recompiled on IA64
- Everything was already 64-bit clean
- And most of our source code is Python
(interpreted) - Even our patches to Red Hats installer simply
recompiled - Tracked down, or rebuilt, third party software
- Rebuilt several x86 RPMs from source to produce
IA64 packages - Some packages are missing (e.g. Sun Grid Engine)
- Others are slightly changed (e.g. OpenPBS does
not use Maui for IA64) - The Kickstart graph was updated for these changes
- Annotated some edges as x86 only (archi386)
- Added some new IA64 only edges (archia64)
36Future Work
- Release Rocks for IA64
- Resolve Advanced Server licensing issues
- Preferred release media is DVD (currently in the
lab) - Testing
- Need to test Rocks IA64 at scale
- Have built a 256-node x86 cluster
- PXE / EFI
- Currently EFI does not support PXE
- Need to track EFI (and ELILO) development
- Gelato
- Have expressed interest in membership to Gelato
leadership - Rocks is already cited on the Gelato web-portal
37Whats still missing?
- Improved Monitoring
- Monitoring Grids of Clusters
- Personal cluster monitor
- Straightforward Integration with Grid (Software)
- Will use NMI (NSF Middleware Initiative)
software as a basis (A Grid endpoint should no
harder to setup than a cluster) - Any sort of real IO story
- PVFS is a toy (no reliability)
- NFS doesnt scale properly (no stabilty)
- MPI-IO is only good for people willing to retread
large sections of code. (most users want
read/write/open close). - Real parallel job control
- MPD looks promising
38Summary
- 100s of clusters have been built with Rocks on a
wide variety of physical hardware - Installation/Customization is done in a
straightforward programmatic way - Scaling is excellent
- HTTP is used as a transport for
reliability/performance - Configuration Server does not have to be in the
cluster - Package Server does not have to be in the cluster
- (Sounds grid-like)
- Already on the Itanium2 curve