Using DescriptionBased Methods for Provisioning Clusters - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Using DescriptionBased Methods for Provisioning Clusters

Description:

Will check in every now and then to see how it is going ... Over 22 hours of continuous computing ... lang en_US. langsupport --default en_US. keyboard us ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 38

Provided by: Phi675

Category:

more less

Transcript and Presenter's Notes

Title: Using DescriptionBased Methods for Provisioning Clusters

1
Using Description-Based Methods for Provisioning
Clusters

Philip M. Papadopoulos
Program Director, Grids and Clusters
San Diego Supercomputer Center (SDSC)
California Institute of Telecommuncations and
Information Technology (CalIT2)
University of California, San Diego

2
Computing Clusters

Prefetch Installation of a Raw Rocks Cluster
How we spent a recent Weekend
Why should you care about about software
configuration and large-scale provision
Description based configuration
Taking the administrator out of cluster
administration
XML-based assembly instructions
Whats next

3
Prefetch

Install a Rocks cluster while this talk
progresses
Will check in every now and then to see how it is
going
Saw this equipment for the first time about ½
hour ago

4
A Story for the Top500

Attempt a TOP 500 Run on a two fused 128 node
PIII (1GHz, 1GB mem) clusters
100 Mbit ethernet, Gigabit to frontend.
Myrinet 2000. 128 port switch on each cluster
Questions
What LINPACK performance could we get?
Would Rocks scale to 256 nodes?
Could we set up/teardown and run benchmarks in
the allotted 48 hours?
Would we get any sleep?

5
Setup
New Frontend
8 Cross Connects (Myrinet)
128 nodes (120 on Myrinet)
128 nodes (120 on Myrinet)

Fri Started 530pm. Built new frontend. Physical
rewiring of myrinet, added ethernet switch.
Fri Midnight. Solved some ethernet issues.
Completely reinstalled all nodes.
Sat 1230a Went to sleep.
Sat 630a. Woke up. Submitted first LINPACK runs
(225 nodes)

6
Mid-Experiment

Sat 1030a. Fixed a failed disk, and a bad SDRAM.
Started Runs on 240 nodes. Went shopping at
1130a
430p. Examined Output. Started 6hrs of Runs
1130p. Examined more output. Submitted final
large runs. Went to sleep

285 GFlops
59.5 Peak
Over 22 hours of continuous computing

240 Dual PIII (1Ghz, 1GB) - Myrinet
7
Final Clean-up

Sun 9a. Submitted 256 node Ethernet HPL
Sun 1100a. Added GigE to frontend. Started
complete reinstall of all 256 nodes.
40 Minutes for Complete reinstall (return to
original system state)
Did find some (fixable) issues at this scale
Too Many open database connections (fixing for
next release)
DHCP client not resilient enough. Server was
fine.
Sun 100p. Restored wiring, rebooted original
frontends. Started reinstall tests. Some debug of
installation issues uncovered at 256 nodes.
Sun 500p. Went home. Drank Beer.
233 in the Novemeber 2002 Top500 List
Built from Stock Rocks

8
Key Software Components

HPL ATLAS. Univ. of Tennessee
The truckload of open source software that we all
build on
Redhat linux distribution and installer
de facto standard
NPACI Rocks
Complete cluster-aware toolkit that extends a
Redhat Linux distribution
(Inherent Belief that simple and fast can beat
out complex and fast)

9
Why Should You Care about Cluster
Configuration/Administration?

We can barely make clusters work. Gordon Bell
_at_ CCGCS02 (Lyon, France)
No Real Breakthroughs in System Administration
in the last 20 years P. Beckman _at_ CCGCS02
System Administrators love to continuously
twiddle knobs
Similar to a mechanic continually adjusting
air/fuel mixture on your car
Or worse Randomly unplugging/plugging spark plug
wires
Turnkey/Automated systems remove the system admin
from the equation
Sysadmin doesnt like this.
This is good
Think of a cluster as an appliance. Not a
sysadmin playground

10
Clusters arent homogeneous

Real clusters are more complex than a pile of
identical computing nodes
Hardware divergence as cluster changes over time
Logical heterogeneity of specialized servers
IO servers, Job Submission, Specialized configs
for apps, Viz nodes
Garden-variety cluster owner shouldnt have to
fight the provisioning issues.
Get them to the point where they are fighting the
hard application parallelization issues
They should be able to easily follow
software/hardware trends
Moderate-sized (upto 128 nodes) clusters are the
standard grid endpoints
A non-uber-administrator handles two 128 node
Rocks clusters at SIO (2601 admin to system
ratio)

11
NPACI Rocks Toolkit rocks.npaci.edu

Techniques and software for easy installation,
management, monitoring and update of Linux
clusters
A complete cluster-aware distribution and
configuration system.
Installation
Bootable CD floppy which contains all the
packages and site configuration info to bring up
an entire cluster
Management and update philosophies
Trivial to completely reinstall any (all) nodes.
Nodes are 100 automatically configured
RedHat Kickstart to define software/configuration
of nodes
Software is packaged in a query-enabled format
Never try to figure out if node software is
consistent
If you ever ask yourself this question, reinstall
the node
Extensible, programmable infrastructure for all
node types that make up a real cluster.

12
Tools Integrated

Standard cluster tools
MPICH, PVM, PBS, Maui (SSH, SSL -gt Red Hat)
Rocks add ons
Myrinet support
GM device build (RPM), RPC-based port-reservation
(usher-patron)
Mpi-launch (understands port reservation system)
Rocks-dist distribution work horse
XML (programmable) Kickstart
eKV (console redirect to ethernet during install)
Automated mySQL database setup
Ganglia Monitoring (U.C. Berkeley and NPACI)
Stupid pet administration scripts
Other tools
PVFS
ATLAS BLAS, High Performance Linpack

13
Support for Myrinet

Myrinet device driver must be versioned to the
exact kernel version (eg. SMP,options) running on
a node
Source is compiled at reinstallation on every
(Myrinet) node (adds 2 minutes installation) (a
source RPM, by the way)
Device module is then installed (insmod).
GM_mapper run (add node to the network)
Myrinet ports are limited and must be identified
with a particular rank in a parallel program
RPC-based reservation system for Myrinet ports
Client requests port reservation from desired
nodes
Rank mapping file (gm.conf) created on-the-fly
No centralized service needed to track port
allocation
MPI-launch hides all the details of this
HPL (LINPACK) comes pre-packaged for Myrinet
Build your Rocks cluster, see where it sits on
the Top500

14
Key Ideas

No difference between OS and application software
OS installation is completely disposable
Unique state that is kept only at a node is bad
Creating unique state at the node is even worse
Software bits (packages) are separated from
configuration
Diametrically opposite from golden image
methods
Description-based configuration rather than
image-based
Installed OS is compiled from a graph.
Inheritance of software configurations
Distribution
Configuration
Single step installation of updated software OS
Security patches pre-applied to the distribution
not post-applied on the node

15
Rocks extends installation to be a
straightforward way to manage software on a
cluster

It becomes trivial to insure software consistency
across a cluster
For our customer base, stability is critical.

16
Rocks Disentangles Software Bits (distributions)
and Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
17
Managing Software Distributions
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Compute Node
IO Server
Web Server
18
Rocks-dist Repeatable process for creation of
localized distributions

rocks-dist mirror
Rocks mirror
Rocks 2.2 release
Rocks 2.2 updates
rocks-dist dist
Create distribution
Rocks 2.2 release
Rocks 2.2 updates
Local software
Contributed software
This is the same procedure NPACI Rocks uses.
Organizations can customize Rocks for their site.
Iterate, extend as needed

19
Description-based Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Compute Node
IO Server
Web Server
20
What is a Kickstart file?
Setup/Packages (20)
Package Configuration (80)
cdrom zerombr yes bootloader --location mbr
--useLilo skipx auth --useshadow
--enablemd5 clearpart --all part /boot --size
128 part swap --size 128 part / --size 4096 part
/export --size 1 --grow lang en_US langsupport
--default en_US keyboard us mouse
genericps/2 timezone --utc GMT rootpw --iscrypted
nrDG4Vb8OjjQ. text install reboot
packages _at_Base _at_Emacs _at_GNOME
post cat gt /etc/nsswitch.conf ltlt 'EOF' passwd
files shadow files group
files hosts files dns bootparams
files ethers files EOF cat gt /etc/ntp.conf
ltlt 'EOF' server ntp.ucsd.edu server 127.127.1.1 fu
dge 127.127.1.1 stratum 10 authenticate
no driftfile /etc/ntp/drift EOF /bin/mkdir -p
/etc/ntp cat gt /etc/ntp/step-tickers ltlt
'EOF' ntp.ucsd.edu EOF /usr/sbin/ntpdate
ntp.ucsd.edu /sbin/hwclock --systohc
Portable (ASCII), Not Programmable, O(30KB)
21
What are the Issues

Kickstart file is ASCII
There is some structure
Pre-configuration
Package list
Post-configuration
Not a programmable format
Most complicated section is post-configuration
Usually this is handcrafted
Really Want to be able to build sections of the
kickstart file from pieces
Straightforward extension to new software,
different OS

22
Focus on the notion of appliances

How do you define the configuration of nodes with
special attributes/capabilities

23
Assembly Graph of a Complete Cluster
- Complete Appliances (compute, NFS, frontend,
desktop, )
- Some key shared configuration nodes
(slave-node, node, base)
24
Describing Appliances

Purple appliances all include slave-node
Or derived from slave-node
Small differences are readily apparent
Portal, NFS has extra-nic. Compute does not
Compute runs pbs-mom, NFS, Portal do not
Can compose some appliances
Compute-pvfs IsA compute and IsA pvfs-io

25
Architecture Dependencies

Focus only on the differences in architectures
logically, IA-64 compute node is identical to
IA-32
Architecture type is passed from the top of
graph
Software bits (x86 vs. IA64) are managed in the
distribution

26
XML Used to Describe Modules

lt?xml version"1.0" standalone"no"?gt
lt!DOCTYPE kickstart SYSTEM "_at_KICKSTART_DTD_at_"
lt!ENTITY ssh "openssh"gtgt
ltkickstartgt
ltdescriptiongt Enable SSH lt/descriptiongt
ltpackagegt ssh lt/packagegt
ltpackagegt ssh-clientslt/packagegt
ltpackagegt ssh-serverlt/packagegt
ltpackagegt ssh-askpasslt/packagegt
lt!-- include XFree86 packages for xauth --gt
ltpackagegtXFree86lt/packagegt
ltpackagegtXFree86-libslt/packagegt
ltpostgt
cat gt /etc/ssh/ssh_config ltlt 'EOF' lt!--
default client setup --gt
Host
CheckHostIP no
ForwardX11 yes
ForwardAgent yes
StrictHostKeyChecking no
UsePrivilegedPort no

Abstract Package Names, versions, architecture
ssh-client
Not
ssh-client-2.1.5.i386.rpm
Allow an administrator to encapsulate a logical
subsystem
Node-specific configuration is retrieved from our
database
IP Address
Firewall policies
Remote access policies

27
Space-Time and HTTP
Node Appliances
Frontends/Servers
DHCP
IP Kickstart URL
Kickstart RQST
Generate File
kpp
DB
Request Package
Serve Packages
kgen
Install Package

HTTP
Kickstart URL (Generator) can be anywhere
Package Server can be (a different) anywhere

Post Config
Reboot
28
Subsystem Replacement is Easy

Binaries are in de facto standard package format
(RPM)
XML module files (components) are very simple
Graph interconnection (global assembly
instructions) is separate from configuration
Examples
Replace PBS with Sun Grid Engine
Upgrade version of OpenSSH or GCC
Turn on RSH (not recommended)
Purchase commercial compiler (recommended)

29
Monitor Built on Ganglia (UCB and SDSC)
30
Installation, Reboot, Performance
32 Node Re-Install

lt 15 minutes to reinstall 32 node subcluster
(rebuilt myri driver)
2.3min for 128 node reboot

Start
Finsish
Reboot
Start HPL
31
Rocks IA64 port
32
Timeline

November 2001 - Beta of Rocks for Itanium release
Did not support Itanium2
Only supported Compaq Proliant IA64 platform
September 2002 - Purchased HP rx2600 workstation
Came with Debian and SystemImager
Rocks requires Red Hat and Kickstart
November 1, 2002 - Received Beta Red Hat Advanced
Server
Graciously provided by a friend (and well known
Linux kernel developer) inside of HP Labs
November 14, 2002 - Port of Rocks 2.3 to IA64
completed
x86 version based on Red Hat 7.3
IA64 version based on Advanced Server 2.1 (Red
Hat 7.2)
November 16, 2002 - Flew to Baltimore for SC2002
Demoed Rock 2.3 running on HP IA64 gear on show
floor

33
Thirteen day porting effort from IA32 to Itanium2

Almost true ?
We did a partial port the previous year for
Itanium 1
But, this was 4 releases ago for Rocks
Challenges
Obtaining HP Itanium 2 hardware
Obtaining Red Hat OS
Dealing with ELILO and EFI

34
Boot Loaders

LILO - x86
LInux LOader (doesnt support IA64)
Uses Master Boot Record (MBR) of disk to
bootstrap
ELILO - IA64
Enhanced version of LILO for IA64
Uses EFI to bootstrap
Similar to boot prom on other workstations
Sun - OpenBoot
Alpha - SRM
From a Rocks perspective this is the real
distinction between x86 and IA64.

35
Software Stack

Rocks software recompiled on IA64
Everything was already 64-bit clean
And most of our source code is Python
(interpreted)
Even our patches to Red Hats installer simply
recompiled
Tracked down, or rebuilt, third party software
Rebuilt several x86 RPMs from source to produce
IA64 packages
Some packages are missing (e.g. Sun Grid Engine)
Others are slightly changed (e.g. OpenPBS does
not use Maui for IA64)
The Kickstart graph was updated for these changes
Annotated some edges as x86 only (archi386)
Added some new IA64 only edges (archia64)

36
Future Work

Release Rocks for IA64
Resolve Advanced Server licensing issues
Preferred release media is DVD (currently in the
lab)
Testing
Need to test Rocks IA64 at scale
Have built a 256-node x86 cluster
PXE / EFI
Currently EFI does not support PXE
Need to track EFI (and ELILO) development
Gelato
Have expressed interest in membership to Gelato
leadership
Rocks is already cited on the Gelato web-portal

37
Whats still missing?

Improved Monitoring
Monitoring Grids of Clusters
Personal cluster monitor
Straightforward Integration with Grid (Software)
Will use NMI (NSF Middleware Initiative)
software as a basis (A Grid endpoint should no
harder to setup than a cluster)
Any sort of real IO story
PVFS is a toy (no reliability)
NFS doesnt scale properly (no stabilty)
MPI-IO is only good for people willing to retread
large sections of code. (most users want
read/write/open close).
Real parallel job control
MPD looks promising

38
Summary

100s of clusters have been built with Rocks on a
wide variety of physical hardware
Installation/Customization is done in a
straightforward programmatic way
Scaling is excellent
HTTP is used as a transport for
reliability/performance
Configuration Server does not have to be in the
cluster
Package Server does not have to be in the cluster
(Sounds grid-like)
Already on the Itanium2 curve