Using DescriptionBased Methods for Provisioning Clusters - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Using DescriptionBased Methods for Provisioning Clusters

Description:

Will check in every now and then to see how it is going ... Over 22 hours of continuous computing ... lang en_US. langsupport --default en_US. keyboard us ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 38
Provided by: Phi675
Category:

less

Transcript and Presenter's Notes

Title: Using DescriptionBased Methods for Provisioning Clusters


1
Using Description-Based Methods for Provisioning
Clusters
  • Philip M. Papadopoulos
  • Program Director, Grids and Clusters
  • San Diego Supercomputer Center (SDSC)
  • California Institute of Telecommuncations and
    Information Technology (CalIT2)
  • University of California, San Diego

2
Computing Clusters
  • Prefetch Installation of a Raw Rocks Cluster
  • How we spent a recent Weekend
  • Why should you care about about software
    configuration and large-scale provision
  • Description based configuration
  • Taking the administrator out of cluster
    administration
  • XML-based assembly instructions
  • Whats next

3
Prefetch
  • Install a Rocks cluster while this talk
    progresses
  • Will check in every now and then to see how it is
    going
  • Saw this equipment for the first time about ½
    hour ago

4
A Story for the Top500
  • Attempt a TOP 500 Run on a two fused 128 node
    PIII (1GHz, 1GB mem) clusters
  • 100 Mbit ethernet, Gigabit to frontend.
  • Myrinet 2000. 128 port switch on each cluster
  • Questions
  • What LINPACK performance could we get?
  • Would Rocks scale to 256 nodes?
  • Could we set up/teardown and run benchmarks in
    the allotted 48 hours?
  • Would we get any sleep?

5
Setup
New Frontend
8 Cross Connects (Myrinet)
128 nodes (120 on Myrinet)
128 nodes (120 on Myrinet)
  • Fri Started 530pm. Built new frontend. Physical
    rewiring of myrinet, added ethernet switch.
  • Fri Midnight. Solved some ethernet issues.
    Completely reinstalled all nodes.
  • Sat 1230a Went to sleep.
  • Sat 630a. Woke up. Submitted first LINPACK runs
    (225 nodes)

6
Mid-Experiment
  • Sat 1030a. Fixed a failed disk, and a bad SDRAM.
    Started Runs on 240 nodes. Went shopping at
    1130a
  • 430p. Examined Output. Started 6hrs of Runs
  • 1130p. Examined more output. Submitted final
    large runs. Went to sleep
  • 285 GFlops
  • 59.5 Peak
  • Over 22 hours of continuous computing

240 Dual PIII (1Ghz, 1GB) - Myrinet
7
Final Clean-up
  • Sun 9a. Submitted 256 node Ethernet HPL
  • Sun 1100a. Added GigE to frontend. Started
    complete reinstall of all 256 nodes.
  • 40 Minutes for Complete reinstall (return to
    original system state)
  • Did find some (fixable) issues at this scale
  • Too Many open database connections (fixing for
    next release)
  • DHCP client not resilient enough. Server was
    fine.
  • Sun 100p. Restored wiring, rebooted original
    frontends. Started reinstall tests. Some debug of
    installation issues uncovered at 256 nodes.
  • Sun 500p. Went home. Drank Beer.
  • 233 in the Novemeber 2002 Top500 List
  • Built from Stock Rocks

8
Key Software Components
  • HPL ATLAS. Univ. of Tennessee
  • The truckload of open source software that we all
    build on
  • Redhat linux distribution and installer
  • de facto standard
  • NPACI Rocks
  • Complete cluster-aware toolkit that extends a
    Redhat Linux distribution
  • (Inherent Belief that simple and fast can beat
    out complex and fast)

9
Why Should You Care about Cluster
Configuration/Administration?
  • We can barely make clusters work. Gordon Bell
    _at_ CCGCS02 (Lyon, France)
  • No Real Breakthroughs in System Administration
    in the last 20 years P. Beckman _at_ CCGCS02
  • System Administrators love to continuously
    twiddle knobs
  • Similar to a mechanic continually adjusting
    air/fuel mixture on your car
  • Or worse Randomly unplugging/plugging spark plug
    wires
  • Turnkey/Automated systems remove the system admin
    from the equation
  • Sysadmin doesnt like this.
  • This is good
  • Think of a cluster as an appliance. Not a
    sysadmin playground

10
Clusters arent homogeneous
  • Real clusters are more complex than a pile of
    identical computing nodes
  • Hardware divergence as cluster changes over time
  • Logical heterogeneity of specialized servers
  • IO servers, Job Submission, Specialized configs
    for apps, Viz nodes
  • Garden-variety cluster owner shouldnt have to
    fight the provisioning issues.
  • Get them to the point where they are fighting the
    hard application parallelization issues
  • They should be able to easily follow
    software/hardware trends
  • Moderate-sized (upto 128 nodes) clusters are the
    standard grid endpoints
  • A non-uber-administrator handles two 128 node
    Rocks clusters at SIO (2601 admin to system
    ratio)

11
NPACI Rocks Toolkit rocks.npaci.edu
  • Techniques and software for easy installation,
    management, monitoring and update of Linux
    clusters
  • A complete cluster-aware distribution and
    configuration system.
  • Installation
  • Bootable CD floppy which contains all the
    packages and site configuration info to bring up
    an entire cluster
  • Management and update philosophies
  • Trivial to completely reinstall any (all) nodes.
  • Nodes are 100 automatically configured
  • RedHat Kickstart to define software/configuration
    of nodes
  • Software is packaged in a query-enabled format
  • Never try to figure out if node software is
    consistent
  • If you ever ask yourself this question, reinstall
    the node
  • Extensible, programmable infrastructure for all
    node types that make up a real cluster.

12
Tools Integrated
  • Standard cluster tools
  • MPICH, PVM, PBS, Maui (SSH, SSL -gt Red Hat)
  • Rocks add ons
  • Myrinet support
  • GM device build (RPM), RPC-based port-reservation
    (usher-patron)
  • Mpi-launch (understands port reservation system)
  • Rocks-dist distribution work horse
  • XML (programmable) Kickstart
  • eKV (console redirect to ethernet during install)
  • Automated mySQL database setup
  • Ganglia Monitoring (U.C. Berkeley and NPACI)
  • Stupid pet administration scripts
  • Other tools
  • PVFS
  • ATLAS BLAS, High Performance Linpack

13
Support for Myrinet
  • Myrinet device driver must be versioned to the
    exact kernel version (eg. SMP,options) running on
    a node
  • Source is compiled at reinstallation on every
    (Myrinet) node (adds 2 minutes installation) (a
    source RPM, by the way)
  • Device module is then installed (insmod).
  • GM_mapper run (add node to the network)
  • Myrinet ports are limited and must be identified
    with a particular rank in a parallel program
  • RPC-based reservation system for Myrinet ports
  • Client requests port reservation from desired
    nodes
  • Rank mapping file (gm.conf) created on-the-fly
  • No centralized service needed to track port
    allocation
  • MPI-launch hides all the details of this
  • HPL (LINPACK) comes pre-packaged for Myrinet
  • Build your Rocks cluster, see where it sits on
    the Top500

14
Key Ideas
  • No difference between OS and application software
  • OS installation is completely disposable
  • Unique state that is kept only at a node is bad
  • Creating unique state at the node is even worse
  • Software bits (packages) are separated from
    configuration
  • Diametrically opposite from golden image
    methods
  • Description-based configuration rather than
    image-based
  • Installed OS is compiled from a graph.
  • Inheritance of software configurations
  • Distribution
  • Configuration
  • Single step installation of updated software OS
  • Security patches pre-applied to the distribution
    not post-applied on the node

15
Rocks extends installation to be a
straightforward way to manage software on a
cluster
  • It becomes trivial to insure software consistency
    across a cluster
  • For our customer base, stability is critical.

16
Rocks Disentangles Software Bits (distributions)
and Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
17
Managing Software Distributions
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Compute Node
IO Server
Web Server
18
Rocks-dist Repeatable process for creation of
localized distributions
  • rocks-dist mirror
  • Rocks mirror
  • Rocks 2.2 release
  • Rocks 2.2 updates
  • rocks-dist dist
  • Create distribution
  • Rocks 2.2 release
  • Rocks 2.2 updates
  • Local software
  • Contributed software
  • This is the same procedure NPACI Rocks uses.
  • Organizations can customize Rocks for their site.
  • Iterate, extend as needed

19
Description-based Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Compute Node
IO Server
Web Server
20
What is a Kickstart file?
Setup/Packages (20)
Package Configuration (80)
cdrom zerombr yes bootloader --location mbr
--useLilo skipx auth --useshadow
--enablemd5 clearpart --all part /boot --size
128 part swap --size 128 part / --size 4096 part
/export --size 1 --grow lang en_US langsupport
--default en_US keyboard us mouse
genericps/2 timezone --utc GMT rootpw --iscrypted
nrDG4Vb8OjjQ. text install reboot
packages _at_Base _at_Emacs _at_GNOME
post cat gt /etc/nsswitch.conf ltlt 'EOF' passwd
files shadow files group
files hosts files dns bootparams
files ethers files EOF cat gt /etc/ntp.conf
ltlt 'EOF' server ntp.ucsd.edu server 127.127.1.1 fu
dge 127.127.1.1 stratum 10 authenticate
no driftfile /etc/ntp/drift EOF /bin/mkdir -p
/etc/ntp cat gt /etc/ntp/step-tickers ltlt
'EOF' ntp.ucsd.edu EOF /usr/sbin/ntpdate
ntp.ucsd.edu /sbin/hwclock --systohc
Portable (ASCII), Not Programmable, O(30KB)
21
What are the Issues
  • Kickstart file is ASCII
  • There is some structure
  • Pre-configuration
  • Package list
  • Post-configuration
  • Not a programmable format
  • Most complicated section is post-configuration
  • Usually this is handcrafted
  • Really Want to be able to build sections of the
    kickstart file from pieces
  • Straightforward extension to new software,
    different OS

22
Focus on the notion of appliances
  • How do you define the configuration of nodes with
    special attributes/capabilities

23
Assembly Graph of a Complete Cluster
- Complete Appliances (compute, NFS, frontend,
desktop, )
- Some key shared configuration nodes
(slave-node, node, base)
24
Describing Appliances
  • Purple appliances all include slave-node
  • Or derived from slave-node
  • Small differences are readily apparent
  • Portal, NFS has extra-nic. Compute does not
  • Compute runs pbs-mom, NFS, Portal do not
  • Can compose some appliances
  • Compute-pvfs IsA compute and IsA pvfs-io

25
Architecture Dependencies
  • Focus only on the differences in architectures
  • logically, IA-64 compute node is identical to
    IA-32
  • Architecture type is passed from the top of
    graph
  • Software bits (x86 vs. IA64) are managed in the
    distribution

26
XML Used to Describe Modules
  • lt?xml version"1.0" standalone"no"?gt
  • lt!DOCTYPE kickstart SYSTEM "_at_KICKSTART_DTD_at_"
    lt!ENTITY ssh "openssh"gtgt
  • ltkickstartgt
  • ltdescriptiongt Enable SSH lt/descriptiongt
  • ltpackagegt ssh lt/packagegt
  • ltpackagegt ssh-clientslt/packagegt
  • ltpackagegt ssh-serverlt/packagegt
  • ltpackagegt ssh-askpasslt/packagegt
  • lt!-- include XFree86 packages for xauth --gt
  • ltpackagegtXFree86lt/packagegt
  • ltpackagegtXFree86-libslt/packagegt
  • ltpostgt
  • cat gt /etc/ssh/ssh_config ltlt 'EOF' lt!--
    default client setup --gt
  • Host
  • CheckHostIP no
  • ForwardX11 yes
  • ForwardAgent yes
  • StrictHostKeyChecking no
  • UsePrivilegedPort no
  • Abstract Package Names, versions, architecture
  • ssh-client
  • Not
  • ssh-client-2.1.5.i386.rpm
  • Allow an administrator to encapsulate a logical
    subsystem
  • Node-specific configuration is retrieved from our
    database
  • IP Address
  • Firewall policies
  • Remote access policies

27
Space-Time and HTTP
Node Appliances
Frontends/Servers
DHCP
IP Kickstart URL
Kickstart RQST
Generate File
kpp
DB
Request Package
Serve Packages
kgen
Install Package
  • HTTP
  • Kickstart URL (Generator) can be anywhere
  • Package Server can be (a different) anywhere

Post Config
Reboot
28
Subsystem Replacement is Easy
  • Binaries are in de facto standard package format
    (RPM)
  • XML module files (components) are very simple
  • Graph interconnection (global assembly
    instructions) is separate from configuration
  • Examples
  • Replace PBS with Sun Grid Engine
  • Upgrade version of OpenSSH or GCC
  • Turn on RSH (not recommended)
  • Purchase commercial compiler (recommended)

29
Monitor Built on Ganglia (UCB and SDSC)
30
Installation, Reboot, Performance
32 Node Re-Install
  • lt 15 minutes to reinstall 32 node subcluster
    (rebuilt myri driver)
  • 2.3min for 128 node reboot

Start
Finsish
Reboot
Start HPL
31
Rocks IA64 port
32
Timeline
  • November 2001 - Beta of Rocks for Itanium release
  • Did not support Itanium2
  • Only supported Compaq Proliant IA64 platform
  • September 2002 - Purchased HP rx2600 workstation
  • Came with Debian and SystemImager
  • Rocks requires Red Hat and Kickstart
  • November 1, 2002 - Received Beta Red Hat Advanced
    Server
  • Graciously provided by a friend (and well known
    Linux kernel developer) inside of HP Labs
  • November 14, 2002 - Port of Rocks 2.3 to IA64
    completed
  • x86 version based on Red Hat 7.3
  • IA64 version based on Advanced Server 2.1 (Red
    Hat 7.2)
  • November 16, 2002 - Flew to Baltimore for SC2002
  • Demoed Rock 2.3 running on HP IA64 gear on show
    floor

33
Thirteen day porting effort from IA32 to Itanium2
  • Almost true ?
  • We did a partial port the previous year for
    Itanium 1
  • But, this was 4 releases ago for Rocks
  • Challenges
  • Obtaining HP Itanium 2 hardware
  • Obtaining Red Hat OS
  • Dealing with ELILO and EFI

34
Boot Loaders
  • LILO - x86
  • LInux LOader (doesnt support IA64)
  • Uses Master Boot Record (MBR) of disk to
    bootstrap
  • ELILO - IA64
  • Enhanced version of LILO for IA64
  • Uses EFI to bootstrap
  • Similar to boot prom on other workstations
  • Sun - OpenBoot
  • Alpha - SRM
  • From a Rocks perspective this is the real
    distinction between x86 and IA64.

35
Software Stack
  • Rocks software recompiled on IA64
  • Everything was already 64-bit clean
  • And most of our source code is Python
    (interpreted)
  • Even our patches to Red Hats installer simply
    recompiled
  • Tracked down, or rebuilt, third party software
  • Rebuilt several x86 RPMs from source to produce
    IA64 packages
  • Some packages are missing (e.g. Sun Grid Engine)
  • Others are slightly changed (e.g. OpenPBS does
    not use Maui for IA64)
  • The Kickstart graph was updated for these changes
  • Annotated some edges as x86 only (archi386)
  • Added some new IA64 only edges (archia64)

36
Future Work
  • Release Rocks for IA64
  • Resolve Advanced Server licensing issues
  • Preferred release media is DVD (currently in the
    lab)
  • Testing
  • Need to test Rocks IA64 at scale
  • Have built a 256-node x86 cluster
  • PXE / EFI
  • Currently EFI does not support PXE
  • Need to track EFI (and ELILO) development
  • Gelato
  • Have expressed interest in membership to Gelato
    leadership
  • Rocks is already cited on the Gelato web-portal

37
Whats still missing?
  • Improved Monitoring
  • Monitoring Grids of Clusters
  • Personal cluster monitor
  • Straightforward Integration with Grid (Software)
  • Will use NMI (NSF Middleware Initiative)
    software as a basis (A Grid endpoint should no
    harder to setup than a cluster)
  • Any sort of real IO story
  • PVFS is a toy (no reliability)
  • NFS doesnt scale properly (no stabilty)
  • MPI-IO is only good for people willing to retread
    large sections of code. (most users want
    read/write/open close).
  • Real parallel job control
  • MPD looks promising

38
Summary
  • 100s of clusters have been built with Rocks on a
    wide variety of physical hardware
  • Installation/Customization is done in a
    straightforward programmatic way
  • Scaling is excellent
  • HTTP is used as a transport for
    reliability/performance
  • Configuration Server does not have to be in the
    cluster
  • Package Server does not have to be in the cluster
  • (Sounds grid-like)
  • Already on the Itanium2 curve
Write a Comment
User Comments (0)
About PowerShow.com