Title: Rocks
1Rocks
- University of Michigan
- MGRID April 2005
- Federico D. Sacerdoti
- SDSC Rocks Cluster Group
2Primary Goal
- Make clusters easy
- Target audience Scientists who want a capable
computational resource in their own lab
A (mad) scientist in need of computing power
3The Way to manage software
- Not fun to care and feed for a system
- Codify all configuration
- Test every software component a priori
- Code the configuration of services like you code
applications - Test test test
- Takes longer at the outset
- But we can repeat config with 100 accuracy every
time - Rocks will win the INSTALLATION OLYMPICS
- Time to bring a node from bare bones to fully
functional, with an arbitrary number of services
and components - All compute nodes are automatically installed
- Critical for scaling in clusters
4Complexity Hiding
- Too hard to repeat OS, service configuration for
N nodes, where N is large. Automate. - Rocks is not the first one to do this
- OSCAR / SystemImager (Some assembly required)
- Radmin
- Rocks is unique in its Complexity Hiding
philosophy - All configuration is pre-tested and hidden
- End user needs computing capabilities, not
exposure to pipes, wires, and fittings of cluster
services. - Mature industries are all similar
- Automotive
- Electrical
- Civil Engineering (Structures)
- No such thing as a wall socket administrator
(Katz 2004)
5Complexity Hiding
- Cluster System Administrators still valuable
- When you want to customize, the structure is
there for you. Can always raise the hood and
tinker. - Rocks uses unmodified RPMs for software bits and
bytes easy to find, add, and use. - Integration and testing of new software is the
hard part for a SysAdmin. Always has been. - We all have to learn how to configure a new
service (condor, sge, globus, x509, myproxy), but
LEARN IT ONCE. - Every command typed to configure a service is
codified - Takes forever, testing is time-consuming, but
rewards are immeasurable
6Codify All Configuration
- How do you configure NTP on Rocks compute nodes?
ltpostgt lt!-- Configure NTP to use an external
server --gt ltfile name"/etc/ntp.conf"gt server
ltvar name"Kickstart_PrivateNTPHost"/gt authenticat
e no driftfile /var/lib/ntp/drift lt/filegt lt!--
Force the clock to be set to the server upon
reboot --gt /bin/mkdir -p /etc/ntp ltfile
name"/etc/ntp/step-tickers"gt ltvar
name"Kickstart_PrivateNTPHost"/gt lt/filegt lt!--
Force the clock to be set to the server right now
--gt /usr/sbin/ntpdate ltvar name"Kickstart_Privat
eNTPHost"/gt /sbin/hwclock --systohc lt/postgt
ntp-client.xml
7More Philosophy
- Use Installation as common mechanism to manage a
cluster - Rocks formats and installs a / partition
- On initial install (from bare metal)
- When replacing a dead node
- Adding new nodes
- Rocks also uses installation to keep software
consistent - If you catch yourself wondering if a nodes
software is up-to-date, reinstall! - In 10 minutes, all doubt is erased
- Rocks doesnt attempt to incrementally update
software
8Rocks uses hard disks
- Rocks employs disks on nodes
- As a performance optimization (compare to NFS
mounting /) - Why not? Disks are free, and reliable
- MTBF of disk is now higher than chassis (1mill
hrs, vs 45k for chassis) - No significant discount from buying nodes without
disk - Less flexible?
- Use NFS overlay mounts wherever you want
- Less secure?
- Use an encrypted filesystem if you need red/black
modes.
9Architecture
10Philosophy
- Run on heterogeneous standard high volume
components - Use the components that offer the best
price/performance! - Given the track record of General Purpose
Processors, any other strategy is risky. - No stopping at the thermal frequency wall dual
core, quad core. - Requires more intelligent installer if your
hardware is not identical. - Redhat fundamentally has this problem (set of
worldwide linux users is maximally heterogeneous) - Their ability to discover configure hardware is
top-notch, why not leverage their work!
11Rocks Hardware Architecture
12Minimum Components
Local Hard Drive
Power
Ethernet
OS on all nodes (not SSI)
i386 (Pentium/Athlon), x86_64 (Opteron/EM64T), ia6
4 (Itanium) server
13Minimum Hardware Requirements
- Frontend
- 2 ethernet connections
- 18 GB disk drive
- 512 MB memory
- Compute
- 1 ethernet connection
- 18 GB disk drive
- 512 MB memory
- Power
- Ethernet
14Optional Components
- High-performance network
- Myrinet
- Infiniband (Infinicon or Voltaire)
- Network-addressable power distribution unit
- keyboard/video/mouse network not required
- Non-commodity
- How do you manage your management network?
15Storage
- NFS
- The frontend exports all home directories
- Parallel Virtual File System version 1
- System nodes can be targeted as Compute PVFS or
strictly PVFS nodes - Lustre Roll is in development
16Standard Rocks Storage
- Exported to compute nodes via NFS
17Network Attached Storage
- A NAS box is an embedded NFS appliance
18Parallel Virtual File System
19Cluster Software Stack
20Rocks Rolls
- Rolls are containers for software packages and
the configuration scripts for the packages - Rolls dissect a monolithic distribution
21Rolls
- Think of a roll as a package for a car
22Rolls User-Customizable Frontends
- Rolls are added by the Red Hat installer
- Software within a roll is added and configured at
initial installation time
23Red Hat Installer Modified to Accept Rolls
24Approach
- Install a frontend
- Insert Rocks Base CD
- Insert Roll CDs (optional components)
- Answer 7 screens of configuration data
- Drink coffee (takes about 30 minutes to install)
- Install compute nodes
- Login to frontend
- Execute insert-ethers
- Boot compute node with Rocks Base CD (or PXE)
- Insert-ethers discovers nodes
- Goto step 3
- Add user accounts
- Start computing
- Optional Rolls
- Condor
- Grid (based on NMI R4)
- Intel (compilers)
- Java
- SCE (developed in Thailand)
- Sun Grid Engine
- PBS (developed in Norway)
- Area51 (security monitoring tools)
25Login to Frontend
- Create ssh public/private key
- Ask for passphrase
- These keys are used to securely login into
compute nodes without having to enter a password
each time you login to a compute node - Execute insert-ethers
- This utility listens for new compute nodes
26Insert-ethers
- Used to integrate appliances into the cluster
- A DHCP listener - Registers new nodes
27Boot a Compute Node in Installation Mode
- Instruct the node to network boot
- Network boot forces the compute node to run the
PXE protocol (Pre-eXecution Environment) - Also can use the Rocks Base CD
- If no CD and no PXE-enabled NIC, can use a boot
floppy built from Etherboot (http//www.rom-o-ma
tic.net)
28Insert-ethers Discovers the Node
29Insert-ethers Status
30eKVEthernet Keyboard and Video
- Monitor your compute node installation over the
ethernet network - No KVM required!
- During compute node installation execute on
frontend ssh -p2200 compute-0-0
31eKV View Console Install via SSH
32Node Info Stored In A MySQL Database
- If you know SQL, you can execute powerful
commands - Rocks-supplied command line utilities are tied
into the database - E.g., get the hostname for the bottom 8 nodes of
each cabinet
cluster-fork --query"select name from nodes
where ranklt9" hostaname
33Cluster Database Backbone
34Kickstart
- Red Hats Kickstart
- Monolithic flat ASCII file
- No macro language
- Requires forking based on site information and
node type. - Rocks XML Kickstart
- Decompose a kickstart file into nodes and a graph
- Graph specifies OO framework
- Each node specifies a service and its
configuration - Macros and SQL for site configuration
- Compile flat kickstart file from a web cgi
script
35Kickstart Compile from Graph
Sent to node (http)
Compile (kgen)
36Sample Node File
ltkickstartgt ltdescriptiongt Enable
SSH lt/descriptiongt ltpackagegtopenssh/packagegt
ltpackagegtopenssh-clientslt/packagegt ltpackagegtopen
ssh-serverlt/packagegt ltpackagegtopenssh-askpasslt/pa
ckagegt ltpostgt ltfile name"/etc/ssh/ssh_config"gt H
ost CheckHostIP no
ForwardX11 yes ForwardAgent
yes StrictHostKeyChecking
no UsePrivilegedPort no
FallBackToRsh no Protocol
1,2 lt/filegt chmod orx /root mkdir
/root/.ssh chmod orx /root/.ssh lt/postgt lt/kickst
artgt
37Sample Graph File
lt?xml version"1.0" standalone"no"?gt ltgraphgt ltd
escriptiongt Default Graph for Rocks. lt/descripti
ongt ltedge from"base" to"scripting"/gt ltedge
from"base" to"ssh"/gt ltedge from"base"
to"ssl"/gt ltedge from"base" to"grub"
arch"i386,x86_64"/gt ltedge from"base"
to"elilo" arch"ia64"/gt ltedge from"node"
to"base"/gt ltedge from"node" to"accounting"/gt
ltedge from"slave-node" to"node"/gt ltedge
from"slave-node" to"autofs-client"/gt ltedge
from"slave-node" to"dhcp-client"/gt ltedge
from"slave-node" to"snmp-server"/gt ltedge
from"slave-node" to"node-certs"/gt ltedge
from"compute" to"slave-node"/gt ltedge
from"master-node" to"node"/gt ltedge
from"master-node" to"x11"/gt lt/graphgt
38Kickstart framework
39Kickstart Graph with Roll
HPC
base
40Compute Node Installation Timeline
41Available Rolls
- Area51
- Tripwire and rootkit
- Condor
- High-throughput computing grid package
- IB
- Infiniband drivers and MPI from Infinicon
- Intel
- Compiler and libraries for Intel-based clusters
(Scalable Systems) - Grid
- NMI packaging of Globus
- PBS/Maui
- Job scheduling
- SCE
- Scalable cluster environment (Thailand)
- SGE
- Job scheduling
- Viz
- Easily set up nVidia-based viz clusters
- Java
- Java environment
- RxC
- Graphical cluster management tool (Scalable
Systems) - Lava
- Workload management (Platform Computing)
- IB-Voltaire
- Infiniband drivers and MPI from Voltaire
42Futures
43Rocks 4.0.0
- Currently in beta
- Based on RHEL 4.0
- Kernel v2.6
- Using CentOS as base operating environment
- CentOS is a RHEL rebuild
- When asked for a roll, input stock CentOS CDs
- Implication
- Opens the door for using any RHEL-based media
- Official RHEL bits
- Other RHEL clones (e.g., Scientific Linux)
44More Rolls
- Application-specific rolls
- Oil and Gas
- Computational Chemistry
- Rendering
- Bioinformatics
45Largest Known Rocks Clusters
- Scientific
- Our bread and butter.
- Tungston2 (1040 CPUS - NCSA)
- Fermilab Farms (1500 cpus, subclustered)
- Lonestar (1024 cpus - TACC)
- Iceburg (600 cpus - Stanford)
- Commercial
- Niobe Cluster (288 cpus - AMD Sunnyvale labs)
- Oil Gas Rumors of 1000s of nodes (unspecified)
- Dell as hardware vendor / Platform for Rocks
support - Beginning to see Rocks on RFP requirement lists