Title: P1253814501UhWry
1Clustering with OSCAR
Ottawa Linux Symposium (OLS02)
June 29, 2002
Thomas Naughton naughtont_at_ornl.gov Oak Ridge
National Laboratory
2Also presenting today
Sean Dague IBM (SIS) Brian Luethke ORNL
(C3) Steve DuChene BGS (Ganglia)
3OSCAR Open Source Cluster Application Resources
- Snapshot of best known methods for building,
- programming, and using clusters.
- Consortium of academic/research industry
- members.
4Project Overview
5What does it do?
- Wizard based cluster software installation (OS
environment) - Automatically configures cluster components
- Increases consistency among cluster builds
- Reduces time to build/install a cluster
- Reduces need for expertise
6Functional Areas
- cluster installation
- programming environment
- workload management
- security
- administration
- maintenance
- documentation
- packaging
7OCG/OSCAR Background
8History Organization
- What is Open Cluster Group (OCG)?
- How is OSCAR related to OCG?
- When was it started?
- Why was it started?
- What is the industry / academic/research facet?
9OSCAR Members
- Dell
- IBM
- Intel
- MSC.Software
- Bald Guy Software
- Silicon Graphics, Inc.
- Indiana University
- Lawrence Livermore National Lab
- NCSA
- Oak Ridge National Lab
blue denotes 2002 core members
10Software releases
oscar-1.0 RedHat 6.2 Apr 2001
oscar-1.1 RedHat 7.1 Jul 2001
oscar-1.2b RedHat 7.1 Jan 2002
oscar-1.2.1 RedHat 7.1 Feb 2002
oscar-1.2.1rh72 RedHat 7.2 Apr 2002
oscar-1.3beta RH 7.1/7.2, MDK 8.2 Jun 2002
Early LUI Based
Latest SIS based
NOTE
11Installation Overview
12Assumptions Requirements
- Currently assume a single headnode, multiple
compute nodes configuration. - User is able to install RedHat Linux with X
Window support and setup the network for this
machine (headnode). - Currently only support single Ethernet (eth0) in
compute nodes. - Selected RedHat for current version, design is
to be distribution agnostic.
13An OSCAR Cluster
- Installed and configured items
- Head node services, e.g. DHCP, NFS
- Internal cluster networking configured
- SIS bootstraps compute-node installation, OS
installed via network (PXE) or floppy boot - OpenSSH/OpenSSL configured
- C3 power tools setup
- OpenPBS and MAUI installed and configured
- Install message passing libs LAM/MPI, MPICH, PVM
- Env-Switcher/Modules installed and defaults setup
14OSCAR 1.3
- Continue to use SIS (replaced LUI in v1.2)
- Add Drop-in package support
- Supports Add/Del node
- Supports RH 7.1,7.2, MDK 8.2, and Itanium
- Add Env-Switcher/Modules
- Add Ganglia
- Update packages C3, LAM/MPI, MPICH, OpenPBS,
OpenSSH/SSL, PVM
15OSCAR 1.3 base pkgs
Package Name Version
SIS 0.90-1/2.1.3oscar-1/1.25-1
C3 3.1
OpenPBS 2.2p11
MAUI 3.0.6p9
LAM/MPI 6.5.6
MPICH 1.2.4
PVM 3.4.46
Ganglia 2.2.3
Env-switcher/modules 1.0.4/3.1.6
16Virtual OSCAR Install
17Step 0
- Install RedHat on head node (see also next
slide) - Include X Window support
- Configure external/internal networking (eth0,
eth1) - Create RPM directory, and copy RPMs from CD(s)
- Download OSCAR
- Available at http//oscar.sourceforge.net/
- Extract the tarball (see also next slide)
- Print/Read document
- Run wizard (install_cluster ethX) to begin the
install
18Step 0.5
- Installing Headnode (standard or via KickStart,
etc.) - Configure networking/naming (internal/external
NICs) - Reboot, login as root, run following commands
- Create the RPM dir (must be this path)
- root_at_headnode root mkdir p /tftpboot/rpm
- insert RedHat CD1 into drive
- root_at_headnode root mount /mnt/cdrom
- root_at_headnode root cp ar /mnt/cdrom/RedHat/RP
MS/ \ - gt /tftpboot/rpm
- wait...wait...wait...
- root_at_headnode root eject /mnt/cdrom
- insert RedHat CD2 into drive
- root_at_headnode root mount /mnt/cdrom
- root_at_headnode root cp ar /mnt/cdrom/RedHat/RP
MS/ \ - gt /tftpboot/rpm
- wait...wait...wait...
19Step 0.75
- root_at_headnode root eject /mnt/cdrom
- root_at_headnode root cd
- root_at_headnode root pwd
- /root
- root_at_headnode root tar zxf oscar-1.3.tar.gz
- root_at_headnode root cd oscar-1.3
- root_at_headnode oscar-1.3 ifconfig
- look at the output and determine the internal
interface - Ex.
- eth1 Link encapEthernet Hwaddr
00A0CC536DF4 - inet addr10.0.0.55 Bcast10.0.0.255
Mask255.255.255.0 -
- root_at_headnode oscar-1.3 ./install_cluster
eth1 - Follow steps in the Install Wizard...
20Install Wizard Overview
- Select default MPI.
- Build image per client type (partition layout, HD
type) - Define clients (network info, image binding)
- Setup networking (collect MAC addresses,
configure DHCP, build boot floppy) - Boot clients / build
- Complete setup (post install)
- Run test suite
- Use cluster
21OSCAR 1.3 Step-by-Step
- After untarring, tar zxvf oscar-1.3b3.tar.gz
22OSCAR 1.3 Step-by-Step
- NOTE On RedHat 7.2 Upgrade RPM to v4.0.4
23OSCAR 1.3 Step-by-Step
- Run the install script, ./install_cluster eth1
24OSCAR 1.3 Step-by-Step
25OSCAR 1.3 Step-by-Step
26OSCAR 1.3 Step-by-Step
27OSCAR 1.3 Step-by-Step
28OSCAR 1.3 Step-by-Step
29OSCAR 1.3 Step-by-Step
30OSCAR 1.3 Step-by-Step
31OSCAR 1.3 Step-by-Step
32OSCAR 1.3 Step-by-Step
33OSCAR 1.3 Step-by-Step
34OSCAR 1.3 Step-by-Step
35OSCAR 1.3 Step-by-Step
36OSCAR 1.3 Step-by-Step
37OSCAR 1.3 Step-by-Step
38OSCAR 1.3 Step-by-Step
39OSCAR 1.3 Step-by-Step
40OSCAR 1.3 Step-by-Step
41OSCAR 1.3 Step-by-Step
PXE capable nodes - select (temporarily) the NIC
as boot device Otherwise use the autoinstall
floppy (not as quick but reliable!)
42OSCAR 1.3 Step-by-Step
43OSCAR 1.3 Step-by-Step
44OSCAR 1.3 Step-by-Step
45OSCAR 1.3 Step-by-Step
46OSCAR 1.3 Step-by-Step
47OSCAR 1.3 Step-by-Step
48OSCAR 1.3 Step-by-Step
49OSCAR 1.3 Step-by-Step
50OSCAR 1.3 Step-by-Step
51OSCAR 1.3 Step-by-Step
52OSCAR 1.3 Step-by-Step
53OSCAR 1.3 Step-by-Step
54OSCAR 1.3 Step-by-Step
55OSCAR 1.3 Step-by-Step
56OSCAR 1.3 Step-by-Step
57OSCAR 1.3 Step-by-Step
58OSCAR 1.3 Step-by-Step
59OSCAR 1.3 Step-by-Step
60Community Usage
61OSCAR, over 40,000 customers served!
- oscar.sourceforge.net
- 41,046 downloads
- 121,355 page hits
- (May 17, 2002 1115am)
62More OSCAR Stats,
- Known packages using OSCAR
- NCSA -in-a-box series
- MSC.Linux
- Large Installations
- LLNL 3-clusters (236-nodes, 472-processors)
- ORNL 3-clusters (150-nodes, 200-processors)
- SNS cluster
- etc
63MSC.Linux
- OSCAR based
- Adds
- Webmin tool
- Commercial grade integration and testing
64Cluster-in-a-Box
- OSCAR based
- www.ncsa.uiuc.edu/News/Access/Stories/IAB
- Cluster-in-a-Box
- Grid-in-a-Box
- Display Wall-in-a-Box
- Access Grid-in-a-Box
- Presently
- Cluster-in-a-Box OSCAR
- Goal to add
- Myrinet
- Additional Alliance software
- IA-64
65eXtreme TORC powered by OSCAR
- Major users
- CSMD - SciDAC SSS scalability research testing
- Spallation Neutron Source Facility codes for
neutronics performance, activation analysis,
shielding analysis, and design engineering
data-support - Genome Analysis and Systems Modeling Genomic
Integrated Supercomputing Toolkit - SciDAC fusion codes
- CSMD - checkpoint/restart capability for
out-of-core scalapack dense solvers
- 65 P4 machines
- performance peak 129.7 GFLOPS
- memory 50.152 GB
- disk 2.68 TB
- dual interconnects
- gigabit ethernet
- fast ethernet
66OSCARmost popular install management package
- clusters.top500.org
- May 17, 2002 11am
- OSCAR install 30
- OSCAR
- MSC.Linux
- NCSA in-a-box
- ( Non-Scientific survey)
67Component Details
68Cluster Management
- Cluster Management is not a single service or
package. Cluster management covers four main
areas - System software mgmt SIS
- Cluster wide monitoring Ganglia
- Parallel execution command env C3
- Power mgmt none presently offered
69Components Presenters
- System Installation Suite (SIS)
- Sean Dague, IBM
- C3 Cluster Power Tools
- Brian Luethke, ORNL
- Ganglia monitoring system
- Steve DuChene, BGS
- Env-Switcher
- Thomas Naughton, ORNL
70BREAK
71System Installation Suite (SIS)
Component Presenter Sean Dague, IBM
72System Installation Suite
Sean Dague sldague_at_us.ibm.com Software
Engineer IBM Linux Technology Center
73SIS SystemImager LUI
- SystemImager Image Based Installation and
Maintenance Tool - LUI Resource Based Cluster Installation Tool
- Projects merged in April of 2001
- Goals
- Support all Linux distributions
- Support a large number of architectures
- Make it easy to add support for new distro and
architectures - Make it so no one has to solve the massive
installation issue again - (i.e. Do it once, do it right, do it for everyone)
74System Installation Suite at a glance
75SIS at a glance described
- Many different images and versions of images may
be stored on an Image Server - Image can be captured from an existing machine
- Image can be created from a set of packages
directly on the Image Server - RSYNC is used to propagate the image during
installation - Because RSYNC is used, maintenance is easy done
(i.e. only changes are pulled across the
network). - Because replication is done at the file level and
not the package level it is very distribution
agnostic
76What does it do for me?
- System Installation
- Fast and efficient way to install machines
- System Maintenance
- Rsync only propagates the changes between client
and image - File System Migration
- Image an ext2 machine, image back as ext3 or
reiserfs (XFS and JFS coming soon) - Migrate systems from non-RAID to Software RAID
- Easy Machine Backup
- Build replicants of machines
77Capturing an Image from a Golden Client
- getimage is the standard SystemImager way of
capturing an image from a golden-client to the
image server - On the client
- prepareclient run on client sets up rsyncd on
client machine - On the server
- getimage rsyncs the image from the client to
the server - mkautoinstallscript creates the autoinstall
script on the server for the image - addclients adds client definitions for the image
78Creating an Image directly on the Image Server
- buildimage is the System Installer program to
create an image directly on the server from an
RPM list and disk partition file - On the server
- mksiimage builds the base image
- mksidisk creates disk partition table
information - mkautoinstallscript builds the autoinstall
script for the image - mksimachine creates client definitions for a
machine - System Installer stores all the image and client
info in a flat file database for other
applications to utilize
79Tksis System Installation Suite GUI
- Perl-Tk GUI for System Installation Suite
available as the systeminstaller-x11 package
(still in early stages) - Currently only interfaces with System Installer
buildimage calls (will integrate with
SystemImager calls in the near future) - Provides an easy to use interface for
installation - Component Panels may easily be integrated into
other Perl based installation tools
80Installing an image part 1
- Image can be autoinstalled via diskette, cd, or
network - mkautoinstalldiskette creates autoinstall
floppy - mkautoinstallcd creates autoinstall cd ISO
- mkbootserver creates PXE autoinstall server
- Boot steps
- autoinstall media boots
- looks for local.cfg (network information) or uses
dhcp to get ip - determines hostname from ip address or local.cfg
- fetches hostname.sh autoinstall script from
Image Server
81Installing an image part 2
- Autoinstall Steps
- rsync over any additionally needed utilities
(mkraid, raidstop, raidstart, mkreiserfs, etc.) - partitions disk drives using sfdisk
- format and mount all filesystems
- rsync image from Image Server
- run systemconfigurator to setup networking and
bootloader - unmount all filesystems
- do specified postinstall actions (one of beep,
shutdown, or reboot) - Autoinstall will dump to a shell if any errors
are encountered
82Maintaining a machine
- Choice 1 Maintain the Image directly
- Image is a full live filesystem
- you can chroot into the image
- compile code in the image
- run rpm -Uhv newpackage.rpm in the image
- Choice 2 Maintain the Golden Client
- Apply hot fixes to the golden client
- Rerun getimage to recapture the image
- updateclient resyncs client to image
- because rsync is used, only the changes between
image and client are propagated
83Who's using it?
- SIS SystemImager
- All users of SystemImager gt 2.0 are SIS users
- OSCAR 1.2 uses SIS for installation
- SCore, Clubmask and other clustering groups
interested in using SIS for installation - SystemImager 2.0 and System Configurator 1.0
accepted into Debian 3.0 distribution
84Future Directions
- System Installer 1.0
- Debian Package support
- IA64 arch support
- SystemImager 2.2
- devfs clients
- IA64 arch support
- SystemImager 2.4
- PPC, S390, and HPARISC arches
- JFS and XFS file systems
- remote logging
- internal API (allows for Tksis integration)
- Inclusion in more Linux Distributions
- Unified GUI for System Installer and SystemImager
85Questions?
- System Installation Suite http//sisuite.org
- SystemImager http//systemimager.org
- System Installer http//systeminstaller.sf.net
- OSCAR http//oscar.sf.net
- Team can be found on sisuite and systemimager
channels on irc.openprojects.net
86Cluster Command Control (C3)
Component Presenter Brian Luethke, ORNL
87C3 Cluster Power ToolsCluster Command Control
Presented by Brian Luethke Brian Luethke, John
Muggler, Thomas Naughton, Stephen Scott
88Overview
- command line based
- single system illusion (SSi) single machine
interface - cluster configuration file
- ability to rapidly deploy from server software
and system images - command line list option enable subcluster
management - distributed file scatter and gather operations
- execution of non-interactive commands
- multiple cluster capability from single entry
point
89Building Blocks
- System administration
- cpushimage - push image across cluster
- cshutdown - Remote shutdown to reboot or halt
cluster - User tools
- cpush - push single file -to- directory
- crm - delete single file -to- directory
- cget - retrieve files from each node
- ckill - kill a process on each node
- cexec - execute arbitrary command on each node
- cexecs serial mode, useful for debugging
- clist list each cluster available and its
type - cname returns a node position from a given
node name - cnum returns a node name from a given node
position
90Cluster Classification Scheme
- Direct local
- The cluster nodes are known at run time
- The command is run from the head node
- Direct remote
- The cluster nodes are known at run time
- The command is not run from the head node
- Indirect remote
- The cluster nodes are not known at run time
- The command is not run from the head node
- Notes
- Local or remote is checked by comparing the head
node names to the local hostname - Indirect clusters will execute on the default
cluster of the head node specified.
91Cluster Configuration File
- default cluster configuration file
- /etc/c3.conf
- Cluster torc direct local cluster
- orc-00bnode0
- node1-4
- exclude 3
-
- Cluster htorc indirect remote cluster
- htorc-00
- user specified configuration file
- /somewhere/list_of_nodes
- Cluster auto-gen direct remote cluster
- node0.csm.ornl.gov
- node1.csm.ornl.gov
- node2.csm.ornl.gov
- node3.csm.ornl.gov
- dead node4.csm.ornl.gov
-
-
92Configuration File Information
- Offline Node Specifier
- Exclude tag applies to ranges
- Dead applies to single machines
- Important for node ranges on the command line
- Cluster Definition Blocks as Meta-clusters
- Group based on hardware
- Groups based on software
- Groups based on role
- User specified cluster configuration files
- Specified at runtime
- User can create both sub-clusters and
super-clusters - Useful for scripting
- Can not have a indirect local cluster (info has
to be somewhere) - Infinite loop warning When using a indirect
remote cluster, the default cluster on the remote
head node is executed. This could make a call
back.
93MACHINE DEFINITIONS (Ranges) on Command Line
- MACHINE DEFINITONS as used in command line
- Position number from configuration file
- Begin at 0
- Does not include head node
- dead and exclude maintain a nodes position
- Format on command line
- First cluster name from configuration file with
a colon - Cluster2 would represent all nodes on cluster2
- signifies default cluster
- ranges and single nodes are separated by a comma
- Cluster21-5,7 executes on nodes 1, 2, 3, 4, 5,
7 - 4 executes node at position 4 on the default
cluster - cexec torc1-5,7 hostname
94Execution Model External to Multi-Cluster
- desktop knowledge
- TORC head node
- eXtremeTORC head node
- HighTORC head node
cluster head-node knowledge node 1 node 2 node
7
95Execution Model External to Multi-Cluster
On desktop
Indirect remotes (several in one
file) --------------------------------------------
------- cluster torc torc cluster
exterme_torc xtorc cluster high_torc
htorc
On eXtremeTORC
Direct local -------------------------------------
-------- cluster xtorc xtorcnode0 nod
e1-7
96cpush
cpush OPTIONS MACHINE DEFINITIONS source
target -h, --help display help message -f,
--file ltfilenamegt alternate cluster configuration
file, default is /etc/c3.conf -l, --list
ltfilenamegt list of files to push (single file
per line, column1SRC column2DEST) -i
interactive mode, ask once before
executing --head execute command on head
node, does not execute on compute nodes --nolocal
the source file or directory lies on the head
node of the remote cluster -b, --blind pushes
the entire file (normally cpush uses rsync)
97cpush
- to move a single file
-
- cpush /home/filename
-
- This pushes the file filename to /home on each
compute node - to move a single file, renaming it on the
cluster nodes -
- cpush /home/filename1 /home/filename2
- Push the file filename1 to each compute node
in the cluster, renaming it to - filename2 on the cluster nodes
- to move a set of files listed in a file
-
- cpush --list/home/filelist escaflowne
- This pushes each file in the filelist where it
is specified to send it. Filelist format is on - the next slide.
98Notes on using a file list
- One file per line
- If no destination is specified then it will push
the file to the location it is on the local
machine - No comments
- Example file
- /home/filename
- /home/filename2 /tmp
- /home/filaname3 /tmp/filename4
- The first line pushes the file filename to
/home on each compute node - The second line pushes the file filename2 to
/tmp on each compute node - The third line pushes the file filename3 to
/tmp on each compute node renaming - the file to filename4
- All options on the command line are applied to
each file, In a filelist, you can not specify
that file one uses the nolocal option and file
two goes to the machine definition clusters3-5.
99cexec
Usage cexec(s) OPTIONS MACHINE_DEFINITIONS
command --help h display help message
--file -f ltfilenamegt alternate cluster
configuration file if one is not supplied
then /etc/c3.conf will be used -i
interactive mode, ask once before
executing --head execute command on head
node, does not execute on the
cluster Using cexecs executes the serial version
of cexec
100cexec
- to simply execute a command
- cexec mkdir temp
- This executes mkdir temp on each node in the
cluster. The working directory of the - cexec command is always your home directory thus
temp would be created in / -
- to print the machine name and then execute the
string - ( serial version only )
- cexecs hostname
- This executes hostname on each node in the
cluster. This differs from cexec in that - each node is executed before the next one. This
is useful if a node is offline and you - wish to see which one.
101cexec
- to execute a command with wildcards on several
clusters - cexec cluster1 cluster22-5 ls
/tmp/pvmd - This will execute ls /tmp/pvmd on each
compute node on cluster one and nodes - 2, 3, 4, and 5 on cluster2. Notice the use of
the quotes. This keeps the shell from
interpreting the command untill it reaches the
compute nodes. -
- Using pipes
-
- cexec ps A grep a.out
- cexec ps A grep a.out
- In the first example the symbol is enclosed in
the quotes. In this case - ps Agrep a.out is executed on each node. In
this way you get the standard cexec - output format with a.out in each nodes block if
it exists. In the second example - ps A is executed on each node and the all the
a.out lines are greped out. This - demonstrates that placement of s is very
important. Example output on next slide.
102cexec quotation example
cexec ps Agrep xinetd
local processin
g node node1 local
processing node
node2 local
---------
node1--------- 9738 ? 000000
xinetd --------- node2--------- 4856 ?
000000 xinetd cexec ps A grep xinetd
9738 ? 000000 xinetd 4856 ?
000000 xinetd
103cname
Usage cname OPTIONS MACHINE DEFINTIONS
--help -h display help message --file -f
ltfilenamegt alternate cluster configuration
file if one is not supplied then
/etc/c3.conf will be used
104cname
- To search the deafult cluster
- cname 0-5
- This returns the node name for the nodes
occupying slots 0, 1, 2, 3, 4, and 5 in the
default configuration file -
- To search a specific cluster
- cname cluster1 cluster24-8
-
- All of the nodes in cluster1 are returned and
nodes 4, 5, 6, 7, and 8 are returned from
cluster2
105cnum
Usage cnum OPTIONS MACHINE DEFINTIONS
node_name --help -h display help
message --file -f ltfilenamegt alternate
cluster configuration file if one is not
supplied then /etc/c3.conf will be used
106cnum
- To search the default cluster
- cnum node2
- This returns the node position (number) that
node2 occupies in the default cluster - configuration file
- To search several clusters in the configuration
file - cnum cluster1 cluster2 gundam eva
- This returns the node position that the nodes
gundam and eva occupy in both - cluster1 and cluster2. If the node does not
exist in the cluster node number is - returned.
107clist
Usage clist OPTIONS --help -h display
help message --file -f ltfilenamegt alternate
cluster configuration file if one is not
supplied then /etc/c3.conf is used
108clist
- To list all the clusters from the default
configuration file - clist
- This lists each cluster in the default
configuration file and its type(direct local,
direct - remote, or indirect remote)
- To list all the clusters from an alternate file
- clist f cluster.conf
- This lists each cluster in the specified
configuration file and its type(direct local,
direct - remote, or indirect remote)
109Multiple cluster examples
- Command line Same as single clusters, only
specify several clusters - example
- installing and rpm on two clusters
- First push rpm out to cluster nodes
- cpush xtorc example-1.0-1.rpm
- Use RPM to install application
- cexec xtorc rpm i example-1.0-1.rpm
- Check for errors in installation
- cexec xtorc rpm q example
- Notice the addition of xtorc cluster
specifier only difference between examples - All clusters in this list will participate in
this command (the standalone represents the
default cluster)
110Usage Notes
- By default C3 does not execute commands on the
head node - Use head option to execute only on the head
node - Interactive option only asks once before
execution - Commands only need to be homogeneous within
itself - Example binary and data on an intel and HPUX
- Data can be pushed to both systems
- cpush head intel hp data.txt
- Binary for each cluster
- cpush head intel app.intel app
- cpush head hp app.HPUX app
- Then execute app
- cexec head intel hp app
111Usage Notes
- Notes on using multiple clusters
- Very powerful, but with power comes danger
- malformed commands can be VERY bad
- homogeneous within its self becomes very
important - crm all could bring down MANY nodes
- Extend nearly all unix/linux gotchas to multiple
clusters/many nodes and very fast - High level administrators can easily set policies
on several clusters from single access point. - Federated clusters those within single domain
- Meta-clusters wide area joined clusters
112Contact Information
torc_at_msr.csm.ornl.gov contact ORNL cluster
team www.csm.ornl.gov/torc/C3 version 3.1
(current release) www.csm.ornl.gov/TORC ORNL
team site www.openclustergroup.org C3 v3.1
included in OSCAR 1.3
113Ganglia
Component Presenter Steve DuChene, BGS
114Ganglia
115Overview
- Ganglia provides a real-time cluster monitoring
environment. - Communication takes place between nodes across a
multicast network using XML XDR formatted text. - Ganglia currently runs on Linux, FreeBSD,
Solaris, AIX, IRIX.
116History
- Ganglia was developed as part of the Millennium
Project at UC Berkeley Computer Sci. Div. - Principal author is Matt Massie
ltmassie_at_cs.berkeley.edugt - Packaged for OSCAR by Steve DuChene
ltlinux-clusters_at_mindspring.comgt
117Ganglia for monitoring
- gmond multithreaded daemon which acts as a
server for monitoring a host - Additional utilities
- gmetric allows adding arbitrary host metrics to
the monitoring data stream - gstat CLI to get cluster status report.
118Graphical Interface
- Php/rrdtool web client.
- Creates histographs of individual data streams
and formats output for web display.
119gmond, a few specifics
- Each gmond stores all of the information for the
entire cluster locally in memory. - Opens up port 8649 and a telnet to this port will
result in a dump of all the information stored in
memory. (XML formatted) - Additionally, when a change occurs in the host
that is being monitored, the gmond multicasts
this information to the other gmonds.
120gstat
- The simplest commandline client available.
- gstat, shows all nodes with basic load info.
- gstat --help, shows general options
- gstat --dead, shows dead nodes
- gstat -m, lists the nodes from least to most
loaded.
121Gmetric
- Gmetric announces a metric value to all the rest
of the gmond multicast channel. Main command line
options are - --nameString what appears in list of
monitored metrics - --valueString value of the metric
- --typeString (string,int8,uint8,int16,uint16,fl
oat,double) - --unitsString (i.e. Degrees F or Kilobytes)
-
122W83782d-i2c-0-2d Adapter SMBus Via Pro adapter
at 5000 Algorithm
Example LM Sensor output
W83782d-i2c-0-2d Adapter SMBus Via Pro adapter
at 5000 Algorithm Non-I2C SMBus adapter VCore 1
1.40 V (min 0.00 V, max 0.00 V)
VCore 2 1.42 V (min 0.00 V, max
0.00 V) 3.3V 3.32 V
(min 2.97 V, max 3.63 V)
5V 4.94 V (min 4.50 V, max
5.48 V) 12V 12.16 V (min
10.79 V, max 13.11 V) -12V
-12.29 V (min -13.21 V, max -10.90 V)
-5V -5.10 V (min -5.51 V, max
-4.51 V) V5SB 4.99 V (min
4.50 V, max 5.48 V) VBat
3.15 V (min 2.70 V, max 3.29 V)
fan1 10714 RPM (min 3000 RPM,
div 2) fan2 10887 RPM
(min 3000 RPM, div 2)
fan3 0 RPM (min 1500 RPM, div 4)
temp1 -48C (limit
60C, hysteresis 50C) sensor thermistor
temp2 43.5C (limit 60C,
hysteresis 50C) sensor PII/Celeron diode
temp3 40.5C (limit 60C,
hysteresis 50C) sensor PII/Celeron diode
vid 0.00 V
123The web client php/rrd
124Displaying all the metrics.
125gmond across multiple clusters
- gmond --trusted_host xxx.xxx.xxx.xxx
- Allows setting up a unicast connection to another
gmond across the Internet. - Must do this on each gmond, so that the
communication is 2-way.
126Ganglia Summary
- gmond is scaleable because of its use of
multicast. - gmond is usefull, as it allows realtime
information gathering of which hosts are alive,
before running a job. - Available at ganglia.sourceforge.net
- Now an included package in OSCAR.
127Ganglia / C3Example
128Description of sync_users
- The sync_users script that ships with OSCAR is a
very simple example usage of cpush to distribute
the files - /etc/passwd,group,shadow,gshadow
- to the nodes manually or via a cron entry.
129Statement of Problem
- The default sync_users script is very simple and
a very annoying characteristic is that it stalls
when any of the nodes are down. (Stalls until
the SSH timeout for that node.) - All available nodes roll by perfectly but the
script pushes 2-4 files and the stall happens at
the end of each cpush (file). Therefore if the
timeout is 2 minutes, it could hang for 8
minutes if no CTRLC is applied.
130Re-Statement of Problem
- Need some way to dynamically determine the down
nodes and skip them when running sync_users. - Also, need to display the list of missed nodes.
131Enter Ganglia
- Same day the sync_user discussion took place
Ganglia was demod by a group member. - Ganglia maintains information about nodes in the
cluster and most relevantly it offers a nice
tool, gstat, with options to list available nodes
and their load!
132Quick sync_users2
- So, a quick sync_users2 was whipped up using
Ganglias gstat in conjunction with C3s cpush to
make a smarter script. - The script uses output from cname and gstat -m
- The output is massaged to build the cpush
command-line and to clearly report missed nodes.
133Usage example
- Things could almost be done from a command-line
like this - root gstat -m gt upnodes.tmp \
- gt cpush l upnodes.tmp /etc/passwd \
- gt rm upnodes.tmp
- repeat for all files passwd,group,shadow,gshadow
- Insteadjust type
- root ./sync_users2
134Perl Script Summary
- Build a hash of the default cluster in the
c3.conf file (use cname) - c3conf munge_c3conf(/opt/c3-3/cname)
name-gtnum - Get list of Up/Avail nodes (via Ganglia)
- _at_uplist get_nodelist(/usr/bin/gstat m)
- Munge standard nodelist into C3-3 format
nodeN-gtN - _at_c3nodelist c3ify_nodelist(aref_uplist,
href_c3conf) - Build C3-3 cmd-ln nodelist
- nodes . join(,, _at_c3nodelist)
- Distribute the files with the above cmd-ln
nodelist - cpush nodes /etc/passwd
- Print missed nodes info
- _at_missed get_missednodes(aref_uplist,
href_c3conf) - print \n Missed nodes\n _at_missed \n
135Ganglia / C3 Comments
- This is just a simple application of C3 and
Ganglia. - The goal was to use these two tool to create a
smarter sync_users this has been met. - Since C3 and Ganglia can be use by standard users
(not just root) this method could be used by
anyone for user-level scripts.
136ganglia is for specific metrics
- A python script, added for convenience.
- It is both an executable and a class (library).
- gmond monitors 15 metrics by default.
- ganglia --help to see the metrics.
- To run ganglia metric metric
- Example ganglia cpu_nice
137Env-Switcher
Component Presenter Thomas Naughton, ORNL
138Env-Switcher
- Written by Jeff Squires,
- jsquyres_at_lam-mpi.org
- Uses the modules package
139The OSCAR switcher package
- Contains 2 RPMs
- modules
- env-switcher
- Each RPM has different intended uses
140Super-short explanation
- modules
- Changes the current shell environment
- Changes are non-persistent current shell only
- env-switcher
- Change future shell environments
- Changes are persistent all future shells
- Controls the list of which modules are loaded at
each future shell invocation
141Design goals for OSCAR switcher package
- Allow users an easy way to persistently control
their shell environment without needing to edit
their dot files - Strongly discourage the use of /etc/profile.d
scripts in OSCAR - Use already-existing modules package
- Contains sophisticated controls for shell
environment manipulation - Uses deep voodoo to change current shell env.
142Design goals for OSCAR switcher package
- Cannot interfere with advanced users wanting to
use modules without switcher - Two-tier system of defaults
- System-level default
- User-level defaults (which always override the
system default) - E.g., system default to use LAM/MPI, but user
bob wants to have MPICH as his default
143Why doesnt switcher change the current
environment?
- Changing the current env requires deep voodoo
- Cannot layer switcher over modules to change the
current mechanism - at least, not without re-creating the entire
change the current env. mechanism - The modules package already does this
- Seems redundant to re-invent this mechanism
- Users can use the module command to change the
current environment
144Why discourage /etc/profile.d scripts?
- Such scripts are not always loaded
- Canonical example is rsh/ssh
- For non-interactive remote shells, profile.d
scripts are not loaded - Non-interactive remote shells are used by all MPI
and PVM implementations - The modules philosophy is a fine-grained approach
to making software packages available - In contrast to the monolithic /usr/bin approach
145The modules software package
- modules.sourceforge.net
- At the core of modules
- Set of TCL scripts called modulefiles
- Each modulefile loads a single software package
into the environment - Can modify anything in the environment (e.g,
PATH) - Each modulefile is reversable can load and
unload them from the environment
146The modules software package
- Loading and unloading modules requires individual
commands no persistent changes - Examples
- module load lam-6.5.6
- module unload pvm
147The env-switcher software package
- Controls the set of modulefiles that are loaded
for each shell - Guarantees that this set is loaded for all shells
- Including the corner cases of rsh/ssh
- Allows users to manipulate this set via the
command line - Current cmd line syntax is somewhat clunky
- Will be made nicer by OSCAR 1.3 stable
148OSCARs three kinds of modulefiles
- Normal
- Not automatically loaded by OSCAR
- /opt/modules/modulefiles
- Auto-loaded
- Guaranteed to be loaded by OSCAR for every shell
- /opt/modules/oscar-modulefiles
149OSCARs three kinds of modulefiles
- Switcher-controlled
- May or may not be loaded by OSCAR, depending on
system and user defaults - No fixed directory location for these modulefiles
- Use the switcher command to register
switcher-controlled modulefiles
150What do RPM / OSCAR package authors need to do?
- Do not provide /etc/profile.d scripts
- Provide a modulefile instead
- Decide how that modulefile will be used in OSCAR
- Normal
- Auto-loaded
- Switcher-controlled
- Install the modulefile in post as appropriate
- Uninstall the modulefile in preun
151Still to be done in switcher
- Add simplified command line syntax for users
- Add a man page
- Add some form of documentation in OSCAR for using
switcher to change MPI implementation
152Future Development
153OSCAR v1.4
- Major topics
- Node grouping
- GUI/CLI/Wizard
- Publish API for OSCAR DB
- Packages exploit DB via API
- Security enhancements compute head node
- User selectable pkgs for contrib pkgs
- Mandrake support (if not already avail)
154OSCAR v1.5 ? v2.0
- Major topics
- Add/Delete package
- OSCAR itself a package
- Maintenance of nodes/images via GUI CLI
155Future OCG
156Future OSCAR
- OSCAR migration/upgrade db migration,etc.
- Support for non-RPM packages
- Support for other UNIXes
- Support for Diskless nodes
157Join OSCAR
OSCAR Research Center
PULL
www.openclustergroup.org oscar.sourceforge.net so
urceforge.net/projects/oscar
Cluster Lab For The Gifted