Title: CLUSTERMATIC An Innovative Approach to Cluster Computing
1CLUSTERMATIC An Innovative Approach to Cluster
Computing
- 2004 LACSI Symposium
- The Los Alamos Computer Science Institute
LA-UR-03-8015
2Tutorial Outline (Morning)
3Tutorial Outline (Afternoon)
4Tutorial Introduction
- Tutorial is divided into modules
- Each module has clear objectives
- Modules comprise short theory component, followed
by hands on - indicates theory
indicates hands-on
Please ask questions at any time!
5Module 1 Overview of ClustermaticPresenter
Greg Watson
- Objective
- To provide a brief overview of the Clustermatic
architecture - Contents
- What is Clustermatic?
- Why Use Clustermatic?
- Clustermatic Components
- Installing Clustermatic
- More Information
- http//www.clustermatic.org
6What is Clustermatic?
- Clustermatic is a suite of software that
completely controls a cluster from the BIOS to
the high level programming environment
- Clustermatic is modular
- Each component is responsible for a specific set
of activities in the cluster - Each component can be used independently of other
components
7Why Use Clustermatic?
- Clustermatic clusters are easy to build, manage
and program - A cluster can be installed and operational in a
few minutes - The architecture is designed for simplicity,
performance and reliability - Utilization is maximized by ensuring machine is
always available - Supports machines from 2 to 1024 nodes (and
counting) - System administration is no more onerous than for
a single machine, regardless of the size of the
cluster - Upgrade O/S on entire machine with a single
command - No need to synchronize node software versions
- The entire software suite is GPL open-source
8Clustermatic Components
- LinuxBIOS
- Replaces normal BIOS
- Improves boot performance and node startup times
- Elimates reliance on proprietary BIOS
- No interaction required, important for 100s of
nodes
LinuxBIOS
9Clustermatic Components
- Linux
- Mature O/S
- Demonstrated performance in HPC applications
- No proprietary O/S issues
- Extensive hardware and network device support
10Clustermatic Components
- V9FS
- Avoids problems associated with global mounts
- Processes are provided with a private shared
filesystem - Namespace exists only for duration of process
- Nodes are returned to pristine state once
process is complete
Users
Compilers Debuggers
BJS
v9fs
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
11Clustermatic Components
- Beoboot
- Manages booting cluster nodes
- Employs a tree-based boot scheme for
fast/scalable booting - Responsible for configuring nodes once they have
booted
Users
Compilers Debuggers
BJS
Beoboot
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
12Clustermatic Components
- BProc
- Manages a single process-space across machine
- Responsible for process startup and management
- Provides commands for starting processes, copying
files to nodes, etc.
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
13Clustermatic Components
- BJS
- BProc Job Scheduler
- Enforces policies for allocating jobs to nodes
- Nodes are allocated to pools which can have
different policies
BJS
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
14Clustermatic Components
- Supermon
- Provides a system monitoring infrastructure
- Provides kernel and hardware status information
- Low overhead on compute nodes and interconnect
- Extensible protocol based on s-expressions
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
15Clustermatic Components
- MPI
- Uses standard MPICH 1.2 (ANL) or LA-MPI (LANL)
- Supports Myrinet (GM) and Ethernet (P4) devices
- Supports debugging with TotalView
MPI
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
16Clustermatic Components
- Compilers Debuggers
- Commercial and non-commercial compilers available
- GNU, Intel, Absoft
- Commercial and non-commercial debuggers available
- Gdb, TotalView, DDT
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
17Linux Support
- Linux Variants
- For RedHat Linux
- Installed as a series of RPMs
- Supports RH 9 2.4.22 kernel
- For other Linuxs
- Must be compiled and installed from source
18Tutorial CD Contents
- RPMs for all Clustermatic components
- Architectures included for x86, x86_64, athlon,
ppc and alpha - Full distibution available on Clustermatic web
site (www.clustermatic.org) - SRPMs for all Clustermatic components
- Miscellaneous RPMs
- Full source tree for LinuxBIOS (gzipped tar
format) - Source for MPI example programs
- Presentation handouts
19Cluster Hardware Setup
- Laptop installed with RH9
- Will act as the master node
- Two slave nodes
- Preloaded with LinuxBIOS and a phase 1 kernel in
flash - iTuner M-100 VIA EPIA 533MHz 128Mb
- 8 port 100baseT switch
- Total cost (excluding laptop) 800
20Clustermatic Installation
- Installation process for RedHat
- Log into laptop
- Username root
- Password lci2004
- Insert and mount CD-ROM
- mount /mnt/cdrom
- Locate install script
- cd /mnt/cdrom/LCI
- Install Clustermatic
- ./install_clustermatic
- Reboot to load new kernel
- reboot
21Module 2 BProc BeobootPresenter Erik Hendriks
- Objective
- To introduce BProc and gain a basic understanding
of how it works - To introduce Beoboot and understand how it fits
together with BProc - To configure and manage a BProc cluster
- Contents
- Overview of BProc
- Overview of Beoboot
- Configuring BProc For Your Cluster
- Bringing Up BProc
- Bringing Up The Nodes
- Using the Cluster
- Managing a Cluster
- Troubleshooting Techniques
22BProc Overview
- BProc Beowulf Distributed Process Space
- BProc is a Linux kernel modification which
provides - A single system image for process control in a
cluster - Process migration for creating processes in a
cluster - BProc is the foundation for the rest of the
Clustermatic software
23Process Space
- A process space is
- A pool of process IDs
- A process tree
- A set of parent/child relationships
- Every instance of the Linux kernel has a process
space - A distributed process space allows parts of one
nodes process space to exist on other nodes
24Distributed Process Space
- With a distributed process space, some processes
will exist on other nodes - Every remote process has a place holder in the
process tree - All remote processes remain visible
- Process related system calls (fork, wait, kill,
etc.) work identically on local and remote
processes - Kill works on remote processes
- No runaway processes
- Ptrace works on remote processes
- Strace, gdb, TotalView transparently work on
remote processes!
A node with two remote processes
25Distributed Process Space Example
Master
Slave
B
- The master starts processes on slave nodes
- These remote processes remain visible on the
master node - Not all processes on the slave are part of the
masters process space
26Process Creation Example
Master
Slave
A
A
A
fork()
B
B
B
- Process A migrates to the slave node
- Process A calls fork() to create a child
process B
- A new place holder for B is created
- Once the place holder exists B is allowed to run
27BProc in a Cluster
- In a BProc cluster, there is a single master and
many slaves - Users (including root) only log into the master
- The masters process space is the process space
for the cluster - All processes in the cluster are
- Created from the master
- Visible on the master
- Controlled from the master
28Process Migration
- BProc provides a process migration system to
place processes on other nodes in the cluster - Process migration on BProc is not
- Transparent
- Preemptive
- A process must call the migration system call in
order to move - Process migration on BProc is
- Very fast (1.9s to place a 16MB process on 1024
nodes) - Scalable
- It can create many copies for the same process
(e.g. MPI startup) very efficiently - O(log copies)
29Process Migration
- Process migration does preserve
- The contents of memory and memory related
metadata - CPU State (registers)
- Signal handler state
- Process migration does not preserve
- Shared memory regions
- Open files
- SysV IPC resources
- Just about anything else that isnt memory
30Running on a Slave Node
- BProc is a process management system
- All other system calls are handled locally on the
slave node - BProc does not impose any extra overhead on
non-process related system calls - File and Network I/O are always handled locally
- Calling open() will not cause contact with the
master node - This means network and file I/O are as fast as
they can be
31Implications
- All processes are started from the master with
process migration - All processes remain visible on the master
- No runaways
- Normal UNIX process control works for ALL
processes in the cluster - No need for direct interaction
- There is no need to log into a node to control
what is running there - No software is required on the nodes except the
BProc slave daemon - ZERO software maintenance on the nodes!
- Diskless nodes without NFS root
- Reliable nodes
32Beoboot
- BProc does not provide any mechanism to get a
node booted - Beoboot fills this role
- Hardware detection and driver loading
- Configuration of network hardware
- Generic network boot using Linux
- Starts the BProc slave daemon
- Beoboot also provides the corresponding boot
servers and utility programs on the front end
33Booting a Slave Node
Master
Slave
Request (who am I?)
Phase 1 Small kernel Minimal functionality
Response (IPs, servers, etc)
Request phase 2 Image
Phase 2 Image
?
Load phase 2 Image (Using magic)
Request (who am I again?)
Phase 2 Operation kernel Full featured
Response
BProc Slave Connect
34Loading the Phase 2 Image
- Two Kernel Monte is a piece of software which
will load a new Linux kernel replacing one that
is already running - This allows you to use Linux as your boot loader!
- Using Linux means you can use any network that
Linux supports. - There is no PXE bios or Etherboot support for
Myrinet, Quadrics or Infiniband - Pink network boots on Myrinet which allowed us
to avoid buying a 1024 port ethernet network - Currently supports x86 (including AMD64) and Alpha
35BProc Configuration
- Main configuration file
- /etc/clustermatic/config
- Edit with favorite text editor
- Lines consist of comments (starting with )
- Rest are keyword followed by arguments
- Specify interface
- interface eth0 10.0.4.1 255.255.255.0
- eth0 is interface connected to nodes
- IP of master node is 10.0.4.1
- Netmask of master node is 255.255.255.0
- Interface will be configured when BProc is started
36BProc Configuration
- Specify range of IP addresses for nodes
- iprange 0 10.0.4.10 10.0.4.14
- Start assigning IP addresses at node 0
- First address is 10.0.4.10, last is 10.0.4.14
- The size of this range determines the number of
nodes in the cluster - Next entries are default libraries to be
installed on nodes - Can explicitly specify libraries or extract
library information from an executable - Need to add entry to install extra libraries
- librariesfrombinary /bin/ls /usr/bin/gdb
- The bplib command can be used to see libraries
that will be loaded
37BProc Configuration
- Next line specifies the name of the phase 2 image
- bootfile /var/clustermatic/boot.img
- Should be no need to change this
- Need to add a line to specify kernel command line
- kernelcommandline apmoff consolettyS0,19200
- Turn APM support off (since these nodes dont
have any) - Set console to use ttyS0 and speed to 19200
- This is used by beoboot command when building
phase 2 image
38BProc Configuration
- Final lines specify ethernet addresses of nodes,
examples given - node 0 005056000000
- node 005056000001
- Needed so node can learn its IP address from
master - First 0 is optional, assign this address to node
0 - Can automatically determine and add ethernet
addresses using the nodeadd command - We will use this command later, so no need to
change now - Save file and exit from editor
39BProc Configuration
- Other configuration files
- Should not need to be changed for this
configuration - /etc/clustermatic/config.boot
- Specifies PCI devices that are going to be used
by the nodes at boot time - Modules are included in phase 1 and phase 2 boot
images - By default the node will try all network
interfaces it can find - /etc/clustermatic/node_up.conf
- Specifies actions to be taken in order to bring a
node up - Load modules
- Configure network interfaces
- Probe for PCI devices
- Copy files and special devices out to node
40Bringing Up BProc
- Check BProc will be started at boot time
- chkconfig --list clustermatic
- Restart master daemon and boot server
- service bjs stop
- service clustermatic restart
- service bjs start
- Load the new configuration
- BJS uses BProc, so needs to be stopped first
- Check interface has been configured correctly
- ifconfig eth0
- Should have IP address we specified in config file
41Build a Phase 2 Image
- Run the beoboot command on the master
- beoboot -2 -n --plugin mon
- -2 this is a phase 2 image
- -n image will boot over network
- --plugin add plugin to the boot image
- The following warning messages can be safely
ignored - WARNING Didnt find a kernel module called
gmac.o - WARNING Didnt find a kernel module called
bmac.o - Check phase 2 image is available
- ls -l /var/clustermatic/boot.img
42Bringing Up The First Node
- Ensure both nodes are powered off
- Run the nodeadd command on the master
- /usr/lib/beoboot/bin/nodeadd -a -e -n 0 eth0
- -a automatically reload daemon
- -e write a node number for every node
- -n 0 start node numbering at 0
- eth0 interface to listen on for RARP requests
- Power on the first node
- Once the node boots, nodeadd will display a
message - New MAC 00304823ac9c
- Sending SIGHUP to beoserv.
43Bringing Up The Second Node
- Power on the the second node
- In a few seconds you should see another message
- New MAC 00304823ade1
- Sending SIGHUP to beoserv.
- Exit nodeadd when second node detected (C)
- At this point, cluster is up and fully
operational - Check cluster status
- bpstat -U
- Node(s) Status Mode User
Group - 0-1 up ---x------ root
root
44Using the Cluster
- bpsh
- Migrates a process to one or more nodes
- Process is started on front-end, but is
immediately migrated onto nodes - Effect similar to rsh command, but no login is
performed and no shell is started - I/O forwarding can be controlled
- Output can be prefixed with node number
- Run date command on all nodes which are up
- bpsh -a -p date
- See other arguments that are available
- bpsh -h
45Using the Cluster
- bpcp
- Copies files to a node
- Files can come from master node, or other nodes
- Note that a node only has a ram disk by default
- Copy /etc/hosts from master to /tmp/hosts on node
0 - bpcp /etc/hosts 0/tmp/hosts
- bpsh 0 cat /tmp/hosts
46Managing the Cluster
- bpstat
- Shows status of nodes
- up node is up and available
- down node is down or cant be contacted by master
- boot node is coming up (running node_up)
- error an error occurred while the node was
booting - Shows owner and group of node
- Combined with permissions, determines who can
start jobs on the node - Shows permissions of the node
- ---x------ execute permission for node owner
- ------x--- execute permission for users in node
group - ---------x execute permission for other users
47Managing the Cluster
- bpctl
- Control a nodes status
- Reboot node 1 (takes about a minute)
- bpctl -S 1 -R
- Set state of node 0
- bpctl -S 0 -s groovy
- Only up, down, boot and error have special
meaning, everything else means not down - Set owner of node 0
- bpctl -S 0 -u nobody
- Set permissions of node 0 so anyone can execute a
job - bpctl -S 0 -m 111
48Managing the Cluster
- bplib
- Manage libraries that are loaded on a node
- List libraries to be loaded
- bplib -l
- Add a library to the list
- bplib -a /lib/libcrypt.so.1
- Remove a library from the list
- bplib -d /lib/libcrypt.so.1
49Troubleshooting Techniques
- The tcpdump command can be used to check for node
activity during and after a node has booted - Connect a cable to serial port on node to check
console output for errors in boot process - Once node reaches node_up processing, messages
will be logged in /var/log/clustermatic/node.N
(where N is node number)
50Module 3 LinuxBIOSPresenter Ron Minnich
- Objective
- To introduce LinuxBIOS
- Build and install LinuxBIOS on a cluster node
- Contents
- Overview
- Obtaining LinuxBIOS
- Source Tree
- Building LinuxBIOS
- Installing LinuxBIOS
- Booting a Cluster without LinuxBIOS
- More Information
- http//www.linuxbios.org
51LinuxBIOS Overview
- Replacement for proprietary BIOS
- Based entirely on open source code
- Can boot from a variety of devices
- Supports a wide range of architectures
- Intel P3 P4
- AMD K7 K8 (Opteron)
- PPC
- Alpha
- Ports available for many systems
compaq ibm lippert rlx tyan advantech dell intel
matsonic sis via asus digitallogic irobot mo
torola stpc winfast6300 bcm elitegroup lan
ner nano supermicro bitworks leadtek
pcchips supertek cocom gigabit lex rcn technoland
52Why Use LinuxBIOS?
- Proprietary BIOSs are inherently interactive
- Major problem when building clusters with 100s
or 1000s of nodes - Proprietary BIOSs misconfigure hardware
- Impossible to fix
- Examples that really happen
- Put in faster memory, but it doesnt run faster
- Can misconfigure PCI address space - huge problem
- Proprietary BIOSs cant boot over HPC networks
- No Myrinet or Quadrics drivers for Phoenix BIOS
- LinuxBIOS is FAST
- This is the least important thing about LinuxBIOS
53Definitions
- Bus
- Two or more wires used to connect two or more
chips - Bridge
- A chip that connects two or more busses of the
same or different type - Mainboard
- Aka motherboard/platform
- Carrier for chips that are interconnected via
buses and bridges - Target
- A particular instance of a mainboard, chips and
LinuxBIOS configuration - Payload
- Software loaded by LinuxBIOS from non volatile
storage into RAM
54Typical Mainboard
Rev D
CPU
CPU
Front-side Bus
DDR
AGP
RAM
Video
Northbridge
I/O Buses (PCI)
BIOS Chip
Southbridge
Keyboard
Legacy
Floopy
55What Is LinuxBIOS?
- That question has changed over time
- In 1999, at the start of the project, LinuxBIOS
was literal - Linux is the BIOS
- Hence the name
- The key questions are
- Can you learn all about the hardware on the
system by asking the hardware on the system? - Does the OS know how to do that?
- The answer, in 1995 or so on PCs, was NO in
both cases - OS needed the BIOS to do significant work to get
the machine ready to use
56What Does The BIOS Do Anyway?
- Make the processor(s) sane
- Make the chipsets sane
- Make the memory work (HARD on newer systems)
- Set up devices so you can talk to them
- Set up interrupts so the go to the right place
- Initialize memory even though you dont want it
to - Totally useless memory test
- Ive never seen a useful BIOS memory test
- Spin up the disks
- Load primary bootstrap from the right place
- Start up the bootstrap
57Is It Possible With Open-Source Software?
- 1995 very hard - tightly coded assembly that
barely fits into 32KB - 1999 pretty easy - the Flash is HUGH (256KB at
least) - So the key in 1999 was knowing how to do the
startup - Lots of secret knowledge which took a while to
work out - Vendors continue to make this hard, some help
- AMD is good example of a very helpful vendor
- LinuxBIOS community wrote the first-ever
open-source code that could - Start up Intel and AMP SMPs
- Enable L2 cache on the PII
- Initialize SDRAM and DDRAM
58Only Really Became Possible In 1999
- Huge 512K byte Flash parts could hold the huge
kernel - Almost 400KB
- PCI bus had self-identifying hardware
- Old ISA, EISA, etc. were DEAD thank goodness!
- SGI Visual Workstation showed you could build x86
systems without standard BIOS - Linux learned how to do a lot of configuration,
ignoring the BIOS - In summary
- The hardware could do it (we thought)
- Linux could do it (we thought)
59LinuxBIOS Image In The 512KB Flash
60The Basic Load Sequence ca. 1999
- Top 16 bytes jump to top 64K
- Top 64K
- Set up hardware for Linux
- Copy Linux from FLASH to bottom of memory
- Jump to 0x100020 (start of Linux)
- Linux do all the stuff you normally do
- 2.2 not much, was a problem
- 2.4 did almost everything
- In 1999, Linux did not do all we needed (2.2)
- In 2000, 2.4 could do almost as much as we want
- The 64K bootstrap ended up doing more than we
planned
61What We Thought Linux Would Do
- Do ALL the PCI setup
- Do ALL the odd processor setup
- In fact, do everything all the 64K code had to
do was copy Linux to RAM
62What We Changed (Due To Hardware)
- DRAM does not start life operational, like the
old days - Turn-on for DRAM is very complex
- The single hardest part of LinuxBIOS is DRAM
support - To turn on DRAM, you need to turn on chipsets
- To turn on chipsets, you need to set up PCI
- And, on AMD Athlon SMPs, we need to grab hold of
all the CPUs (save one) and idle them - So the 64K chunk ended up doing more
63Getting To DRAM
Rev D
CPU
CPU
Front-side Bus
DDR
AGP
RAM
Video
Northbridge
I/O Buses (PCI)
BIOS Chip
Southbridge
Keyboard
Legacy
Floopy
64Another Problem
- IRQ wiring can not be determined from hardware!
- Botch in PCI results in having to put tables in
the BIOS - This is true for all motherboards
- So, although PCI hardware is self-identifying,
hardware interrupts are not - So Linux cant figure out what interrupt is for
what card - LinuxBIOS has to pick up this additional function
65The PCI Interrupt Botch
1 2 3 4
A B C D
1 2 3 4
A B C D
66What We Changed (Due To Linux)
- Linux could not set up a totally empty PCI bus
- Needed some minimal configuration
- Linux couldnt find the IRQs
- Not really its fault, but
- Linux needed SMP hardware set up as per BIOS
- Linux needed per-CPU hardware set up as per
BIOS - Linux needed tables (IRQ, ACPI, etc.) set up as
per BIOS - Over time, this is changing
- Someone has a patent on the SRAT ACPI table
- SRAT describes hardware
- So Linux ignores SRAT, talks to hardware directly
67As Of 2000/2001
- We could boot Linux from flash (quickly)
- Linux would find the hardware and the tables
ready for it - Linux would be up and running in 3-12 seconds
- Problem solved?
68Problems
- Looking at trends, in 1999 we counted on
motherboard flash sizes doubling every 2 years or
so - From 1999 to 2000 the average flash size either
shrank or stayed the same - Linux continued to grow in size though
- Linux outgrew the existing flash parts, even as
they were getting smaller - Venders went to a socket that couldnt hold a
larger replacement - Why did vendors do this?
- Everyone wants cheap mainboards!
69LinuxBIOS Was Too Big
- Enter the alternate bootstraps
- Etherboot
- FILO
- Built-in etherboot
- Built-in USB loader
70The New Picture
Compact Flash (32MB)
Flash (256KB)
Top 16 bytes Top 64K (LinuxBIOS) Next 64K
(Etherboot)
Linux Kernel
Empty
Loaded over IDE channel by bootloader
Empty
71LinuxBIOS Now
- The aggregate of the 64K loader, Etherboot (or
FILO), and Linux from Compact Flash? - Too confusing
- LinuxBIOS now means only the 64K piece, even
though its not Linux any more - On older systems, LinuxBIOS loads Etherboot which
loads Linux from Compact Flash - Compact Flash read as raw set of blocks
- On newer systems, LinuxBIOS loads FILO which
loads Linux from Compact Flash - Compact Flash treated as ext2 filesystem
72Final Question
- Youre reflashing 1024 nodes on a cluster and the
power fails - Youre now the proud owner of 1024 bricks, right?
- Wrong
- Linux NetworX developed fallback BIOS technology
73Fallback BIOS
Flash (256KB)
- Jump to BIOS jumps to fallback BIOS
- Fallback BIOS checks conditions
- Was the last boot successful?
- Do we want to just use fallback anyway?
- Does normal BIOS look ok?
- If things are good, use normal
- If things are bad, use fallback
- Note there is also a fallback and normal FILO
- These load different files from CF
- So normal kernel, FILO, and BIOS can be hosed and
youre ok
Jump to BIOS Fallback BIOS Normal BIOS Fallback
FILO Normal FILO
74Rules For Upgrading Flash
- NEVER replace the fallback BIOS
- NEVER replace the fallback FILO
- NEVER replace the fallback kernel
- Mess up other images at will, because you can
always fall back
75A Last Word On Flash Size
- Flash size decreased to 256KB from 1999-2003
- Driven by packaging constraints
- Newer technology uses address-address
multiplexing to pack lots of address bits onto 3
address lines - up to 128 MB! - Driven by cell phone and MP3 player demand
- So the same small package can support 1,2,4,8 MB
- Will need them kernel initrd can be 4MB!
- This will allow us to realize our original vision
- Linux in flash
- Etherboot, FILO, etc., are really a hiccup
76Source Tree
- /console
- Device independent console support
- /cpu
- Implementation specific files
- /devices
- Dynamic device allocation routines
- /include
- Header files
- /lib
- Generic library functions (atoi)
- COPYING
- NEWS
- ChangeLog
- documentation
- Not enough!
- src
- /arch
- Architecture specific files, including initial
startup code - /boot
- Main LinuxBIOS entry code hardwaremain()
- /config
- Configuration for a given platform
77Source Tree
- /stream
- Source of payload data
- /superio
- Chip to emulate legacy crap
- targets
- Instances of specific platforms
- utils
- Utility programs
- /mainboard
- Mainboard specific code
- /northbridge
- Memory and bus interface routines
- /pc80
- Legacy crap
- /pmc
- Processor mezzanine cards
- /ram
- Generic RAM support
- /sdram
- Synchronous RAM support
- /southbridge
- Bridge to interface to legacy crap
78Building LinuxBIOS
- For this demonstration, untar source from CDROM
- mount /mnt/cdrom
- cd /tmp
- tar zxvf /mnt/cdrom/LCI/linuxbios/freebios2.tgz
- cd freebios2
- Find target that matches your hardware
- cd targets/via/epia
- Edit configuration file Config.lb and change any
settings specific to your board - Should not need to make any changes in this case
79Building LinuxBIOS
- Build the target configuration files
- cd ../..
- ./buildtarget via/epia
- Now build the ROM image
- cd via/epia/epia
- make
- Should result in a single file
- linuxbios.rom
- Copy ROM image onto a node
- bpcp linuxbios.rom 0/tmp
80Installing LinuxBIOS
- This will overwrite old BIOS with LinuxBIOS
- Prudent to keep a copy of the old BIOS chip
- Bad BIOS useless junk
- Build flash utility
- cd /tmp/freebios2/util/flash_and_burn
- make
- Now flash the ROM image - please do not do this
step - bpsh 0 ./flash_rom /tmp/linuxbios.rom
- Reboot node and make sure it comes up
- bpctl -S 0 -R
- Use BProc troubleshooting techniques if not!
81Booting a Cluster Without LinuxBIOS
- Although an important part of Clustermatic, its
not always possible to deploy LinuxBIOS - Requires detailed understanding of technical
details - May not be available for a particular mainboard
- In this situation it is still possible to set up
and boot a cluster using a combination of DHCP,
TFTP and PXE - Dynamic Host Configuration Protocol (DHCP)
- Used by node to obtain IP address and bootloader
image name - Trivial File Transfer Program (TFTP)
- Simple protocol to ransfer files across an IP
network - Pre-Execution Environment (PXE)
- BIOS support for network booting
82Configuring DHCP
- Copy configuration file
- cp /mnt/cdrom/LCI/pxi/dhcpd.conf /etc
- Contains the following entry (one host entry
for each node) - ddns-update-style ad-hoc
- subnet 10.0.4.0 netmask 255.255.255.0
- host node1
- hardware ethernet xxxxxxxxxxxx
- fixed-address 10.0.4.14
- filename pxelinux.0
-
-
- Replace xxxxxxxxxxxx with MAC address of
node - Restart server to load new configuration
- service dhcpd restart
83Configuring TFTP
- Create directory to hold bootloader
- mkdir -p /tftpboot
- Edit TFTP config file
- /etc/xinetd.d/tftp
- Enable TFTP
- Change
- disable yes
- To
- disable no
- Restart server
- service xinetd restart
84Configuring PXE
- Depends on BIOS, enabled through menu
- Create correct directories
- mkdir -p /tftpboot/pxelinux.cfg
- Copy bootloader and config file
- cd /mnt/cdrom/LCI/pxe
- cp pxelinux.0 /tftpboot/
- cp default /tftpboot/pxelinux.cfg/
- Generate a bootable phase 2 image
- beoboot -2 -i -o /tftpboot/node --plugin mon
- Creates a kernel and initrd image
- /tftpboot/node
- /tftpboot/node.initrd
85Booting The Cluster
- Run nodeadd to add node to config file
- /usr/lib/beoboot/bin/nodeadd -a -e eth0
- Node can now be powered on
- BIOS uses DHCP to obtain IP address and filename
- pxelinux.0 will be loaded
- pxelinux.0 will in turn load phase 2 image and
initrd - Node should boot
- Check status using bpstat command
- Requires monitor to observe behavior of node
86Module 4 FilesystemsPresenter Ron Minnich
- Objective
- To show the different kinds of filesystems that
can be used with a BProc cluster and demonstrate
the advantages and disadvantages of each - Contents
- Overview
- No Local Disk, No Network Filesystem
- Local Disks
- Global Network Filesystems
- NFS
- Third Party Filesystems
- Private Network Filesystems
- V9FS
87Filesystems Overview
- Nodes in a Clustermatic cluster do not require
any type of local or network filesystem to
operate - Jobs that operate with only local data need no
other filesystems - Clustermatic can provide a range of different
filesystem options
88No Local Disk, No Network Filesystem
- Root filesystem is a tmpfs located in system RAM,
so size is limited to RAM size of nodes - Applications that need an input deck must copy
necessary files to nodes prior to execution and
from nodes after execution - 30K input deck can be copied to 1023 nodes in
under 2.5 seconds - This can be a very fast option for suitable
applications - Removes node dependency on potentially unreliable
fileserver
89Local Disks
- Nodes can be provided with one or more local
disks - Disks are automatically mounted by creating entry
in /etc/clustermatic/fstab - Solves local space problem, but filesystems are
still not shared - Also reduces reliability of nodes since they are
now dependent on spinning hardware
90NFS
- Simplest solution to providing a shared
filesystem on nodes - Will work in most environments
- Nodes are now dependent on availability of NFS
server - Master can act as NFS server
- Adds extra load
- Master may already be loaded if there are a large
number of nodes - Better option is to provide a dedicated server
- Configuration can be more complex if server is on
a different network - May require mutliple network adapters in master
- Performance is never going to be high
91Configuring Master as NFS Server
- Standard Linux NFS configuration on server
- Check NFS is enabled at boot time
- chkconfig --list nfs
- chkconfig nfs on
- Start NFS daemons
- service nfs start
- Add exported filesystem to /etc/exports
- /home 10.0.4.0/24(rw,sync,no_root_squash)
- Export filesystem
- exportfs -a
92Configuring Nodes To Use NFS
- Edit /etc/clustermatic/fstab to mount filesystem
when node boots - MASTER/home /home nfs nolock 0 0
- MASTER will be replaced with IP address of front
end - nolock must be used unless portmap is run on each
node - /home will be automatically created on node at
boot time - Reboot nodes
- bpctl -S allup -R
- When nodes have rebooted, check NFS mount is
available - bpsh 0-1 df
93Third Party Filesystems
- GPFS (http//www.ibm.com)
- Panasas (http//www.panasas.com)
- Lustre (http//www.lustre.org)
94GPFS
- Supports up to 2.4.21 kernel (latest is 2.4.26 or
2.6.5) - Data striping across multiple disks and multiple
nodes - Client-side data caching
- Large blocksize option for higher efficiencies
- Read-ahead and write-behind support
- Block level locking supports concurrent access to
files - Network Shared Disk Model
- Subset of nodes are allocated as storage nodes
- Software layer ships I/O requests from
application node to storage nodes across cluster
interconnect - Direct Attached Model
- Each node must have direct connection to all
disks - Requires Fibre Channel Switch and Storage Area
Network disk configuration
95Panasas
- Latest version supports 2.4.26 kernel
- Object Storage Device (OSD)
- Intelligent disk drive
- Can be directly accessed in parallel
- PanFS Client
- Object-based installable filesystem
- Handles all mounting, namespace operations, file
I/O operations - Parallel access to multiple object storage
devices - Metadata Director
- Separate control path for managing OSDs
- mapping of directories and files to data
objects - Authentication and secure access
- Metadata Director and OSD require dedicated
proprietary hardware - PanFS Client is open source
96Lustre
- Lustre Lite supports 2.4.24 kernel
- Full Lustre will support 2.6 kernel
- Luster Lite Lustre - clustered metadata
scalability - All open source
- Meta Data Servers (MDSs)
- Supports all filesystem namespace operations
- Lock manager and concurrency support
- Transaction log of metadata operations
- Handles failover of metadata servers
- Object Storage Targets (OSTs)
- Handles actual file I/O operations
- Manages storage on Object-Based Disks (OBDs)
- Object-Based Disk drivers support normal Linux
filesystems - Arbitrary network support through Network
Abstraction Layer - MDSs and OSTs can be standard Linux hosts
97V9FS
- Provides a shared private network filesystem
- Shared
- All nodes running a parallel process can access
the filesystem - Private
- Only processes in a single process group can see
or access files in the filesystem - Mounts exist only for duration of process
- Node cleanup is automatic
- No hanging mount problems
- Protocol is lightweight
- Pluggable authentication services
98V9FS
- Experimental
- Can be mounted across a secure channel (e.g. ssh)
for additional security - 1000 concurrent mounts in 20 seconds
- Multiple servers will improve this
- Servers can run on cluster nodes or dedicated
systems - Filesystem can use cluster interconnect or
dedicated network - More information
- http//v9fs.sourceforge.net
99Configuring Master as V9FS Server
- Start server
- v9fs_server
- Can be started at boot if desired
- Create mount point on nodes
- bpsh 0-1 mkdir /private
- Can add mkdir command to end of node_up script if
desired
100V9FS Server Commands
- Define filesystems to be mounted on the nodes
- v9fs_addmount 10.0.4.1/home /private
- List filesystems to be mounted
- v9fs_lsmount
101V9FS On The Cluster
- Once filesystem mounts have been defined on the
server, filesystems will be automatically mounted
when a process is migrated to the node - cp /etc/hosts /home
- bpsh 0-1 ls -l /private
- bpsh 0 cat /private/hosts
- Remove filesystems to be mounted
- v9fs_rmmount /private
- bpsh 0-1 ls -l /private
102One Note
- Note that we ran the file server as root
- You can actually run the file server as you
- If run as you, there is added security
- The server cant run amok
- And subtracted security
- We need a better authentication system
- Can use ssh, but something tailored to the
cluster would be better - Note that the server can chroot for even more
safety - Or be told to serve from a file, not a file
system - There is tremendous flexibility and capability in
this approach
103Also
- Recall that on 2.4.19 and later there is a /proc
entry for each process - /proc/mounts
- It really is quite private
- There is a lot of potential capability here we
have not started to use - Still trying to determine need/demand
104Why Use V9FS?
- Youve got some wacko library you need to use for
one application - Youve got a giant file which you want to serve
as a file system - Youve got data that you want visible to you only
- Original motivation compartmentation in grids
(1996) - You want a mount point but its not possible for
some reason - You want an encrypted data file system
105Wacko Library
- Clustermatic systems (intentionally) limit the
number of libraries on nodes - Current systems have about 2GB worth of libraries
- Putting all these on nodes would take 2GB of
memory! - Keeping node configuration consistent is a big
task on 1000 nodes - Need to do rsync, or whatever
- Lots of work, lots of time for libraries you
dont need - What if you want some special library available
all the time - Painful to ship it out, set up paths, etc., every
time - V9FS allows custom mounts to be served from your
home directory
106Giant File As File System
- V9FS is a user-level server
- i.e. an ordinary program
- On Plan 9, there are all sorts of nifty uses of
this - Servers for making a tar file look like a
read-only file system - Or cpio archive, or whatever
- So, instead of trying to locate something in the
middle of a huge tar file - Run the server to serve the tar file
- Save disk blocks and time
107Data Visible To You Only
- This usage is still very important
- Run your own personal server (assuming
authentication is fixed) or use the global server - Files that you see are not visible to anyone else
at all - Even root
- On Unix, if you cant get to the mount point, you
cant see the files - On Linux with private mounts, other people dont
even know the mount point exists
108You Want A Mount Point But Cant Get One
- Please Mr. Sysadmin, sir, can I have another
mount point? - NO!
- System administrators have enough to do, than to
- Modify fstab on all nodes
- Modify permissions on a server
- And so on
- Just to make your library available on the nodes?
- Doubtful
- V9FS gives a level of flexibility that you cant
get otherwise
109Want Encrypted Data File System
- This one is really interesting
- Crypto file systems are out there in abundance
- But they always require lots of root
involvement to set up - Since V9FS is user-level, you can run one
yourself - Set up your own keys, crypto, all your own stuff
- Serve a file system out of one big encrypted file
- Copy the file elsewhere, leaving it encrypted
- Not easily done with existing file systems
- So you have a personal, portable, encrypted file
system
110So Why Use V9FS?
- Opens up a wealth of new ways to store, access
and protect your data - Dont have to bother System Administrators all
the time - Can extend the file system name space of a node
to your specification - Can create a whole file system in one file, and
easily move that file system around (cp, scp,
etc.) - Can do special per-user policy on the file system
- Tar or compressed file format
- Per-user crypto file system
- Provides capabilities you cant get any other way
111Module 5 SupermonPresenter Matt Sottile
- Objectives
- Present an overview of supermon
- Demonstrate how to install and use supermon to
monitor a cluster - Contents
- Overview of Supermon
- Starting Supermon
- Monitoring the Cluster
- More Information
- http//supermon.sourceforge.net
112Overview of Supermon
- Provides monitoring solution for clusters
- Capable of high sampling rates (Hz)
- Very small memory and computational footprint
- Sampling rates are controlled by clients at
run-time - Completely extensible without modification
- User applications
- Kernel modules
113Node View
- Data sources
- Kernel module(s)
- User application
- Mon daemon
- IANA-registered port number
- 2709
114Cluster View
- Data sources
- Node mon daemons
- Other supermons
- Supermon daemon
- Same port number
- 2709
- Same protocol at every level
- Composable, extensible
115Data Format
- S-expressions
- Used in LISP, Scheme, etc.
- Very mature
- Extensible, composable, ASCII
- Very portable
- Easily changed to support richer data and
structures - Composable
- (expr 1) o (expr 2) ((expr 1) (expr 2))
- Fast to parse, low memory and time overhead
116Data Protocol
- command
- Provides description of what data is provided and
how it is structured - Shows how the data is organized in terms of rough
categories containing specific data variables
(e.g. cpuinfo category, usertime variable) - S command
- Request actual data
- Structure matches that described in command
- R command
- Revive clients that disappeared and were
restarted - N command
- Add new clients
117User Defined Data
- Each node allows user-space programs to push data
into mon to be sent out on the next sample - Only requirement
- Data is arbitrary text
- Recommended to be an s-expression
- Very simple interface
- Uses UNIX domain socket for security
118Starting Supermon
- Start supermon daemon
- supermon n0 n1 2gt /dev/null
- Check output from kernel
- bpsh 1 cat /proc/sys/supermon/
- bpsh 1 cat /proc/sys/supermon/S
- Check sensor output from kernel
- bpsh 1 cat /proc/sys/supermon_sensors_t/
- bpsh 1 cat /proc/sys/supermon_sensors_t/S
119Supermon In Action
- Check mon output from a node
- telnet n1 2709
- S
- close
- Check output from supermon daemon
- telnet localhost 2709
- S
- close
120Supermon In Action
- Read supermon data and display to console
- supermon_stats options
- Create trace file for off-line analysis
- supermon_tracer options
- supermon_stats can be used to process trace data
off-line
121Module 6 BJSPresenter Matt Sottile
- Objectives
- Introduce the BJS scheduler
- Configure and submit jobs using BJS
- Contents
- Overview of BJS
- BJS Configuration
- Using BJS
122Overview of BJS
- Designed to cover the needs of most users
- Simple, easy to use
- Extensible interface for adding policies
- Used in production environments
- Optimized for use with BProc
- Traditional schedulers require O(N) processes,
BJS requires O(1) - Schedules and unschedules 1000 processes in 0.1
seconds
123BJS Configuration
- Nodes are divided into pools, each with a policy
- Standard policies
- Filler
- Attempts to backfill unused nodes
- Shared
- Allows multiple jobs to run on a single node
- Simple
- Very simple FIFO scheduling algorithm
124Extending BJS
- BJS was designed to be extensible
- Policies are plug-ins
- They require coding to the BJS C API
- Not hard, but nontrivial
- Particularly useful for installation-specific
policies - Based on shared-object libraries
- A fair-share policy is currently in testing at
LANL for BJS - Enforce fairness between groups
- Enforce fairness between users within a group
- Optimal scheduling between users own jobs
125BJS Configuration
- BJS configuration file
- /etc/clustermatic/bjs.config
- Global configuration options (usually dont need
to be changed) - Location of spool files
- spooldir
- Location of dynamically loaded policy modules
- policypath
- Location of UNIX domain socket
- socketpath
- Location of user accouting log file
- acctlog
126BJS Configuration
- Per-pool configuration options
- Defines the default pool
- pool default
- Name of policy module for this pool (must exist
in policydir) - policy filler
- Nodes that are in this pool
- nodes 0-10000
- Maximum duration of a job (wall clock time)
- maxsecs 86400
- Optional Users permitted to submit to this pool
- users
- Optional Groups permitted to submit to this pool
- groups
127BJS Configuration
- Restart BJS daemon to accept changes
- service bjs restart
- Check nodes are available
- bjsstat
- Pool default Nodes (total/up/free) 5/2/2
- ID User Command
Requirements
128Using BJS
- bjssub
- Submit a request to allocate nodes
- ONLY runs the command on the front end
- The command is responsible for executing on nodes
- -p specify node pool
- -n number of nodes to allocate
- -s run time of job (in seconds)
- -i run in interactive mode
- -b run in batch mode (default)
- -D set working directory
- -O redirect command output to file
129Using BJS
- bjsstat
- Show status of node pools
- Name of pool
- Total number of nodes in pool
- Number of operational nodes in pool
- Number of free nodes in pool
- Lists status of jobs in each pool
130Using BJS
- bjsctl
- Terminate a running job
- -r specify ID number of job to terminate
131Interactive vs Batch
- Interactive jobs
- Schedule a node or set of nodes for use
interactively - bjssub will wait until nodes are available, then
run the command - Good during development
- Good for single run, short runtime jobs
- Hands-on interaction with nodes
- bjssub -p default -n 2 -s 1000 -i bash
- Waiting for interactive job nodes.
- (nodes 0 1)
- Starting interactive job.
- NODES0,1
- JOBID59
- gt bpsh NODES date
- gt exit
132Interactive vs Batch
- Batch jobs
- Schedule a job to run as soon as requested nodes
are available - bjssub will queue the command until nodes are
available - Good for long running jobs that require little or
no interaction -