CLUSTERMATIC An Innovative Approach to Cluster Computing - PowerPoint PPT Presentation

1 / 149
About This Presentation
Title:

CLUSTERMATIC An Innovative Approach to Cluster Computing

Description:

Tutorial CD Contents. RPMs for all Clustermatic components ... Strace, gdb, TotalView transparently work on remote processes! A node with two remote processes ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 150
Provided by: greg234
Category:

less

Transcript and Presenter's Notes

Title: CLUSTERMATIC An Innovative Approach to Cluster Computing


1
CLUSTERMATIC An Innovative Approach to Cluster
Computing
  • 2004 LACSI Symposium
  • The Los Alamos Computer Science Institute

LA-UR-03-8015
2
Tutorial Outline (Morning)
3
Tutorial Outline (Afternoon)
4
Tutorial Introduction
  • Tutorial is divided into modules
  • Each module has clear objectives
  • Modules comprise short theory component, followed
    by hands on
  • indicates theory
    indicates hands-on

Please ask questions at any time!
5
Module 1 Overview of ClustermaticPresenter
Greg Watson
  • Objective
  • To provide a brief overview of the Clustermatic
    architecture
  • Contents
  • What is Clustermatic?
  • Why Use Clustermatic?
  • Clustermatic Components
  • Installing Clustermatic
  • More Information
  • http//www.clustermatic.org

6
What is Clustermatic?
  • Clustermatic is a suite of software that
    completely controls a cluster from the BIOS to
    the high level programming environment
  • Clustermatic is modular
  • Each component is responsible for a specific set
    of activities in the cluster
  • Each component can be used independently of other
    components

7
Why Use Clustermatic?
  • Clustermatic clusters are easy to build, manage
    and program
  • A cluster can be installed and operational in a
    few minutes
  • The architecture is designed for simplicity,
    performance and reliability
  • Utilization is maximized by ensuring machine is
    always available
  • Supports machines from 2 to 1024 nodes (and
    counting)
  • System administration is no more onerous than for
    a single machine, regardless of the size of the
    cluster
  • Upgrade O/S on entire machine with a single
    command
  • No need to synchronize node software versions
  • The entire software suite is GPL open-source

8
Clustermatic Components
  • LinuxBIOS
  • Replaces normal BIOS
  • Improves boot performance and node startup times
  • Elimates reliance on proprietary BIOS
  • No interaction required, important for 100s of
    nodes

LinuxBIOS
9
Clustermatic Components
  • Linux
  • Mature O/S
  • Demonstrated performance in HPC applications
  • No proprietary O/S issues
  • Extensive hardware and network device support

10
Clustermatic Components
  • V9FS
  • Avoids problems associated with global mounts
  • Processes are provided with a private shared
    filesystem
  • Namespace exists only for duration of process
  • Nodes are returned to pristine state once
    process is complete

Users
Compilers Debuggers
BJS
v9fs
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
11
Clustermatic Components
  • Beoboot
  • Manages booting cluster nodes
  • Employs a tree-based boot scheme for
    fast/scalable booting
  • Responsible for configuring nodes once they have
    booted

Users
Compilers Debuggers
BJS
Beoboot
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
12
Clustermatic Components
  • BProc
  • Manages a single process-space across machine
  • Responsible for process startup and management
  • Provides commands for starting processes, copying
    files to nodes, etc.

Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
13
Clustermatic Components
  • BJS
  • BProc Job Scheduler
  • Enforces policies for allocating jobs to nodes
  • Nodes are allocated to pools which can have
    different policies

BJS
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
14
Clustermatic Components
  • Supermon
  • Provides a system monitoring infrastructure
  • Provides kernel and hardware status information
  • Low overhead on compute nodes and interconnect
  • Extensible protocol based on s-expressions

Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
15
Clustermatic Components
  • MPI
  • Uses standard MPICH 1.2 (ANL) or LA-MPI (LANL)
  • Supports Myrinet (GM) and Ethernet (P4) devices
  • Supports debugging with TotalView

MPI
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
16
Clustermatic Components
  • Compilers Debuggers
  • Commercial and non-commercial compilers available
  • GNU, Intel, Absoft
  • Commercial and non-commercial debuggers available
  • Gdb, TotalView, DDT

Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
17
Linux Support
  • Linux Variants
  • For RedHat Linux
  • Installed as a series of RPMs
  • Supports RH 9 2.4.22 kernel
  • For other Linuxs
  • Must be compiled and installed from source

18
Tutorial CD Contents
  • RPMs for all Clustermatic components
  • Architectures included for x86, x86_64, athlon,
    ppc and alpha
  • Full distibution available on Clustermatic web
    site (www.clustermatic.org)
  • SRPMs for all Clustermatic components
  • Miscellaneous RPMs
  • Full source tree for LinuxBIOS (gzipped tar
    format)
  • Source for MPI example programs
  • Presentation handouts

19
Cluster Hardware Setup
  • Laptop installed with RH9
  • Will act as the master node
  • Two slave nodes
  • Preloaded with LinuxBIOS and a phase 1 kernel in
    flash
  • iTuner M-100 VIA EPIA 533MHz 128Mb
  • 8 port 100baseT switch
  • Total cost (excluding laptop) 800

20
Clustermatic Installation
  • Installation process for RedHat
  • Log into laptop
  • Username root
  • Password lci2004
  • Insert and mount CD-ROM
  • mount /mnt/cdrom
  • Locate install script
  • cd /mnt/cdrom/LCI
  • Install Clustermatic
  • ./install_clustermatic
  • Reboot to load new kernel
  • reboot

21
Module 2 BProc BeobootPresenter Erik Hendriks
  • Objective
  • To introduce BProc and gain a basic understanding
    of how it works
  • To introduce Beoboot and understand how it fits
    together with BProc
  • To configure and manage a BProc cluster
  • Contents
  • Overview of BProc
  • Overview of Beoboot
  • Configuring BProc For Your Cluster
  • Bringing Up BProc
  • Bringing Up The Nodes
  • Using the Cluster
  • Managing a Cluster
  • Troubleshooting Techniques

22
BProc Overview
  • BProc Beowulf Distributed Process Space
  • BProc is a Linux kernel modification which
    provides
  • A single system image for process control in a
    cluster
  • Process migration for creating processes in a
    cluster
  • BProc is the foundation for the rest of the
    Clustermatic software

23
Process Space
  • A process space is
  • A pool of process IDs
  • A process tree
  • A set of parent/child relationships
  • Every instance of the Linux kernel has a process
    space
  • A distributed process space allows parts of one
    nodes process space to exist on other nodes

24
Distributed Process Space
  • With a distributed process space, some processes
    will exist on other nodes
  • Every remote process has a place holder in the
    process tree
  • All remote processes remain visible
  • Process related system calls (fork, wait, kill,
    etc.) work identically on local and remote
    processes
  • Kill works on remote processes
  • No runaway processes
  • Ptrace works on remote processes
  • Strace, gdb, TotalView transparently work on
    remote processes!

A node with two remote processes
25
Distributed Process Space Example
Master
Slave
B
  • The master starts processes on slave nodes
  • These remote processes remain visible on the
    master node
  • Not all processes on the slave are part of the
    masters process space

26
Process Creation Example
Master
Slave
A
A
A
fork()
B
B
B
  • Process A migrates to the slave node
  • Process A calls fork() to create a child
    process B
  • A new place holder for B is created
  • Once the place holder exists B is allowed to run

27
BProc in a Cluster
  • In a BProc cluster, there is a single master and
    many slaves
  • Users (including root) only log into the master
  • The masters process space is the process space
    for the cluster
  • All processes in the cluster are
  • Created from the master
  • Visible on the master
  • Controlled from the master

28
Process Migration
  • BProc provides a process migration system to
    place processes on other nodes in the cluster
  • Process migration on BProc is not
  • Transparent
  • Preemptive
  • A process must call the migration system call in
    order to move
  • Process migration on BProc is
  • Very fast (1.9s to place a 16MB process on 1024
    nodes)
  • Scalable
  • It can create many copies for the same process
    (e.g. MPI startup) very efficiently
  • O(log copies)

29
Process Migration
  • Process migration does preserve
  • The contents of memory and memory related
    metadata
  • CPU State (registers)
  • Signal handler state
  • Process migration does not preserve
  • Shared memory regions
  • Open files
  • SysV IPC resources
  • Just about anything else that isnt memory

30
Running on a Slave Node
  • BProc is a process management system
  • All other system calls are handled locally on the
    slave node
  • BProc does not impose any extra overhead on
    non-process related system calls
  • File and Network I/O are always handled locally
  • Calling open() will not cause contact with the
    master node
  • This means network and file I/O are as fast as
    they can be

31
Implications
  • All processes are started from the master with
    process migration
  • All processes remain visible on the master
  • No runaways
  • Normal UNIX process control works for ALL
    processes in the cluster
  • No need for direct interaction
  • There is no need to log into a node to control
    what is running there
  • No software is required on the nodes except the
    BProc slave daemon
  • ZERO software maintenance on the nodes!
  • Diskless nodes without NFS root
  • Reliable nodes

32
Beoboot
  • BProc does not provide any mechanism to get a
    node booted
  • Beoboot fills this role
  • Hardware detection and driver loading
  • Configuration of network hardware
  • Generic network boot using Linux
  • Starts the BProc slave daemon
  • Beoboot also provides the corresponding boot
    servers and utility programs on the front end

33
Booting a Slave Node
Master
Slave
Request (who am I?)
Phase 1 Small kernel Minimal functionality
Response (IPs, servers, etc)
Request phase 2 Image
Phase 2 Image
?
Load phase 2 Image (Using magic)
Request (who am I again?)
Phase 2 Operation kernel Full featured
Response
BProc Slave Connect
34
Loading the Phase 2 Image
  • Two Kernel Monte is a piece of software which
    will load a new Linux kernel replacing one that
    is already running
  • This allows you to use Linux as your boot loader!
  • Using Linux means you can use any network that
    Linux supports.
  • There is no PXE bios or Etherboot support for
    Myrinet, Quadrics or Infiniband
  • Pink network boots on Myrinet which allowed us
    to avoid buying a 1024 port ethernet network
  • Currently supports x86 (including AMD64) and Alpha

35
BProc Configuration
  • Main configuration file
  • /etc/clustermatic/config
  • Edit with favorite text editor
  • Lines consist of comments (starting with )
  • Rest are keyword followed by arguments
  • Specify interface
  • interface eth0 10.0.4.1 255.255.255.0
  • eth0 is interface connected to nodes
  • IP of master node is 10.0.4.1
  • Netmask of master node is 255.255.255.0
  • Interface will be configured when BProc is started

36
BProc Configuration
  • Specify range of IP addresses for nodes
  • iprange 0 10.0.4.10 10.0.4.14
  • Start assigning IP addresses at node 0
  • First address is 10.0.4.10, last is 10.0.4.14
  • The size of this range determines the number of
    nodes in the cluster
  • Next entries are default libraries to be
    installed on nodes
  • Can explicitly specify libraries or extract
    library information from an executable
  • Need to add entry to install extra libraries
  • librariesfrombinary /bin/ls /usr/bin/gdb
  • The bplib command can be used to see libraries
    that will be loaded

37
BProc Configuration
  • Next line specifies the name of the phase 2 image
  • bootfile /var/clustermatic/boot.img
  • Should be no need to change this
  • Need to add a line to specify kernel command line
  • kernelcommandline apmoff consolettyS0,19200
  • Turn APM support off (since these nodes dont
    have any)
  • Set console to use ttyS0 and speed to 19200
  • This is used by beoboot command when building
    phase 2 image

38
BProc Configuration
  • Final lines specify ethernet addresses of nodes,
    examples given
  • node 0 005056000000
  • node 005056000001
  • Needed so node can learn its IP address from
    master
  • First 0 is optional, assign this address to node
    0
  • Can automatically determine and add ethernet
    addresses using the nodeadd command
  • We will use this command later, so no need to
    change now
  • Save file and exit from editor

39
BProc Configuration
  • Other configuration files
  • Should not need to be changed for this
    configuration
  • /etc/clustermatic/config.boot
  • Specifies PCI devices that are going to be used
    by the nodes at boot time
  • Modules are included in phase 1 and phase 2 boot
    images
  • By default the node will try all network
    interfaces it can find
  • /etc/clustermatic/node_up.conf
  • Specifies actions to be taken in order to bring a
    node up
  • Load modules
  • Configure network interfaces
  • Probe for PCI devices
  • Copy files and special devices out to node

40
Bringing Up BProc
  • Check BProc will be started at boot time
  • chkconfig --list clustermatic
  • Restart master daemon and boot server
  • service bjs stop
  • service clustermatic restart
  • service bjs start
  • Load the new configuration
  • BJS uses BProc, so needs to be stopped first
  • Check interface has been configured correctly
  • ifconfig eth0
  • Should have IP address we specified in config file

41
Build a Phase 2 Image
  • Run the beoboot command on the master
  • beoboot -2 -n --plugin mon
  • -2 this is a phase 2 image
  • -n image will boot over network
  • --plugin add plugin to the boot image
  • The following warning messages can be safely
    ignored
  • WARNING Didnt find a kernel module called
    gmac.o
  • WARNING Didnt find a kernel module called
    bmac.o
  • Check phase 2 image is available
  • ls -l /var/clustermatic/boot.img

42
Bringing Up The First Node
  • Ensure both nodes are powered off
  • Run the nodeadd command on the master
  • /usr/lib/beoboot/bin/nodeadd -a -e -n 0 eth0
  • -a automatically reload daemon
  • -e write a node number for every node
  • -n 0 start node numbering at 0
  • eth0 interface to listen on for RARP requests
  • Power on the first node
  • Once the node boots, nodeadd will display a
    message
  • New MAC 00304823ac9c
  • Sending SIGHUP to beoserv.

43
Bringing Up The Second Node
  • Power on the the second node
  • In a few seconds you should see another message
  • New MAC 00304823ade1
  • Sending SIGHUP to beoserv.
  • Exit nodeadd when second node detected (C)
  • At this point, cluster is up and fully
    operational
  • Check cluster status
  • bpstat -U
  • Node(s) Status Mode User
    Group
  • 0-1 up ---x------ root
    root

44
Using the Cluster
  • bpsh
  • Migrates a process to one or more nodes
  • Process is started on front-end, but is
    immediately migrated onto nodes
  • Effect similar to rsh command, but no login is
    performed and no shell is started
  • I/O forwarding can be controlled
  • Output can be prefixed with node number
  • Run date command on all nodes which are up
  • bpsh -a -p date
  • See other arguments that are available
  • bpsh -h

45
Using the Cluster
  • bpcp
  • Copies files to a node
  • Files can come from master node, or other nodes
  • Note that a node only has a ram disk by default
  • Copy /etc/hosts from master to /tmp/hosts on node
    0
  • bpcp /etc/hosts 0/tmp/hosts
  • bpsh 0 cat /tmp/hosts

46
Managing the Cluster
  • bpstat
  • Shows status of nodes
  • up node is up and available
  • down node is down or cant be contacted by master
  • boot node is coming up (running node_up)
  • error an error occurred while the node was
    booting
  • Shows owner and group of node
  • Combined with permissions, determines who can
    start jobs on the node
  • Shows permissions of the node
  • ---x------ execute permission for node owner
  • ------x--- execute permission for users in node
    group
  • ---------x execute permission for other users

47
Managing the Cluster
  • bpctl
  • Control a nodes status
  • Reboot node 1 (takes about a minute)
  • bpctl -S 1 -R
  • Set state of node 0
  • bpctl -S 0 -s groovy
  • Only up, down, boot and error have special
    meaning, everything else means not down
  • Set owner of node 0
  • bpctl -S 0 -u nobody
  • Set permissions of node 0 so anyone can execute a
    job
  • bpctl -S 0 -m 111

48
Managing the Cluster
  • bplib
  • Manage libraries that are loaded on a node
  • List libraries to be loaded
  • bplib -l
  • Add a library to the list
  • bplib -a /lib/libcrypt.so.1
  • Remove a library from the list
  • bplib -d /lib/libcrypt.so.1

49
Troubleshooting Techniques
  • The tcpdump command can be used to check for node
    activity during and after a node has booted
  • Connect a cable to serial port on node to check
    console output for errors in boot process
  • Once node reaches node_up processing, messages
    will be logged in /var/log/clustermatic/node.N
    (where N is node number)

50
Module 3 LinuxBIOSPresenter Ron Minnich
  • Objective
  • To introduce LinuxBIOS
  • Build and install LinuxBIOS on a cluster node
  • Contents
  • Overview
  • Obtaining LinuxBIOS
  • Source Tree
  • Building LinuxBIOS
  • Installing LinuxBIOS
  • Booting a Cluster without LinuxBIOS
  • More Information
  • http//www.linuxbios.org

51
LinuxBIOS Overview
  • Replacement for proprietary BIOS
  • Based entirely on open source code
  • Can boot from a variety of devices
  • Supports a wide range of architectures
  • Intel P3 P4
  • AMD K7 K8 (Opteron)
  • PPC
  • Alpha
  • Ports available for many systems

compaq ibm lippert rlx tyan advantech dell intel
matsonic sis via asus digitallogic irobot mo
torola stpc winfast6300 bcm elitegroup lan
ner nano supermicro bitworks leadtek
pcchips supertek cocom gigabit lex rcn technoland
52
Why Use LinuxBIOS?
  • Proprietary BIOSs are inherently interactive
  • Major problem when building clusters with 100s
    or 1000s of nodes
  • Proprietary BIOSs misconfigure hardware
  • Impossible to fix
  • Examples that really happen
  • Put in faster memory, but it doesnt run faster
  • Can misconfigure PCI address space - huge problem
  • Proprietary BIOSs cant boot over HPC networks
  • No Myrinet or Quadrics drivers for Phoenix BIOS
  • LinuxBIOS is FAST
  • This is the least important thing about LinuxBIOS

53
Definitions
  • Bus
  • Two or more wires used to connect two or more
    chips
  • Bridge
  • A chip that connects two or more busses of the
    same or different type
  • Mainboard
  • Aka motherboard/platform
  • Carrier for chips that are interconnected via
    buses and bridges
  • Target
  • A particular instance of a mainboard, chips and
    LinuxBIOS configuration
  • Payload
  • Software loaded by LinuxBIOS from non volatile
    storage into RAM

54
Typical Mainboard
Rev D
CPU
CPU
Front-side Bus
DDR
AGP
RAM
Video
Northbridge
I/O Buses (PCI)
BIOS Chip
Southbridge
Keyboard
Legacy
Floopy
55
What Is LinuxBIOS?
  • That question has changed over time
  • In 1999, at the start of the project, LinuxBIOS
    was literal
  • Linux is the BIOS
  • Hence the name
  • The key questions are
  • Can you learn all about the hardware on the
    system by asking the hardware on the system?
  • Does the OS know how to do that?
  • The answer, in 1995 or so on PCs, was NO in
    both cases
  • OS needed the BIOS to do significant work to get
    the machine ready to use

56
What Does The BIOS Do Anyway?
  • Make the processor(s) sane
  • Make the chipsets sane
  • Make the memory work (HARD on newer systems)
  • Set up devices so you can talk to them
  • Set up interrupts so the go to the right place
  • Initialize memory even though you dont want it
    to
  • Totally useless memory test
  • Ive never seen a useful BIOS memory test
  • Spin up the disks
  • Load primary bootstrap from the right place
  • Start up the bootstrap

57
Is It Possible With Open-Source Software?
  • 1995 very hard - tightly coded assembly that
    barely fits into 32KB
  • 1999 pretty easy - the Flash is HUGH (256KB at
    least)
  • So the key in 1999 was knowing how to do the
    startup
  • Lots of secret knowledge which took a while to
    work out
  • Vendors continue to make this hard, some help
  • AMD is good example of a very helpful vendor
  • LinuxBIOS community wrote the first-ever
    open-source code that could
  • Start up Intel and AMP SMPs
  • Enable L2 cache on the PII
  • Initialize SDRAM and DDRAM

58
Only Really Became Possible In 1999
  • Huge 512K byte Flash parts could hold the huge
    kernel
  • Almost 400KB
  • PCI bus had self-identifying hardware
  • Old ISA, EISA, etc. were DEAD thank goodness!
  • SGI Visual Workstation showed you could build x86
    systems without standard BIOS
  • Linux learned how to do a lot of configuration,
    ignoring the BIOS
  • In summary
  • The hardware could do it (we thought)
  • Linux could do it (we thought)

59
LinuxBIOS Image In The 512KB Flash
60
The Basic Load Sequence ca. 1999
  • Top 16 bytes jump to top 64K
  • Top 64K
  • Set up hardware for Linux
  • Copy Linux from FLASH to bottom of memory
  • Jump to 0x100020 (start of Linux)
  • Linux do all the stuff you normally do
  • 2.2 not much, was a problem
  • 2.4 did almost everything
  • In 1999, Linux did not do all we needed (2.2)
  • In 2000, 2.4 could do almost as much as we want
  • The 64K bootstrap ended up doing more than we
    planned

61
What We Thought Linux Would Do
  • Do ALL the PCI setup
  • Do ALL the odd processor setup
  • In fact, do everything all the 64K code had to
    do was copy Linux to RAM

62
What We Changed (Due To Hardware)
  • DRAM does not start life operational, like the
    old days
  • Turn-on for DRAM is very complex
  • The single hardest part of LinuxBIOS is DRAM
    support
  • To turn on DRAM, you need to turn on chipsets
  • To turn on chipsets, you need to set up PCI
  • And, on AMD Athlon SMPs, we need to grab hold of
    all the CPUs (save one) and idle them
  • So the 64K chunk ended up doing more

63
Getting To DRAM
Rev D
CPU
CPU
Front-side Bus
DDR
AGP
RAM
Video
Northbridge
I/O Buses (PCI)
BIOS Chip
Southbridge
Keyboard
Legacy
Floopy
64
Another Problem
  • IRQ wiring can not be determined from hardware!
  • Botch in PCI results in having to put tables in
    the BIOS
  • This is true for all motherboards
  • So, although PCI hardware is self-identifying,
    hardware interrupts are not
  • So Linux cant figure out what interrupt is for
    what card
  • LinuxBIOS has to pick up this additional function

65
The PCI Interrupt Botch
1 2 3 4
A B C D
1 2 3 4
A B C D
66
What We Changed (Due To Linux)
  • Linux could not set up a totally empty PCI bus
  • Needed some minimal configuration
  • Linux couldnt find the IRQs
  • Not really its fault, but
  • Linux needed SMP hardware set up as per BIOS
  • Linux needed per-CPU hardware set up as per
    BIOS
  • Linux needed tables (IRQ, ACPI, etc.) set up as
    per BIOS
  • Over time, this is changing
  • Someone has a patent on the SRAT ACPI table
  • SRAT describes hardware
  • So Linux ignores SRAT, talks to hardware directly

67
As Of 2000/2001
  • We could boot Linux from flash (quickly)
  • Linux would find the hardware and the tables
    ready for it
  • Linux would be up and running in 3-12 seconds
  • Problem solved?

68
Problems
  • Looking at trends, in 1999 we counted on
    motherboard flash sizes doubling every 2 years or
    so
  • From 1999 to 2000 the average flash size either
    shrank or stayed the same
  • Linux continued to grow in size though
  • Linux outgrew the existing flash parts, even as
    they were getting smaller
  • Venders went to a socket that couldnt hold a
    larger replacement
  • Why did vendors do this?
  • Everyone wants cheap mainboards!

69
LinuxBIOS Was Too Big
  • Enter the alternate bootstraps
  • Etherboot
  • FILO
  • Built-in etherboot
  • Built-in USB loader

70
The New Picture
Compact Flash (32MB)
Flash (256KB)
Top 16 bytes Top 64K (LinuxBIOS) Next 64K
(Etherboot)
Linux Kernel
Empty
Loaded over IDE channel by bootloader
Empty
71
LinuxBIOS Now
  • The aggregate of the 64K loader, Etherboot (or
    FILO), and Linux from Compact Flash?
  • Too confusing
  • LinuxBIOS now means only the 64K piece, even
    though its not Linux any more
  • On older systems, LinuxBIOS loads Etherboot which
    loads Linux from Compact Flash
  • Compact Flash read as raw set of blocks
  • On newer systems, LinuxBIOS loads FILO which
    loads Linux from Compact Flash
  • Compact Flash treated as ext2 filesystem

72
Final Question
  • Youre reflashing 1024 nodes on a cluster and the
    power fails
  • Youre now the proud owner of 1024 bricks, right?
  • Wrong
  • Linux NetworX developed fallback BIOS technology

73
Fallback BIOS
Flash (256KB)
  • Jump to BIOS jumps to fallback BIOS
  • Fallback BIOS checks conditions
  • Was the last boot successful?
  • Do we want to just use fallback anyway?
  • Does normal BIOS look ok?
  • If things are good, use normal
  • If things are bad, use fallback
  • Note there is also a fallback and normal FILO
  • These load different files from CF
  • So normal kernel, FILO, and BIOS can be hosed and
    youre ok

Jump to BIOS Fallback BIOS Normal BIOS Fallback
FILO Normal FILO
74
Rules For Upgrading Flash
  • NEVER replace the fallback BIOS
  • NEVER replace the fallback FILO
  • NEVER replace the fallback kernel
  • Mess up other images at will, because you can
    always fall back

75
A Last Word On Flash Size
  • Flash size decreased to 256KB from 1999-2003
  • Driven by packaging constraints
  • Newer technology uses address-address
    multiplexing to pack lots of address bits onto 3
    address lines - up to 128 MB!
  • Driven by cell phone and MP3 player demand
  • So the same small package can support 1,2,4,8 MB
  • Will need them kernel initrd can be 4MB!
  • This will allow us to realize our original vision
  • Linux in flash
  • Etherboot, FILO, etc., are really a hiccup

76
Source Tree
  • /console
  • Device independent console support
  • /cpu
  • Implementation specific files
  • /devices
  • Dynamic device allocation routines
  • /include
  • Header files
  • /lib
  • Generic library functions (atoi)
  • COPYING
  • NEWS
  • ChangeLog
  • documentation
  • Not enough!
  • src
  • /arch
  • Architecture specific files, including initial
    startup code
  • /boot
  • Main LinuxBIOS entry code hardwaremain()
  • /config
  • Configuration for a given platform

77
Source Tree
  • /stream
  • Source of payload data
  • /superio
  • Chip to emulate legacy crap
  • targets
  • Instances of specific platforms
  • utils
  • Utility programs
  • /mainboard
  • Mainboard specific code
  • /northbridge
  • Memory and bus interface routines
  • /pc80
  • Legacy crap
  • /pmc
  • Processor mezzanine cards
  • /ram
  • Generic RAM support
  • /sdram
  • Synchronous RAM support
  • /southbridge
  • Bridge to interface to legacy crap

78
Building LinuxBIOS
  • For this demonstration, untar source from CDROM
  • mount /mnt/cdrom
  • cd /tmp
  • tar zxvf /mnt/cdrom/LCI/linuxbios/freebios2.tgz
  • cd freebios2
  • Find target that matches your hardware
  • cd targets/via/epia
  • Edit configuration file Config.lb and change any
    settings specific to your board
  • Should not need to make any changes in this case

79
Building LinuxBIOS
  • Build the target configuration files
  • cd ../..
  • ./buildtarget via/epia
  • Now build the ROM image
  • cd via/epia/epia
  • make
  • Should result in a single file
  • linuxbios.rom
  • Copy ROM image onto a node
  • bpcp linuxbios.rom 0/tmp

80
Installing LinuxBIOS
  • This will overwrite old BIOS with LinuxBIOS
  • Prudent to keep a copy of the old BIOS chip
  • Bad BIOS useless junk
  • Build flash utility
  • cd /tmp/freebios2/util/flash_and_burn
  • make
  • Now flash the ROM image - please do not do this
    step
  • bpsh 0 ./flash_rom /tmp/linuxbios.rom
  • Reboot node and make sure it comes up
  • bpctl -S 0 -R
  • Use BProc troubleshooting techniques if not!

81
Booting a Cluster Without LinuxBIOS
  • Although an important part of Clustermatic, its
    not always possible to deploy LinuxBIOS
  • Requires detailed understanding of technical
    details
  • May not be available for a particular mainboard
  • In this situation it is still possible to set up
    and boot a cluster using a combination of DHCP,
    TFTP and PXE
  • Dynamic Host Configuration Protocol (DHCP)
  • Used by node to obtain IP address and bootloader
    image name
  • Trivial File Transfer Program (TFTP)
  • Simple protocol to ransfer files across an IP
    network
  • Pre-Execution Environment (PXE)
  • BIOS support for network booting

82
Configuring DHCP
  • Copy configuration file
  • cp /mnt/cdrom/LCI/pxi/dhcpd.conf /etc
  • Contains the following entry (one host entry
    for each node)
  • ddns-update-style ad-hoc
  • subnet 10.0.4.0 netmask 255.255.255.0
  • host node1
  • hardware ethernet xxxxxxxxxxxx
  • fixed-address 10.0.4.14
  • filename pxelinux.0
  • Replace xxxxxxxxxxxx with MAC address of
    node
  • Restart server to load new configuration
  • service dhcpd restart

83
Configuring TFTP
  • Create directory to hold bootloader
  • mkdir -p /tftpboot
  • Edit TFTP config file
  • /etc/xinetd.d/tftp
  • Enable TFTP
  • Change
  • disable yes
  • To
  • disable no
  • Restart server
  • service xinetd restart

84
Configuring PXE
  • Depends on BIOS, enabled through menu
  • Create correct directories
  • mkdir -p /tftpboot/pxelinux.cfg
  • Copy bootloader and config file
  • cd /mnt/cdrom/LCI/pxe
  • cp pxelinux.0 /tftpboot/
  • cp default /tftpboot/pxelinux.cfg/
  • Generate a bootable phase 2 image
  • beoboot -2 -i -o /tftpboot/node --plugin mon
  • Creates a kernel and initrd image
  • /tftpboot/node
  • /tftpboot/node.initrd

85
Booting The Cluster
  • Run nodeadd to add node to config file
  • /usr/lib/beoboot/bin/nodeadd -a -e eth0
  • Node can now be powered on
  • BIOS uses DHCP to obtain IP address and filename
  • pxelinux.0 will be loaded
  • pxelinux.0 will in turn load phase 2 image and
    initrd
  • Node should boot
  • Check status using bpstat command
  • Requires monitor to observe behavior of node

86
Module 4 FilesystemsPresenter Ron Minnich
  • Objective
  • To show the different kinds of filesystems that
    can be used with a BProc cluster and demonstrate
    the advantages and disadvantages of each
  • Contents
  • Overview
  • No Local Disk, No Network Filesystem
  • Local Disks
  • Global Network Filesystems
  • NFS
  • Third Party Filesystems
  • Private Network Filesystems
  • V9FS

87
Filesystems Overview
  • Nodes in a Clustermatic cluster do not require
    any type of local or network filesystem to
    operate
  • Jobs that operate with only local data need no
    other filesystems
  • Clustermatic can provide a range of different
    filesystem options

88
No Local Disk, No Network Filesystem
  • Root filesystem is a tmpfs located in system RAM,
    so size is limited to RAM size of nodes
  • Applications that need an input deck must copy
    necessary files to nodes prior to execution and
    from nodes after execution
  • 30K input deck can be copied to 1023 nodes in
    under 2.5 seconds
  • This can be a very fast option for suitable
    applications
  • Removes node dependency on potentially unreliable
    fileserver

89
Local Disks
  • Nodes can be provided with one or more local
    disks
  • Disks are automatically mounted by creating entry
    in /etc/clustermatic/fstab
  • Solves local space problem, but filesystems are
    still not shared
  • Also reduces reliability of nodes since they are
    now dependent on spinning hardware

90
NFS
  • Simplest solution to providing a shared
    filesystem on nodes
  • Will work in most environments
  • Nodes are now dependent on availability of NFS
    server
  • Master can act as NFS server
  • Adds extra load
  • Master may already be loaded if there are a large
    number of nodes
  • Better option is to provide a dedicated server
  • Configuration can be more complex if server is on
    a different network
  • May require mutliple network adapters in master
  • Performance is never going to be high

91
Configuring Master as NFS Server
  • Standard Linux NFS configuration on server
  • Check NFS is enabled at boot time
  • chkconfig --list nfs
  • chkconfig nfs on
  • Start NFS daemons
  • service nfs start
  • Add exported filesystem to /etc/exports
  • /home 10.0.4.0/24(rw,sync,no_root_squash)
  • Export filesystem
  • exportfs -a

92
Configuring Nodes To Use NFS
  • Edit /etc/clustermatic/fstab to mount filesystem
    when node boots
  • MASTER/home /home nfs nolock 0 0
  • MASTER will be replaced with IP address of front
    end
  • nolock must be used unless portmap is run on each
    node
  • /home will be automatically created on node at
    boot time
  • Reboot nodes
  • bpctl -S allup -R
  • When nodes have rebooted, check NFS mount is
    available
  • bpsh 0-1 df

93
Third Party Filesystems
  • GPFS (http//www.ibm.com)
  • Panasas (http//www.panasas.com)
  • Lustre (http//www.lustre.org)

94
GPFS
  • Supports up to 2.4.21 kernel (latest is 2.4.26 or
    2.6.5)
  • Data striping across multiple disks and multiple
    nodes
  • Client-side data caching
  • Large blocksize option for higher efficiencies
  • Read-ahead and write-behind support
  • Block level locking supports concurrent access to
    files
  • Network Shared Disk Model
  • Subset of nodes are allocated as storage nodes
  • Software layer ships I/O requests from
    application node to storage nodes across cluster
    interconnect
  • Direct Attached Model
  • Each node must have direct connection to all
    disks
  • Requires Fibre Channel Switch and Storage Area
    Network disk configuration

95
Panasas
  • Latest version supports 2.4.26 kernel
  • Object Storage Device (OSD)
  • Intelligent disk drive
  • Can be directly accessed in parallel
  • PanFS Client
  • Object-based installable filesystem
  • Handles all mounting, namespace operations, file
    I/O operations
  • Parallel access to multiple object storage
    devices
  • Metadata Director
  • Separate control path for managing OSDs
  • mapping of directories and files to data
    objects
  • Authentication and secure access
  • Metadata Director and OSD require dedicated
    proprietary hardware
  • PanFS Client is open source

96
Lustre
  • Lustre Lite supports 2.4.24 kernel
  • Full Lustre will support 2.6 kernel
  • Luster Lite Lustre - clustered metadata
    scalability
  • All open source
  • Meta Data Servers (MDSs)
  • Supports all filesystem namespace operations
  • Lock manager and concurrency support
  • Transaction log of metadata operations
  • Handles failover of metadata servers
  • Object Storage Targets (OSTs)
  • Handles actual file I/O operations
  • Manages storage on Object-Based Disks (OBDs)
  • Object-Based Disk drivers support normal Linux
    filesystems
  • Arbitrary network support through Network
    Abstraction Layer
  • MDSs and OSTs can be standard Linux hosts

97
V9FS
  • Provides a shared private network filesystem
  • Shared
  • All nodes running a parallel process can access
    the filesystem
  • Private
  • Only processes in a single process group can see
    or access files in the filesystem
  • Mounts exist only for duration of process
  • Node cleanup is automatic
  • No hanging mount problems
  • Protocol is lightweight
  • Pluggable authentication services

98
V9FS
  • Experimental
  • Can be mounted across a secure channel (e.g. ssh)
    for additional security
  • 1000 concurrent mounts in 20 seconds
  • Multiple servers will improve this
  • Servers can run on cluster nodes or dedicated
    systems
  • Filesystem can use cluster interconnect or
    dedicated network
  • More information
  • http//v9fs.sourceforge.net

99
Configuring Master as V9FS Server
  • Start server
  • v9fs_server
  • Can be started at boot if desired
  • Create mount point on nodes
  • bpsh 0-1 mkdir /private
  • Can add mkdir command to end of node_up script if
    desired

100
V9FS Server Commands
  • Define filesystems to be mounted on the nodes
  • v9fs_addmount 10.0.4.1/home /private
  • List filesystems to be mounted
  • v9fs_lsmount

101
V9FS On The Cluster
  • Once filesystem mounts have been defined on the
    server, filesystems will be automatically mounted
    when a process is migrated to the node
  • cp /etc/hosts /home
  • bpsh 0-1 ls -l /private
  • bpsh 0 cat /private/hosts
  • Remove filesystems to be mounted
  • v9fs_rmmount /private
  • bpsh 0-1 ls -l /private

102
One Note
  • Note that we ran the file server as root
  • You can actually run the file server as you
  • If run as you, there is added security
  • The server cant run amok
  • And subtracted security
  • We need a better authentication system
  • Can use ssh, but something tailored to the
    cluster would be better
  • Note that the server can chroot for even more
    safety
  • Or be told to serve from a file, not a file
    system
  • There is tremendous flexibility and capability in
    this approach

103
Also
  • Recall that on 2.4.19 and later there is a /proc
    entry for each process
  • /proc/mounts
  • It really is quite private
  • There is a lot of potential capability here we
    have not started to use
  • Still trying to determine need/demand

104
Why Use V9FS?
  • Youve got some wacko library you need to use for
    one application
  • Youve got a giant file which you want to serve
    as a file system
  • Youve got data that you want visible to you only
  • Original motivation compartmentation in grids
    (1996)
  • You want a mount point but its not possible for
    some reason
  • You want an encrypted data file system

105
Wacko Library
  • Clustermatic systems (intentionally) limit the
    number of libraries on nodes
  • Current systems have about 2GB worth of libraries
  • Putting all these on nodes would take 2GB of
    memory!
  • Keeping node configuration consistent is a big
    task on 1000 nodes
  • Need to do rsync, or whatever
  • Lots of work, lots of time for libraries you
    dont need
  • What if you want some special library available
    all the time
  • Painful to ship it out, set up paths, etc., every
    time
  • V9FS allows custom mounts to be served from your
    home directory

106
Giant File As File System
  • V9FS is a user-level server
  • i.e. an ordinary program
  • On Plan 9, there are all sorts of nifty uses of
    this
  • Servers for making a tar file look like a
    read-only file system
  • Or cpio archive, or whatever
  • So, instead of trying to locate something in the
    middle of a huge tar file
  • Run the server to serve the tar file
  • Save disk blocks and time

107
Data Visible To You Only
  • This usage is still very important
  • Run your own personal server (assuming
    authentication is fixed) or use the global server
  • Files that you see are not visible to anyone else
    at all
  • Even root
  • On Unix, if you cant get to the mount point, you
    cant see the files
  • On Linux with private mounts, other people dont
    even know the mount point exists

108
You Want A Mount Point But Cant Get One
  • Please Mr. Sysadmin, sir, can I have another
    mount point?
  • NO!
  • System administrators have enough to do, than to
  • Modify fstab on all nodes
  • Modify permissions on a server
  • And so on
  • Just to make your library available on the nodes?
  • Doubtful
  • V9FS gives a level of flexibility that you cant
    get otherwise

109
Want Encrypted Data File System
  • This one is really interesting
  • Crypto file systems are out there in abundance
  • But they always require lots of root
    involvement to set up
  • Since V9FS is user-level, you can run one
    yourself
  • Set up your own keys, crypto, all your own stuff
  • Serve a file system out of one big encrypted file
  • Copy the file elsewhere, leaving it encrypted
  • Not easily done with existing file systems
  • So you have a personal, portable, encrypted file
    system

110
So Why Use V9FS?
  • Opens up a wealth of new ways to store, access
    and protect your data
  • Dont have to bother System Administrators all
    the time
  • Can extend the file system name space of a node
    to your specification
  • Can create a whole file system in one file, and
    easily move that file system around (cp, scp,
    etc.)
  • Can do special per-user policy on the file system
  • Tar or compressed file format
  • Per-user crypto file system
  • Provides capabilities you cant get any other way

111
Module 5 SupermonPresenter Matt Sottile
  • Objectives
  • Present an overview of supermon
  • Demonstrate how to install and use supermon to
    monitor a cluster
  • Contents
  • Overview of Supermon
  • Starting Supermon
  • Monitoring the Cluster
  • More Information
  • http//supermon.sourceforge.net

112
Overview of Supermon
  • Provides monitoring solution for clusters
  • Capable of high sampling rates (Hz)
  • Very small memory and computational footprint
  • Sampling rates are controlled by clients at
    run-time
  • Completely extensible without modification
  • User applications
  • Kernel modules

113
Node View
  • Data sources
  • Kernel module(s)
  • User application
  • Mon daemon
  • IANA-registered port number
  • 2709

114
Cluster View
  • Data sources
  • Node mon daemons
  • Other supermons
  • Supermon daemon
  • Same port number
  • 2709
  • Same protocol at every level
  • Composable, extensible

115
Data Format
  • S-expressions
  • Used in LISP, Scheme, etc.
  • Very mature
  • Extensible, composable, ASCII
  • Very portable
  • Easily changed to support richer data and
    structures
  • Composable
  • (expr 1) o (expr 2) ((expr 1) (expr 2))
  • Fast to parse, low memory and time overhead

116
Data Protocol
  • command
  • Provides description of what data is provided and
    how it is structured
  • Shows how the data is organized in terms of rough
    categories containing specific data variables
    (e.g. cpuinfo category, usertime variable)
  • S command
  • Request actual data
  • Structure matches that described in command
  • R command
  • Revive clients that disappeared and were
    restarted
  • N command
  • Add new clients

117
User Defined Data
  • Each node allows user-space programs to push data
    into mon to be sent out on the next sample
  • Only requirement
  • Data is arbitrary text
  • Recommended to be an s-expression
  • Very simple interface
  • Uses UNIX domain socket for security

118
Starting Supermon
  • Start supermon daemon
  • supermon n0 n1 2gt /dev/null
  • Check output from kernel
  • bpsh 1 cat /proc/sys/supermon/
  • bpsh 1 cat /proc/sys/supermon/S
  • Check sensor output from kernel
  • bpsh 1 cat /proc/sys/supermon_sensors_t/
  • bpsh 1 cat /proc/sys/supermon_sensors_t/S

119
Supermon In Action
  • Check mon output from a node
  • telnet n1 2709
  • S
  • close
  • Check output from supermon daemon
  • telnet localhost 2709
  • S
  • close

120
Supermon In Action
  • Read supermon data and display to console
  • supermon_stats options
  • Create trace file for off-line analysis
  • supermon_tracer options
  • supermon_stats can be used to process trace data
    off-line

121
Module 6 BJSPresenter Matt Sottile
  • Objectives
  • Introduce the BJS scheduler
  • Configure and submit jobs using BJS
  • Contents
  • Overview of BJS
  • BJS Configuration
  • Using BJS

122
Overview of BJS
  • Designed to cover the needs of most users
  • Simple, easy to use
  • Extensible interface for adding policies
  • Used in production environments
  • Optimized for use with BProc
  • Traditional schedulers require O(N) processes,
    BJS requires O(1)
  • Schedules and unschedules 1000 processes in 0.1
    seconds

123
BJS Configuration
  • Nodes are divided into pools, each with a policy
  • Standard policies
  • Filler
  • Attempts to backfill unused nodes
  • Shared
  • Allows multiple jobs to run on a single node
  • Simple
  • Very simple FIFO scheduling algorithm

124
Extending BJS
  • BJS was designed to be extensible
  • Policies are plug-ins
  • They require coding to the BJS C API
  • Not hard, but nontrivial
  • Particularly useful for installation-specific
    policies
  • Based on shared-object libraries
  • A fair-share policy is currently in testing at
    LANL for BJS
  • Enforce fairness between groups
  • Enforce fairness between users within a group
  • Optimal scheduling between users own jobs

125
BJS Configuration
  • BJS configuration file
  • /etc/clustermatic/bjs.config
  • Global configuration options (usually dont need
    to be changed)
  • Location of spool files
  • spooldir
  • Location of dynamically loaded policy modules
  • policypath
  • Location of UNIX domain socket
  • socketpath
  • Location of user accouting log file
  • acctlog

126
BJS Configuration
  • Per-pool configuration options
  • Defines the default pool
  • pool default
  • Name of policy module for this pool (must exist
    in policydir)
  • policy filler
  • Nodes that are in this pool
  • nodes 0-10000
  • Maximum duration of a job (wall clock time)
  • maxsecs 86400
  • Optional Users permitted to submit to this pool
  • users
  • Optional Groups permitted to submit to this pool
  • groups

127
BJS Configuration
  • Restart BJS daemon to accept changes
  • service bjs restart
  • Check nodes are available
  • bjsstat
  • Pool default Nodes (total/up/free) 5/2/2
  • ID User Command
    Requirements

128
Using BJS
  • bjssub
  • Submit a request to allocate nodes
  • ONLY runs the command on the front end
  • The command is responsible for executing on nodes
  • -p specify node pool
  • -n number of nodes to allocate
  • -s run time of job (in seconds)
  • -i run in interactive mode
  • -b run in batch mode (default)
  • -D set working directory
  • -O redirect command output to file

129
Using BJS
  • bjsstat
  • Show status of node pools
  • Name of pool
  • Total number of nodes in pool
  • Number of operational nodes in pool
  • Number of free nodes in pool
  • Lists status of jobs in each pool

130
Using BJS
  • bjsctl
  • Terminate a running job
  • -r specify ID number of job to terminate

131
Interactive vs Batch
  • Interactive jobs
  • Schedule a node or set of nodes for use
    interactively
  • bjssub will wait until nodes are available, then
    run the command
  • Good during development
  • Good for single run, short runtime jobs
  • Hands-on interaction with nodes
  • bjssub -p default -n 2 -s 1000 -i bash
  • Waiting for interactive job nodes.
  • (nodes 0 1)
  • Starting interactive job.
  • NODES0,1
  • JOBID59
  • gt bpsh NODES date
  • gt exit

132
Interactive vs Batch
  • Batch jobs
  • Schedule a job to run as soon as requested nodes
    are available
  • bjssub will queue the command until nodes are
    available
  • Good for long running jobs that require little or
    no interaction
Write a Comment
User Comments (0)
About PowerShow.com