Title: High Performance Linux Clusters
1High Performance Linux Clusters
- Guru Session, Usenix, Boston
- June 30, 2004
- Greg Bruno, SDSC
2Overview of San Diego Supercomputer Center
- Founded in 1985
- Non-military access to supercomputers
- Over 400 employees
- Mission Innovate, develop, and deploy technology
to advance science - Recognized as an international leader in
- Grid and Cluster Computing
- Data Management
- High Performance Computing
- Networking
- Visualization
- Primarily funded by NSF
3My Background
- 1984 - 1998 NCR - Helped to build the worlds
largest database computers - Saw the transistion from proprietary parallel
systems to clusters - 1999 - 2000 HPVM - Helped build Windows clusters
- 2000 - Now Rocks - Helping to build Linux-based
clusters
4Why Clusters?
5Moores Law
6Cluster Pioneers
- In the mid-1990s, Network of Workstations project
(UC Berkeley) and the Beowulf Project (NASA)
asked the question
Can You Build a High Performance Machine
From Commodity Components?
7The Answer is Yes
Source Dave Pierce, SIO
8The Answer is Yes
9Types of Clusters
- High Availability
- Generally small (less than 8 nodes)
- Visualization
- High Performance
- Computational tools for scientific computing
- Large database machines
10High Availability Cluster
- Composed of redundant components and multiple
communication paths
11Visualization Cluster
- Each node in the cluster drives a display
12High Performance Cluster
- Constructed with many compute nodes and often a
high-performance interconnect
13Cluster Hardware Components
14Cluster Processors
- Pentium/Athlon
- Opteron
- Itanium
15Processors x86
- Most prevalent processor used in commodity
clustering - Fastest integer processor on the planet
- 3.4 GHz Pentium 4, SPEC2000int 1705
16Processors x86
- Capable floating point performance
- 5 machine on Top500 list built with Pentium 4
processors
17Processors Opteron
- Newest 64-bit processor
- Excellent integer performance
- SPEC2000int 1655
- Good floating point performance
- SPEC2000fp 1691
- 10 machine on Top500
18Processors Itanium
- First systems released June 2001
- Decent integer performance
- SPEC2000int 1404
- Fastest floating-point performance on the planet
- SPEC2000fp 2161
- Impressive Linpack efficiency 86
19Processors Summary
Processor GHz SPECint SPECfp Price
Pentium 4 EE 3.4 1705 1561 791
Athlon FX-51 2.2 1447 1423 728
Opteron 150 2.4 1655 1644 615
Itanium 2 1.5 1404 2161 4798
Itanium 2 1.3 1162 1891 1700
Power4 1.7 1158 1776 ????
20But What You Really Build?
- Itanium Dell PowerEdge 3250
- Two 1.4 GHz CPUs (1.5 MB cache)
- 11.2 Gflops peak
- 2 GB memory
- 36 GB disk
- 7,700
- Two 1.5 GHz (6 MB cache) makes the system cost
17,700 - 1.4 GHz vs. 1.5 GHz
- 7 slower
- 130 cheaper
21Opteron
- IBM eServer 325
- Two 2.0 GHz Opteron 246
- 8 Gflops peak
- 2 GB memory
- 36 GB disk
- 4,539
- Two 2.4 GHz CPUs 5,691
- 2.0 GHz vs. 2.4 GHz
- 17 slower
- 25 cheaper
22Pentium 4 Xeon
- HP DL140
- Two 3.06 GHz CPUs
- 12 Gflops peak
- 2 GB memory
- 80 GB disk
- 2,815
- Two 3.2 GHz 3,368
- 3.06 GHz vs. 3.2 GHz
- 4 slower
- 20 cheaper
23If You Had 100,000 To Spend On A Compute Farm
System of Boxes Peak GFlops Aggregate SPEC2000fp Aggregate SPEC2000int
Pentium 4 3 GHz 35 420 89810 104370
Opteron 246 2.0 GHz 22 176 56892 57948
Itanium 1.4 GHz 12 132 46608 24528
24What People Are Buying
- Gartner study
- Servers shipped in 1Q04
- Itanium 6,281
- Opteron 31,184
- Opteron shipped 5x more servers than Itanium
25What Are People Buying
- Gartner study
- Servers shipped in 1Q04
- Itanium 6,281
- Opteron 31,184
- Pentium 1,000,000
- Pentium shipped 30x more than Opteron
26Interconnects
27Interconnects
- Ethernet
- Most prevalent on clusters
- Low-latency interconnects
- Myrinet
- Infiniband
- Quadrics
- Ammasso
28Why Low-Latency Interconnects?
- Performance
- Lower latency
- Higher bandwidth
- Accomplished through OS-bypass
29How Low Latency Interconnects Work
- Decrease latency for a packet by reducing the
number memory copies per packet
30Bisection Bandwidth
- Definition If split system in half, what is the
maximum amount of data that can pass between each
half? - Assuming 1 Gb/s links
- Bisection bandwidth 1 Gb/s
31Bisection Bandwidth
- Assuming 1 Gb/s links
- Bisection bandwidth 2 Gb/s
32Bisection Bandwidth
- Definition Full bisection bandwidth is a network
topology that can support N/2 simultaneous
communication streams. - That is, the nodes on one half of the network can
communicate with the nodes on the other half at
full speed.
33Large Networks
- When run out of ports on a single switch, then
you must add another network stage - In example above Assuming 1 Gb/s links, uplinks
from stage 1 switches to stage 2 switches must
carry at least 6 Gb/s
34Large Networks
- With low-port count switches, need many switches
on large systems in order to maintain full
bisection bandwidth - 128-node system with 32-port switches requires 12
switches and 256 total cables
35Myrinet
- Long-time interconnect vendor
- Delivering products since 1995
- Deliver single 128-port full bisection bandwidth
switch - MPI Performance
- Latency 6.7 us
- Bandwidth 245 MB/s
- Cost/port (based on 64-port configuration) 1000
- Switch NIC cable
- http//www.myri.com/myrinet/product_list.html
36Myrinet
- Recently announced 256-port switch
- Available August 2004
37Myrinet
- 5 System on Top500 list
- System sustains 64 of peak performance
- But smaller Myrinet-connected systems hit 70-75
of peak
38Quadrics
- QsNetII E-series
- Released at the end of May 2004
- Deliver 128-port standalone switches
- MPI Performance
- Latency 3 us
- Bandwidth 900 MB/s
- Cost/port (based on 64-port configuration) 1800
- Switch NIC cable
- http//doc.quadrics.com/Quadrics/QuadricsHome.nsf/
DisplayPages/A3EE4AED738B6E2480256DD30057B227
39Quadrics
- 2 on Top500 list
- Sustains 86 of peak
- Other Quadrics-connected systems on Top500 list
sustain 70-75 of peak
40Infiniband
- Newest cluster interconnect
- Currently shipping 32-port switches and 192-port
switches - MPI Performance
- Latency 6.8 us
- Bandwidth 840 MB/s
- Estimated cost/port (based on 64-port
configuration) 1700 - 3000 - Switch NIC cable
- http//www.techonline.com/community/related_conten
t/24364
41Ethernet
- Latency 80 us
- Bandwidth 100 MB/s
- Top500 list has ethernet-based systems sustaining
between 35-59 of peak
42Ethernet
- What we did with 128 nodes and a 13,000 ethernet
network - 101 / port
- 28/port with our latest Gigabit Ethernet switch
- Sustained 48 of peak
- With Myrinet, would have sustained 1 Tflop
- At a cost of 130,000
- Roughly 1/3 the cost of the system
43Rockstar Topology
- 24-port switches
- Not a symmetric network
- Best case - 41 bisection bandwidth
- Worst case - 81
- Average - 5.31
44Low-Latency Ethernet
- Bring os-bypass to ethernet
- Projected performance
- Latency less than 20 us
- Bandwidth 100 MB/s
- Potentially could merge management and
high-performance networks - Vendor Ammasso
45Application Benefits
46Storage
47Local Storage
- Exported to compute nodes via NFS
48Network Attached Storage
- A NAS box is an embedded NFS appliance
49Storage Area Network
- Provides a disk block interface over a network
(Fibre Channel or Ethernet) - Moves the shared disks out of the servers and
onto the network - Still requires a central service to coordinate
file system operations
50Parallel Virtual File System
- PVFS version 1 has no fault tolerance
- PVFS version 2 (in beta) has fault tolerance
mechanisms
51Lustre
- Open Source
- Object-based storage
- Files become objects, not blocks
52Cluster Software
53Cluster Software Stack
- Linux Kernel/Environment
- RedHat, SuSE, Debian, etc.
54Cluster Software Stack
- HPC Device Drivers
- Interconnect driver (e.g., Myrinet, Infiniband,
Quadrics) - Storage drivers (e.g., PVFS)
55Cluster Software Stack
- Job Scheduling and Launching
- Sun Grid Engine (SGE)
- Portable Batch System (PBS)
- Load Sharing Facility (LSF)
56Cluster Software Stack
- Cluster Software Management
- E.g., Rocks, OSCAR, Scyld
57Cluster Software Stack
- Cluster State Management and Monitoring
- Monitoring Ganglia, Clumon, Nagios, Tripwire,
Big Brother - Management Node naming and configuration (e.g.,
DHCP)
58Cluster Software Stack
- Message Passing and Communication Layer
- E.g., Sockets, MPICH, PVM
59Cluster Software Stack
- Parallel Code / Web Farm / Grid / Computer Lab
- Locally developed code
60Cluster Software Stack
- Questions
- How to deploy this stack across every machine in
the cluster? - How to keep this stack consistent across every
machine?
61Software Deployment
- Known methods
- Manual Approach
- Add-on method
- Bring up a frontend, then add cluster packages
- OpenMosix, OSCAR, Warewulf
- Integrated
- Cluster packages are added at frontend
installation time - Rocks, Scyld
62Rocks
63Primary Goal
- Make clusters easy
- Target audience Scientists who want a capable
computational resource in their own lab
64Philosophy
- Not fun to care and feed for a system
- All compute nodes are 100 automatically
installed - Critical for scaling
- Essential to track software updates
- RHEL 3.0 has issued 232 source RPM updates since
Oct 21 - Roughly 1 updated SRPM per day
- Run on heterogeneous standard high volume
components - Use the components that offer the best
price/performance!
65More Philosophy
- Use installation as common mechanism to manage a
cluster - Everyone installs a system
- On initial bring up
- When replacing a dead node
- Adding new nodes
- Rocks also uses installation to keep software
consistent - If you catch yourself wondering if a nodes
software is up-to-date, reinstall! - In 10 minutes, all doubt is erased
- Rocks doesnt attempt to incrementally update
software
66Rocks Cluster Distribution
- Fully-automated cluster-aware distribution
- Cluster on a CD set
- Software Packages
- Full Red Hat Linux distribution
- Red Hat Linux Enterprise 3.0 rebuilt from source
- De-facto standard cluster packages
- Rocks packages
- Rocks community packages
- System Configuration
- Configure the services in packages
67Rocks Hardware Architecture
68Minimum Components
Local Hard Drive
Power
Ethernet
OS on all nodes (not SSI)
X86, Opteron, IA64 server
69Optional Components
- Myrinet high-performance network
- Infiniband support in Nov 2004
- Network-addressable power distribution unit
- keyboard/video/mouse network not required
- Non-commodity
- How do you manage your management network?
- Crash carts have a lower TCO
70Storage
- NFS
- The frontend exports all home directories
- Parallel Virtual File System version 1
- System nodes can be targeted as Compute PVFS or
strictly PVFS nodes
71Minimum Hardware Requirements
- Frontend
- 2 ethernet connections
- 18 GB disk drive
- 512 MB memory
- Compute
- 1 ethernet connection
- 18 GB disk drive
- 512 MB memory
- Power
- Ethernet switches
72Cluster Software Stack
73Rocks Rolls
- Rolls are containers for software packages and
the configuration scripts for the packages - Rolls dissect a monolithic distribution
74Rolls User-Customizable Frontends
- Rolls are added by the Red Hat installer
- Software is added and configured at initial
installation time - Benefit apply security patches during initial
installation - This method is more secure than the add-on method
75Red Hat Installer Modified to Accept Rolls
76Approach
- Install a frontend
- Insert Rocks Base CD
- Insert Roll CDs (optional components)
- Answer 7 screens of configuration data
- Drink coffee (takes about 30 minutes to install)
- Install compute nodes
- Login to frontend
- Execute insert-ethers
- Boot compute node with Rocks Base CD (or PXE)
- Insert-ethers discovers nodes
- Goto step 3
- Add user accounts
- Start computing
- Optional Rolls
- Condor
- Grid (based on NMI R4)
- Intel (compilers)
- Java
- SCE (developed in Thailand)
- Sun Grid Engine
- PBS (developed in Norway)
- Area51 (security monitoring tools)
77Login to Frontend
- Create ssh public/private key
- Ask for passphrase
- These keys are used to securely login into
compute nodes without having to enter a password
each time you login to a compute node - Execute insert-ethers
- This utility listens for new compute nodes
78Insert-ethers
- Used to integrate appliances into the cluster
79Boot a Compute Node in Installation Mode
- Instruct the node to network boot
- Network boot forces the compute node to run the
PXE protocol (Pre-eXecution Environment) - Also can use the Rocks Base CD
- If no CD and no PXE-enabled NIC, can use a boot
floppy built from Etherboot (http//www.rom-o-ma
tic.net)
80Insert-ethers Discovers the Node
81Insert-ethers Status
82eKVEthernet Keyboard and Video
- Monitor your compute node installation over the
ethernet network - No KVM required!
- Execute ssh compute-0-0
83Node Info Stored In A MySQL Database
- If you know SQL, you can execute some powerful
commands
84Cluster Database
85Kickstart
- Red Hats Kickstart
- Monolithic flat ASCII file
- No macro language
- Requires forking based on site information and
node type. - Rocks XML Kickstart
- Decompose a kickstart file into nodes and a graph
- Graph specifies OO framework
- Each node specifies a service and its
configuration - Macros and SQL for site configuration
- Driven from web cgi script
86Sample Node File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_KICKSTART_DTD_at_" lt!ENTITY ssh
"openssh"gtgt ltkickstartgt ltdescriptiongt Enable
SSH lt/descriptiongt ltpackagegtsshlt/packagegt
ltpackagegtssh-clientslt/packagegt ltpackagegtssh-s
erverlt/packagegt ltpackagegtssh-askpasslt/packagegt
ltpostgt ltfile name"/etc/ssh/ssh_config"gt Host
CheckHostIP no
ForwardX11 yes ForwardAgent
yes StrictHostKeyChecking
no UsePrivilegedPort no
FallBackToRsh no Protocol
1,2 lt/filegt chmod orx /root mkdir
/root/.ssh chmod orx /root/.ssh lt/postgt lt/kickst
artgtgt
87Sample Graph File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_GRAPH_DTD_at_"gt ltgraphgt ltdescrip
tiongt Default Graph for NPACI Rocks. lt/descripti
ongt ltedge from"base" to"scripting"/gt ltedge
from"base" to"ssh"/gt ltedge from"base"
to"ssl"/gt ltedge from"base" to"lilo"
arch"i386"/gt ltedge from"base" to"elilo"
arch"ia64"/gt ltedge from"node" to"base"
weight"80"/gt ltedge from"node"
to"accounting"/gt ltedge from"slave-node"
to"node"/gt ltedge from"slave-node"
to"nis-client"/gt ltedge from"slave-node"
to"autofs-client"/gt ltedge from"slave-node"
to"dhcp-client"/gt ltedge from"slave-node"
to"snmp-server"/gt ltedge from"slave-node"
to"node-certs"/gt ltedge from"compute"
to"slave-node"/gt ltedge from"compute"
to"usher-server"/gt ltedge from"master-node"
to"node"/gt ltedge from"master-node"
to"x11"/gt ltedge from"master-node"
to"usher-client"/gt lt/graphgt
88Kickstart framework
89Appliances
- Laptop / Desktop
- Appliances
- Final classes
- Node types
- Desktop IsA
- standalone
- Laptop IsA
- standalone
- pcmcia
- Code re-use is good
90Architecture Differences
- Conditional inheritance
- Annotate edges with target architectures
- if i386
- Base IsA grub
- if ia64
- Base IsA elilo
- One Graph, Many CPUs
- Heterogeneity is easy
- Not for SSI or Imaging
91Installation Timeline
92Status
93But Are Rocks Clusters High Performance Systems?
- Rocks Clusters on June 2004 Top500 list
94(No Transcript)
95What We Proposed To Sun
- Lets build a Top500 machine
- from the ground up
- in 2 hours
- in the Sun booth at Supercomputing 03
96Rockstar Cluster (SC03)
- Demonstrate
- We are now in the age of personal
supercomputing - Highlight abilities of
- Rocks
- SGE
- Top500 list
- 201 - November 2003
- 413 - June 2004
- Hardware
- 129 Intel Xeon servers
- 1 Frontend Node
- 128 Compute Nodes
- Gigabit Ethernet
- 13,000 (US)
- 9 24-port switches
- 8 4-gigabit trunk uplinks